forms + ML = less bureaucracy     
It's hard to find time to post when we're busy. Here's a glimpse of a recent piece of work we're quite excited about. Our form comprehension algorithms have improved in dynamically understanding a form structure as well as extracting user inputs as structured values.

Consider the following manual form (pay no attention to the content, it is purely illustrative).

We previously had difficulty in accurately identifying check-boxes, as well as whether these were checked or not.
Updated output:
[Note1: First char indicates check "X" vs unchecked "O"]
[Note2: Second char is cluster id for grouped form elements]

X 1 What is your favorite editor? --> Vim
O 1 What is your favorite editor? --> Emacs
O 2 Is pineapple OK on pizza ? --> Yes
X 2 Is pineapple OK on pizza ? --> No
X 3 What are your favorite days of the week? --> Monday
O 3 What are your favorite days of the week? --> Tuesday
O 3 What are your favorite days of the week? --> Wednesday
O 3 What are your favorite days of the week? --> Thursday
O 3 What are your favorite days of the week? --> Friday
O 3 What are your favorite days of the week? --> Saturday
O 3 What are your favorite days of the week? --> Sunday
X 4 Oxford comma or normal comma ? --> Oxford
O 4 Oxford comma or normal comma ? --> Normal
X 5 Star wars is better than Star Trek? --> Yes
O 5 Star wars is better than Star Trek? --> No

We're now able to generate fairly accurate output without any special fine-tuning. There are still some hiccups, e.g. the O from Oxford interpreted as a checkbox, but the performance is already quite strong, especially in standard computer-generated forms where handwriting is less of a challenge.
ML in a racing game     
Bye-bye 2023     
We are so very very grateful for 2023 and would like to thank-you all for this wonderful year. We wish you all a peaceful and prosperous 2024: stay safe, kiss your loved ones, keep smiling.

Comparing Llama2 LLM 7B and 13B     
Earlier this year, Meta made a significant open-source contribution with Llama2, an alternative to ChatGPT. They released 3 models, one with 7 billion (7B) parameters, on with 13 billion parameters (13b) and one with 70 billion parameters (70b). For reference, the 70B model performs similar to / slightly worst than the Chat-GPT 3.5 model depending on which benchmark you read. We've deployed the 2 smaller models - Llama2 7B and 13B - and had a small benchmark.


Approach


We used 2 queries to test the chat-variation of the Llama2 13b and 7b models:
  • "I need instructions for a Christmas soup. Write as a pirate and use funny metaphors"
  • We requested a summary from the first 11 paragraphs of this nytimes article: approx. 600 words

The outputs from the models are at the bottom of this post. Note that the pirate-speak is only half-a-joke, we intentionally wanted to test how the LLM responds in a style that doesn't match the vast amount of it's training data.


Findings


  • First of all, we're incredibly proud that we can run these models on our local infrastructure at good performance. This opens the door to new state-of-the-art solutions for our customers while guaranteeing that their data remains in Switzerland and adheres to the highest security standards. We'll be deploying the 70B model shortly and building up a cluster of Llama2 LLMs
  • 13B is better, the summary is more incisive and the pirate recipe flows better (and we actually want to cook it). However ...
  • 7B didn't do bad, and 7B can run on only 1 24 VRAM GPU whereas 13B needs to be distributed on two. I.e. 7B could process twice the amount of queries that 13B does considering similar hardware (more-or-less, we're simplifying)


Conclusion


Ultimately it boils down to an age-old principle: using the right tool for the right job. Certain areas of your use-case may require a simple interaction and a 7B LLM is the best choice: offering speed and simplicity and low cost (for example summarizing large emails to a few bullet points).
But there may be some areas of your use-case that require more sophistication (for example drafting an email response to certain types of emails), and then a 13B or 70B model need to do the heavy-lifting

We'll continue to learn how to run, fine-tune and train these models. It's incredibly exciting that this technology is available to build solutions on.


Output 13b - pirate soup query


  • Ahoy matey! Yer lookin' fer a Christmas soup, eh? Well, hoist the sails and set course fer a hearty, swashbucklin' soup that'll make ye feel like the scurviest of scallywags!
  • First, ye'll need to gather yer ingredients like a treasure chest filled with gold doubloons. Ye'll need some onions, garlic, carrots, potatoes, and a wee bit o' beef or seafood (arrr, make it count, matey!). Now, don't be a landlubber, chop those veggies up like ye're singin' a sea shanty!
  • Next, ye'll need to cook up yer soup like ye're sailin' the high seas. In a great big pot, mind ye! Add in some olive oil like a drop o' rum, and let it swirl like the ocean currents. Throw in yer chopped veggies and cook 'em until they're soft as a sea breeze.
  • Now, here be the fun part, me hearty! Add in yer beef or seafood and cook it until it's browned like a well-worn leather sea chest. Don't ferget the spices, me matey! A pinch o' salt, a grind o' pepper, and a dash o' paprika'll make yer soup taste like the treasure of the seven seas!
  • Now, let it simmer like a ship at anchor for a good long while, until the veggies be soft as a mermaid's song and the meat be fallin' off the bone like a piece o' booty.
  • And that's it, me hearty! Yer Christmas soup be ready to be served like a treasure chest overflowin' with booty! Serve it up with some crusty bread like a ship's biscuit, and a mug o' grog like a pint o' rum.
  • So hoist the colors, me hearties, and enjoy yer Christmas soup like the scurviest of scallywags! Arrrr!



Output 13b - nytimes article summary


  • Sam Altman, CEO of OpenAI, was unexpectedly fired by the company's board members via a video call on November 17th.
  • The ouster was the culmination of longstanding tensions within OpenAI, pitting those concerned about AI's power against those seeking profit and prestige.
  • The drama has raised questions about the trustworthiness of OpenAI and the AI industry as a whole, as well as the motivations and ethics of its leaders and investors.



Output 7b - pirate soup query


  • Ahoy matey! Here be the instructions for a swashbucklin' Christmas soup, fit fer a scurvy dog like yerself!
  • Step 1: Ye'll be needin' some o' that fancy-schmancy "vegetable broth" (translation: water with a bit o' salt and some veggies floating in it). Don't be forgettin' the "cream o' mushroom soup" (translation: a can o' mushroom soup, because ye don't want to be eatin' nothin' too fancy on a first date). And o' course, ye can't forget the "chopped ham" (translation: a bit o' ham that's been chopped up into bite-sized pieces).
  • Step 2: Chop up yer ingredients like a pirate choppin' up a treasure chest. Take yer veggies and chop 'em up into little pieces that'll fit in yer mouth. Don't be worryin' about how big or small they are, just make sure they're all roughly the same size so ye can eat 'em easy. And when ye get to the ham, make sure ye chop it up into bite-sized pieces so ye can eat it easy too.
  • Step 3: Cook yer soup like a pirate cookin' up a feast. Pour yer broth and mushroom soup into a big ol' pot and heat it up on the stove. Don't be afraid to add a bit o' seasonin' (translation: salt and pepper) to give it a bit o' flavor. And when ye add the chopped ham, make sure ye stir it in good so it's nice and hot all the way through.
  • Step 4: Serve yer soup like a pirate servin' up a treasure chest full o' booty. Pour yerself a big ol' bowl and enjoy! Don't be forgettin' to add a bit o' crusty bread or crackers on top to soak up all the goodness. And if ye be feelin' extra hungry, ye can always add a bit o' shredded cheese or a dollop o' sour cream on top to make it extra special.
  • So there ye have it, me hearty! A swashbucklin' Christmas soup that'll make ye feel like a pirate on the high seas. Enjoy yer feast, and don't be forgettin' to save me a bit o' the loot!



Output 7b - nytimes article summary


  • Sam Altman, the CEO of OpenAI, was fired from the company after a 15-minute video call with the independent board members who had been secretly plotting against him for months.
  • The board members accused Altman of dishonesty and voted to remove him from OpenAI, which he co-founded.
  • Altman's ouster was the culmination of years of tensions at OpenAI, where divisions deepened between those who were concerned about AI's power and those who saw it as a profit and prestige opportunity.
ML for consultants     
Hey there! We've been pretty busy lately, and we've got some cool stuff to share. One thing we're excited about is how we've made our tech work together like Lego blocks. It's like taking different parts and creating something fun! We recently decided to link our OCR pipeline with a slide-generating AI, so that we can get our scribbles more quickly in a presentation



It's still early, but we're happy with the results so far. Plus, it does a neat trick by turning PDF slides back into their original format. We're going to share this cool feature with our subscribers soon, and we're working on making it even better.
Simple powerful examples with SQL     
Let's dive into the most useful SQL statements and give some examples of how they are used. These next few keywords may seem simple, but they can be used to create powerful queries.

SELECT:
The SELECT statement is the cornerstone of SQL, allowing us to retrieve data from a database. It determines which columns from the specified tables to include in the result set.

FROM:
The FROM clause specifies the table or tables from which to retrieve data. It defines the source of information for subsequent operations.

WHERE:
The WHERE clause filters records based on specified conditions. This keyword allows us to narrow down the results to only those that meet specific criteria, enhancing data relevance and accuracy.

ORDER BY:
ORDER BY sorts the result set in ascending or descending order based on specified columns. It's crucial for arranging data in a meaningful way for analysis and reporting.

LIMIT:
LIMIT restricts the number of rows returned by a query, allowing us to focus on a manageable subset of results. This is particularly useful when dealing with extensive datasets.

Now let's look at an example of creating SQL queries on the following dataset:

Name EmploymentStartDate Origin Age
John 2005-04-15 USA 38
Emily 2020-08-22 Canada 31
Michael 2012-12-10 UK 38
Sarah 2006-06-05 Australia 47
David 2021-03-25 USA 28
Claire 2019-03-25 Australia 28

Question 1:
Retrieve names and ages of individuals from the USA, sorted by age in descending order.

Solution 1:
SELECT Name, Age
FROM Dataset
WHERE Origin = 'USA'
ORDER BY Age DESC;


Explanation: We're selecting the "Name" and "Age" columns from our dataset where the "Origin" is 'USA'. The results are then ordered by "Age" in descending order, showcasing individuals from the USA with the highest ages first.

Question 2:
List names and origins of individuals hired after 2020, ordered alphabetically by name.

Solution 2:
SELECT Name, Origin
FROM Dataset
WHERE EmploymentStartDate > '2020-01-01'
ORDER BY Name;


Explanation: Here, we're selecting the "Name" and "Origin" columns for individuals hired after January 1, 2020. The results are ordered alphabetically by "Name", presenting a list of individuals meeting this condition.

Question 3:
Retrieve names and employment start dates of individuals from Australia, sorted by employment date in ascending order.

Solution 3:
SELECT Name, EmploymentStartDate
FROM Dataset
WHERE Origin = 'Australia'
ORDER BY EmploymentStartDate ASC;


Explanation: In this query, we're selecting the "Name" and "Employment Start Date" columns for individuals from Australia. The results are ordered by "Employment Start Date" in ascending order, presenting the dates from earliest to latest.

Question 4:
List names and ages of individuals older than 35, ordered by age in descending order.

Solution 4:
SELECT Name, Age
FROM Dataset
WHERE Age > 35
ORDER BY Age DESC;


Explanation: Here, we're selecting the "Name" and "Age" columns for individuals with an age greater than 35. The results are ordered by "Age" in descending order, showcasing individuals aged over 35 from highest to lowest age.

Question 5:
Select the employee with longest tenure.

Solution 5:
SELECT Name
FROM Dataset
ORDER BY EmploymentStartDate ASC
LIMIT 1;


Explanation: Here we don't need a WHERE clause, simply selecting all individuals ordered by their employment start date, and then limiting the number to 1, means we have the person with the longest tenure selected.

SQL is easy to come by; if you have MS Excel spreadsheets that extract data from a third-party database, or have MS Access running, then chances are you've already had to write a couple of queries. At Sense6, even though we mostly work on NoSQL databases, we could not escape the pull that SQL databases (commonly known as relational databases) exert. We do have some tailored reports for a handful of clients running on MySQL.
Combined offsite with Denteo     
The team at Denteo invited us to their Hackathon offsite, which took place this week. For three days, we shared meals, discussions, fun activities and a lot of coding.

Most interesting was to observe how the Denteo tackled various challenges during the Hackathon. It gave us a perspective on how another team problem-solves, collaborates, innovates, persists and ultimately succeeds. The team achieved a harmony of competitiveness and collaboration that was fully on display during the awards ceremony. We won't go into details of the ideas presented and pilots demoed, but we think this is going to be an exciting year for the Denteo team and its customers.

In the future, we will seek out other startups for combined offsites: this experience was too valuable not to reproduce.
Where does data come from?     
In today's AI landscape, where chatbots seamlessly answer queries, languages are effortlessly translated, and images are conjured from mere characters, it is imperative to understand the origins of the data underpinning these recent innovations. In this blog post, we discuss some of the myriad data sources that power our data-driven world. Let's dive into this!

User-generated data

The data you generate every day is a significant contributor to the digital universe. When you post on social media, shop online, or use a fitness app, you're creating user-generated data. This can include your posts, likes, comments, and even the products you browse. Companies collect and analyze this data to understand your preferences and behavior.

Sensors and IoT devices

Your smartphone, smartwatch, refrigerator, car, home, office, ... are equipped with small sensors that constantly gather data. These Internet of Things (IoT) devices are increasingly interwoven in our daily lives; they monitor heart rates, sleeping phases, temperatures, number of people in a building, traveling speed, the weather, etc. This data helps us better understand our environment and is mostly used to enhance our daily lives. It's also a huge data privacy and security concern, but that's a different topic.

Websites and apps

Have you ever wondered how websites and apps offer personalized experiences? Many collect data about your interactions, like the pages you visit, the time you spend, and the buttons you click. This information helps tailor content and recommendations just for you.

Government and public data

Governments collect a vast amount of data and conduct national studies (for example BFS Admin in Switzerland). These studies are regularly made available to the public and can be used for various purposes, such as research, business planning, and policy analysis.

Business data

Companies generate data through their day-to-day operations. This includes sales transactions, inventory records, customer feedback, and employee performance metrics. Analyzing this data helps businesses make informed decisions and optimize their processes. Note that laws such as GDPR protect consumers from overly data-hungry companies and also put an expiration date on how long consumer data can be stored.

Web scraping and APIs

Sometimes, data is collected by scraping information from websites or using Application Programming Interfaces (APIs) (for example OpenWeather APIs). These tools allow developers to access and retrieve data from various online sources. While collecting data via APIs doesn't create new data per se, combining data from different sources in useful ways can generate additional insights and therefore data.

Surveys and feedback forms

When you fill out surveys or feedback forms, you're contributing valuable data. Organizations use this input to understand customer satisfaction, product improvement areas, and market trends.

Open data initiatives

Some governments and organizations promote open data initiatives, making datasets freely available to the public. This fosters innovation and transparency while encouraging citizens to participate in data-driven projects.

Having the appropriate data gathering and storing policies is a cornerstone of modern society, and a fine line to balance competing interests. On one hand, access to the right data allows companies and governments to provide effective services to consumers. On the other hand, consumers' privacy needs to be protected in order for third-parties not to gain an unfair advantage in using data to manipulate behavior. We don't believe there's a right solution, rather there needs to be a healthy debate and corrective actions whenever we stray too far on either side of the right balance.

At Sense6 we curate our access to data sources (or build up our own) with clear use-cases in mind. If there is no specific service that the data supports, then we don't use it and we don't store it.
Measuring OCR performance     
Extracting data from images and scanned documents is a crucial part of what we do. We've deployed 3 OCR engines on our platform, the first 2 are locally deployed (Tesseract, Microsoft Azure OCR), the third is available via an API (Google Vision). We thought that we'd illustrate performance by measuring them on 2 basic use-cases.



Use-case 1 with following raw file: mimicking a hastily scribbled driver's log:
  • Microsoft Azure OCR:
    Numbers: managed everything except for number "ZH42651A", which it confused with "ZH42659A"
    Dates: managed all dates except for 10012023
    Comments: managed all 3 handwritten comments
  • Google Vision API:
    Numbers: managed everything except for number "ZH42651A", where it only gave back the numbers "42651"
    Dates: managed all dates except for the first 09072023
    Comments: managed all 3 handwritten comments
  • Tesseract (opensource): all over the place, almost nothing correct




Use-case 2: a screen-shot of a dummy Excel file:
  • Microsoft Azure OCR: Everything correct (numbers, headers, row numbers) with 1 exception, instead of 0.56 it sent 0:56
  • Google Vision API: Everything correct without exception, however it misaligned the 3 bottom rows (capital city and data no longer aligned to capital)
  • Tesseract (opensource): got all the numbers typed into the cells correctly, however it didn't get 3 row numbers (before the country name) correctly: 7, 31, 32. It also didn't manage the color-inverted header row, where it only got gibberish

Your takeaway: If your use-case doesn't have a lot of handwriting, then opensource solutions such as Tesseract will bring you a long way. However if you do have handwritten notes, then both Azure and Google offer similar impressive performance and are way beyond Tesseract. All 3 were similar in terms of speed

Conclusion: we find these results impressive. OCR has come a long way in a short time and - despite not going easy in our 2 use-cases - especially Google and Azure did an incredible job of digitizing the data. We've deployed all 3 OCR engines and continue to experiment (different languages, different colors, different resolutions, etc.) in order to leverage their strengths. For customers with strong data security requirements, we tend to gravitate to Azure and Tesseract, as these are deployed locally and we haven't found an easy way to deploy Google Vision locally (yet).
Jenkins     
In the fast-paced software-development world of startups where innovation and speed are of the essence, setting up the right automated testing tools is key. We finally began migrating our Python unit tests to the industry-leading open source automation server Jenkins. Here are our main thoughts so far.

Open-source

Jenkins, being open source, turned out to be a cost-effective solution and simple solution. There are neither hefty licensing fees nor complex licensing agreements, enabling us to focus on test coverage rather than admin overhead.

Simplicity in set-up and configuration

Jenkins has a user-friendly installation process and we were setting up complex test-cases within the first hour of having set it up. It dealt with more complex topics - for example storing user credentials to Git servers - in an exceptionally simple way.

User-Friendly Interface

Jenkins' charm lies in its unintimidating user-friendly interface. Navigating through the various functionalities is a breeze, making the learning curve gentle for startups. This approach-ability ensures that even those who are new to Jenkins can quickly get on board and start reaping its benefits.

Plugins

Jenkins doesn't just stop at being user-friendly—it empowers startups with a range of plugins that open up a world of possibilities. From connecting with version control systems to integrating with cloud services, these plugins seamlessly extend Jenkins' capabilities. This versatility meant we could craft a testing environment that is suited to our startup's unique requirements: connection to Git, building and management of Docker images, deployment to our ECR, etc.

Continuity in Integration and Delivery

Continuous Integration (CI) and Continuous Delivery (CD) are the bedrock of efficient development, and Jenkins is our guiding star. With CI, Jenkins automates the integration of code changes from different contributors, curbing integration issues. CD takes us a step further by automating the deployment process, ensuring that our code journey from development to production is both seamless and error-free.

Docker Integration

In the modern tech landscape, Docker has revolutionized the way we deploy applications. Jenkins recognized this, and its seamless Docker integration caught our attention. The ability to incorporate Docker containers into our testing and deployment workflows streamlined our processes, adding another layer of efficiency to our startup journey.

We continue our journey hoping to set up an automated testing environment and maybe even deployment. We'll keep you update how it goes.
Data visualization with Chart.js     
We've been working on our own form tool these past weeks and have been reminded of how simple and powerful it is to visualize data with Chart.js

For those who don't know what it is, Chart.js provides Javascript libraries to visualize your data. Makes us wish something similar were available in PowerPoint.



Why are we building our own form tools? Alternatives are continuously constraining in terms of how large files we can upload, whether they treat our data confidentially, etc. More on this later
How to Choose the Ideal Database?     
In today's data-driven world, choosing the right database is a critical decision for businesses and organizations of all sizes. The abundance of data types, varying workloads, and emerging technologies has made this decision rather complex. In this blog post, we'll delve into the different types of databases available, and provide a comprehensive guide to help you choose the ideal database for your specific needs.

Database types

Relational databases (SQL)
These databases use structured query language (SQL) to manage and manipulate structured data. They are known for their reliability, ACID compliance, and well-established schemas. Examples include MySQL, PostgreSQL, and Oracle Database.

NoSQL databases
We already had a blogpost about NoSQL databases here, which you can check out for more details. These databases accommodate unstructured or semi-structured data and offer flexibility and scalability. Categories include:
  • Document stores:
  • Suitable for semi-structured data. Examples are MongoDB and Couchbase.
  • Key-value stores:
  • Great for high-speed read and write operations. Redis and Cassandra fall into this category.
  • Column-family stores:
  • Optimal for managing and querying large datasets. HBase and Cassandra are prominent examples.
  • Graph databases:
  • Designed for managing complex relationships in data. Neo4j and Amazon Neptune are notable names.

Choosing the ideal database, best practices:

Selecting the right database requires careful consideration of several factors:
  1. Data structure and type:
  2. Determine if your data is structured, semi-structured, or unstructured. This influences the choice between SQL and NoSQL databases.
  3. Scalability:
  4. Analyze your growth projections. NoSQL databases are generally more scalable, making them suitable for rapidly expanding datasets.
  5. Performance requirements:
  6. Consider the nature of your queries. If your application requires complex joins and transactions, a relational database might be preferable.
  7. Consistency and availability:
  8. For critical applications, where data consistency and high availability are essential, relational databases are a strong choice due to their ACID properties.
  9. Flexibility:
  10. NoSQL databases excel at handling evolving data schemas. If your data requirements change frequently, a NoSQL solution might be more fitting.
  11. Budget and costs:
  12. Factor in licensing fees, hardware costs, and operational expenses. Open-source databases like PostgreSQL can be cost-effective alternatives.
  13. Ecosystem and integration:
  14. Evaluate how well the database integrates with your existing tools, frameworks, and analytics platforms.
  15. Security:
  16. Ensure the database provides robust security features, encryption, and compliance with relevant regulations like GDPR.
  17. Ease of management:
  18. Consider the ease of administration, monitoring, and maintenance. Cloud-managed databases can simplify these aspects.
  19. Future growth and innovation:
  20. Anticipate how your data needs might evolve in the coming years. Choose a solution that aligns with your long-term strategy.
Selecting the ideal database involves understanding your data, workload, and long-term goals. The landscape offers various options, from traditional relational databases to NoSQL solutions. By evaluating your requirements and following best practices, you can make an informed decision that empowers your organization's data-driven journey.
Why is it misleading to give clean datasets to scholars?     
In data science students are introduced to the field with nicely organized, clean, and balanced datasets for their academic projects. These datasets are meticulously pre-processed and free from any real-life complications that arise when working with raw, unfiltered data. While such pristine datasets serve as an excellent starting point for learning essential concepts and techniques, they can also be misleading. In reality, data scientists and ML/AI practitioners rarely encounter such clean datasets in their professional careers. This blog post explores the catches of relying solely on sanitized data and highlights the importance of preparing scholars for real-world challenges in the data-driven landscape.

The Illusion of Ideal Data

When students work with clean datasets, they often get the illusion that data is naturally well-structured and readily available. In reality, real-world data can be messy, incomplete, and riddled with errors. From missing values to outliers, data practitioners must be prepared to tackle these challenges head-on.

Lack of understanding of data bias

Clean datasets, by their very nature, are designed to be unbiased and balanced. However, real-world data can be heavily skewed and possess inherent biases. When students solely work with balanced data, they miss the opportunity to understand the importance of bias detection and mitigation. Scholars should be exposed to diverse datasets, ensuring they learn how to handle and rectify biases appropriately.

Unrealistic model performance

In controlled environments, ML models will exhibit exceptional performance after tweaking a handful of parameters. Students should also be confronted to the practical side of building models in real-world situations: experience the "frustration" of a poorly performing model despite "doing everything right", learn the grit and relentlessness that iteratively builds great models.

Lack of data exploration skills

Working with clean datasets rarely requires extensive exploratory data analysis. In the real world, understanding the data is a critical step in the modeling process. Uncovering hidden patterns, visualizing trends, and gaining insights from the data are skills that students may overlook if they only deal with cleaned data. Data exploration is an essential aspect of data science that should be emphasized in educational settings.

The importance of data preprocessing

Data preprocessing can consume a significant portion of a data scientist's time. Unfortunately, clean datasets sidestep this crucial step. Learners may not fully grasp the significance of data cleaning, feature engineering, and transformation techniques. Yet, these skills are vital for preparing data for analysis and building robust ML models.

Overlooking data quality issues

When learners are accustomed to working with clean datasets, they may fail to recognize the importance of data quality assessment. In real-world scenarios, data can be collected from various sources, leading to inconsistencies, inaccuracies, and duplicate entries. Without the experience of handling imperfect data, learners may struggle to identify and rectify data quality issues, resulting in flawed analyses and inaccurate models.

Complementing a theory with safeguards

Data collection and training will unveil edge-situations/errors where the model performs abysmally. Instead of trying a pure-ML-only approach, students should learn how to complement their model with standard code for safeguarding and error-handling rules. This will prevent the model from extremely erroneous measurements.

Poor model generalization

Models trained on clean and balanced datasets may exhibit high accuracy during testing, but they could struggle to generalize well in real-world situations. Real data is diverse and often imbalanced, and students must understand how to build models that can perform reliably in such scenarios. Failure to address this issue may result in ML models that underperform or fail to meet the intended objectives when applied to real applications.

While clean datasets serve as an excellent starting point for students learning journeys in data science, machine learning, and artificial intelligence, relying exclusively on such data can have detrimental effects on their preparedness for real-world challenges. From overlooking data quality issues to considering unrealistic project timelines, the problems stemming from this approach are numerous.

By incorporating real-world complexities into coursework, scholars can better prepare themselves for the multifaceted and unpredictable nature of data-driven careers. The ability to handle imperfect data, navigate biases, and build robust models in dynamic environments is essential for becoming a successful data scientist in the real world.
ML lessons learned - July 2023     
We've been working on a predictive ML Model for one of our customers in the construction industry, here are some unstructured thoughts of the experience:

  • The importance of feature engineering, especially when using unstructured data sources as inputs. Are you going to stem or lemmatize your keywords. Are you going to select the words with the most occurrences, or the ones with the most differentiation vs. the output variable? Are you going to one-hot-encode some of your numeric data? It's impressive to see how different the models performed with different answers to these questions

  • Over-fitting is your enemy, do everything you can to detect it and avoid it

  • Take an industrialized approach, build or buy the capabilities to train multiple models with tweaked parameters. Having the results for multiple combinations with give you a deeper understanding than one-by-one steps

  • Squeeze the lemon: there were 3 separate occurrences where we thought "that's it", we're doing all we can with the data, there's no way to improve. But taking a step back led to 3 "wait-a-minute-why-don't-we-use-that" moments. If you feel your stuck take a step back: are you really using all the data you have?
Our model's median absolute error started at 300% and is down to 25% in the span of 2 weeks (mean average error 10% higher mostly due to a few crazy outliers). It's been a tough, fun, chaotic, insightful, frustrating and satisfying journey all at once. We're going to do our best to get MAE down to 20%, for now the week-end beckons. Happy week-end !
Machine learning jargon     
Machine learning is an exciting field with the power to transform industries and revolutionize our lives. As a machine learning engineer, it's important to understand the everyday terms used in this field to communicate effectively with peers, researchers, and stakeholders. In this blog post, we'll explore key terms in machine learning, including supervised learning, unsupervised learning, neural networks, feature engineering, overfitting, underfitting, hyperparameters, bias-variance tradeoff, regularization, cross-validation, early stopping and ensemble learning.

  • Supervised learning
  • Supervised learning is a type of machine learning where the algorithm learns from labeled data. The algorithm's goal is to learn patterns in the data and make predictions on new, unseen inputs. Popular examples of supervised learning include predicting housing prices based on features like size and location, or classifying emails as spam or not spam.

  • Unsupervised learning
  • Unsupervised learning deals with unlabeled data. The algorithm's objective is to find patterns, structures, or relationships within the data without any predefined labels. Common applications of unsupervised learning include grouping similar documents together or discovering hidden themes in customer feedback.

  • Neural networks
  • Neural networks are algorithms inspired by the human brain. They consist of interconnected nodes, or "neurons," organized in layers. These networks are great at learning complex patterns from data. For example, they can recognize faces in images or understand the sentiment in text.

  • Feature engineering
  • Feature engineering involves selecting, transforming, and creating meaningful input features from raw data. It's like preparing ingredients before cooking a meal. Feature engineering helps machine learning models by providing relevant information that captures the underlying patterns in the data. For instance, in a spam email classification task, features like the presence of certain words or the length of the email could be important indicators.

  • Overfitting and Underfitting
  • Overfitting happens when a model becomes too specialized in the training data and fails to generalize well to new, unseen data. It's like memorizing a book word for word but struggling to understand a similar book with different sentences. Underfitting, on the other hand, occurs when a model is too simple to capture the underlying patterns in the data. It's like trying to understand a complex book with a limited vocabulary. Both overfitting and underfitting need to be avoided for accurate predictions.

  • Hyperparameters
  • Hyperparameters are configuration settings that you need to tune before training a machine learning model. They control the behavior and performance of the model. Think of them as the knobs and switches you adjust to get the best performance. Examples of hyperparameters include the learning rate, which determines how quickly the model learns, or the number of layers in a neural network, which affects its complexity.

  • Bias-Variance tradeoff
  • The bias-variance tradeoff is a fundamental concept in machine learning. Bias refers to the assumptions made by a model. If a model has high bias, it might oversimplify the data and make strong assumptions. Variance, on the other hand, refers to the model's sensitivity to fluctuations in the training data. Too much variance can cause the model to be easily influenced by noise. Balancing bias and variance is important to create a model that can make accurate predictions on new, unseen data.

  • Regularization
  • Regularization is a technique used to prevent overfitting in machine learning models. It adds a penalty to the model's objective function, discouraging overly complex or large parameter values. Regularization helps the model focus on important features and reduces its tendency to memorize noise or outliers in the training data.

  • Cross-validation
  • Cross-validation is a technique used to evaluate the performance of a machine learning model. It involves splitting the data into subsets for training and testing. By using multiple train-test splits, cross-validation provides a more robust assessment of the model's ability to generalize to new, unseen data. It helps in estimating how well the model will perform in real-world scenarios.

  • Early stopping
  • Early stopping is a technique used to prevent overfitting and determine the optimal number of training iterations or epochs for a machine learning model. It involves monitoring the model's performance on a validation set during training. When the performance starts to degrade, early stopping stops the training process, preventing the model from becoming overly specialized on the training data.

  • Ensemble learning
  • Ensemble learning involves combining multiple machine learning models to improve performance. It's like having a group of experts with different perspectives, and their collective decision is often better than that of a single expert. Examples of ensemble learning techniques include bagging, boosting, and random forests.
Understanding the everyday terms and concepts in machine learning is required for day-to-day communication and collaboration. The terms discussed in this blog post provide a foundation for further exploration. Keep expanding your knowledge, and stay updated with the latest advancements.
Preprocessing Techniques     
In one of our previous posts, we mentioned we'd delve deeper into the topic of pre-processing. Data preprocessing is crucial in any data analysis or machine learning project: it involves transforming raw data into a format that is suitable for further analysis and modeling. We will focus on analytical data this time, as we already covered the basics of NLP and text classification. In this blog post, we explore various preprocessing techniques along with real-world examples.

  1. Handling Missing Values
  2. Missing values are a common occurrence in datasets and can impact the accuracy and reliability of the analysis. Here are some techniques to handle them:

    1. Deleting Rows/Columns
    2. If the missing values are few and randomly distributed, you can consider deleting the respective rows or columns. However, be cautious as this approach may result in the loss of valuable information.

    3. Imputation
    4. For cases where deleting rows/columns is not feasible, imputation techniques can be employed. Imputing missing values means filling them with estimated values based on existing data. Common imputation techniques include mean, median, and mode imputation, as well as advanced methods like regression imputation. In order to chose the best technique, domain knowledge is indispensable.

  3. Data Cleaning and Formatting
  4. Data inconsistencies and errors can hinder the analysis process. Here are some techniques to clean and format your data:

    1. Removing Duplicates
    2. Duplicates introduce a bias towards certain results (think of it as a higher "weighting"), unless this is what you really want, you should remove duplicates from your data-set.

    3. Correcting Inconsistent Data
    4. Data manipulation functions and formulas can be used to identify and correct inconsistent data, such as inconsistent date formats, currency symbols, or misspelled words.

  5. Feature Scaling and Normalization
  6. Feature scaling is essential when dealing with variables that have different scales or units. There are various techniques to scale and normalize data, let's discuss 2 of the most common ones:

    1. Standardization
    2. Standardizing variables brings them to a common scale, with a mean of zero and a standard deviation of one. This step is especially important if you are planning to apply ML models that consider the data distribution of data rather than it's absolute values. Depending on the distribution, you may want to consider applying a further transformation, for example taking the distribution of the log of the parameter, if your parameter behaves in an exponential way (e.g. Covid infections in a population). Iff the distribution of your data is already a Gaussian, you should still consider standardizing it.

    3. Min-Max Scaling
    4. Min-max scaling scales the variables to a specific range, typically between 0 and 1. Use min-max scaling when you have prior knowledge or domain expertise that the data has a specific range or when the algorithm you plan to use requires input features to be on a bounded interval (e.g., neural networks with activation functions that expect input values between 0 and 1).

  7. Handling Categorical Variables

    1. One-Hot Encoding
    2. One-hot encoding converts categorical variables into binary vectors, where each category is represented by a binary column. For example if you want to use a color as an input parameter, and you have options "red", "green", and "blue", these can be transformed into red=(1,0,0), green=(0,1,0) and blue=(0,0,1). This step is mandatory in case you want to use a model that doesn't support categorical features (some model support text labels).

    3. Label Encoding
    4. Label encoding assigns a numerical label to each category. This is important because some models are only able to predict numerical labels. The encoding is simple each category will be mapped to a number.

  8. Dealing with Outliers
  9. Outliers are data points that significantly deviate from the overall pattern. You can identify them via data exploration (scatter-plots, box-plots or histograms will reveal them visually):

    1. Keep or remove
    2. There's no one-size-fits-all approach here. You need to consider on a case-by-case basis whether to keep the extreme values (e.g. because they indicate something important in the data) or whether to remove them, usually only if they are truly erroneous values (e.g. sensor values that hardware malfunction is triggering).

    3. Winsorization
    4. If you want a middle-ground between keep or remove, Winsorization replaces extreme values with less extreme values, reducing the impact of outliers on analysis results. You can think about these outliers as missing values and use the same solution that you identified in the previous steps.
By effectively handling missing values, cleaning and formatting data, scaling and normalizing variables, dealing with categorical variables, and addressing outliers, you can unlock valuable insights and enhance the accuracy of your model later on.
Takeaways from Alvasys SI Day     
We spent last Thursday listening to talks from various players in the building automation industry. In a nutshell, the following items stood out:

  • Opensource gravitational pull continues. Smaller companies are tackling established players by touting their openness and their embrace of open standards (and simplified license structure). This applies to protocols (MQTT, Haystack), software (Node-RED, Grafana) and even hardware (Raspberry Pi: not quite open-source but close enough). The company Revolution Pi is a strong case study for this topic

  • There's growing opportunity for AI. Currently, software and data are siloed according to different vendor solutions, which creates barriers to deploying consistent AI at scale and the industry lags in what it could achieve with improved algorithms with holistic data sources. However the opportunity is growing as we consider 2 trends: (i) data is freed from silos and can be consolidated into ML pipelines for training and deployment, and (ii) protocols to trigger actions are becoming more standardized and therefore easier to implement

  • Success takes a blend of innovation and pragmatism. We were very interested to see what Lumenradio is doing to remove cables (and had an engaging discussion around radio frequencies and alternatives such as Lora), as well as the refreshingly pragmatic approach demonstrated by company Fidelix also within large-scale projects



We're looking forward to pushing some use-cases of our own in this space over the coming months. It's an exciting space to be working in. Many thanks again to Alvasys for hosting this fun event.
The importance of preprocessing your data     
Lately, we've been working on a data analysis project, and we wanted to highlight why it's so important to prepare your data properly. Even if we already know it's crucial, let's take a moment to discuss the problems that can arise when we skip data preparation.

  1. Biased Analysis
  2. When you have missing values, outliers, or duplicated data, it can mess up your conclusions. For example, if you don't handle missing values correctly, your analysis might end up with biased results because incomplete data can throw off calculations and misrepresent trends. Similarly, ignoring outliers can mess up your statistical measurements, impacting the overall analysis. Plus, duplicated data can make results look bigger than they actually are and give you a false idea of patterns or trends, leading to wrong interpretations.
  3. Wrong Predictions
  4. Inconsistent data formats, like different ways of writing addresses or inconsistent capitalization, can introduce errors and mess with accurate predictions. Also, when your features are measured on different scales (like time in hours/minutes), not using data preparation techniques can lead to unreliable predictions and misleading comparisons. If you don't address these issues, your predictions might not be reliable or valid.
  5. Inefficient Analysis
  6. When your features have different scales, it can be hard to make accurate calculations and comparisons, making the analysis process slow and inefficient. Also, duplicated data can make the analysis take longer and use up unnecessary computing power, making the whole process less efficient.
All these problems can be easily avoided with some simple but thorough data preparation. In the next few weeks, we'll share some suggestions to help you with this.
Enter the Prophet model     
In an ongoing project - where we are tasked with modelling a complex organization - we were recently made aware of the "Prophet model": an opensource model released by Facebook that is apt for modeling time series data with seasonality.

Prophet provides two enhancements we find interesting.

First, it makes it easy to add "changepoints": discrete events that have a strong impact on your output depending on what you are trying to model. For example you would want to account for public holidays or the Superbowl if you were trying to model traffic.



Second, it makes it straightforward to break down the seasonality components of the forecast into a daily, monthly or yearly view. It is possible to do address these points in ARIMA or KNN, but Prophet does make it easier.



We're going to be experimenting with it a lot more in the coming days and pit it up against our usual favorites. If you are interested in more information we recommend Facebook's posts on the topic, two examples located here and here.
The beauty of regular expressions     
Every programmer has encountered a seemingly simple task that requires a complex solution. Often, such tasks involve handling and understanding text, whether it's cleaning data, validating input, or extracting information from disorganized files. Enter the world of regular expressions, an essential tool that is easy to start and difficult to master.


What are Regular Expressions?

Regular expressions (also known as regex) are a sequence of characters that define a pattern to be searched within text. They are widely supported in popular programming languages like Python, JavaScript, Java, and PHP, and commonly used in text editors and Unix-based systems to search, manipulate and validate text.

Enough theory already, let's try it out. If you don't have a regexp interpreter handy, there are multiple websites that provide the service for free: regex101 or regexr


Example 1: Extracting dates
Extract date formats such as "01.05.2026" with the following regex :

\d{2}\.\d{2}\.\d{4}

Feel free to follow along with regex101, copy past the regexp in the "regular expression" text input, and then type a sentence such as "today's date is the 20.05.2023 and it's sunny"


Example 2: Check for 1 word or the other
You can check for an entire word or another, for examples months in a year, with the following regexp

(August)|(September)
You can type: "This is a line with August in it"


Example 3: Validate an email address
The following regex pattern checks whether a string contains an "@" and a period, along with valid email characters.

[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,6}


We hope you'll play around with these, or at least know they exist. Here are a few more examples to help you discover this beautiful tool.

1. Alphanumeric Check: [a-zA-Z0-9]*
Ensure a provided string has only letters or digits. This is helpful for usernames, IDs, or simple passwords.

2. Numeric Range: (\d{1,2}|100)
Validate if a number is between 1 and 100, useful for percentage or rating checks.

3. US Phone Number: (\+\d{1,2}\s)?\(?\d{3}[\-\s)]?\d{3}[\-\s]?\d{4}
Validate formatting of US phone numbers, supporting multiple formats like "(123) 456-7890" and "123-456-7890."

4. Simple Password Check: (?=.*[A-za-z])(?=.*\d)[A-Za-z\d]{8,}
Ensure a password has a minimum length of eight characters, at least one letter and one number.

5. URL: (https?:\/\/)?([\da-z.-]+)\.([a-z.]{2,6})([\/\w.-]*)*\/?
Validate a URL with support for various domains and paths, including optional "http://" or "https://."
Programming for beginners with ChatGPT     
Software engineering has a steep learning curve early on. Not knowing how to start, how to structure, how to debug has meant hours of frustration for beginner software engineers. And most of the time, these hours could have been saved if a helpful tutor had been available with a couple of pointers or examples.

We decided to see if could write a simple program by prompting ChatGPT and not programming a single line.

Prompt: Can you recommend an approach to extract the main headlines from a newspaper such as the nytimes via a Python program
ChatGPT: Certainly! Extracting the main headlines from a newspaper like The New York Times using a Python program typically involves web scraping and parsing HTML content.

Prompt: What if i don't want to scrape the page, is there another way ?
ChatGPT: If you don't want to scrape the webpage directly, you can explore alternative options for obtaining the main headlines from The New York Times. Here are a couple of alternatives:
NYTimes API: The New York Times provides an API that allows you [...]
RSS Feeds: The New York Times offers RSS feeds that provide a structured format of their content, including headlines. [...]

Prompt: I like the RSS approach. Can you write a simple python program which extracts the main world feeds and prints them out in a readable way with a brief summary ?
ChatGPT: Generates this file with explanations. The program works as requested (example output here) and is an easy example of iterating through a list and printing specific data

Prompt: Can you generate a web page with this data, and add an image for each feed entry ?
ChatGPT: Generates this file with explanations. The program works as requested (example output here) and demonstrates how to print out into special templates (e.g. HTML)

Prompt: Can you update the code to send me an email with a synthesis of the headlines by email ?
ChatGPT: Generates this file with explanations. The program works as requested and demonstrates how to interact with more advanced libraries to make an API call to an email server

Such tutoring is invaluable in getting started, and we're excited that it makes programming more accessible to millions of people.

However a word of caution: don't think that being able to chat with ChatGPT makes you de facto a programmer. There too is a learning curve on knowing what and how to prompt an LLM, similarly to learning to use the Google search engine. For example: "explain this piece of code", "how can we make the algorithm more modular", "can we write test cases" etc. The better you understand what you are doing, the more you know how to ask specific prompts, the better you understand... it's a virtuous circle.

We'll continue working with ChatGPT to write better code and we'll give some updates when we figure out how to use it in further unexpected ways.
Shoutout to HITS innovation breakfast     
We'll keep this week's post short and sweet: HITS is hosting an innovation breakfast on the topic ChatGPT applied to insurance. We certainly have our own views on the topic, but we think it will be fun to listen to Samyr and Kilian share their perspective.

See you there!
Data-driven decision-making     
Effective communication of engineers to management is critical for data-driven decision-making. However it is a challenging task as they almost don't speak the same language:
  • engineers can be too tech-savvy and not have enough focus on the "so-what" of their analyses, missing the forest for the trees
  • management can oversee the subtleties of the analysis and press for cartoonish statements

Here are three strategies that can help you effectively communicate complex data insights to a non-technical audience:


Use simple language and make analogies


Use clear and concise language, avoid technical jargon or complex terminology that may be unfamiliar to your audience. Analogies and metaphors can be a powerful tool to explain the data in a relatable way. For example, if you're communicating insights about website traffic, you can use the analogy of a physical store. Unique IP's become new customers, page views are product views, a sales funnel can be customers moving from room-to-room. Einstein had a powerful quote related to this: "If you can't explain it to a six-year-old, you don't understand it yourself"


1 picture is worth 1000 words


Charts, graphs, and diagrams are the most powerful tool to help you to convey complex insights to non-technical stakeholders. They can illustrate trends, relationships or complex patterns in an instant, whereas a text description would long and laborious. It's important to choose formats that are easy to understand. Where possible stick to basic line or column graphs and avoid the more complex dual-axes plots or histograms. Keep your layouts clean and concise (and never forget the units!), use color coding or labels to help distinguish different data types. Also, keep in mind that different visual aids work better for different types of data insights. For example, a line chart may be suitable for displaying trends over time, while a scatter plot may be more appropriate for analyzing relationships between variables.

At Sense6 we use Grafana and Python Plots to present our data in an effective way, and so far, we can highly suggest these tools. We already have several blog posts about Grafana, you can check them out here and here as well as about data visualization use-cases that you can find here and here.


Focus your message to your audience


A byproduct of a good analysis is the breadth of insights that are available for discussion, which makes it easy to overwhelm your stakeholders. Therefore it is especially important to understand your audience and what they want to get out of the analysis. Focus your message and graphs relentlessly, and resist the urge to show all the details. Different stakeholders have different priorities, concerns, and levels of technical expertise, and therefore will require different types of information. Your communication of the analysis must change depending on who you present to.

A crucial part of Sense6 (and any data-analytics company) is not only to generate strong analyses, but also to deliver impactful high-quality presentations to enable our customers to move forward. We hope the above points serve you well. If you are interested in what else we do, or if you have any suggestions, feel free to reach out.
Learning is a lifelong process     
Simply put, we think of learning as a lifelong process: you stop when you die and not before. This is obviously true in a professional setting - where staying up-to-date with your industry's trends is crucial - but also in a personal one, where the same mindset can be applied to trying to become a better person. A bit of learning every day gradually builds up to a mountain of acquired skills and knowledge. These skills can open new opportunities, help you stay ahead of the curve, or give you a profound sense of satisfaction and progress. But most of all, they give you context. You'll start to discern the inner-workings and underlying dynamics, you'll build your own unique perspective and evaluate developments (the good and the bad) according to your own values. Not those spouted by some arbitrary influencer.

Lifelong learning is especially relevant in the IT industry, where advances can change the world within a couple of years (if even that). As technology continues to advance, it applies pressure on other industries to further evolve and there is always something new to experiment with. The amount and quality of options actually creates the opposite problem: a need for focus. Are you going to try a family reunion in the Metaverse, or write a wedding speech with Chat-GPT, or setup a blockchain for your local library, or try to sell some NFT art, or build a Raspberry-Pi powered smart-clock, or 3D print an enclosure for a Lora-Node to track your pet, or shoot a small special effects movie with some friends using Blender? There's enough interesting stuff for 100 lifetimes of learning.

Not everyone has the same approach to learning. Some require a plan and a clear time allocation. Others combine the skill they want to learn with a use-case relevant to them. Others build small groups and work together, keeping each other motivated. Whatever works for you is fine, but it must enable you to keep the momentum and drive once the novelty wears off (it always does) and the first problems arise.

Finding learning materials has never been easier: podcasts, tutorials, online courses, and webinars are all available on the internet, and why not attend a networking event or try to engage with experts around you, or a mentorship program within your company. These resources are often free, varied and flexible. Here are some suggestions that we use at Sense6:

We'll close this post by echoing a relevant quote from Steve Jobs: stay hungry, stay foolish.
Predicting the Future: ARIMA vs. KNN     
Have you ever wondered how businesses and researchers can accurately predict future trends and behaviors? One way is through time series forecasting methods such as ARIMA and KNN. Both methods have their strengths and weaknesses, and in this blog post, we compare these two popular techniques.

ARIMA, or Autoregressive Integrated Moving Average, is a statistical model used to analyze and forecast time series data. It considers three factors: autoregression, or the relationship between a variable and its past values; integration, or the differencing of the data to make it stationary; and moving average, or the relationship between the error terms in a time series. ARIMA is widely used in fields such as finance, economics, and engineering due to its ability to capture long-term trends and seasonal patterns.

ARIMA forecasting results
On the other hand, KNN, or K-Nearest Neighbors, is a machine-learning algorithm used for classification and regression. It works by finding the K closest points in the training data to a given input and using their values to predict the output. In time series forecasting, KNN is used to find the K closest historical data points to a current point and use their values to predict the future. KNN is known for its simplicity and flexibility, making it popular in fields such as healthcare and marketing.

KNN forecasting results
So, what are the pros and cons of each method? ARIMA has the advantage of capturing complex time series patterns and trends, making it useful for long-term forecasting. However, ARIMA requires a stationary time series, which is not feasible in all cases. Additionally, ARIMA can be difficult to interpret and requires expertise in statistical modeling.

KNN, on the other hand, is easy to understand and implement, making it accessible to a wider range of users. It also does not require the data to be stationary, making it more versatile in some cases. However, KNN is sensitive to noise and outliers in the data, which can result in inaccurate predictions. It is also not as effective at capturing long-term trends as ARIMA.

You can see how the two algorithms worked in an anonymized representation of one of our projects.

Despite the fact, ARIMA is a state-of-the-art technique used for forecasting, it did not work very well for our use case, however, KNN could provide good results out of the box. We could've probably gotten ARIMA to perform better if we invested more effort and time.

In conclusion, there is a tradeoff between sophisticated and complex versus pragmatic and simple algorithms. KNN is definitely an effective first step.
Wireguard after 2 years     
When a startup talks about what it does, data analytics and machine learning often steal the show. While these areas can be game-changers for a business, it's important not to forget about the foundational infrastructure that keeps everything running smoothly.

One critical piece of this infrastructure is a secure VPN. A VPN, or virtual private network, creates an encrypted connection between our company's users, services and the internet. This is particularly important in today's remote work environment and for handling confidential information in a secure way. All of our data is encrypted, even within our protected network.

To that end, Wireguard has been invaluable. Take a look at these stats from the past 6 months. Our VPN networks haven't dropped since we made the switch from OpenVPN 2 years ago.



While data analytics and machine learning may be sexier, investing in stable infrastructure is crucial for us to succeed in the long term. By prioritizing security and reliability, we avoid costly data breaches and downtime that would harm our reputation (and bottom line).
NoSQL databases     
NoSQL databases have gained immense popularity over the past decade due to their ability to handle large-scale data storage and processing. Unlike traditional relational databases, NoSQL databases do not rely on a fixed schema and can handle semi-structured and unstructured data. In this post, we'll take a look at some of the popular NoSQL databases and their main concepts.

  • Document-based database: MongoDB
    MongoDB is a popular document-based NoSQL database that stores data in the form of JSON-like documents. MongoDB is schema-less, which means you don't have to define a schema before you start storing data. You can store different types of data in a single collection, making it a popular choice for applications with varying data structures. MongoDB is known for its scalability, high performance, and flexibility.

  • Column-family database: Cassandra
    Cassandra is a distributed column-family NoSQL database that was originally developed by Facebook. It is designed to handle large amounts of data across multiple nodes and data centers. In Cassandra, data is organized in column families, which are collections of key-value pairs. Each column family can have a different schema, which allows you to store different types of data in the same database. Cassandra is known for its scalability, fault-tolerance, and high availability.

  • Graph database: Neo4j
    Neo4j is a popular graph database that stores data in the form of nodes and relationships. It is designed to handle complex data relationships and is commonly used in social networking, recommendation engines, and fraud detection systems. Neo4j is known for its performance, flexibility, and scalability.

  • Key-value database: Redis
    Redis is an in-memory key-value store that is commonly used for caching and session management. It stores data in the form of key-value pairs, where the key is a unique identifier for the value. Redis is known for its speed, high performance, and scalability.

These are just a few examples of popular NoSQL databases and their main concepts. It's important to note that each database has its strengths and weaknesses, and the choice of the database depends on the specific needs of your application.

One tool that can help you visualize and monitor your NoSQL databases is Grafana. Grafana is an open-source platform that allows you to create custom dashboards and visualizations for your data. It supports a wide range of data sources, including MongoDB, Cassandra, Neo4j, Redis, and many others. With Grafana, you can easily monitor the performance and health of your databases in real time, making it an essential tool for any NoSQL database administrator.

In conclusion, NoSQL databases offer a flexible and scalable approach to data storage and processing. Whether you're storing documents, columns, graphs, or key-value pairs, there's a NoSQL database that can meet your needs. And with tools like Grafana, you can easily monitor and visualize your data, making it easier to extract insights and make data-driven decisions.
1 resource and 2 points     
A lot has been written about AI and its impact on humanity. Instead of covering what you've read before, we thought we'd focus on the new and less discussed. We'll keep this short:
 
  • The resource: Lex Fridman recently interviewed Sam Altman, the founder of Open AI (the company behind Chat-GPT). In the 2 hour podcast, they cover the company, their approach, the benefits and dangers, what we need to build, what we need to learn, and much more. It's an interesting interview straight from the horse's mouth, recommended: link
     
  • Point 1: A return to nuance. Consider that two of the most consumer-impacting products of the last decades - internet search and social media - have an incentive to show users what they search for. Search engines do a much better job than social networks at balancing their results, but the risk of falling into an echo chamber is significant. An interactive AI has the opportunity to provide context, counter-opinions, and even debate with the user during the interaction. We're very aware that this puts more pressure on removing bias from these systems, a hard problem. Still, a return to more nuanced opinions, more self-doubt, more civilized discourse, is something we'd welcome.
     
  • Point 2: A new way of interacting. Because Chat-GPT provides such impressive results, something that is overlooked is the new user interaction it supports. Consider examples of innovations that dramatically enhanced and/or simplified the interaction abilities of their users: the iPhone, the ATM, mobile banking platforms, ride-sharing apps. We think Chat-GPT also falls into this category: imagine that you want to cook spaghetti pesto and are looking for a special recipe ("give me a recipe for spaghetti pesto with extra garlic"). Before, you would have had to browse recipes until you'd find the right one. Now you can converse with the AI until its generated a result tailored to you ("can I add cauliflower to the recipe?") as well as amend it on the fly ("I accidentally spilled milk into the bowl, how do I save it"). Beyond the accuracy of the results, the very way we can interact with these algorithms is different and a marked improvement.
What do you think is not talked about enough ?
The CRISP-DM process     
The CRISP-DM (Cross Industry Standard Process for Data Mining) process provides a structured approach to help guide data analysts and data scientists through a project from beginning to end. It was introduced in this paper in 2000 and widely used ever since. There are even articles like this discussing how this process makes better data scientists. It is important that not only the data scientists, but everybody taking part in the project (e.g. stakeholders, managers, etc.) understands the main concept, so if you are interested in the area, we recommend you check it out.

Let's take a look at the CRISP-DM process and its six steps:
  1. Business understanding: The first step of the CRISP-DM process is to understand the business problem that needs to be solved. This involves identifying the key stakeholders, defining the project objectives, and formulating the initial hypotheses. This step helps ensure that the data mining project is focused on solving a specific business problem and aligns with the organization's goals.
  2. Data understanding: Once the business problem is defined, the next step is to gather and explore the data. Data sources are identified and evaluated for their quality, relevance, and completeness. This step helps data analysts and data scientists understand the nature of the data they'll be working with and identify potential data quality issues.
  3. Data preprocessing: In this step, the data is cleaned, transformed, and prepared for analysis. This includes handling missing data, dealing with outliers, and selecting relevant features. This step is critical for ensuring that the data is in the right format for analysis and that any issues are addressed before modeling.
  4. Modeling In this step, data analysts and data scientists develop predictive models to answer the business questions identified in step 1. This involves selecting appropriate modeling techniques, testing different models, and selecting the best model for the data. This step helps identify the best approach for modeling the data and ensures that the model is accurate and reliable.
  5. Evaluation: Once the model is developed, it's time to evaluate its performance against the business objectives defined in step 1. This involves testing the model on a holdout dataset and comparing the results against the business requirements. This step ensures that the model meets the business needs and performs well in real-world scenarios.
  6. Deployment: The final step of the CRISP-DM process is to deploy the model into the production environment. This involves creating documentation, implementing the model, and monitoring its performance over time. This step ensures that the model is integrated into the organization's processes and that it continues to deliver value over the long term.
By following the six steps outlined above, data analysts and data scientists can develop accurate and reliable models that solve specific business problems and deliver value to the organization.
Using ML in Excel     
The goal of this short blog post is to demystify Machine Learning for the Excel-only users out there. Let's imagine you are responsible for a small post office and have 4 different data points stored in an Excel for each week:
  • how many customers showed up
  • how much time it took on average to serve the customers
  • a rating of how happy your customers were with the service incl. with the wait time
  • how many hours your employees worked
We're going to pull data from an Excel model and then create a RandomForestRegressor, a simple yet powerful ML algorithm to predict continuous values such as our customer rating (a percentage score).

We're going to take some shortcuts in the spirit of simplicity, dive in and follow along. The only 2 functions we'll need are below. Skip understanding them, they aren't central, just copy paste them (and the imports) in your code (one helps extract data from Excel, the other formats the data into a dataframe so that the ML library can read it)

import plotly.graph_objects as go
import openpyxl
import os
import pandas as pd
import matplotlib.pyplot as plt
from plotly.subplots import make_subplots
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

def get_data_from_excel_row(row_num, wb, column_start, column_stop):
   return_list = []
   for col_i in range(column_start, column_stop + 1):
      return_list.append( wb.cell(row_num, col_i).value )
   return return_list

def create_dataframe_from_list_of_lists(data_list, col_list):
   transposed_tuples = list(zip(*data_list))
   transposed_data = [list(sublist) for sublist in transposed_tuples]
   df = pd.DataFrame (transposed_data, columns=col_list)
   return df


Step 1: We're going to pull data from your Excel file (here's an example) using our get_data_from_excel_row function

excel_path = <path of the excel file on your computer>
excel_file = openpyxl.load_workbook(excel_path, data_only=True)
wb = excel_file["Sheet1"]
dates = get_data_from_excel_row(2, wb, 3, 168)
customers_num = get_data_from_excel_row(3, wb, 3, 168)
avg_handling_time = get_data_from_excel_row(4, wb, 3, 168)
achievement_rating = get_data_from_excel_row(5, wb, 3, 168)
employee_work_hours = get_data_from_excel_row(6, wb, 3, 168)



Step 2: Now that we have our data loaded into Python, we can prepare it as is required by the RandomForestRegressor model:

x_data_list = [customers_num, avg_handling_time, employee_work_hours]
x_col_list = ["customers_num", "avg_handling_time", "employee_work_hours"]
y_data_list = [achievement_rating]
y_col_list = ["achievement_rating"]
df_x = create_dataframe_from_list_of_lists(x_data_list, x_col_list)
df_y = create_dataframe_from_list_of_lists(y_data_list, y_col_list)

This means that we want to train a machine learning model to predict our customer achievement rating using the 3 other values as input parameters: number of customers, handling time and work hours.

Step 3: Finally, we split our data into testing and training sets (ML tests its prediction-testing capabilities on data it has never seen before, that's why we separate the training data from the testing data):

x_train, x_test, y_train, y_test = train_test_split(df_x, df_y, test_size=0.25)
regr = RandomForestRegressor(max_depth=3)
regr.fit(x_train, y_train)
y_predict = regr.predict(x_test)

Note that the above 4 lines are the core of the program, all the rest is about importing, preparing and visualizing the data. Think about it, 4 lines of code is all you need to create an ML model and generate a prediction, we think that's amazing.

Step 4: Let's visualize the prediction accuracy by plotting the actual results versus the predicted test data:

num_vals = y_predict.shape[0]
plt.plot(range(num_vals), y_test, color="black", label="actuals")
plt.plot(range(num_vals), y_predict, color="red", label="prediction")
plt.ylim(0, 1)
plt.legend()
plt.show()



We'll be sending the full code to our mailing-list subscribers. We really hope you have a go at this, there's no better way than learning by doing.
Version control - benefits and misuses     
Version control is a practice that allows developers to manage changes made to their code over time. It is an essential part of modern software development and plays a crucial role in ensuring the quality and reliability of code. In this post, we will discuss the benefits of using version control and how it can help developers work more efficiently. We will also highlight common mistakes that developers make when using version control and provide tips on how to avoid them.

Version control provides many benefits for developers. Here are a few examples:
  1. Manage changes made to code over time: Version control allows developers to manage changes made to their codebase over time. They can see when a change was made, who made it, and what exactly was changed. This information is crucial when trying to debug issues that arise later in the development process.
  2. Enabling collaboration among multiple developers: With version control, multiple developers can work on the same codebase simultaneously. They can merge their changes together and resolve conflicts easily, which makes it possible to build large and complex software projects collaboratively.
  3. Allowing developers to experiment with new features without worrying about breaking existing code: Version control makes it possible to create branches off of the main codebase, allowing developers to experiment with new features without affecting the main codebase. If something goes wrong, they can simply delete the branch and start over, without affecting the main codebase.
  4. Facilitating code review and bug tracking: Version control allows teams to review changes before they are merged into the main codebase. This helps catch bugs and other issues before they become major problems.
  5. Making it easy to roll back to a previous version of code if necessary: Version control makes it easy to revert to a previous version of code if something goes wrong. This is especially useful if a bug is introduced in a new code version, as developers can quickly roll back to the previous version while fixing the issue.
Despite its many benefits, developers often misuse or underuse version control. Here are a few common mistakes:
  1. Failing to commit changes regularly, leading to large, difficult-to-merge changesets: Developers should commit their changes regularly to keep their codebase up to date. Failing to do so can result in large changesets that are difficult to merge, leading to conflicts and other issues.
  2. Failing to use branching and merging effectively, leading to conflicts and lost work: Developers should use branching and merging effectively to keep their codebase organized. Failing to do so can result in conflicts and lost work.
  3. Not taking advantage of features such as tags and releases to keep track of important milestones: Developers should use features such as tags and releases to keep track of important milestones in their development process. Failing to do so can make it difficult to track the evolution of the codebase over time.
  4. Not using version control at all, or using it only for a single developer working on a small project: Some developers do not use version control at all, or only use it for small projects. This can be problematic, as it makes it difficult to collaborate effectively and track changes over time.
Version control is a critical tool for software development. It allows teams to collaborate effectively, track changes over time, and roll back to previous code versions if necessary. However, it is important to use version control effectively and take advantage of its features to get the most out of it. By committing changes regularly, using branching and merging effectively, and taking advantage of features such as tags and releases, developers can ensure that they are getting the most out of version control and building better software as a result.

Here is a nice tutorial that we suggest using for learning the use of version control with git.
Whisper     
Being able to seamlessly interact with humans on various channels is (should be) a key objective for any customer-facing application. Visuals have long been the preferred medium and will continue to be as increasingly immersive technologies such as augmented reality mature. However, interacting via speech has proven to be more difficult.

To much less fanfare than ChatGPT, OpenAI open-sourced their speech-to-text model Whisper in September of last year. We only recently took it for a spin.

We first tested it against a couple of well enunciated German audio-files, located here and here (credit to www.audio-lingua.eu for providing these free recordings). The results were quasi-perfect, note that the AI also understood when individual letters were being enunciated:

Recording 1:
Hallo, wir sind Mara, Luis Unico und ich bin Nico und meine Lieblingsmusik Richtung ist Hip-Hop und Rap. Ich höre einen sehr gerne, einen deutschen Künstler namens Pascha-Nim, den Buch stabiliert man, P-A-S-H-A-N-I-M. Ich höre Generog und meine Lieblingsdeutsche Band heißt Plont, B-L-O-N-D. Ich höre einen Hip-Hop und mein Lieblingskünstler ist Tupac, es schreibt durch zwei P-A-C.

Recording 2:
Was tu ich für die Umwelt? Das Thema Umweltschutz ist in den letzten Jahren ziemlich in den Vordergrund gerückt und hat auch für mich eine wichtige Rolle in meinem Leben eingenommen. Um dem Klimawandel ein wenig entgegenzuwerten, versuche ich möglichst viel bzw. so gutes geht, alle Wege mit dem Fahrrad zu fahren...


We then decided to give it something harder and found sample call-center recording from pac-biz. We transcribed the first 2 calls (it takes approx. 2 seconds per minute) and once again obtained an impressive result, especially given the audio quality:

Recording 1:
It's time to be tomorrow. What time? 10 a.m. They're heading where? I am heading to... What's your name? My name is... Can comedy prosecutors will be there? My tell. Just one. Let me just confirm everything. I'm going to pick up the address in the hospital. I think to end the press call, recruit number three. Pick up time will be tomorrow at 10 a.m. Correct? Yes, that's correct. I am...

Recording 2:
Thank you for calling. This is how can I help? I'm calling the verify that my husband has been assigned to a driver tomorrow for his ride please. Okay, sure, ma'am. Let me check it for you here one moment. Thank you. You're welcome. What's the name of the passenger under the booking ma'am? I'm his name and rather thank you. Let me check that one here. All right, the verify ma'am that will be picked up for tomorrow 11 a.m. I'm going to Chicago. Is that correct? ...


You'll observe that some words look out of place, however having listened to the recordings several times, it's hard to make out what the correct words are and the transcription is very close to the phonetics.

Whisper is released under the MIT license, and we are incredibly grateful for companies such as OpenAI to release these open-source models. It allows the community to learn from their expertise, to tinker, to discuss, to improve. It also allows smaller companies like ours to stand on the shoulders of giants and build value-adding services for our customers, which would be impossible without deep R&D pockets.

Next step: we will be testing it on Swiss-German, which is going to be a lot of fun :)
Visualizing forecast information     
In the world of predictions we try to summarize how well we do with some over-arching metrics such as P-values, F-scores, or precision and recall.

While a useful shorthand, these metrics are abstract and don't provide all relevant details (how could they). So as an alternative, it's nice to be able to pull-out a time-series database (Influxdb) and visualization platform (Grafana) and plot out our predictions followed by the actual results. Sometimes a simple graph can be worth a thousand KPIs.



The above is a trivial example where we pull the weather forecast from a public API, and later the actual weather from sensors managed by our friends at Alvasys. The red and blue values are predictions, and the green values are actuals.

Now imagine that instead of weather data, you can visualize your ML/AI predictions alongside the actual results. Influxdb and Grafana make it easy to query by time, model type or any dynamic metric you add as metadata. These graphs can be a precious complement to your over-arching metrics and help you form a more holistic understanding of your model performance.
Jupyter Notebook for health insurance analysis     
In this blogpost we are presenting how Jupyter Notebook can be used to easily present some data analytics results. This can be useful when some relatively easy data analysis needs to be done on a relatively small dataset. A Jupyter Notebook can be interactive, and if slight modifications are needed they can be easily made real time.
We are going to drive you through a step-by-step analysis of an insurance dataset. The dataset can be found here. If you want to try out the following tutorial yourself, you need to download the dataset (insurance.csv). It contains the following information:
  • age: Age of primary beneficiary
  • sex: Insurance contractor gender, female / male
  • bmi: Body mass index, providing an understanding of the body, weights that are relatively high or low relative to height, objective index of body weight (kg / m2) using the ratio of height to weight, ideally 18.5 to 24.9
  • children: Number of children covered by health insurance / Number of dependents
  • smoker: Smoker / Non - smoker
  • region: The beneficiary's residential area in the US, northeast, southeast, southwest, northwest.
  • charges: Individual medical costs billed by health insurance

In order to perform the analysis, first you need to install Python (preferably python3) alongside some necessary packages like pandas, numpy, and matplotlib. You also need to install Jupyter Notebook. In case you are a Windows user we suggest following this step-by-step instruction for installation.
Once everything is installed, it's time to perform the analysis. We will cover the following steps:
  1. Loading the dataset
  2. Data exploration (datatypes, missing data, distributions, etc)
  3. Correlation analysis between the attributes
  4. Hypothesis testing (based on previous findings)
  5. Conclusions

1. Loading the dataset

Before starting the process we need to make sure that the necessary libraries are imported. To import a library in python you need to run the import <library name>. We are going to need the following libraries:
  • Pandas: used for data analysis and data manipulation.
  • Matplotlib: used for creating static, animated, and interactive visualizations.
  • NumPy: offer mathematical functions.
  • Seaborn: is a data visualization library based on Matplotlib.
As a very first step, we need to load the dataset and check that it is loaded correctly. You can do this by running the following commands in a Jupyter Notebook:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv("insurance.csv")
df.head(10)

The output of the above command should look like this:


2. Data exploration

Once the dataset is loaded, we can do some exploratory data analysis to better understand the data that we are working with.

2.1 Datatypes

As a first step, we check the number of rows in the dataset and the datatype of the columns. You can check this by running the following command:
df.info()
The output should look something like this:


2.2 Distribution of numerical attibutes

For visualizing the distribution of the numerical attributes we will use the following function:
def distribution(x):
  sns.histplot(df[x])
  plt.title(f'{x.capitalize()} Density Distribution')

We will call this function by replacing x with the corresponding numerical column names: age, bmi, and charges. The distribution of numerical attributes is the following:


2.3 Distribution of categorical attibutes

For visualizing the distribution of the categorical attributes is a bit simpler, because we just simply count the occurrences of the values of the attributes. We will use this line of code:
sns.countplot(x = x, data = df)
We will again replace x with the corresponding categorical column names: sex, smoker, region and children. The distribution of categorical attributes is the following:


3. Correlation analysis between the attributes

Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate in relation to each other. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.
Correlation analysis can be done only among numerical attributes, so firstly, we transform the categorical attributes using a simple mapping. We map each categorical value of each attribute to a number and just replace the values with the corresponding number. This can be done by the following piece of code:
for x in ['sex', 'children', 'smoker', 'region']:
  df[x] = df[x].astype('category')
cat_columns = df.select_dtypes(['category']).columns
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)

Then we create the correlation heatmap with the following code:
correlation = df.corr()
sns.heatmap(correlation, annot=True, cmap = 'summer_r')

The result is the following:

4. Hypothesis testing

A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. You can read more about it here. We are going to visually investigate the next 2 hypotheses:
  1. Do the charges of people who smoke differ significantly from the people who don't?
  2. Does the BMI of males differ significantly from that of females?


4.1 Do the charges of people who smoke differ significantly from the people who don't?


As you can see in the graphs above, the charges of people who smoke indeed differ from the people who don't.

4.2 Does the BMI of males differ significantly from that of females?


As you can see on the graph, the BMI of males does not differ from that of females in our data.

5. Conclusions

The conclusions gathered from the dataset during our analysis are the followings:
  • The charges of people who smoke indeed differ significantly from the people who don't.
  • The BMI of males does not differ significantly from that of females in our data.


As you can see, with some very simple steps, you can easily gather meaningful information from a dataset. Using simple python commands you can do similar analyses on your own dataset.
If you would like to receive our Jupyter Notebook with some extended analysis, you can subscribe to our newsletter here.
Python for business users     
We think everyone should learn to program (at least a bit), and the secret to starting is finding an achievable project that is useful to you. For this mini-tutorial, let's turn to business users who increasingly have to deal with large data files (some examples here).

If you have python installed then tag-along (if you don't, quickly install it and also tag along). Type "cmd" in your Windows Start Menu to initiate a terminal, navigate to a directory with large files (in our case we downloaded a large CSV in the download folder) and follow the steps:


Step 1: open a file
Python provides a native library to interact with your operating system and its files. The following three commands list all files in a directory, and then open a specific file:

import os
os.listdir(".")
file_for_analysis = open("Data7602DescendingYearOrder.csv")




Congrats, although nothing visual happened, you've successfully loaded the file into the variable 'file_for_analysis'. Did you note it was instantaneous? Now let's find out what this file contains.


Step 2: first explorations

ctr = 0
for l in file_for_analysis.readlines():
   if ctr < 10: print( l )
   ctr = ctr + 1

print("Total amount of lines:", ctr)




OK, so what's going on here. First we initialized a counter variable 'ctr'. Using the 'readlines()' command, we loop through the file line-by-line and increment the counter for each line. We print the first 10 lines and print the final ctr value after going through all lines in the file. In our case its 5,985,365 lines and executing the above lines takes about 7 seconds.

If you think you can do the same with Excel, this is what we got after 45 seconds of waiting for Excel to load the file:



This is not meant as Excel-bashing: we love Excel and use it a lot. However its not suited for big data files, and 5 lines of code have enabled us identify the size of the file and see some sample information. Let's keep going.


Step 3: more details

Let's say we want to understand how many different years there are in the data. Once we've split a line by the commas into a list of different sub-lines, we can check the text at the 3rd position. In most/all programming languages, the index of a list starts at 0, so the third item is at the index 2. This is something you quickly get used to so don't worry about it.

file_for_analysis.seek(0,0)     # optional line to re-initialize the lines for reading
years_set = set([])

for l in file_for_analysis.readlines():
   line_parts = l.split(",")
   year_in_row = line_parts[2]
   years_set.add( year_in_row )

print("Different years available:", years_set )


We create an empty set to store unique values. We store the year from each line into the set, and then print out its contents. If you want to be fancy and store the years in order, you can transform the set into a list and then sort it as so:

years_list = list( years_set )
years_list.sort()
print(years_list)





Step 4: let's get out of here

OK, we want to analyze the 2021 data in a tool we know better. But how to extract the data for 2021 years only? We can tweak our code to copy the lines with the 2021 date to a new file as so:

file_for_analysis.seek(0,0)     # optional line to re-initialize the lines for reading

new_file = open("2021_data.csv", "w")

for l in file_for_analysis.readlines():
   line_parts = l.split(",")
   year_in_row = line_parts[2]
   if year_in_row == "2021" or year_in_row == "year":
      new_file.write( l )

new_file.close()


Yay, only 274,329 lines in the new csv file, which Excel will be able to open without a problem.


Putting it all togther

As the data arounds us continues to grow in size and quantity, we think everyone needs to find a way to explore large data-sets. We're skimming the surface here, things become a lot more exciting once you start tapping into data processing and visualization libraries, which are not that hard to use. We intend this as a fun little project for you to get to know Python better, and help if you have a large file you cannot open.

Finally, several years ago when some of us were consultants, we put together various functions into an executable and shared it within the team (split a file, random sample of lines, filter, etc.). It still works, so anyone who needs a quick solution is welcome to use this .exe file. If you don't trust the .exe file, you can use the source code here (note that it was programmed in Python 2.6, Python is now on the 3.x version, which provides better simpler faster functions).
Text classification     
In our previous post, we talked about NLP and its exciting everyday applications. This time we present one specific use case, text classification.

Text classification is a method of automatically assigning predefined categories or labels to a text. This process is commonly used to classify text data such as news articles, social media posts, and customer feedback.

For example, a text classification algorithm could be used to sort emails into different folders, such as "spam" or "not spam", or to categorize news articles by topic, such as "sports," "books," or "entertainment."

The process of text classification typically involves three main steps:

  1. Preprocessing: In this step, the text data is cleaned and formatted so that the algorithm can use it. This might include removing special characters, converting all the text to lowercase, and removing stop words (common words that are unlikely to be useful for the classification task, like "the," "and," and "is").

  2. Feature extraction: In this step, the algorithm extracts meaningful features from the text data that can be used to classify it. These features might include the frequency of certain words or phrases, the presence of certain keywords, or even the sentiment (positive or negative) of the text.

  3. Classification: In this final step, the algorithm uses the features extracted in step 2 to classify the text into one or more predefined categories or labels. This is typically done using a machine learning algorithm, such as logistic regression or neural networks.

Text classification is a widely used technique in NLP that can provide valuable business insights.

Overall, text classification is a process used to sort and categorize text data in a way that can be easily understood and used. It's a powerful tool for understanding and making sense of large amounts of text data.
Natural Language Processing (NLP)     
Have you ever used a virtual assistant like Siri or Alexa? Or maybe a language translation app to communicate with someone who speaks a different language? If so, you've experienced the power of Natural Language Processing (NLP).

NLP is a field of artificial intelligence that enables computers to understand, interpret, and generate human language. It makes it easier for people to communicate with each other and with machines, and it has a wide range of everyday applications. Here are the most important ones:
  • Email filtering (ex. spam)
  • Language translation
  • Smart assistants
  • Search engines
  • Sentiment analysis (ex. customer satisfaction analysis)
  • Chatbots (ex. ChatGPT)
In short, NLP is changing the way we communicate and extract insights from text data. It's making our lives easier and more efficient.

You may wonder why NLP is important to us. Well, let's say you have a bunch of scanned documents and you want to extract insights from them. To do so first you need to extract the text. That's where OCR comes in - it's a tool that can do just that. But extracting the text is only the beginning. To analyze that text and gain insights from it NLP is needed. It's the next step in the process, and it's what allows us to really unlock the power of data. NLP is the key to understanding and making sense of all that extracted text, and it's what allows us to take data analysis to the next level.
Augmented Reality for Kids     
Before we start explaining the how, it's probably easiest just to try it out. If you want to try our sample treasure hunt, you first need to scan the following QR code with the camera application of your phone.



... and then, once the page has loaded (you'll need to grant access to your camera), simply hover over any of the following markers:


You should see something similar to the below screenshot. Whenever the page recognizes the marker, it displays a riddle, as well as provides a hint image. You can print-out these markers and hide them in the fridge, shoe-cupboard and pillow respectively and let your children have fun. We will create a few more markers and riddles and send the full treasure hunt to those signed-up to our newsletter.



The how:

It takes a few minutes to generate something like a treasure hunt. Our Augmented Reality dashboard lets you configure what markers you want to use and the response to them. In the below screenshot, you can see how we've configured 3 markers to trigger the 'plane' reaction (the red square), the text overlay, as well as the 'hint' buttons at the bottom of the screen. It's very easy to add more markers, more buttons, or change the Augmented Reality visualization completely.



We'll cover in a future post some more advanced uses of Augmented Reality. For example how to show dynamic data (think key metrics from a machine) or more interactive content (video how-to for a complex program). That's for later, for now enjoy the treasure hunt.
Welcome 2023     
It's back to work for us and we're still getting used to writing the date year 2023. There's a lot we want to share with you in the next few days on OCR, Augmented Reality and examples of how we use API's. For now, we're excited about the new year and we wish you a great start.
Applications of OCR     
In one of our previous posts, we discussed what OCR (Optica Character Recognition) is. Now let's some applications.

Here are a few common use cases OCR across industries:
  • Healthcare: OCR is used to extract patient information from scanned documents and electronic medical records, such as insurance claims and medical reports.
  • Finance: OCR is used to process invoices, extract data from financial statements, and process check and credit card payments.
  • Legal: OCR is used to digitize and organize legal documents, such as contracts and court filings.
  • Government: OCR is used to process passports, visas, and other identification documents, as well as to extract data from census records and voter registration forms.
  • Retail: OCR is used to process and track customer orders, as well as to extract data from receipts and invoices.
  • Manufacturing: OCR is used to extract data from production reports, quality control documents, and shipping manifests.
  • Transportation: OCR is used to process shipping labels and track packages, as well as to extract data from vehicle registration and maintenance documents.
  • Education: OCR is used to digitize and organize student records, such as transcripts and enrollment documents.

At Sense6, we have the most expertise using this technology within the insurance industry. Here are a few more applications:

  • Claims processing: Extracting information from insurance claims forms and supporting documentation, such as medical bills and repair estimates. Verifying the accuracy of the information on the claims forms. Identifying and flagging potentially fraudulent claims. Routing the claim to the appropriate department or adjuster for further review
  • Underwriting: Extracting information from insurance applications and supporting documentation, such as medical records and driving records. Verifying the accuracy of the information on the application. Evaluating the risk of insuring a particular individual or business based on the extracted data to determine the appropriate premium for the insurance policy
  • Other insurance applications: Extracting data from policy documents and other legal documents. Digitizing and organizing paper-based records, such as policy manuals and customer files. Generating reports and analytics based on extracted data.


We hope the above applications have given you ideas of how to use OCR (and subsequent classification analytics) to improve your workflows, whether you are a small business owner looking to streamline your operations or a large corporation looking to digitize your records. We believe in the power of OCR!
Bye bye 2022     
We often read about industries or companies being at "inflection points". Well 2022 truly was one for Sense6. It's the year we hired our first employees. It's the year we established our license model. It's the year we built our Swiss IT infrastructure in support of our solutions.

This is not a victory lap, quite the opposite: the acknowledgement of first steps taken in a new arena. Many more steps need to be taken, and 2023 will come soon enough with its intensity and deadlines and successes and failures.

But for now, we feel fortunate for 2022. We feel fortunate for collaborating with great companies and people. We feel fortunate to love the work we do. We feel fortunate, dear reader, that you take the time to read these lines. And so we wish you a relaxing holiday and a happy new year :)
Optical Character Recognition (OCR)     
Have you ever had to manually transcribe a document or image into text? It's a tedious and time-consuming process, and prone to errors. That's where OCR (Optical Character Recognition) comes in.

OCR is a technology that allows you to extract text and structure from scanned documents and images. Machine learning algorithms trained to identify patterns in the shapes and arrangements of characters will extract text and labels from 'chaotic' data sources. Accuracy has improved greatly these last years, making it possible to digitize and organize your paper documents, PDFs, and images in a matter of seconds, even in case of hand-written notes.

So why is OCR cool?

For starters, it's more efficient. It's much faster and produces less errors than manual transcription, which is anyway a thankless task so your employees will be happy to automate it away.

A second reason is it creates transparency. You can finally search and organize your documents using their entire content, not just the few manually created labels that inevitably restrict you in the future. Need to classify customer application PDFs by street-name? No problem (regular expressions are your friend in such cases). OCR allows you to sort and categorize your documents dynamically.

The third and most exciting reason is that OCR opens the door to new opportunities. You can now create services, models or processes that use all the data available to you or your company, not just the subset stored in a database. In that sense, OCR is a key enabler and foundation for break-through services.

Last but not least, OCR is environmentally friendly. By supporting the digitization of documents, it helps reduce the amount of physical paper printed, stored, and physically sent around.
Creative uses of ChatGPT     
Unlike GPT-3, ChatGPT was released to the entire internet. Here are a few examples of creative uses:

  • To get therapy: ChatGPT provides generally good advice when asked personal questions. We tried queries like "I am overly critical of others and that makes me sad", or "Is being in the moment really that important? What's wrong with planning ahead?" and the answers were very good, considering they were free and instantaneous.


  • To do your homework: We don't really condone this behavior, but ChatGPT can solve basic math problems. We tried examples such as "My sister was half my age when I was 10. Now I am 39, how old is my sister ?" and it got the answer and explanation right. It got the same answer wrong a few minutes later, but we weren't able to reproduce the behavior.


  • To write movie or book scripts: We tried generating movie scripts and received results of varying quality. It seemed the AI performed better if it recognized the context. For example: "Can you write me synopsis for a thriller movie that takes place on an oil rig. It should involve a storm and a great white shark".


  • To write code: The implications of this one still blows our minds. We tested the example "Can you write a basic Python function that tests for palindromes in a sentence?" and it got it right for the sentence as a whole. When we were more explicit in broadening the definition, it got the answer wrong (in a very confident way). It's also able to explain code, which would make it invaluable to reverse engineer documentation from legacy systems.


  • For advertisement and marketing: Short stories and slogans seem to be one of ChatGPT's fortes. If we were working in a marketing agency, we'd definitely be tapping into its creativity if only to get new ideas. Try examples such as "Can you write a brutally honest slogan for a luxury watch brand? It should be funny too".
As a final comment, throughout our experiments, we did notice that ChatGPT makes some mistakes. These are hard to identify as (i) there aren't many of them and (ii) ChatGPT is able to formulate sketchy responses into fluent and self-confident answers. So this is not an AI that you should blindly trust.

We recommend to everyone to try ChatGPT: explore new interactions and tell us about your exciting dialogues. And also take a step back to marvel at the breadth of different queries that this AI can respond to.

We'll email the transcripts of our Q&A with ChatGPT to those signed up to our newsletter. Please sign-up here if you want to receive the transcript of the full interaction (we'll send it by email later today)
Thank you HITS!     
As part of House of Insurtech Switzerland (HITS), we have access to many possibilities that are critical for young startups such as ours. We can interact with like-minded entrepreneurs, pitch our solutions to relevant corporates, and get insights from the experienced HITS leadership. Besides the professional advantages, HITS offers us great open spaces with a coffee machine for work hours and a fridge for cold beverages after work hours.

We are glad to be part of the HITS community! Thank you HITS!
Newsletter     
Fellow AI enthusiasts, free thinkers, industry leaders, geeks, and anyone interested in technology and its impact on our lives: we're setting up a newsletter where we hope you'll sign-up.



The newsletter is intended as an in-depth add-on to our blog, where we can dive into interesting details, provide our audience with some goodies and behind-the-scenes peeks, as well as give you the space to respond.

For example, we might outline what we think of OpenAI ChatGPT on the blog, but only those in the newsletter will receive the Christmas present recommendations for their family generated by the AI (as well as the bonus fondue joke).

Did we help you find some present ideas? You're welcome, and you should sign-up.

Fun fact: the airplane image was also generated by AI.

We commit to not sending advertisements and not sharing your emails with any third-parties, and you can unsubscribe at any time. We hope to see you around :)
Data visualization with Grafana     
While working with a growing amount of diverse data, we quickly realized that we needed a powerful visualization tool, both for ourselves and our customers. Enter Grafana, a Swiss army-knife of data-querying dashboard-creating capabilities. Here are a few reasons why we think it's very cool.

Firstly, Grafana is opensource so you aren't limited in terms of licenses per user or data source. Secondly, it supports a variety of data-sources: from Influxdb, to SQL, to static CSV or text files. Thirdly, it's quite user friendly in helping you to create the right query and select the right graph.



Fourthly, there are a lot visualization options, ranging from standard time-value graphs to more complex geomaps or heatmaps.



Finally, it plays nice with user authentication and access management solutions, meaning that you can keep your user management separate and Grafana will provide admin / edit / readonly rights depending on your configuration.

We've already witnessed a couple of instances where a solution became clear (or a problem apparent) without the need for a complex analyses, having the right metrics displayed in Grafana was enough. We'll continue to expand it's use for our customers as well as our internal needs (test coverage, prediction model accuracy etc.) A great tool we recommend.
Kubernetes and parenthood     
In the course of moving our applications to Kubernetes, we noticed a few analogies to parenthood.

Applications start small and need little space to be happy: a small bare-metal server is more than enough. It's easy to access them, they're usually at the same place you left them. A quick ping will check they're doing OK, and if you want to find out what's going on behind the front-end then a brief ssh is all you need.

As they start to grow these enclosures become too small. Initially you applaud their development and help them explore new geographies with a couple of new servers. Complexity is almost the same, and it actually helps you address concerns such as disaster-recovery location diversification.

But they keep growing. The more they want to break free, the more you doubt the next move. It all seems so complicated. What if they get lost and have a problem? Will you be able to help? Will you even find them? Finally, you take the leap and let them get onto a container orchestrator.

Life changes... Communication is harder: now you need to announce yourself with ingresses and interact through weird deployment facades. You never know where they are, you're not even sure how many there are, it seems like a small miracle when they respond after 3-4 service bounces. They're more distant than before. Gone are the days when you would access their command line. Now it's all about keeping secrets, predefined configmaps and resource quotas.

But goodness: the things they can do! They scale at the flick of a variable, they become resilient to hardware failures, they seamlessly jump from local machine to private cloud, they're never off even when they upgrade. In short they turn into something you don't quite comprehend, that scares you a bit, but that also makes you proud.
War in Ukraine     
We've been silent for on social media and on our blog over the last months, mostly because of how appalled we've been by the Russian government’s brutal and senseless invasion of Ukraine. A lot of day-to-day stuff just seemed secondary in contrast to Ukraine's fight for survival.

We'll start to write again about what we love to do, about technology and data analytics. These remain secondary to the challenge on liberal western values, upon which a startup ecosystem enables companies like Sense6 to exist. We continue to follow the conflict closely and support Ukraine in whatever small way we can.
The humility of entrepreneurship     
1. It doesn't matter if a job is glamorous, only if it is necessary.

2. Be stubborn, sometimes listen to others. Seek the balance between confidence and humility.

3. Flexibility is more important than having a strategy.

4. Build it! Fail. Improve. Be resilient to failure, it's what makes you grow.

5. Be curious, don't settle. Aim between average and perfection.

6. Have a strong partner in your personal life.

Start of the 3D printing age     
We've been experimenting with IOT edge protocols and sensors over the past months and one thing quickly became a problem: the lack of enclosures for our custom devices.

When you are putting together a Raspberry Pi with a light & humidity sensor, a Lora WAN antenna, 2 rechargeable batteries and some 5G connectivity, then there aren't a lot of cases out there for your prototype.


Say hello to "Porcu-Pi", a beast of connectivity

So when we ordered the TOFU board from Swiss company Oratek (@Oratek: the integrated SIM card is a game-changer, thanks!), they offered a case with their board. The options were either to pay CHF 35 to receive the case by post, or pay 8 CHF for the 3D model digital files only. Sense6 has a 3D printer and we selected the second option.

So there it is: the revolutionary benefits of 3D printing delivered to us for the first time and without fanfare. Within a few hours we had the parts printed and assembled: no need for shipping, no need for delays, and completely custom to our requirements. We modified the 3D model by making the enclosure higher to accommodate the additional battery pack and 5G antennas.

Perhaps the most interesting realization was that the process felt simple and familiar. As if it had been around for the past 10 years.
"First mile" data extraction with LoraWAN     
For real-world AI applications, your ability to access more relevant data is just as important as your ability to deploy sophisticated machine-learning models. That's why at Sense6, we also specialize in some IOT-related technologies (MQTT, Wireguard, deploying cool sensors). So what does this have to do with LoraWAN?

The situation: a lot of use-cases need to react in real-time to sensor data. To do that, you need to quickly get the data out of the sensors and onto a secured network in order to analyze them. If your sensors are indoors and have access to WiFi, or they're on some fancy robot and you don't mind adding a GSM cellular connection (and have the money and electric power to spare), then maybe LoraWAN is not for you. But in other cases things are not so simple: It's hard to get the data out of the sensors, what we mean by the "first mile".

The problem: imagine a logistics company wanting to monitor shipping containers, or a farmer wanting to survey crops in open fields, or a company wanting to monitor its office room temperatures but without the resources to add internet everywhere.
In these cases, it's not practical to use traditional communication protocols, usually for 3 reasons:
  • Too costly: wiring up a warehouse or an office with Ethernet requires significant investment
  • Low reach of WiFi: while WiFi would be great, its range is a few dozen metres indoors and maybe a couple hundred metres outdoors. Our use-cases require hundreds of metres indoors and kilometres outdoors
  • Power constraints: most of these nodes need to run outdoors and therefore on battery. We can't be recharging batteries every week, nodes must be power efficient

A solution: LoraWAN. In a nutshell, LoraWAN enables you to connect cheap sender- and receiver-devices across many kilometres outdoors and hundreds of metres indoors. These Lora devices are cheap and can run for many months on a single battery charge (we didn't test this yet). There is a trade-off however, you are very restricted in the amount of data you can send. Forget sending images or even long texts, best is to restrict Lora messages to a few numbers. Tough, but not a showstopper: a heat sensor sensing a fire doesn't need to send a newspaper article, 1 bit could suffice (fire = 1 or something like that).

It's not glamorous, but we've got a strong signal here... and we think that's amazing.

Our first experiment: we tested prototype #1 during a storm in the countryside: strong rain and howling winds (not very apparent in the photo) couldn't keep us indoors. Right away the "sender" node (half the size of a credit card) built a strong connection with a "receiver gateway" (size of a pocket book) more than 1 kilometre away despite the receiver being indoors and a few trees and houses separating us. So cool!

We'll be testing the stability of these devices in the months to come. I think you can tell we're excited about this technology, it opens more doors for powerful AI applications.
Show don't tell #1     
To follow this demo, you'll need both a computer screen and your phone. Keep your computer on this page. Take out your phone, start the camera app and point the camera to the QR code to left. After a few seconds a link should appear, click on it to load the augmented reality demo on your phone (preferably using Chrome browser).



If for some reason the below instructions don't work on your phone, here's a video showing what should have happened ;)






Point your phone camera to this first marker, it should automatically start a video overlayed on top of the marker. We can imagine using this in marketing (create engaging advertisement) or in construction (visual explanations for complex machinery).

It's also interesting to think that different videos can be started depending on the context, for example overriding with an evacuation video in case of a fire.

(Credits to Blender for the video used)








This second marker demonstrates the ability to display dynamic data, calculated on the spot or read from a database or sensor.

A button should also appear at the bottom of the screen that can be used to reference further relevant content, for example user manuals or detailed reports.








This last marker shows that you can trigger other apps. If you press on button that appears at the bottom of your screen, it should open your maps application and point you to ETH Zurich. Admittedly this is a basic interaction, but more complex ones are possible via REST-ful APIs.

Also note that the marker images don't have to be blobs of color or black-and-white boxes, they can contain text.


We're excited about this technology and are busy developing our first commercial applications. More coming soon. Please get in touch if you want a more detailed demonstration.
The urgency of (some) cloud computing     
A great many articles have been written on innovation. Rather than contribute to the overall body of literature, we'd like to focus on one important enabler: access (or lack thereof) to cloud computing resources.

Why nitpick on such a specific topic? Because we don't see it mentioned enough and time and time again it's one of the greatest hurdles we face, especially in the financial services industry.

If we segment companies into 2 distinct groups - the "haves" and "have-nots" of cloud computing - we observe drastically different expectations with respect to the speed of IT innovation.

When we ask a "haves" company (approx. 30% those we know) if they can give us access to a prototyping environment, they'll usually shrug and say "Sure. Is tomorrow OK? How much RAM and storage space do you need?" And that's great, and we can start the next day.

When we ask the same question to a "have nots" company, the response is a cold icy stare followed by something like "Sure, why don't we schedule a meeting in 10 years to finalize the details, it will likely cost 20,000$, and we first need to get this huge bureaucratic process out of the way". And that's not great, and we're basically stuck.

We should back-up here and explain why cloud computing is important. In a nutshell it provides the necessary infrastructure for innovation: the aforementioned rapid prototyping environment.
IT innovation doesn't happen on end-user applications such as Microsoft Excel or PowerPoint. You need server technologies to build 24/7 responsive services, sustainable data repositories, intuitive web-interfaces, embedded workflows. Prototyping servers are no longer provided as physical machines that you store in office corners, rather they are virtual and served from a company's cloud.

Imagine an application that is able to receive an address list via email, corrects & validates these addresses using a Swiss address API and finally generates an output file ready for back-end lead-management systems. This is a typical use-case for a short innovation cycle (one we recently completed), impossible to run on a laptop but takes 5 minutes to deploy on a virtual server (or container).

Why not just use a public cloud provider such as AWS if the company isn't ready? Financial services companies are rightfully restrictive about their customer data, therefore careful design needs to be applied to set-up a cloud that enables secure virtual environments before it can be used for prototyping.

We're not advocating for a complete shift to cloud computing (not required), nor what the best mix of public versus private cloud is (it depends), nor the best technology to get there (we like Kubernetes/Docker, there are alternatives). What we are advocating for is that any bank or insurance should be able, within days, to provide a secured virtual environment where its teams can prototype and experiment. Cloud computing technologies have drastically matured over the last 10 years (fueled by the rise of AWS, Google Cloud and many others). Configuring a functional and secure private/public cloud is within the reach of any company.

Cloud computing gives you the means to prototype, to experiment, to innovate in a secured environment. Please, if you are a company where securing cloud resources takes more than 2 weeks: someone in IT needs to tackle this head-on. Competitors are moving, customer expectations are rising, and Excel can only get you so far. If you need help getting organized and setting things up, please contact us.
Boosting search result accuracy with ML     
We've been super-busy so sorry for the low number of blog posts. Here's another tech one lifted from production: we'll keep it short and sweet.

Say you've got 800'000 recipes stored in a database in pure text (no structured data) and you want to automate a workflow that, based on a customer-submitted description, pulls out exactly the right recipe (or the best top 10 recipes). Database text search tools aren't designed to be accurate for "fuzzy" queries on large data volumes. Traditional keyword approaches won't help either: how would you know which keywords to focus on? Also to make things even more interesting: some of the 800'000 recipes are updated, removed or added every month so static mappings won't help, and some recipes use the metric system whereas others the imperial system.

Enter machine learning, more specifically LSTM models. You define 40 recipe categories (soups, deserts, salads, etc.) and sort a few thousand recipes into these categories. Using this data you can train an LSTM model to predict the category of a recipe based on its text description... and the same LSTM can help you classify what kind of category a user is looking for. So when a user enters a query, you first identify the recipe category and then focus your search only on those categories. In our example and assuming recipes are uniformly allocated, this would reduce our query 20'000 recipes, much easier than before.

There are further steps to the algorithm (think "word2vec" to identify the best keywords) but we'll stop here: it already illustrates how including machine learning at distinct points in an automation process has a significant impact. The above use-case is something we are dealing with at one of our clients (not with recipes) and the high accuracy we achieve wouldn't be possible without machine learning.
Triangulating data with machine learning     
We’ve had a little breakthrough at Sense6 recently, solving a problem which was taking our attention for a few months. Ready for a small technical rant? We’ll keep things as simple as possible and start with an example.

Suppose you’re developing your own OCR engine and have trained 3 neural networks with different specializations. The first is trained for dealing with black-and-white documents, the second for very colorful documents, the third is a jack-of-all-trades performing reasonably well whatever the situation.

Let’s further assume that the best overall performance is from the “jack of all trades” with an accuracy of 75%. However in the 25% cases it gets wrong, you’ve observed that at least one of the specialized models gets the right result. If you were able to always select the best result from your 3 models, your accuracy would skyrocket.

Programming static rules atop 3 neural networks is a recipe for disaster. Therefore you decide to create a second type of neural network whose goal is to triangulate the best result from the output of the 3 underlying models. You set everything up and excitedly run your first test-cases, waiting for accuracy bliss and … nothing happens. Still stuck at 75%. Ensue weeks of debugging 2 layers of neural networks, an activity best described as searching for a needle in a haystack... in the dark... with mittens on.

We’ll pause here to let the reader wonder about the solution for a few seconds.



There’s an important step whenever training classification models - “class imbalance” - a step we were taking into account for the first layer models but had forgotten in the second layer, mistakenly thinking “it had already been dealt with”. What is it? Let’s have another example.

Suppose you wanted to train a neural network to “turn left” or “turn right”, but 99% of your training data pertained to a “turn left” scenario. Your training data is “imbalanced” and if you trained a model without any changes, it would quickly learn that the best strategy is to simply turn left all the time: 1% error rate isn’t bad. BTW, this is typical when working on “fraud” use-cases where fraudulent banking transactions or fraudulent insurance claims are a tiny fraction of the total data volume.

To obtain the correct behavior you need to rebalance your data, i.e. transform your training data until “turn left” versus “turn right” are about 50:50. There are multiple ways to achieve this and this is a separate topic altogether; a basic approach is to randomly delete unwanted data until you reach the desired equilibrium.

So back to our original OCR example: what was happening? Well, the triangulation model had learned that one of the underlying OCR models had a better performance and that the best strategy was simply to stick to it and ignore the others. To remedy this, all you need to do is to rebalance the data, thereby removing the incentive to always default to the “jack of all trades” output (33% accuracy from a rebalanced scenario isn’t that great), and hey presto: performance improvement.

The more experienced of you could point out: "Why bother with 2 layers of neural network for OCR? Why not just create a bigger deeper neural network and let it deal automatically with jack-of-all-trades versus specialized internal configurations?”
That is a valid criticism, and that’s why OCR is a simplistic example. However there are use-cases where (we think) a second neural network layer makes sense, mostly because they have access to more data (features) than were available in a “first run” prediction. For example:
  • Data extraction from financial statements, where the relationship between financial metrics can be codified as features in a second layer (e.g. total assets = liquid assets + non-liquid assets)
  • Decision-making in self-driving cars, where the master model receives as input the predictions from the various car sensors (layer 1 neural networks). We’re less sure of this one though, having never worked on self-driving neural networks

In closing, every time we describe something technical we seem to go on a rant and write long posts. Apologies for this, we’re trying to break down complex problems into small digestible parts, unfortunately that takes up space.
Two charts on Covid-19     
We love data at Sense6 and, now that a few months have passed in the "Covid-19 world", we thought we'd share to charts on what we think are interesting observations. Unfortunately we only have data on Switzerland, so this is a local country analysis.

Chart 1: Overview Covid-19 # of deaths versus # of positive test cases per day



To build this slide we visited bag.admin.ch (almost) every day and copied the key stats (# of deaths, # of positive test cases) into our own Excel file. We opted for the 7-day average approach to reduce noise and also replicate the approach used by NYTimes in order to compare results.

Chart 2: Net growth of A.G. and GmbH companies in Switzerland (Jan-Aug 2020 vs 2019)



To build this slide, we leveraged Cloudlink and filtered only the A.G. and Gmbh companies from the federal publications. We could have gone deeper in the analysis, for example looking at differences in regions or activity sector, but the population size could then become too small for meaningful conclusions.

What do you think? Do you agree with our observations and hypotheses? Do let us know if you have some feedback via email, we're happy to continue this conversation.
Thoughts on the Pareto Principle     
Many have heard of the Pareto principle - otherwise known as the 80:20 rule - but most only remember the first part of the principle: "you can achieve roughly 80% of the effect by focusing on 20% of the causes".

To be fair Pareto is not a rule and more a guiding principle. While not always accurate, it does show up in a variety of settings: a prototype provides 80% of the benefits, focus on a couple of customer segments yields 80% of the profit, 20% of your apartment is where you spend 80% of your time.

We do think that the second part is just as interesting as the first: "completing the last mile, the remaining 20%, requires 80% of the work". This is what separates a pilot from an end-to-end solution, efficiency gains from an automated process, being top 10 from reaching number 1.
First steps with big data     
Several years ago, companies weren't very diligent in digitizing their data, and therefore most commercial IT systems would generate data extracts with thousands of entries only. This has changed and we've witnessed more and more people running into big data issues. Not just geeks, but consultants, project managers, and auditors; individuals not trained as data scientists.

We were reminded of this during a recent engagement where we used Cloud Link to analyze significant swaths of the Swiss population. We combined, filtered and scored millions of text and numeric data entries, a task impossible for Excel to achieve.

Now don't get us wrong: we love Excel and use it a lot. However, if Excel is all you know and you have a few minutes to spare, then in this post we'd like to invite you to experiment something different: R.
We'll focus on a few basic but useful operations: opening a data file, doing some exploratory stats, filtering relevant entries, storing these entries into a separate smaller file.

Why should you join in? R is orders of magnitude faster than Excel when working with big data sets and chances are you'll run into this challenge sooner than you think. It's a good investment of your time.

Step 1: Download and install R with this public link (it's free, how great is that). If you want to use your work-computer but don't have installation rights then don't worry because you should still be OK. Most companies include R in their accepted applications so you can order it via your company's standard procurement process.

Step 2 (optional): Install R-Studio which provides a more intuitive programming environment than the R console does. You don't need to for this tutorial though.

Step 3: Create a folder on your Desktop called "R_Experiment" and store in it this example data file. It's a CSV (Comma Separated Values) file, one of the main standard formats by which large data sets are shared. Values are usually separated by commas or tabs, the latter being the case in this example. CSV's are human readable so you can open the file with any text editor (like Notepad) if you want to look at the 8 entries in our example.

Step 4: Great! All set to start. Launch R (or R-Studio) and type into the R console the "set working directory" command:
setwd("C:\Users\YourUserName\Desktop\R_Experiment")
With this command you are asking R to work from the folder you previously created. Note that your path will look different if you are on Mac or Linux.

Step 5: Now that you are in the right folder, you can read the file into R as so:
mydata <- read.csv("example.csv", sep="\t")
Variable "mydata" will hold the entire table of data from the CSV file, it's type is a data-frame. This is also a good place to get used to the "<-" notation in R, which you can simply imagine as "=". Finally the above command specifies the value-separator as tab ("\t" denotes a tab). You can change the separator to include any character you want, for example sep=","

Step 6: OK, so you've successfully loaded the contents from the CSV and stored them into variable "mydata". Let's quickly check by printing the first 2 rows:
head(mydata, 2)
Let's go a bit further and ask for some exploratory statistics:
summary(mydata)
Did you see that R automatically recognized which columns contain text versus numbers and applied the relevant statistics based on each case? Pretty cool.

Step 7: Now for the most complex command of this tutorial, we're going to filter all entries where the department is "IT":
filterdata <- mydata[mydata$dept == "IT", ]
Let's break it down.
First, consider the left side of the command "filterdata <- ". This simply means that, whatever happens on the right side, we're going to store it into a variable called "filterdata".
The rest of the command tells R to select data-frame "mydata" and filter out all rows where mydata$dept equals "IT". The $ sign is how you refer to a column within a data-frame, so "mydata$dept" means column "dept" from data-frame "mydata". Also the "," followed by nothing means that we don't want to apply a second filter on the columns: we want the full row returned where dept == "IT".
It's a bit much, we know, but you'll quickly get used to this notation. It's repetitive and there are tons of tutorials out there to help.

Step 8 (final): All that's left is to store the filtered data into a new file. Execute:
write.table(filterdata, file="filtered.csv", sep="\t", row.names = FALSE)
This command tells R to write the data-frame stored in variable "filterdata" into a file called "filtered.csv". This file will be stored to the working directory by default. Finally it requests the output file to use tabs as separators (annotated with sep="\t") and disables the row.names option, else R will start each row with a number which is not something we need.

Congrats, we're done! The powerful thing is that this approach will still work if you process a file with millions of entries. You can now open it, explore the data, filter out the relevant entries and store them in a separate file so that you can continue the analysis in a tool you are more familiar with, like Excel.
ML Chicken and egg     
Well annotated data is a rare thing when working in real-world machine-learning projects. So how can developers still use machine-learning without available training data to start? For example let's say you're building a machine learning model that classifies companies based on their text description (something that we're building at Sense6). The challenge is that no data-set exists for the classification that you have in mind.

A pragmatic way to tackle this challenge - we think - is by setting up 3 components:
  • A basic algorithm to kick-start the classification: in our company classification example it could be something as basic as a keyword ranking algorithm, or alternatively something slightly more sophisticated like tf-idf
  • A feedback loop: a combination of process and technology that enables end-users to review and update the classification that your model is generating
  • A machine-learning model: in our example we'd recommend using an LSTM which are simply magical in "divining" the right classifications

So how does this work? The motto for your basic algorithm is "good enough": as soon as its output is a usable starting-point for your end-users, then move on.
The motto for the feedback loop is "smooth and beautiful". It's a vital step and one that often doesn't get enough attention. One unnecessary click, one slow load time, one unnecessary text field: any of these can mean the difference between abundant feedback or nothing at all. Enhance the feedback mechanism until it's as seamless and easy as possible for your end-users.
Finally the motto for your machine-learning model is "retrain, retrain, retrain".

Very soon you'll observe that the machine-learning model picks up where the basic model stopped and becomes orders of magnitude more precise, assuming you are gathering regular and quality feedback from your end-users via your feedback loop.

There are other ways around this chicken-and-egg problem (for example deep reinforcement learning) however they are often more complex and not always applicable in real-world situations. We've found that the above approach is pragmatic and rapidly leads to performing models.

Please contact us if you have alternative ways of dealing with the chicken-and-egg problem, or if you have a similar situation that you'd like us to take a look at.
When you give to the community     
We didn't expect anything in return when we started coronaschool.ch at the start of the Covid-19 crisis. The goal was to spend one week-end and hack together a handful of web-applications that would help families with homeschooling. Word-of-mouth spread and, within a couple of weeks, the website was receiving requests in the low hundreds every day. We were happy that others were using it and didn't expect the learnings that followed.

1. Back-end lessons learned:
Corona School is built using Fargate from Amazon: an infrastructure-as-a-service set-up that we'd been eyeing for some time. It turns out that traffic to coronaschool.ch is cyclical: according to our logs most parents ask their children to do math from 9AM to 2PM. This taught us a few tricks about load-balancing and how resilient Amazon Fargate is. We're happy with its performance and cost, and we'll re-use it next time we can.

2. Front-end lessons learned:
Mike Tyson has a quote about everyone having a plan until they get punched in the face. I'll paraphrase and say that every GUI developer thinks they're pretty good until their GUI lands in the hands of a bunch of five-to-ten year-olds. I've never seen a GUI break-down that fast, repeatedly. Our young users are still teaching us lessons and we'll make sure to integrate the improvements into our normal front-end.

The most valuable lesson has been the unexpected benefits of opening up parts of our technology to the public. Progress was much faster than if we'd tested internally for the same amount of time, something we'll keep in mind going forward.
Hello World !     
Most virtual explorations start with a "hello world". This applies when you are exploring a new platform, a new programming language, a new database, or in some cases even new APIs.

Let's say that you want to learn to program in Python but have no idea how to begin. A Google search for "hello world python" will give you a good place to start, for example here.

We think it's appropriate to start the Sense6 blog with a "hello world" of our own.
Powered by Sense6