In one of our previous posts, we mentioned we'd delve deeper into the topic of pre-processing. Data preprocessing is crucial in any data analysis or machine learning project: it involves transforming raw data into a format that is suitable for further analysis and modeling. We will focus on analytical data this time, as we already covered the basics of NLP
and text classification
. In this blog post, we explore various preprocessing techniques along with real-world examples.
- Handling Missing Values
Missing values are a common occurrence in datasets and can impact the accuracy and reliability of the analysis. Here are some techniques to handle them:
- Deleting Rows/Columns
If the missing values are few and randomly distributed, you can consider deleting the respective rows or columns. However, be cautious as this approach may result in the loss of valuable information.
For cases where deleting rows/columns is not feasible, imputation techniques can be employed. Imputing missing values means filling them with estimated values based on existing data. Common imputation techniques include mean, median, and mode imputation, as well as advanced methods like regression imputation. In order to chose the best technique, domain knowledge is indispensable.
- Data Cleaning and Formatting
Data inconsistencies and errors can hinder the analysis process. Here are some techniques to clean and format your data:
- Removing Duplicates
Duplicates introduce a bias towards certain results (think of it as a higher "weighting"), unless this is what you really want, you should remove duplicates from your data-set.
- Correcting Inconsistent Data
Data manipulation functions and formulas can be used to identify and correct inconsistent data, such as inconsistent date formats, currency symbols, or misspelled words.
- Feature Scaling and Normalization
Feature scaling is essential when dealing with variables that have different scales or units. There are various techniques to scale and normalize data, let's discuss 2 of the most common ones:
Standardizing variables brings them to a common scale, with a mean of zero and a standard deviation of one. This step is especially important if you are planning to apply ML models that consider the data distribution of data rather than it's absolute values. Depending on the distribution, you may want to consider applying a further transformation, for example taking the distribution of the log of the parameter, if your parameter behaves in an exponential way (e.g. Covid infections in a population). Iff the distribution of your data is already a Gaussian, you should still consider standardizing it.
- Min-Max Scaling
Min-max scaling scales the variables to a specific range, typically between 0 and 1. Use min-max scaling when you have prior knowledge or domain expertise that the data has a specific range or when the algorithm you plan to use requires input features to be on a bounded interval (e.g., neural networks with activation functions that expect input values between 0 and 1).
- Handling Categorical Variables
- One-Hot Encoding
One-hot encoding converts categorical variables into binary vectors, where each category is represented by a binary column. For example if you want to use a color as an input parameter, and you have options "red", "green", and "blue", these can be transformed into red=(1,0,0), green=(0,1,0) and blue=(0,0,1). This step is mandatory in case you want to use a model that doesn't support categorical features (some model support text labels).
- Label Encoding
Label encoding assigns a numerical label to each category. This is important because some models are only able to predict numerical labels. The encoding is simple each category will be mapped to a number.
- Dealing with Outliers
Outliers are data points that significantly deviate from the overall pattern. You can identify them via data exploration (scatter-plots, box-plots or histograms will reveal them visually):
- Keep or remove
There's no one-size-fits-all approach here. You need to consider on a case-by-case basis whether to keep the extreme values (e.g. because they indicate something important in the data) or whether to remove them, usually only if they are truly erroneous values (e.g. sensor values that hardware malfunction is triggering).
If you want a middle-ground between keep or remove, Winsorization replaces extreme values with less extreme values, reducing the impact of outliers on analysis results. You can think about these outliers as missing values and use the same solution that you identified in the previous steps.
By effectively handling missing values, cleaning and formatting data, scaling and normalizing variables, dealing with categorical variables, and addressing outliers, you can unlock valuable insights and enhance the accuracy of your model later on.