The most time-consuming part of a Data Scientist’s job is to prepare and preprocess the data at hand. The data we get in real-life scenarios is not clean and suitable for modelling. The data needs to be cleaned, brought to a certain format and transformed before feeding to the Machine Learning models.
At the end of this tutorial, you will know the following
Why Data Preprocessing?
When data is retrieved by scrapping websites and gathering it from other data sources, it is generally full of discrepancies. It can be formatting issues, missing values, garbage values and text and even errors in the data. Several preprocessing steps need to be done to make sure that the data that is fed to the model is up to the mark so that the model can learn and generalize on it.
The first and most essential step is to clean the irregularities in the data. Without doing this step, we cannot make much sense out of the statistics of the data. These can be formatting issues, garbage values and outliers.
We need the data to be in a tabular format most of the times but it is not the case. The data might have missing or incorrect column names, blank columns. Moreover, when dealing with unstructured data such as Images and Text, it becomes utmost essential to get the 2D or 3D data loaded in Dataframes for modelling.
Many instances or complete columns might have certain garbage values appended to the actual required value. For example, consider a column “rank” which has the values such as: “#1”, “#3”, “#12”, “#2” etc. Now, it is important to remove all the preceding “#” characters to be able to feed the numeric value to the model.
Many times certain numeric values are either too large or too low than the average value of the specific column. These are considered as outliers. Outliers need special treatment and are a sensitive factor to treat. These outliers might be measurement errors or they might be real values as well. They either need to be removed completely or handled separately as they might contain a lot of important information.
It is seldom the case that your data will contain all the values for every instance. Many values are missing or filled with garbage entry. These missing values need to be treated. These values can have multiple reasons why they might be missing. They could be missing due to some reason such as sensor error or other factors, or they can also be missing completely at random.
The most straight forward and easiest way is to drop the rows where values are missing. Doing this has many disadvantages like loss of crucial information. It might be a good step to drop the missing values when the amount of data you have is huge. But if the data is less and there are a lot of missing values, you need better ways to tackle this issue.
The quickest way to impute missing values is by simply imputing the mean value of the column. However, it has disadvantages because it disturbs the original distribution of the data. You can also impute the median value or the mode value which is generally better than the simple mean.
Linear interpolation & KNN
More smart ways can also be used to impute missing values. 2 of which are Linear Interpolations using multiple models by treating the column with blank values as the feature to be predicted. Another way is to use clustering by KNN. KNN makes clusters of the values in a particular feature and then assigns the value closest to the cluster.
In a data set with multiple numerical features, all the features might not be on the same scale. For example, a feature “Distance” has distances in meters such as 1300, 800, 560, etc. And another feature “time” has times in hours such as 1, 2.5, 3.2, 0.8, etc. So, when these two features are fed to the model, it considers the feature with distances as more weightage as its values are large. To avoid this scenario and to have faster convergence, it is necessary to bring all the features on the same scale.
A common way to scale the features is by normalizing them. It can be implemented using Scikit-learn’s Normalizer. It works not on the columns, but on the rows. L2 normalization is applied to each observation so that the values in a row have a unit norm after scaling.
Min Max Scaling
Min Max scaling can be implemented using Scikit-learn’s Min MaxScaler class. It subtracts the minimum value of the features and then divides by the range, where the range is the difference between the original maximum and original minimum. It preserves the shape of the original distribution, with default range in 0-1.
Standard Scaler also can be implemented using Scikit-learn’s class. It standardizes a feature by subtracting the mean and then scaling to unit variance, where unit variance means dividing all the values by the standard deviation. It makes the mean of the distribution 0 and standard deviation as 1.
A lot of times data is not in numeric form instead of in the categorical form. For example, consider a feature “temperature” with values as “High”, “Low”, “Medium”. These textual values need to encoded in numerical form to able for the model to train upon.
Categorical Data is label encoded to bring it in numerical form. So “High”, “Medium” and “Low” can be Label Encoded to 3,2, and 1. Categorical features can be either nominal or ordinal. Ordinal categorical features are those which have a certain order. For example, in the above case, we can say that 3>2>1 as the temperatures can be measured/quantified.
However, in an example where a feature of “City” which has values like “Delhi”, “Jammu” & “Agra”, cannot be measured. In other words, when we label encode them as 3, 2, 1, we cannot say that 3>2>1 because “Delhi” > ”Jammu” won’t make much sense. In such cases, we use One Hot Encoding.
Features with continuous values can also be discretized by binning the values into bins of specific ranges. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in.
For example, say we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”. Now we can divide into 3 bins with ranges 0 to 50, 51-70, and 71-100 belonging to bins 1,2, and 3 respectively. Therefore, the feature will now only contain one of these 3 values. Pandas offers 2 functions to achieve binning quickly: qcut and cut.
Pandas qcut takes in the number of quantiles and divides the data points to each bin based on the data distribution.
Pandas cut, on the other hand, takes in the custom ranges defined by us and divides the data points in those ranges.
Related read: Data Preprocessing in Machine Learning
Data Preprocessing is an essential step in any Data Mining and Machine Learning task. All the steps we discussed are certainly not all but do cover most of the basic part of the process. Data preprocessing techniques are different for NLP and Image data as well. Make sure to try examples of above steps and implement in your Data Mining pipeline.
If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.