Data Preprocessing in Machine Learning: 7 Easy Steps To Follow

Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format. 

Why Data Preprocessing in Machine Learning?

 When it comes to creating a Machine Learning model, data preprocessing is the first step marking the initiation of the process. Typically, real-world data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends. This is where data preprocessing enters the scenario – it helps to clean, format, and organize the raw data, thereby making it ready-to-go for Machine Learning models.

Steps in Data Preprocessing in Machine Learning

 There are seven significant steps in data preprocessing in Machine Learning:

 1. Acquire the dataset

To build and develop Machine Learning models, you must first acquire the relevant dataset. This dataset will be comprised of data gathered from multiple and disparate sources which are then combined in a proper format to form a dataset. Dataset formats differ according to use cases. For instance, a business dataset will be entirely different from a medical dataset. While a business dataset will contain relevant industry and business data, a medical dataset will include healthcare-related data.

There are several online sources from where you can download datasets like https://www.kaggle.com/uciml/datasets and https://archive.ics.uci.edu/ml/index.php. You can also create a dataset by collecting data via different Python APIs. Once the dataset is ready, you must put it in a CSV, or HTML, or XLSX file formats.

2. Import all the crucial libraries

Since Python is the most extensively used and also the most preferred library by Data Scientists around the world, we’ll show you how to import Python libraries for data preprocessing in Machine Learning. Read more about Python libraries for Data Science here. The predefined Python libraries can perform specific data preprocessing jobs. The three core Python libraries used for this data preprocessing in Machine Learning are:

  • NumPy – NumPy is the fundamental package for scientific calculation in Python. Hence, it is used for inserting any type of mathematical operation in the code. Using NumPy, you can also add large multidimensional arrays and matrices in your code. 
  • Pandas – Pandas is an excellent open-source Python library for data manipulation and analysis. It is extensively used for importing and managing the datasets. It packs in high-performance, easy-to-use data structures and data analysis tools for Python.
  • Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of charts in Python. It can deliver publication-quality figures in numerous hard copy formats and interactive environments across platforms (IPython shells, Jupyter notebook, web application servers, etc.). 

Read: Machine Learning Project Ideas for Beginners

3. Import the dataset

In this step, you need to import the dataset/s that you have gathered for the ML project at hand. However, before you can import the dataset/s, you must set the current directory as the working directory. You can set the working directory in Spyder IDE in three simple steps:

  1. Save your Python file in the directory containing the dataset.
  2. Go to File Explorer option in Spyder IDE and choose the required directory.
  3. Now, click on the F5 button or Run option to execute the file.

data preprocessing in ml

Source

This is how the working directory should look. 

Once you’ve set the working directory containing the relevant dataset, you can import the dataset using the “read_csv()” function of the Pandas library. This function can read a CSV file (either locally or through a URL) and also perform various operations on it. The read_csv() is written as:

data_set= pd.read_csv(‘Dataset.csv’)

In this line of code, “data_set” denotes the name of the variable wherein you stored the dataset. The function contains the name of the dataset as well. Once you execute this code, the dataset will be successfully imported. 

During the dataset importing process, there’s another essential thing you must do – extracting dependent and independent variables. For every Machine Learning model, it is necessary to separate the independent variables (matrix of features) and dependent variables in a dataset. 

Consider this dataset:

Source

This dataset contains three independent variables – country, age, and salary, and one dependent variable – purchased. 

How to extract the independent variables?

To extract the independent variables, you can use “iloc[ ]” function of the Pandas library. This function can extract selected rows and columns from the dataset.

x= data_set.iloc[:,:-1].values  

In the line of code above, the first colon(:) considers all the rows and the second colon(:) considers all the columns. The code contains “:-1” since you have to leave out the last column containing the dependent variable. By executing this code, you will obtain the matrix of features, like this – 

[[‘India’ 38.0 68000.0]  

 [‘France’ 43.0 45000.0]  

 [‘Germany’ 30.0 54000.0]  

 [‘France’ 48.0 65000.0]  

 [‘Germany’ 40.0 nan]  

 [‘India’ 35.0 58000.0]  

 [‘Germany’ nan 53000.0]  

 [‘France’ 49.0 79000.0]  

 [‘India’ 50.0 88000.0]  

 [‘France’ 37.0 77000.0]] 

How to extract the dependent variable?

You can use the “iloc[ ]” function to extract the dependent variable as well. Here’s how you write it:

y= data_set.iloc[:,3].values  

This line of code considers all the rows with the last column only. By executing the above code, you will get the array of dependent variables, like so – 

array([‘No’, ‘Yes’, ‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’, ‘No’, ‘Yes’],

      dtype=object)

4. Identifying and handling the missing values

In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do this, you might draw inaccurate and faulty conclusions and inferences from the data. Needless to say, this will hamper your ML project. 

Basically, there are two ways to handle missing data:

  • Deleting a particular row – In this method, you remove a specific row that has a null value for a feature or a particular column where more than 75% of the values are missing. However, this method is not 100% efficient, and it is recommended that you use it only when the dataset has adequate samples. You must ensure that after deleting the data, there remains no addition of bias. 
  • Calculating the mean – This method is useful for features having numeric data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value. This method can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields better results compared to the first method (omission of rows/columns). Another way of approximation is through the deviation of neighbouring values. However, this works best for linear data.

Read: Applications of Machine Learning Applications Using Cloud

5. Encoding the categorical data

Categorical data refers to the information that has specific categories within the dataset. In the dataset cited above, there are two categorical variables – country and purchased.

Machine Learning models are primarily based on mathematical equations. Thus, you can intuitively understand that keeping the categorical data in the equation will cause certain issues since you would only need numbers in the equations.

How to encode the country variable?

As seen in our dataset example, the country column will cause problems, so you must convert it into numerical values. To do so, you can use the LabelEncoder() class from the sci-kit learn library. The code will be as follows –

#Catgorical data  

#for Country Variable  

from sklearn.preprocessing import LabelEncoder  

label_encoder_x= LabelEncoder()  

x[:, 0]= label_encoder_x.fit_transform(x[:, 0]) 

 And the output will be – 

 Out[15]: 

  array([[2, 38.0, 68000.0],

            [0, 43.0, 45000.0],

         [1, 30.0, 54000.0],

         [0, 48.0, 65000.0],

         [1, 40.0, 65222.22222222222],

         [2, 35.0, 58000.0],

         [1, 41.111111111111114, 53000.0],

         [0, 49.0, 79000.0],

         [2, 50.0, 88000.0],

        [0, 37.0, 77000.0]], dtype=object)

 Here we can see that the LabelEncoder class has successfully encoded the variables into digits. However, there are country variables that are encoded as 0, 1, and 2 in the output shown above. So, the ML model may assume that there is come some correlation between the three variables, thereby producing faulty output. To eliminate this issue, we will now use Dummy Encoding.

Dummy variables are those that take the values 0 or 1 to indicate the absence or presence of a specific categorical effect that can shift the outcome. In this case, the value 1 indicates the presence of that variable in a particular column while the other variables become of value 0. In dummy encoding, the number of columns equals the number of categories.

Since our dataset has three categories, it will produce three columns having the values 0 and 1. For Dummy Encoding, we will use OneHotEncoder class of the scikit-learn library. The input code will be as follows – 

#for Country Variable  

from sklearn.preprocessing import LabelEncoder, OneHotEncoder  

label_encoder_x= LabelEncoder()  

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  

#Encoding for dummy variables  

onehot_encoder= OneHotEncoder(categorical_features= [0])    

x= onehot_encoder.fit_transform(x).toarray()

 On execution of this code, you will get the following output –

 array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,

Get our free ebook!
A Beginners Guide to Fundamentals of
Natural Language Processing
Download Now

        6.80000000e+04],

       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,

        4.50000000e+04],

       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,

        5.40000000e+04],

       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,

        6.50000000e+04],

       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,

        6.52222222e+04],

       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,

        5.80000000e+04],

       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,

        5.30000000e+04],

       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,

        7.90000000e+04],

       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,

        8.80000000e+04],

       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,

        7.70000000e+04]])

 In the output shown above, all the variables are divided into three columns and encoded into the values 0 and 1.

How to encode the purchased variable?

For the second categorical variable, that is, purchased, you can use the “labelencoder” object of the LableEncoder class. We are not using the OneHotEncoder class since the purchased variable only has two categories yes or no, both of which are encoded into 0 and 1.

The input code for this variable will be – 

labelencoder_y= LabelEncoder()  

y= labelencoder_y.fit_transform(y) 

The output will be – 

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

6. Splitting the dataset

Every dataset for Machine Learning model must be split into two separate sets – training set and test set. 

data preprocessing

Source

Training set denotes the subset of a dataset that is used for training the machine learning model. Here, you are already aware of the output. A test set, on the other hand, is the subset of the dataset that is used for testing the machine learning model. The ML model uses the test set to predict outcomes. 

Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take 70% or 80% of the data for training the model while leaving out the rest 30% or 20%. The splitting process varies according to the shape and size of the dataset in question. 

 To split the dataset, you have to write the following line of code – 

 from sklearn.model_selection import train_test_split  

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

Here, the first line splits the arrays of the dataset into random train and test subsets. The second line of code includes four variables:

  • x_train – features for the training data
  • x_test – features for the test data
  • y_train – dependent variables for training data
  • y_test – independent variable for testing data

Thus, the train_test_split() function includes four parameters, the first two of which are for arrays of data. The test_size function specifies the size of the test set. The test_size maybe .5, .3, or .2 – this specifies the dividing ratio between the training and test sets. The last parameter, “random_state” sets seed for a random generator so that the output is always the same. 

7. Feature scaling

Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.

Consider this dataset for example – 

Source

In the dataset, you can notice that the age and salary columns do not have the same scale. In such a scenario, if you compute any two values from the age and salary columns, the salary values will dominate the age values and deliver incorrect results. Thus, you must remove this issue by performing feature scaling for Machine Learning.

Most ML models are based on Euclidean Distance, which is represented as:

Source

You can perform feature scaling in Machine Learning in two ways:

Standardization

standar

Source 

Normalization

Source 

For our dataset, we will use the standardization method. To do so, we will import StandardScaler class of the sci-kit-learn library using the following line of code:

from sklearn.preprocessing import StandardScaler  

The next step will be to create the object of StandardScaler class for independent variables. After this, you can fit and transform the training dataset using the following code:

st_x= StandardScaler()  

x_train= st_x.fit_transform(x_train) 

For the test dataset, you can directly apply transform() function (you need not use the fit_transform() function because it is already done in training set). The code will be as follows – 

x_test= st_x.transform(x_test) 

The output for the test dataset will show the scaled values for x_train and x_test as:

Source

Source

All the variables in the output are scaled between the values -1 and 1.

Now, to combine all the steps we’ve performed so far, you get: 

 

# importing libraries  

import numpy as nm  

import matplotlib.pyplot as mtp  

import pandas as pd  

  

#importing datasets  

data_set= pd.read_csv(‘Dataset.csv’)  

  

#Extracting Independent Variable  

x= data_set.iloc[:, :-1].values  

  

#Extracting Dependent variable  

y= data_set.iloc[:, 3].values  

  

#handling missing data(Replacing missing data with the mean value)  

from sklearn.preprocessing import Imputer  

imputer= Imputer(missing_values =’NaN’, strategy=’mean’, axis = 0)  

  

#Fitting imputer object to the independent varibles x.   

imputerimputer= imputer.fit(x[:, 1:3])  

  

#Replacing missing data with the calculated mean value  

x[:, 1:3]= imputer.transform(x[:, 1:3])  

  

#for Country Variable  

from sklearn.preprocessing import LabelEncoder, OneHotEncoder  

label_encoder_x= LabelEncoder()  

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  

  

#Encoding for dummy variables  

onehot_encoder= OneHotEncoder(categorical_features= [0])    

x= onehot_encoder.fit_transform(x).toarray()  

  

#encoding for purchased variable  

labelencoder_y= LabelEncoder()  

y= labelencoder_y.fit_transform(y)  

  

# Splitting the dataset into training and test set.  

from sklearn.model_selection import train_test_split  

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

  

#Feature Scaling of datasets  

from sklearn.preprocessing import StandardScaler  

st_x= StandardScaler()  

x_train= st_x.fit_transform(x_train)  

x_test= st_x.transform(x_test)  

 

So, that’s data processing in Machine Learning in a nutshell!

If you’re interested to learn more about big data, check out IIIT-B’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Explore Now!

Leave a comment

Your email address will not be published. Required fields are marked *

×
Know More
Download EBook
Download EBook
By clicking Download EBook, you agree to our terms and conditions and our privacy policy.
Get our free ebook!
A Beginners Guide to Fundamentals of
Natural Language Processing
Download Now