Data Preprocessing in Machine Learning: 7 Easy Steps To Follow


In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow.

  1. Acquire the dataset
  2. Import all the crucial libraries
  3. Import the dataset
  4. Identifying and handling the missing values
  5. Encoding the categorical data
  6. Splitting the dataset
  7. Feature scaling

Read more to know each in detail.

Data preprocessing in Machine Learning is a crucial step that helps enhance the quality of data to promote the extraction of meaningful insights from the data. Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models. In simple words, data preprocessing in Machine Learning is a data mining technique that transforms raw data into an understandable and readable format. 

Why Data Preprocessing in Machine Learning?

When it comes to creating a Machine Learning model, data preprocessing is the first step marking the initiation of the process. Typically, real-world data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends. This is where data preprocessing enters the scenario – it helps to clean, format, and organize the raw data, thereby making it ready-to-go for Machine Learning models. Let’s explore various steps of data preprocessing in machine learning. 

Join Artificial Intelligence Course online from the World’s top Universities – Masters, Executive Post Graduate Programs, and Advanced Certificate Program in ML & AI to fast-track your career.

Steps in Data Preprocessing in Machine Learning

 There are seven significant steps in data preprocessing in Machine Learning:

 1. Acquire the dataset

Acquiring the dataset is the first step in data preprocessing in machine learning. To build and develop Machine Learning models, you must first acquire the relevant dataset. This dataset will be comprised of data gathered from multiple and disparate sources which are then combined in a proper format to form a dataset. Dataset formats differ according to use cases. For instance, a business dataset will be entirely different from a medical dataset. While a business dataset will contain relevant industry and business data, a medical dataset will include healthcare-related data.

There are several online sources from where you can download datasets like and You can also create a dataset by collecting data via different Python APIs. Once the dataset is ready, you must put it in CSV, or HTML, or XLSX file formats.

2. Import all the crucial libraries

Since Python is the most extensively used and also the most preferred library by Data Scientists around the world, we’ll show you how to import Python libraries for data preprocessing in Machine Learning. Read more about Python libraries for Data Science here. The predefined Python libraries can perform specific data preprocessing jobs. Importing all the crucial libraries is the second step in data preprocessing in machine learning. The three core Python libraries used for this data preprocessing in Machine Learning are:

  • NumPy – NumPy is the fundamental package for scientific calculation in Python. Hence, it is used for inserting any type of mathematical operation in the code. Using NumPy, you can also add large multidimensional arrays and matrices in your code. 
  • Pandas – Pandas is an excellent open-source Python library for data manipulation and analysis. It is extensively used for importing and managing the datasets. It packs in high-performance, easy-to-use data structures and data analysis tools for Python.
  • Matplotlib – Matplotlib is a Python 2D plotting library that is used to plot any type of charts in Python. It can deliver publication-quality figures in numerous hard copy formats and interactive environments across platforms (IPython shells, Jupyter notebook, web application servers, etc.). 

Read: Machine Learning Project Ideas for Beginners

3. Import the dataset

In this step, you need to import the dataset/s that you have gathered for the ML project at hand. Importing the dataset is one of the important steps in data preprocessing in machine learning. However, before you can import the dataset/s, you must set the current directory as the working directory. You can set the working directory in Spyder IDE in three simple steps:

  1. Save your Python file in the directory containing the dataset.
  2. Go to File Explorer option in Spyder IDE and choose the required directory.
  3. Now, click on the F5 button or Run option to execute the file.

data preprocessing in machine learning


This is how the working directory should look. 

Once you’ve set the working directory containing the relevant dataset, you can import the dataset using the “read_csv()” function of the Pandas library. This function can read a CSV file (either locally or through a URL) and also perform various operations on it. The read_csv() is written as:

data_set= pd.read_csv(‘Dataset.csv’)

In this line of code, “data_set” denotes the name of the variable wherein you stored the dataset. The function contains the name of the dataset as well. Once you execute this code, the dataset will be successfully imported. 

During the dataset importing process, there’s another essential thing you must do – extracting dependent and independent variables. For every Machine Learning model, it is necessary to separate the independent variables (matrix of features) and dependent variables in a dataset. 

Consider this dataset:

data preprocessing in ml - steps


This dataset contains three independent variables – country, age, and salary, and one dependent variable – purchased. 

How to extract the independent variables?

To extract the independent variables, you can use “iloc[ ]” function of the Pandas library. This function can extract selected rows and columns from the dataset.

x= data_set.iloc[:,:-1].values  

In the line of code above, the first colon(:) considers all the rows and the second colon(:) considers all the columns. The code contains “:-1” since you have to leave out the last column containing the dependent variable. By executing this code, you will obtain the matrix of features, like this – 

[[‘India’ 38.0 68000.0]  

 [‘France’ 43.0 45000.0]  

 [‘Germany’ 30.0 54000.0]  

 [‘France’ 48.0 65000.0]  

 [‘Germany’ 40.0 nan]  

 [‘India’ 35.0 58000.0]  

 [‘Germany’ nan 53000.0]  

 [‘France’ 49.0 79000.0]  

 [‘India’ 50.0 88000.0]  

 [‘France’ 37.0 77000.0]] 

Must Read: Free deep learning course!

How to extract the dependent variable?

You can use the “iloc[ ]” function to extract the dependent variable as well. Here’s how you write it:

y= data_set.iloc[:,3].values  

This line of code considers all the rows with the last column only. By executing the above code, you will get the array of dependent variables, like so – 

array([‘No’, ‘Yes’, ‘No’, ‘No’, ‘Yes’, ‘Yes’, ‘No’, ‘Yes’, ‘No’, ‘Yes’],


Best Machine Learning Courses & AI Courses Online

4. Identifying and handling the missing values

In data preprocessing, it is pivotal to identify and correctly handle the missing values, failing to do this, you might draw inaccurate and faulty conclusions and inferences from the data. Needless to say, this will hamper your ML project. 

Basically, there are two ways to handle missing data:

  • Deleting a particular row – In this method, you remove a specific row that has a null value for a feature or a particular column where more than 75% of the values are missing. However, this method is not 100% efficient, and it is recommended that you use it only when the dataset has adequate samples. You must ensure that after deleting the data, there remains no addition of bias. 
  • Calculating the mean – This method is useful for features having numeric data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value. This method can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields better results compared to the first method (omission of rows/columns). Another way of approximation is through the deviation of neighbouring values. However, this works best for linear data.

Read: Applications of Machine Learning Applications Using Cloud

5. Encoding the categorical data

Categorical data refers to the information that has specific categories within the dataset. In the dataset cited above, there are two categorical variables – country and purchased.

Machine Learning models are primarily based on mathematical equations. Thus, you can intuitively understand that keeping the categorical data in the equation will cause certain issues since you would only need numbers in the equations.

How to encode the country variable?

As seen in our dataset example, the country column will cause problems, so you must convert it into numerical values. To do so, you can use the LabelEncoder() class from the sci-kit learn library. The code will be as follows –

#Catgorical data  

#for Country Variable  

from sklearn.preprocessing import LabelEncoder  

label_encoder_x= LabelEncoder()  

x[:, 0]= label_encoder_x.fit_transform(x[:, 0]) 

 And the output will be – 


  array([[2, 38.0, 68000.0],

            [0, 43.0, 45000.0],

         [1, 30.0, 54000.0],

         [0, 48.0, 65000.0],

         [1, 40.0, 65222.22222222222],

         [2, 35.0, 58000.0],

         [1, 41.111111111111114, 53000.0],

         [0, 49.0, 79000.0],

         [2, 50.0, 88000.0],

        [0, 37.0, 77000.0]], dtype=object)

 Here we can see that the LabelEncoder class has successfully encoded the variables into digits. However, there are country variables that are encoded as 0, 1, and 2 in the output shown above. So, the ML model may assume that there is come some correlation between the three variables, thereby producing faulty output. To eliminate this issue, we will now use Dummy Encoding.

Dummy variables are those that take the values 0 or 1 to indicate the absence or presence of a specific categorical effect that can shift the outcome. In this case, the value 1 indicates the presence of that variable in a particular column while the other variables become of value 0. In dummy encoding, the number of columns equals the number of categories.

Since our dataset has three categories, it will produce three columns having the values 0 and 1. For Dummy Encoding, we will use OneHotEncoder class of the scikit-learn library. The input code will be as follows – 

#for Country Variable  

from sklearn.preprocessing import LabelEncoder, OneHotEncoder  

label_encoder_x= LabelEncoder()  

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  

#Encoding for dummy variables  

onehot_encoder= OneHotEncoder(categorical_features= [0])    

x= onehot_encoder.fit_transform(x).toarray()

 On execution of this code, you will get the following output –

 array([[0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,


       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.30000000e+01,


       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,


       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,


       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,


       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.50000000e+01,


       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.11111111e+01,


       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.90000000e+01,


       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 5.00000000e+01,


       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,


 In the output shown above, all the variables are divided into three columns and encoded into the values 0 and 1.

How to encode the purchased variable?

For the second categorical variable, that is, purchased, you can use the “labelencoder” object of the LableEncoder class. We are not using the OneHotEncoder class since the purchased variable only has two categories yes or no, both of which are encoded into 0 and 1.

The input code for this variable will be – 

labelencoder_y= LabelEncoder()  

y= labelencoder_y.fit_transform(y) 

The output will be – 

Out[17]: array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

In-demand Machine Learning Skills

6. Splitting the dataset

Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for Machine Learning model must be split into two separate sets – training set and test set. 

data preprocessing


Training set denotes the subset of a dataset that is used for training the machine learning model. Here, you are already aware of the output. A test set, on the other hand, is the subset of the dataset that is used for testing the machine learning model. The ML model uses the test set to predict outcomes. 

Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take 70% or 80% of the data for training the model while leaving out the rest 30% or 20%. The splitting process varies according to the shape and size of the dataset in question. 

 To split the dataset, you have to write the following line of code – 

 from sklearn.model_selection import train_test_split  

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  

Here, the first line splits the arrays of the dataset into random train and test subsets. The second line of code includes four variables:

  • x_train – features for the training data
  • x_test – features for the test data
  • y_train – dependent variables for training data
  • y_test – independent variable for testing data

Thus, the train_test_split() function includes four parameters, the first two of which are for arrays of data. The test_size function specifies the size of the test set. The test_size maybe .5, .3, or .2 – this specifies the dividing ratio between the training and test sets. The last parameter, “random_state” sets seed for a random generator so that the output is always the same. 

7. Feature scaling

Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.

Consider this dataset for example – 


In the dataset, you can notice that the age and salary columns do not have the same scale. In such a scenario, if you compute any two values from the age and salary columns, the salary values will dominate the age values and deliver incorrect results. Thus, you must remove this issue by performing feature scaling for Machine Learning.

Most ML models are based on Euclidean Distance, which is represented as:


You can perform feature scaling in Machine Learning in two ways:






For our dataset, we will use the standardization method. To do so, we will import StandardScaler class of the sci-kit-learn library using the following line of code:

from sklearn.preprocessing import StandardScaler  

The next step will be to create the object of StandardScaler class for independent variables. After this, you can fit and transform the training dataset using the following code:

st_x= StandardScaler()  

x_train= st_x.fit_transform(x_train) 

For the test dataset, you can directly apply transform() function (you need not use the fit_transform() function because it is already done in training set). The code will be as follows – 

x_test= st_x.transform(x_test) 

The output for the test dataset will show the scaled values for x_train and x_test as:

data preprocessing in machine learning : steps



All the variables in the output are scaled between the values -1 and 1.

Now, to combine all the steps we’ve performed so far, you get: 


# importing libraries  

import numpy as nm  

import matplotlib.pyplot as mtp  

import pandas as pd  


#importing datasets  

data_set= pd.read_csv(‘Dataset.csv’)  


#Extracting Independent Variable  

x= data_set.iloc[:, :-1].values  


#Extracting Dependent variable  

y= data_set.iloc[:, 3].values  


#handling missing data(Replacing missing data with the mean value)  

from sklearn.preprocessing import Imputer  

imputer= Imputer(missing_values =’NaN’, strategy=’mean’, axis = 0)  


#Fitting imputer object to the independent varibles x.   

imputerimputer=[:, 1:3])  


#Replacing missing data with the calculated mean value  

x[:, 1:3]= imputer.transform(x[:, 1:3])  


#for Country Variable  

from sklearn.preprocessing import LabelEncoder, OneHotEncoder  

label_encoder_x= LabelEncoder()  

x[:, 0]= label_encoder_x.fit_transform(x[:, 0])  


#Encoding for dummy variables  

onehot_encoder= OneHotEncoder(categorical_features= [0])    

x= onehot_encoder.fit_transform(x).toarray()  


#encoding for purchased variable  

labelencoder_y= LabelEncoder()  

y= labelencoder_y.fit_transform(y)  


# Splitting the dataset into training and test set.  

from sklearn.model_selection import train_test_split  

x_train, x_test, y_train, y_test= train_test_split(x, y, test_size= 0.2, random_state=0)  


#Feature Scaling of datasets  

from sklearn.preprocessing import StandardScaler  

st_x= StandardScaler()  

x_train= st_x.fit_transform(x_train)  

x_test= st_x.transform(x_test)  

So, that’s data processing in Machine Learning in a nutshell!

Popular AI and ML Blogs & Free Courses

You can check IIT Delhi’s Executive PG Programme in Machine Learning & AI in association with upGrad. IIT Delhi is one of the most prestigious institutions in India. With more the 500+ In-house faculty members which are the best in the subject matters.

What is the importance of data preprocessing?

Because errors, redundancies, missing values, and inconsistencies all jeopardize the dataset's integrity, you must address all of them for a more accurate result. Assume you're using a defective dataset to train a Machine Learning system to deal with your clients' purchases. The system is likely to generate biases and deviations, resulting in a bad user experience. As a result, before you use that data for your intended purpose, it must be as organized and 'clean' as feasible. Depending on the type of difficulty you're dealing with, there are numerous options.

What is data cleaning?

There will almost certainly be missing and noisy data in your data sets. Because the data collection procedure isn't ideal, you'll have a lot of useless and missing information. Data cleaning is the way you should employ to deal with this problem. This can be divided into two categories. The first one discusses how to deal with missing data. You can choose to ignore the missing values in this section of the data collection (called a tuple). The second data cleaning method is for data that is noisy. It's critical to get rid of useless data that can't be read by the systems if you want the entire process to run smoothly.

What do you mean by data transformation and reduction?

Data preprocessing moves on to the transformation stage after dealing with the concerns. You use it to convert data into relevant conformations for analysis. Normalization, attribute selection, discretization, and Concept Hierarchy Generation are some of the approaches that can be used to accomplish this. Even for automated methods, sifting through large datasets can take a long time. That is why the data reduction stage is so crucial: it reduces the size of data sets by limiting them to the most important information, increasing storage efficiency while lowering the financial and time expenses of working with them.

Refer to your Network!

If you know someone, who would benefit from our specially curated programs? Kindly fill in this form to register their interest. We would assist them to upskill with the right program, and get them a highest possible pre-applied fee-waiver up to 70,000/-

You earn referral incentives worth up to ₹80,000 for each friend that signs up for a paid programme! Read more about our referral incentives here.

Want to share this article?

Lead the AI Driven Technological Revolution

Leave a comment

Your email address will not be published. Required fields are marked *

Our Popular Machine Learning Course

Get Free Consultation

Leave a comment

Your email address will not be published. Required fields are marked *

Get Free career counselling from upGrad experts!
Book a session with an industry professional today!
No Thanks
Let's do it
Get Free career counselling from upGrad experts!
Book a Session with an industry professional today!
Let's do it
No Thanks