Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconData Preprocessing In Data Mining: Steps, Missing Value Imputation, Data Standardization

Data Preprocessing In Data Mining: Steps, Missing Value Imputation, Data Standardization

Last updated:
30th Dec, 2020
Views
Read Time
8 Mins
share image icon
In this article
Chevron in toc
View All
Data Preprocessing In Data Mining: Steps, Missing Value Imputation, Data Standardization

The most time-consuming part of a Data Scientist’s job is to prepare and preprocess the data at hand. The data we get in real-life scenarios is not clean and suitable for modelling. The data needs to be cleaned, brought to a certain format and transformed before feeding to the Machine Learning models.

At the end of this tutorial, you will know the following

Why Data Preprocessing?

When data is retrieved by scrapping websites and gathering it from other data sources, it is generally full of discrepancies. It can be formatting issues, missing values, garbage values and text and even errors in the data. Several preprocessing steps need to be done to make sure that the data that is fed to the model is up to the mark so that the model can learn and generalize on it.

Data Cleaning

The first and most essential step is to clean the irregularities in the data. Without doing this step, we cannot make much sense out of the statistics of the data. These can be formatting issues, garbage values and outliers.

Formatting issues

We need the data to be in a tabular format most of the times but it is not the case. The data might have missing or incorrect column names, blank columns. Moreover, when dealing with unstructured data such as Images and Text, it becomes utmost essential to get the 2D or 3D data loaded in Dataframes for modelling.

Garbage Values

Many instances or complete columns might have certain garbage values appended to the actual required value. For example, consider a column “rank” which has the values such as: “#1”, “#3”, “#12”, “#2” etc. Now, it is important to remove all the preceding “#” characters to be able to feed the numeric value to the model.

Outliers

Many times certain numeric values are either too large or too low than the average value of the specific column. These are considered as outliers. Outliers need special treatment and are a sensitive factor to treat. These outliers might be measurement errors or they might be real values as well. They either need to be removed completely or handled separately as they might contain a lot of important information.

Missing Values

It is seldom the case that your data will contain all the values for every instance. Many values are missing or filled with garbage entry. These missing values need to be treated. These values can have multiple reasons why they might be missing. They could be missing due to some reason such as sensor error or other factors, or they can also be missing completely at random.

Read: Data Mining Projects in India

Explore our Popular Data Science Certifications

Dropping

The most straightforward and easiest way is to drop the rows where values are missing. Doing this has many disadvantages like loss of crucial information. It might be a good step to drop the missing values when the amount of data you have is huge. But if the data is less and there are a lot of missing values, you need better ways to tackle this issue.

Mean/Median/Mode imputation

The quickest way to impute missing values is by simply imputing the mean value of the column. However, it has disadvantages because it disturbs the original distribution of the data. You can also impute the median value or the mode value which is generally better than the simple mean.

Linear interpolation & KNN

More smart ways can also be used to impute missing values. 2 of which are Linear Interpolations using multiple models by treating the column with blank values as the feature to be predicted. Another way is to use clustering by KNN. KNN makes clusters of the values in a particular feature and then assigns the value closest to the cluster.

Data Standardization

In a data set with multiple numerical features, all the features might not be on the same scale. For example, a feature “Distance” has distances in meters such as 1300, 800, 560, etc. And another feature “time” has times in hours such as 1, 2.5, 3.2, 0.8, etc. So, when these two features are fed to the model, it considers the feature with distances as more weightage as its values are large. To avoid this scenario and to have faster convergence, it is necessary to bring all the features on the same scale.

Normalization

A common way to scale the features is by normalizing them. It can be implemented using Scikit-learn’s Normalizer. It works not on the columns, but on the rows. L2 normalization is applied to each observation so that the values in a row have a unit norm after scaling.

Min Max Scaling

Min Max scaling can be implemented using Scikit-learn’s Min MaxScaler class. It subtracts the minimum value of the features and then divides by the range, where the range is the difference between the original maximum and original minimum. It preserves the shape of the original distribution, with default range in 0-1.

upGrad’s Exclusive Data Science Webinar for you –

ODE Thought Leadership Presentation

Top Data Science Skills to Learn

Standard Scaling

Standard Scaler also can be implemented using Scikit-learn’s class. It standardizes a feature by subtracting the mean and then scaling to unit variance, where unit variance means dividing all the values by the standard deviation. It makes the mean of the distribution 0 and standard deviation as 1.

Discretization

A lot of times data is not in numeric form instead of in categorical form. For example, consider a feature “temperature” with values as “High”, “Low”, “Medium”. These textual values need to be encoded in numerical form to able for the model to train upon. 

Categorical Data  

Categorical Data is label encoded to bring it in numerical form. So “High”, “Medium” and “Low” can be Label Encoded to 3,2, and 1. Categorical features can be either nominal or ordinal. Ordinal categorical features are those which have a certain order. For example, in the above case, we can say that 3>2>1 as the temperatures can be measured/quantified. 

However, in an example where a feature of “City” which has values like “Delhi”, “Jammu” & “Agra”, cannot be measured. In other words, when we label encode them as 3, 2, 1, we cannot say that 3>2>1 because “Delhi” > ”Jammu” won’t make much sense. In such cases, we use One Hot Encoding.

Continuous Data

Features with continuous values can also be discretized by binning the values into bins of specific ranges. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in. 

For example, say we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”. Now we can divide into 3 bins with ranges 0 to 50, 51-70, and 71-100 belonging to bins 1,2, and 3 respectively. Therefore, the feature will now only contain one of these 3 values. Pandas offers 2 functions to achieve binning quickly: qcut and cut.

Pandas qcut takes in the number of quantiles and divides the data points to each bin based on the data distribution.

Pandas cut, on the other hand, takes in the custom ranges defined by us and divides the data points in those ranges.

Related read: Data Preprocessing in Machine Learning

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Read our popular Data Science Articles

Conclusion

Data Preprocessing is an essential step in any Data Mining and Machine Learning task. All the steps we discussed are certainly not all but do cover most of the basic part of the process. Data preprocessing techniques are different for NLP and Image data as well. Make sure to try examples of above steps and implement in your Data Mining pipeline.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Rohit Sharma

Blog Author
Rohit Sharma is the Program Director for the UpGrad-IIIT Bangalore, PG Diploma Data Analytics Program.

Frequently Asked Questions (FAQs)

1What is data preprocessing and what is its significance?

This is a technique to furnish the raw unstructured data which is in the form of images, text, videos. This data is first preprocessed to remove inconsistencies, errors, and redundancies so that it can be analyzed later.

The raw data is transformed into relevant data that can be understood by the machines. Preprocessing the data is an important step to transform the data for modelling. Without processing, it is practically useless.

2What are the steps involved in data preprocessing?

Data preprocessing involves various steps to complete the whole process. The data is first cleaned to remove noises and fill the missing values. After this, the data is integrated from multiple sources to combine into a single data set. These steps are then followed by transformation, reduction, and discretization.

The transformation of the raw data involves normalizing the data. Reduction and discretization basically deal with reducing the attributes and dimensions of the data. This is followed by compressing this large set of data.

3What is the difference between univariate and multivariate methods?

The univariate method is the simplest method to handle an outlier. It does not overview any relationship since it is a single variate and its main purpose is to analyze the data and determine the pattern associated with it. Mean, median, and mode are examples of patterns found in the univariate data.

On the other hand, the multivariate method is for analyzing three or more variables. It is more precise than the earlier method since, unlike the univariate method, the multivariate method deals with relationships and patterns. Additive Tree, Canonical Correlation Analysis, and Cluster Analysis are some of the ways to perform multivariate analysis.

Explore Free Courses

Suggested Blogs

Python Developer Salary in India in 2024 [For Freshers & Experienced]
906211
Wondering what is the range of Python developer salary in India? Before going deep into that, do you know why Python is so popular now? Python has be
Read More

by Sriram

11 Feb 2024

6 Types of Filters in Tableau: How You Should Use Them
64404
Tableau is one of the most popular tools in data visualization and analysis that facilitates brands across all domains to leverage the reckoning poten
Read More

by Rohit Sharma

04 Feb 2024

Data Cleaning Techniques: Learn Simple & Effective Ways To Clean Data
51788
Data cleansing is an essential part of data science. Working with impure data can lead to many difficulties. And today, we’ll be discussing the same.
Read More

by Rohit Sharma

04 Feb 2024

13 Exciting Data Science Project Ideas &  Topics for Beginners [2024]
944890
Summary: In this Article, you will learn about 13 exciting data science project ideas & topics for beginners. 1. Beginner Level | Data Science P
Read More

by Rohit Sharma

28 Jan 2024

Top 15 Python AI & Machine Learning Open Source Projects
35776
Machine learning and artificial intelligence are some of the most advanced topics to learn. So you must employ the best learning methods to make sure
Read More

by Pavan Vadapalli

28 Jan 2024

Most Common Binary Tree Interview Questions & Answers [For Freshers & Experienced]
4329
fIntroduction Data structures are one of the most fundamental concepts in object-oriented programming. To explain it simply, a data structure is a pa
Read More

by Rohit Sharma

28 Jan 2024

Cluster Analysis in Data Mining: Applications, Methods & Requirements
110068
Here we are going to discuss Cluster Analysis in Data Mining. So first let us know about what is clustering in data mining then its introduction and t
Read More

by Rohit Sharma

26 Jan 2024

What is Linear Data Structure? List of Data Structures Explained
53009
Data structures are the data structured in a way for efficient use by the users. As the computer program relies hugely on the data and also requires a
Read More

by Rohit Sharma

24 Jan 2024

Python Free Online Course with Certification [2024]
129157
Summary: In this Article, you will learn about python free online course with certification. Programming with Python: Introduction for Beginners Le
Read More

by Rohit Sharma

24 Jan 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon