How to Handle Missing Data: A Complete Guide for Data Scientists
By Rahul Singh
Updated on Jun 17, 2026 | 10 min read | 4.76K+ views
Share:
Looks like you're browsing from the
United StatesSome programs may not be available in your location
You're browsing from the
United States
Some programs may not be available in your location
Switch to upGrad USAll courses
Certifications
More
By Rahul Singh
Updated on Jun 17, 2026 | 10 min read | 4.76K+ views
Share:
Table of Contents
Handling missing data is a critical step in data preprocessing because incomplete values can reduce data quality and affect the accuracy of machine learning models. Missing values often occur due to data collection issues, system errors, or incomplete user inputs.
To handle missing data effectively, data scientists first identify why values are missing and then apply suitable techniques such as deletion, statistical imputation, or predictive methods. Choosing the right approach helps improve data reliability and model performance.
In this blog, you'll learn practical techniques to identify, understand, and handle missing data effectively. You'll discover why different approaches work in different situations, see Python code examples you can use immediately, and understand the mistakes that even experienced data scientists make.
Transform your career with upGrad’s Data Science Course. Learn from industry experts, work on hands-on projects, and gain the skills top employer’s demand.
Before picking a method, you need a simple process. Most beginners jump straight to deleting rows or filling in the average value without checking what is actually happening in their data. That is a mistake.
Here is a step by step approach that works for almost any project:
This is the core loop you will repeat every time you need to handle missing data. The exact technique changes depending on the size of your dataset, the percentage of missing values, and what you plan to do with the data next.
Also Read: Career in Data Science: Jobs, Salary, and Skills Required
A quick way to decide your first move:
Situation |
Suggested First Step |
| Less than 5% missing in a column | Consider simple deletion or basic imputation |
| 5% to 20% missing | Use mean, median, or mode imputation, or a model based method |
| More than 20% missing | Investigate why, consider dropping the column or using advanced imputation |
| Missing data linked to another variable | Use that relationship to guide imputation |
Knowing how to handle missing data is not about memorizing one formula. It is about asking the right questions before you touch the dataset. Once you understand the scale and pattern of the problem, picking a technique becomes much easier.
Also Read: Importance of Statistics for Machine Learning Systems
Not all missing data behaves the same way. Before you can properly handle missing data, you need to understand why it is missing in the first place. Statisticians group missing data into three categories.
This happens when there is no pattern at all. A value is missing simply by chance, not because of anything related to the data itself. For example, a lab machine randomly fails to record one reading out of a thousand.
Here the missingness is related to another variable in the dataset, but not to the missing value itself. For example, older customers might be less likely to fill in their email address, but whether the email is missing has nothing to do with what that email actually is.
This is the trickiest type. The reason data is missing is directly connected to the value itself. For example, people with very high incomes may be less likely to report their income on a survey.
Type |
What It Means |
Example |
| MCAR | No pattern, pure chance | Random sensor glitch |
| MAR | Linked to another variable | Age affects whether email is filled in |
| MNAR | Linked to the missing value itself | High earners skip the income field |
Why does this matter? Because the type of missing data changes which method you should use. If your data is MCAR, simple deletion usually works fine. If it is MAR, imputation methods that use other columns work better. If it is MNAR, you may need domain knowledge or specialized models, since standard imputation can introduce bias.
Also Read: Data Preprocessing In Data Mining: Steps, Missing Value Imputation, Data Standardization
Once you know the type and pattern of your missing data, it is time to choose a technique. This stage of data preprocessing handling missing values usually falls into two broad groups: deletion and imputation.
Deletion is the simplest way to handle missing data. You either remove rows or columns that contain missing values.
Deletion works well when the missing data is MCAR and the dataset is large enough that you can afford to lose some rows.
Also Read: 5 Must-Know Steps in Data Preprocessing for Beginners!
Imputation means filling in the missing values instead of removing them. This is the more common path in data preprocessing handling missing values, especially for machine learning projects.
Method |
Best For |
Limitation |
| Mean or median imputation | Numerical columns with MCAR pattern | Reduces variance, can distort relationships |
| Mode imputation | Categorical columns | Can overrepresent the most common category |
| Forward or backward fill | Time series data | Assumes values do not change much over time |
| Regression imputation | Data with strong relationships between columns | Can overfit if relationships are weak |
| KNN imputation | Medium sized datasets with mixed patterns | Slower on very large datasets |
| Multiple imputation | Datasets where bias matters a lot | More complex to implement and explain |
A few practical tips:
Choosing the right way to handle missing data is part science and part judgment. There is rarely one perfect answer. The goal is to pick a method that keeps your data useful without distorting the patterns you actually care about.
Also Read: Linear Regression Model in Machine Learning: Concepts, Types, And Challenges in 2026
If you work with data professionally, you will eventually need to know how to handle missing data in Python. The pandas and scikit-learn libraries make this fairly straightforward once you understand the logic behind it.
import pandas as pd
df = pd.read_csv("data.csv")
# Count missing values per column
print(df.isnull().sum())
# Percentage of missing values
print(df.isnull().mean() * 100)
This is always your first step before handling missing data in Python. You need to know exactly where the gaps are before you decide what to do with them.
# Drop rows with any missing value
df_clean = df.dropna()
# Drop columns with more than 40% missing values
threshold = len(df) * 0.6
df_clean = df.dropna(thresh=threshold, axis=1)
# Fill numerical column with median
df["age"].fillna(df["age"].median(), inplace=True)
# Fill categorical column with mode
df["city"].fillna(df["city"].mode()[0], inplace=True)
# Forward fill for time series
df["sales"].fillna(method="ffill", inplace=True)
from sklearn.impute import SimpleImputer, KNNImputer
# Simple imputer
imputer = SimpleImputer(strategy="mean")
df[["age", "income"]] = imputer.fit_transform(df[["age", "income"]])
# KNN imputer
knn_imputer = KNNImputer(n_neighbors=5)
df[["age", "income"]] = knn_imputer.fit_transform(df[["age", "income"]])
These examples cover most situations you will face when handling missing data in Python. Start simple. Use dropna() or fillna() for small projects. Move to SimpleImputer or KNNImputer once your dataset is larger or your relationships between columns matter more.
One thing worth repeating: always split your data into training and test sets before you fit an imputer. If you calculate the mean or median on the full dataset, you leak information from the test set into training, which gives you an overly optimistic view of model performance.
Also Read: Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data
Knowing the techniques is only half the job. The other half is using them wisely. Here are some practices that separate beginners from experienced practitioners when they handle missing data.
Also Read: Top 10 Data Visualization Techniques for Successful Presentations
Mistake |
Why It Hurts |
Better Approach |
| Filling every column with the mean | Ignores the actual distribution and relationships | Use the method that fits each column's data type and pattern |
| Deleting rows without checking the pattern | Can introduce serious bias if data is MAR or MNAR | Investigate the missing pattern before deleting anything |
| Imputing before splitting train and test data | Leaks information and inflates accuracy scores | Fit your imputer only on training data |
| Ignoring missing data in categorical columns | Categorical gaps often carry real meaning | Consider creating a separate "missing" category |
| Treating all missing values the same | Different columns may have different causes | Treat each column on its own terms |
A simple rule to remember: the goal of any method to handle missing data is to preserve the truth in your dataset as closely as possible, not just to make the rows look complete. Speed matters less than accuracy here.
Missing data is not something you can avoid. It shows up in almost every dataset, from spreadsheets to large scale machine learning pipelines. What matters is having a clear, repeatable way to handle missing data so you are not guessing every time.
Start by checking how much data is missing and why. Pick a deletion or imputation method that fits the pattern you find. If you are working in Python, pandas and scikit-learn give you everything you need to get this done quickly.
Want personalized guidance on Data Science and upskilling? Speak with an expert for a free 1:1 counselling session today.
For beginners, mean or median imputation for numbers and mode imputation for categories is the easiest starting point. It is quick and works fine for small amounts of missing data, though more advanced methods give better accuracy on larger datasets.
No. Deleting rows only makes sense when missing data is random and your dataset is large enough to absorb the loss. If a large portion of rows have gaps, deletion can throw away useful information.
There is no strict rule, but columns with more than 40 to 50 percent missing values are usually unreliable. At that point, consider dropping the column or finding the data through another source.
Yes, significantly. Most machine learning algorithms cannot process missing values directly, and poor handling can introduce bias, reduce variance unnaturally, or cause the model to learn incorrect patterns.
Deletion removes rows or columns with missing values entirely. Imputation fills in those gaps with estimated values based on statistics or relationships in the rest of the data, keeping the dataset's original size.
Not always, but it works best only when missing values are few and random. Overusing mean imputation on skewed data or large gaps can flatten natural variation and hurt model performance.
AI tools generally suggest the same core framework: detect missing values, identify the pattern, and choose deletion or imputation based on the percentage missing and the type of data, similar to standard data science practice.
Pandas is best for basic detection, filling, and dropping. Scikit-learn is better for advanced techniques like KNN or iterative imputation, especially in machine learning pipelines.
Yes. Time series data usually relies on forward fill, backward fill, or interpolation rather than mean imputation, since values are connected to time and order matters.
Multiple imputation creates several different filled in versions of your dataset and combines the results. It is useful when missing data is substantial and you need a more statistically robust estimate than a single imputation.
No single method works for every project. The right approach depends on the type of missing data, the size of your dataset, and what you plan to do with it afterward.
75 articles published
Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...
Start Your Career in Data Science Today