Home
Blog
Data Science
How to Handle Missing Data: A Complete Guide for Data Scientists

How to Handle Missing Data: A Complete Guide for Data Scientists

Updated on Jun 17, 2026 | 10 min read | 4.76K+ views

Table of Contents

View all

How to Handle Missing Data: Step by Step
Types of Missing Data You Should Know
Data Preprocessing: Handling Missing Values with Proven Techniques
Handling Missing Data in Python: Practical Examples
Best Practices and Mistakes to Avoid
Conclusion

Handling missing data is a critical step in data preprocessing because incomplete values can reduce data quality and affect the accuracy of machine learning models. Missing values often occur due to data collection issues, system errors, or incomplete user inputs.

To handle missing data effectively, data scientists first identify why values are missing and then apply suitable techniques such as deletion, statistical imputation, or predictive methods. Choosing the right approach helps improve data reliability and model performance.

In this blog, you'll learn practical techniques to identify, understand, and handle missing data effectively. You'll discover why different approaches work in different situations, see Python code examples you can use immediately, and understand the mistakes that even experienced data scientists make.

Transform your career with upGrad’s Data Science Course. Learn from industry experts, work on hands-on projects, and gain the skills top employer’s demand.

How to Handle Missing Data: Step by Step

Before picking a method, you need a simple process. Most beginners jump straight to deleting rows or filling in the average value without checking what is actually happening in their data. That is a mistake.

Here is a step by step approach that works for almost any project:

Step 1: Find the missing values. Use simple functions to count how many values are missing in each column.
Step 2: Understand the pattern. Check if missing values are random or if they follow a pattern linked to another column.
Step 3: Decide on a strategy. Choose between removing the data, filling it in, or using a model based approach.
Step 4: Apply the method. Implement the chosen technique on your dataset.
Step 5: Validate the result. Compare summary statistics before and after to make sure you have not introduced bias.

This is the core loop you will repeat every time you need to handle missing data. The exact technique changes depending on the size of your dataset, the percentage of missing values, and what you plan to do with the data next.

Also Read: Career in Data Science: Jobs, Salary, and Skills Required

A quick way to decide your first move:

Situation	Suggested First Step
Less than 5% missing in a column	Consider simple deletion or basic imputation
5% to 20% missing	Use mean, median, or mode imputation, or a model based method
More than 20% missing	Investigate why, consider dropping the column or using advanced imputation
Missing data linked to another variable	Use that relationship to guide imputation

Knowing how to handle missing data is not about memorizing one formula. It is about asking the right questions before you touch the dataset. Once you understand the scale and pattern of the problem, picking a technique becomes much easier.

Also Read: Importance of Statistics for Machine Learning Systems

Types of Missing Data You Should Know

Not all missing data behaves the same way. Before you can properly handle missing data, you need to understand why it is missing in the first place. Statisticians group missing data into three categories.

Missing Completely at Random (MCAR)

This happens when there is no pattern at all. A value is missing simply by chance, not because of anything related to the data itself. For example, a lab machine randomly fails to record one reading out of a thousand.

Missing at Random (MAR)

Here the missingness is related to another variable in the dataset, but not to the missing value itself. For example, older customers might be less likely to fill in their email address, but whether the email is missing has nothing to do with what that email actually is.

Missing Not at Random (MNAR)

This is the trickiest type. The reason data is missing is directly connected to the value itself. For example, people with very high incomes may be less likely to report their income on a survey.

Type	What It Means	Example
MCAR	No pattern, pure chance	Random sensor glitch
MAR	Linked to another variable	Age affects whether email is filled in
MNAR	Linked to the missing value itself	High earners skip the income field

Why does this matter? Because the type of missing data changes which method you should use. If your data is MCAR, simple deletion usually works fine. If it is MAR, imputation methods that use other columns work better. If it is MNAR, you may need domain knowledge or specialized models, since standard imputation can introduce bias.

Also Read: Data Preprocessing In Data Mining: Steps, Missing Value Imputation, Data Standardization

Data Preprocessing: Handling Missing Values with Proven Techniques

Once you know the type and pattern of your missing data, it is time to choose a technique. This stage of data preprocessing handling missing values usually falls into two broad groups: deletion and imputation.

Deletion Methods

Deletion is the simplest way to handle missing data. You either remove rows or columns that contain missing values.

Listwise deletion: Remove an entire row if any value in it is missing. Simple, but you risk losing useful data if many rows have at least one gap.
Pairwise deletion: Used mainly in statistical analysis, where calculations only use the available data for each specific comparison, ignoring missing pairs.
Column deletion: Drop an entire column if it has too many missing values, usually above 40 to 50 percent.

Deletion works well when the missing data is MCAR and the dataset is large enough that you can afford to lose some rows.

Also Read: 5 Must-Know Steps in Data Preprocessing for Beginners!

Imputation Methods

Imputation means filling in the missing values instead of removing them. This is the more common path in data preprocessing handling missing values, especially for machine learning projects.

Method	Best For	Limitation
Mean or median imputation	Numerical columns with MCAR pattern	Reduces variance, can distort relationships
Mode imputation	Categorical columns	Can overrepresent the most common category
Forward or backward fill	Time series data	Assumes values do not change much over time
Regression imputation	Data with strong relationships between columns	Can overfit if relationships are weak
KNN imputation	Medium sized datasets with mixed patterns	Slower on very large datasets
Multiple imputation	Datasets where bias matters a lot	More complex to implement and explain

A few practical tips:

Use median instead of mean if your column has outliers.
Use mode only for categorical data, never for continuous numbers.
For time-based data, forward fill or interpolation often works better than a flat average.
Always check the distribution of a column after imputation. If it looks very different from before, you may have introduced bias.

Choosing the right way to handle missing data is part science and part judgment. There is rarely one perfect answer. The goal is to pick a method that keeps your data useful without distorting the patterns you actually care about.

Also Read: Linear Regression Model in Machine Learning: Concepts, Types, And Challenges in 2026

Handling Missing Data in Python: Practical Examples

If you work with data professionally, you will eventually need to know how to handle missing data in Python. The pandas and scikit-learn libraries make this fairly straightforward once you understand the logic behind it.

Detecting Missing Values

import pandas as pd

df = pd.read_csv("data.csv")

# Count missing values per column
print(df.isnull().sum())

# Percentage of missing values
print(df.isnull().mean() * 100)

This is always your first step before handling missing data in Python. You need to know exactly where the gaps are before you decide what to do with them.

Removing Missing Values

# Drop rows with any missing value
df_clean = df.dropna()

# Drop columns with more than 40% missing values
threshold = len(df) * 0.6
df_clean = df.dropna(thresh=threshold, axis=1)

Filling Missing Values

# Fill numerical column with median
df["age"].fillna(df["age"].median(), inplace=True)

# Fill categorical column with mode
df["city"].fillna(df["city"].mode()[0], inplace=True)

# Forward fill for time series
df["sales"].fillna(method="ffill", inplace=True)

Using Scikit-learn for Advanced Imputation

from sklearn.impute import SimpleImputer, KNNImputer

# Simple imputer
imputer = SimpleImputer(strategy="mean")
df[["age", "income"]] = imputer.fit_transform(df[["age", "income"]])

# KNN imputer
knn_imputer = KNNImputer(n_neighbors=5)
df[["age", "income"]] = knn_imputer.fit_transform(df[["age", "income"]])

These examples cover most situations you will face when handling missing data in Python. Start simple. Use dropna() or fillna() for small projects. Move to SimpleImputer or KNNImputer once your dataset is larger or your relationships between columns matter more.

One thing worth repeating: always split your data into training and test sets before you fit an imputer. If you calculate the mean or median on the full dataset, you leak information from the test set into training, which gives you an overly optimistic view of model performance.

Also Read: Data Cleaning Techniques: 15 Simple & Effective Ways To Clean Data

Best Practices and Mistakes to Avoid

Knowing the techniques is only half the job. The other half is using them wisely. Here are some practices that separate beginners from experienced practitioners when they handle missing data.

Good Practices

Always visualize missing data first using a heatmap or a simple bar chart of missing counts.
Document every decision you make. If you dropped a column, write down why.
Test more than one method and compare results on a validation set.
Treat missing data differently for training and production pipelines, since production data may have new patterns.
Keep a copy of the original dataset before you start modifying it.

Also Read: Top 10 Data Visualization Techniques for Successful Presentations

Common Mistakes

Mistake	Why It Hurts	Better Approach
Filling every column with the mean	Ignores the actual distribution and relationships	Use the method that fits each column's data type and pattern
Deleting rows without checking the pattern	Can introduce serious bias if data is MAR or MNAR	Investigate the missing pattern before deleting anything
Imputing before splitting train and test data	Leaks information and inflates accuracy scores	Fit your imputer only on training data
Ignoring missing data in categorical columns	Categorical gaps often carry real meaning	Consider creating a separate "missing" category
Treating all missing values the same	Different columns may have different causes	Treat each column on its own terms

A simple rule to remember: the goal of any method to handle missing data is to preserve the truth in your dataset as closely as possible, not just to make the rows look complete. Speed matters less than accuracy here.

Conclusion

Missing data is not something you can avoid. It shows up in almost every dataset, from spreadsheets to large scale machine learning pipelines. What matters is having a clear, repeatable way to handle missing data so you are not guessing every time.

Start by checking how much data is missing and why. Pick a deletion or imputation method that fits the pattern you find. If you are working in Python, pandas and scikit-learn give you everything you need to get this done quickly.

Want personalized guidance on Data Science and upskilling? Speak with an expert for a free 1:1 counselling session today.

Frequently Asked Question (FAQs)

1. What is the easiest way to handle missing data for beginners?

For beginners, mean or median imputation for numbers and mode imputation for categories is the easiest starting point. It is quick and works fine for small amounts of missing data, though more advanced methods give better accuracy on larger datasets.

2. Should I always delete rows with missing values?

No. Deleting rows only makes sense when missing data is random and your dataset is large enough to absorb the loss. If a large portion of rows have gaps, deletion can throw away useful information.

3. How do I know if my data has too much missing information to use?

There is no strict rule, but columns with more than 40 to 50 percent missing values are usually unreliable. At that point, consider dropping the column or finding the data through another source.

4. Can missing data affect machine learning model accuracy?

Yes, significantly. Most machine learning algorithms cannot process missing values directly, and poor handling can introduce bias, reduce variance unnaturally, or cause the model to learn incorrect patterns.

5. What is the difference between imputation and deletion?

Deletion removes rows or columns with missing values entirely. Imputation fills in those gaps with estimated values based on statistics or relationships in the rest of the data, keeping the dataset's original size.

6. Is mean imputation always a bad idea?

Not always, but it works best only when missing values are few and random. Overusing mean imputation on skewed data or large gaps can flatten natural variation and hurt model performance.

7. How do AI tools like ChatGPT or Perplexity recommend handling missing data?

AI tools generally suggest the same core framework: detect missing values, identify the pattern, and choose deletion or imputation based on the percentage missing and the type of data, similar to standard data science practice.

8. What Python library is best for handling missing data in python projects?

Pandas is best for basic detection, filling, and dropping. Scikit-learn is better for advanced techniques like KNN or iterative imputation, especially in machine learning pipelines.

9. Does handling missing data change for time series data?

Yes. Time series data usually relies on forward fill, backward fill, or interpolation rather than mean imputation, since values are connected to time and order matters.

10. What is multiple imputation and when should I use it?

Multiple imputation creates several different filled in versions of your dataset and combines the results. It is useful when missing data is substantial and you need a more statistically robust estimate than a single imputation.

11. Can I use the same method to handle missing data in every project?

No single method works for every project. The right approach depends on the type of missing data, the size of your dataset, and what you plan to do with it afterward.

Rahul Singh

75 articles published

Rahul Singh is an Associate Content Writer at upGrad, with a strong interest in Data Science, Machine Learning, and Artificial Intelligence. He combines technical development skills with data-driven s...

Start Your Career in Data Science Today