For working professionals
For fresh graduates
Study abroad
More

Home
Blog
Data Science
Exploratory Data Analysis in Python: What You Need to Know?

Exploratory Data Analysis in Python: What You Need to Know?

By Rohit Sharma

Updated on Apr 02, 2025 | 9 min read | 6.89K+ views

Share:

Table of Contents

View all

Looking at the Data
Missing Data
Data Imbalance
Key Techniques Used in Exploratory Data Analysis in Python
Conclusion

Exploratory Data Analysis (EDA) is a crucial practice followed by all data scientists. Exploratory Data Analysis in Python involves examining large datasets from different perspectives to gain a comprehensive understanding. This process helps in cleaning and summarizing data, ultimately revealing hidden insights and trends that might otherwise go unnoticed.

EDA has no hard-core set of rules which are to be followed like in ‘data analysis’, for example. People who are new to the field always tend to confuse between the two terms, which are mostly similar but different in their purpose. Unlike EDA, data analysis is more inclined towards the implementation of probabilities and statistical methods to reveal facts and relationships among different variants.

Coming back, there is no right or wrong way to perform EDA. It varies from person to person however, there are some major guidelines commonly followed which are listed below.

Handling missing values: Null values can be seen when all the data may not have been available or recorded during collection.
Removing duplicate data: It is important to prevent any overfitting or bias created during training the machine learning algorithm using repeated data records
Handling outliers: Outliers are records that drastically differ from the rest of the data and don’t follow the trend. It can arise due to certain exceptions or inaccuracy during data collection
Scaling and normalizing: This is only done for numerical data variables. Most of the time the variables greatly differ in their range and scale which makes it difficult to compare them and find correlations.
Univariate and Bivariate analysis: Univariate analysis is usually done by seeing how one variable is affecting the target variable. Bivariate analysis is carried out between any 2 variables, it can either be numerical or categorical or both.

We will look at how some of these are implemented using a very famous ‘Home Credit Default Risk’ dataset available on Kaggle here. The data contains information about the loan applicant at the time of applying for the loan. It contains two types of scenarios:

The client with payment difficulties: he/she had late payment more than X days

on at least one of the first Y instalments of the loan in our sample,

All other cases: All other cases when the payment is paid on time.

We’ll be only working on the application data files for the sake of this article.

Related: Python Project Ideas & Topics for Beginners

Looking at the Data

app_data = pd.read_csv( ‘application_data.csv’ )

app_data.info()

After reading the application data, we use the info() function to get a short overview of the data we’ll be dealing with. The output below informs us that we have around 300000 loan records with 122 variables. Out of these, there are 16 categorical variables and the rest numerical.

<class ‘pandas.core.frame.DataFrame’>

RangeIndex: 307511 entries, 0 to 307510

Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR

dtypes: float64(65), int64(41), object(16)

memory usage: 286.2+ MB

It is always a good practice to handle and analyse numerical and categorical data separately.

categorical = app_data.select_dtypes(include = object).columns

app_data[categorical].apply(pd.Series.nunique, axis = 0)

Looking only at the categorical features below, we see that most of them only have a few categories which make them easier to analyse using simple plots.

NAME_CONTRACT_TYPE 2

CODE_GENDER 3

FLAG_OWN_CAR 2

FLAG_OWN_REALTY 2

NAME_TYPE_SUITE 7

NAME_INCOME_TYPE 8

NAME_EDUCATION_TYPE 5

NAME_FAMILY_STATUS 6

NAME_HOUSING_TYPE 6

OCCUPATION_TYPE 18

WEEKDAY_APPR_PROCESS_START 7

ORGANIZATION_TYPE 58

FONDKAPREMONT_MODE 4

HOUSETYPE_MODE 3

WALLSMATERIAL_MODE 7

EMERGENCYSTATE_MODE 2

dtype: int64

Now for the numerical features, the describe() method gives us the statistics of our data:

numer= app_data.describe()

numerical= numer.columns

numer

Looking at the entire table it’s evident that:

days_birth is negative: applicant’s age (in days) relative to the day of application
days_employed has outliers (max value is around 100 years) (635243)
amt_annuity- mean much smaller than the max value

So now we know which features will have to be analysed further.

Our learners also read: Free Python Course with Certification

upGrad’s Exclusive Data Science Webinar for you –

Transformation & Opportunities in Analytics & Insights

Missing Data

We can make a point plot of all the features having missing values by plotting the % of missing data along Y-axis.

missing = pd.DataFrame( (app_data.isnull().sum()) * 100 / app_data.shape[0]).reset_index()

plt.figure(figsize = (16,5))

ax = sns.pointplot(‘index’, 0, data = missing)

plt.xticks(rotation = 90, fontsize = 7)

plt.title(“Percentage of Missing values”)

plt.ylabel(“PERCENTAGE”)

plt.show()

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree17 Months

IIIT Bangalore

Executive Post Graduate Certificate in Data Science & AI

Placement Assistance

Certification6 Months

Many columns have a lot of missing data (30-70%), some have few missing data (13-19%) and many columns also have no missing data at all. It is not really necessary to modify the dataset when you just have to perform EDA. However, going ahead with data pre-processing, we should know how to handle missing values.

For features with less missing values, we can use regression to predict the missing values or fill with the mean of the values present, depending on the feature. And for features with a very high number of missing values, it is better to drop those columns as they give very less insight on analysis.

Data Imbalance

In this dataset, loan defaulters are identified using the binary variable ‘TARGET’.

100 * app_data[‘TARGET’].value_counts() / len(app_data[‘TARGET’])

0 91.927118

1 8.072882

Name: TARGET, dtype: float64

We see that the data is highly imbalanced with a ratio of 92:8. Most of the loans were paid back on time (target = 0). So whenever there is such a huge imbalance, it is better to take features and compare them with the target variable (targeted analysis) to determine what categories in those features tend to default on the loans more than others.

Below are just a few examples of graphs that can be made using the seaborn library of python and simple user-defined functions.

Also, Check out all trending Python tutorial concepts in 2024.

Gender

Males (M) have a higher chance of defaulting compared to females (F), even though the number of female applicants is almost twice as more. So females are more reliable than men for paying back their loans.

Education Type

Even though most student loans are for their secondary education or higher education, it is the lower secondary education loans that are riskiest for the company followed by secondary.

Also Read: Career in Data Science

Key Techniques Used in Exploratory Data Analysis in Python

Several techniques are essential in Exploratory Data Analysis in Python, as they help understand and clean the data while identifying relevant features and testing hypotheses. Python libraries offer various functions and methods for implementing these techniques, making Exploratory Data Analysis in Python a powerful tool for data analysis.

Feature Engineering

The process of building new features from existing ones is known as feature engineering. It is a critical stage in EDA Python since it enables you to extract additional information from your data. Python includes various libraries for feature engineering, such as NumPy, Pandas, and Scikit-learn.

Outlier Detection

Outliers are data points significantly different from the rest of the data in your dataset. Outsiders can substantially impact your research, so they must be properly recognised and managed. Outlier identification methods in Python include the Z-score, IQR, and isolation forest.

Data Visualization

It is a crucial part of EDA as it allows you to identify patterns and tendencies in your statistics. Python has many visualization libraries, which include Matplotlib, Seaborn, and Plotly. These libraries have an intensive set of charts and graphs that you could use to help show your data.

Data Preprocessing

Data preprocessing is cleaning and transforming your data before you start your analysis. It’s a crucial step in EDA because it can greatly impact the results of your analysis. Python provides several libraries for data preprocessing, including Pandas and Scikit-learn.

Hypothesis Testing

Hypothesis trying out is a statistical method of determining whether or not population speculation is true. This is a critical step in EDA as it lets you attract logical conclusions from your data. Scipy and Statsmodels are two Python packages for testing hypotheses.

Conclusion

The analysis techniques discussed above are widely used in risk analytics within banking and financial services. Exploratory Data Analysis in Python helps leverage historical data to minimize the risk of financial losses while lending to customers. The scope of Exploratory Data Analysis in Python extends across various industries, making it an essential practice for data-driven decision-making.

If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Discover our popular online Data Science courses and gain the skills needed to excel in the world of data-driven decision-making!

Explore our Popular Data Science Online courses

Executive Post Graduate Programme in Data Science from IIITB	Professional Certificate Program in Data Science for Business Decision Making	Master of Science in Data Science from University of Arizona
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Online Certifications

Master the top Data Science skills you need to succeed, from data analysis to machine learning and AI-driven insights!

Top Data Science Skills You Should Learn

1	Data Analysis Online Certification	Inferential Statistics Online Certification
2	Hypothesis Testing Online Certification	Logistic Regression Online Certification
3	Linear Regression Certification	Linear Algebra for Analysis Online Certification

Explore our popular Data Science articles to stay updated on the latest trends, insights, and advancements in the field!

Read our popular Data Science Articles

Data Science Career Path: A Comprehensive Career Guide	Data Science Career Growth: The Future of Work is here	Why is Data Science Important? 8 Ways Data Science Brings Value to the Business
Relevance of Data Science for Managers	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	How to Become a Data Scientist
Career in Data Science	Data Science Top 10 Careers in 2025	Business Intelligence vs Data Science: What are the differences?

Reference Link:
https://www.kaggle.com/c/home-credit-default-risk/data

Frequently Asked Questions (FAQs)

1. Why is Exploratory Data Analysis (EDA) needed?

2. What are outliers and how to handle them?

3. What are the guidelines to perform EDA?

4. What is the main purpose of Exploratory Data Analysis in Python?

5. Which Python libraries are commonly used for EDA?

6. How does EDA differ from data preprocessing?

7. What are some key statistical techniques used in EDA?

8. Can EDA be automated in Python?

9. What are outliers, and how are they handled in EDA?

10. Why is visualization important in Exploratory Data Analysis in Python?

11. How does EDA help in feature selection for machine learning?

12. Is EDA necessary for small datasets?

13. What are the common challenges faced during EDA?

14. How long should an EDA process take?

763 articles published

Rohit Sharma shares insights, skill building advice, and practical tips tailored for professionals aiming to achieve their career goals.

Get Free Consultation

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad

Business Analytics & Consulting with PWC India

Placement assistance

Certification

3 Months

bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

17 Months

bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months

Suggested Blogs

Data Analysis Using Python [Everything You Need to Know]

By Rohit Sharma

18 Apr 2025 | 12 min read

What is Exploratory Data Analysis in Python? Learn From Scratch

By Rohit Sharma

11 Jan 2024 | 10 min read

Steps in Data Preprocessing: What You Need to Know?

By Rohit Sharma

03 Jul 2023 | 8 min read

Goto Statement in Python: What You Need to Know in 2024

By Rohit Sharma

14 Feb 2024 | 8 min read

View All Data Science Blogs