Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconWhat is Exploratory Data Analysis in Python? Learn From Scratch

What is Exploratory Data Analysis in Python? Learn From Scratch

Last updated:
4th Mar, 2021
Views
Read Time
10 Mins
share image icon
In this article
Chevron in toc
View All
What is Exploratory Data Analysis in Python? Learn From Scratch

Exploratory Data Analysis or EDA, in short, comprises almost 70% of Data Science Project. EDA is the process of exploring the data by using various analytics tools to get out the inferential statistics from the data. These explorations are done either by seeing plain numbers or by plotting graphs and charts of different types.

Each graph or chart depicts a different story and an angle to the same data. For most of the data analysis and cleaning part, Pandas is the most used tool. For the visualizations and plotting graphs/charts, plotting libraries such as Matplotlib, Seaborn and Plotly are used. 

EDA is extremely necessary to be carried out as it makes the data confess to you. A Data Scientist who does a very good EDA knows a lot about the data and hence the model that they will build will be automatically better than the Data Scientist who does not do a good EDA. 

By the end of this tutorial, you will know the following:

  • Checking the basic overview of the data
  • Checking the descriptive statistics of the data
  • Manipulating column names and data types
  • Handling missing values & duplicate rows
  • Bivariate Analysis

Explore our Popular Data Science Courses

Check out our data science training to upskill yourself

Basic Overview of Data

We will be using the Cars Dataset for this tutorial which can be downloaded from Kaggle. The first step for almost any dataset is to import it and check its basic overview – its shape, columns, column types, top 5 rows, etc. This step gives you a quick gist of the data you’ll be working with. Let’s see how to do this in Python. 

# Importing the required libraries
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline
sns.set(color_codes=True)

Data Head & Tail

data = pd.read_csv(“path/dataset.csv”)
# Check the top 5 rows of the dataframe
data.head()

The head function prints the top 5 indexes of the data frame by default. You can also specify how many top indexes you need to see bypassing that value to the head. Printing the head instantly gives us a quick look at what type of data we have, what type of features are present and what values they contain. Of course, this does not tell the whole story about the data, but it does give you a quick peek at the data. You can similarly print the bottom part of the data frame by using the tail function.

Top Essential Data Science Skills to Learn

# Print the last 10 rows of the dataframe
data.tail(10)

One thing to notice here is that both the functions-head and tail give us the top or the bottom indexes. But the top or bottom rows not always are a good preview of the data. So you can also print any number of rows randomly sampled from the dataset using the sample() function.

# Print 5 random rows
data.sample(5)

Descriptive Statistics

Next, let’s check out the descriptive statistics of the dataset. Descriptive stats consist of everything that “describes” the dataset. We check the shape of the data frame, what all columns are present, what all numeric and categorical features are there. We will also see how to do all this in simple functions.

Our learners also read: Free Online Python Course for Beginners

Shape

# Checking the dataframe shape (mxn)
# m=number of rows
# n=number of columns
data.shape

As we see, this data frame contains 11914 rows and 16 columns. 

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on The Future of Consumer Data in an Open Data Economy

 

Columns

# Print the column names
data.columns

Dataframe information

# Print the column data types and the number of non-missing values
data.info()

As you see, the info() function gives us all the columns, how many non-null or non-missing values are there in those columns and lastly the data type of those columns. This is a nice quick way of seeing what all features are numeric and what all are categorical/text-based. Also, we now have information about what all columns have missing values. We will look at how to work with missing values later.

Manipulating Column Names and Data Types

Carefully checking and manipulating each column is extremely crucial in EDA. We need to see what all type of content a column/feature contains and what has pandas read its data type. The numeric data types are mostly int64 or float64. The text-based or categorical features are assigned the ‘object’ data type.

The date-time based features are assigned  There are times where Pandas doesn’t understand a feature’s data type. In such cases, it just lazily assigns it the ‘object’ data type. We can specify the column datatypes explicitly while reading the data with read_csv. 

Selecting Categorical and Numerical Columns

# Add all the categorical and numerical columns to separate lists
categorical = data.select_dtypes(‘object’).columns
numerical = data.select_dtypes(‘number’).columns

Here the type that we passed as ‘number’ selects all columns with data types that have any kind of number- be it int64 or float64.

Renaming the Columns

# Renaming the column names
data = data.rename(columns={“Engine HP”: “HP”,
                            “Engine Cylinders”: “Cylinders”,
                            “Transmission Type”: “Transmission”,
                            “Driven_Wheels”: “Drive Mode”,
                            “highway MPG”: “MPG-H”,
                            “MSRP”: “Price” })
data.head(5)

The rename function just takes in a dictionary with the column names to be renamed and their new names.

Handling Missing Values and Duplicate Rows

Missing values is one of the most common issues/discrepancies in any real-life dataset. Handling missing values is in itself a vast topic as there are multiple ways to do it. Some ways are more generic ways, and some are more specific to the dataset one might be dealing with.

Checking Missing Values

# Checking missing values
data.isnull().sum()

This gives us the number of the values missing in all the columns. We can also see the percentage of values missing.

# Percent of missing values
data.isnull().mean()*100

Checking the percentages might be useful when there are a lot of columns that have missing values. In such cases, the columns with a lot of missing values (for example, >60% missing) can be just dropped. 

Imputing Missing Values

#Imputing missing values of numeric columns by mean
data[numerical] = data[numerical].fillna(data[numerical].mean().iloc[0])

#Imputing missing values of categorical columns by mode
data[categorical] = data[categorical].fillna(data[categorical].mode().iloc[0])

Here we simply impute the missing values in the numeric columns by their respective means and the ones in the categorical columns by their modes. And as we can see, there are no missing values now.

Please note that this is the most primitive way of imputing the values and doesn’t work in real-life cases where more sophisticated ways are developed, for example, interpolation, KNN, etc. 

Handling Duplicate Rows

# Drop duplicate rows
data.drop_duplicates(inplace=True)

This just drops the duplicate rows.

Checkout: Python Project Ideas & Topics

Bivariate Analysis

Now let’s see how to get more insights by doing bivariate analysis. Bivariate means an analysis that consists of 2 variables or features. There are different types of plots available for different types of features. 

For Numerical – Numerical

  1. Scatter plot
  2. Line plot
  3. Heatmap for correlations

For Categorical-Numerical

  1. Bar Chart
  2. Violin plot
  3. Swarm plot

For Categorical-Categorical

  1. Bar chart
  2. Point plot

Heatmap for Correlations

# Checking the correlations between the variables.
plt.figure(figsize=(15,10))
c= data.corr()
sns.heatmap(c,cmap=“BrBG”,annot=True)

Bar Plot

sns.barplot(data[‘Engine Fuel Type’], data[‘HP’])

Get data science certification from the World’s top Universities. Learn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Check out all trending Python tutorial concepts in 2024.

Conclusion

As we saw, there are a lot of steps to be covered while exploring a dataset. We only covered a handful of aspects in this tutorial but this will give you more than just basic knowledge of a good EDA. 

If you are curious to learn about Python, everything about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Profile

Rohit Sharma

Blog Author
Rohit Sharma is the Program Director for the UpGrad-IIIT Bangalore, PG Diploma Data Analytics Program.

Frequently Asked Questions (FAQs)

1What are the steps in exploratory data analysis?

2What is the purpose of exploratory data analysis?

Exploratory analysis can be used by data scientists to guarantee that the results they create are accurate and appropriate to any targeted business outcomes and goals. EDA also assists stakeholders by ensuring that they are addressing the appropriate questions. Standard deviations, categorical data, and confidence intervals can all be answered with EDA. Following the completion of EDA and the extraction of insights, its features can be applied to more advanced data analysis or modelling, including machine learning.

3What are the different types of exploratory data analysis?

To investigate relationships, univariate approaches look at one variable (data column) at a time, whereas multivariate methods look at two or more variables at once. Univariate and multivariate graphical and non-graphical are the four forms of EDA. Quantitative procedures are more objective, whereas pictorial methods are more subjective.

Explore Free Courses

Suggested Blogs

Top 13 Highest Paying Data Science Jobs in India [A Complete Report]
905090
In this article, you will learn about Top 13 Highest Paying Data Science Jobs in India. Take a glimpse below. Data Analyst Data Scientist Machine
Read More

by Rohit Sharma

12 Apr 2024

Most Common PySpark Interview Questions & Answers [For Freshers & Experienced]
20850
Attending a PySpark interview and wondering what are all the questions and discussions you will go through? Before attending a PySpark interview, it’s
Read More

by Rohit Sharma

05 Mar 2024

Data Science for Beginners: A Comprehensive Guide
5064
Data science is an important part of many industries today. Having worked as a data scientist for several years, I have witnessed the massive amounts
Read More

by Harish K

28 Feb 2024

6 Best Data Science Institutes in 2024 (Detailed Guide)
5150
Data science training is one of the most hyped skills in today’s world. Based on my experience as a data scientist, it’s evident that we are in
Read More

by Harish K

28 Feb 2024

Data Science Course Fees: The Roadmap to Your Analytics Career
5075
A data science course syllabus covers several basic and advanced concepts of statistics, data analytics, machine learning, and programming languages.
Read More

by Harish K

28 Feb 2024

Inheritance in Python | Python Inheritance [With Example]
17594
Python is one of the most popular programming languages. Despite a transition full of ups and downs from the Python 2 version to Python 3, the Object-
Read More

by Rohan Vats

27 Feb 2024

Data Mining Architecture: Components, Types & Techniques
10772
Introduction Data mining is the process in which information that was previously unknown, which could be potentially very useful, is extracted from a
Read More

by Rohit Sharma

27 Feb 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
80603
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

19 Feb 2024

Sorting in Data Structure: Categories & Types [With Examples]
138987
The arrangement of data in a preferred order is called sorting in the data structure. By sorting data, it is easier to search through it quickly and e
Read More

by Rohit Sharma

19 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon