For anyone who wants to get started with Data analysis, the first language that comes to mind is R or Python. And the reason why developers are now more inclined towards Python is due to its wide adaptability in the generic Software Development field. Hence, data analysis using python is one of the most heard terms for someone starting their journey into Data Science.
Why Data Analysis?
Now first, why Data Analysis? Well, it is the first step into knowing what type of data you are working with. It is the step where you find valuable patterns in data, which you might not see otherwise. Overall, it provides an intuitive understanding of the dataset in hand.
Here we do need to draw a line between data analysis and data pre-processing. Data pre-processing deals with modeling your dataset to make sure it is ready for training. Data analysis is to understand the dataset, which is a pre-step for data pre-processing. In data analysis, we try to model data to view it better and, hence, learn insights about the dataset in hand.
The second question is, why Python? Well, we already stated that Python is a widely adapted language. Yes, it is not the only choice when it comes to data analysis, but it is a pretty good one. Another reason why is that it is used more! Python is easy and has a large community of developers to help you regarding data analysis using python. Moreover, data analysis using Python is quite enjoyable because of the wide number of creative libraries it offers for data analysis and visualization.
In Python, the base library for data analysis is Pandas. It is a high-level library, built on the NumPy library, which is for scientific computing and numerical analysis. Pandas make it easier to work with data by offering its data structure, known as DataFrame. DataFrame helps in reading and storing your dataset. It provides the base functions for reading and writing the dataset, as well as viewing the metadata and querying functions to extract every insight from the dataset.
It is important to note that data visualization is a considerable part of overall data analysis. Because it not only helps in understanding the data better yourself but also to those whom you are providing the insights. We would be discussing the two most used libraries for visualization: Matplotlib and Seaborn. Matplotlib is the base library for any visualizations in Python. Seaborn is also made on top of Matplotlib, which offers some of the most creative data visualization functions.
Set Up Environment
The first step is to set up your environment. While performing data analysis using python, it is important to have a proper environment for keeping all your work. Data analysis using python is not going to be just a script, but it is going to be an interaction of yourself with the dataset, and for that, you do require an appropriate place to work.
In python, that service is provided by the Anaconda Distribution. Anaconda’s leading workplace is the Jupyter notebook. So, now why Jupyter? Well, it lets you have the visualizations directly inside your notebook. It also has some magic functions that let you see the output directly without explicitly stating where you want it.
The libraries, Pandas, and Matplotlib, come preinstalled, and hence there is no extra setup required for using them.
Here is the synopsis of how to get around doing data analysis using Python:
- Loading of the Dataset
- Viewing the metadata of the dataset using Pandas
- Data visualizations using Matplotlib
- Collecting insights on data
Import Necessary Libraries
Before we start looking at the code for steps, just import the necessary libraries with pseudo tags, as in with the name that we would call them for the entire program.
import numpy as np
import pandas as pd
# for data visualizations
import matplotlib.pyplot as plt
import seaborn as sns
Now we would look at each step and discuss which functions are available and how to use those.
First, reading datasets. Pandas provide some basic functions for loading the dataset into its core data structure: DataFrame. We can use it as follows.
data_df = pd.read_csv(‘heart.csv’)
The output of any read function is going to be a DataFrame. Apart from CSV readers, pandas provide readers for almost all types of data. From HTML to JSON and excel.
Apart from this, if you do not have any data as such and want to create your dataset, you can easily use the Pandas’ Series and DataFrame object functions.
So, once you have the data in hand, let us move on to viewing what the data is about. To get the first view of data, you could use the functions like df.info or df.describe to know the structure of your dataset.
Once you know what features your dataset contains, you might want to look at the values of those. You can use the df.head() function to get the first 5 samples.
You may also specify the number of samples to override the default value of 5. You can also use the df.tail() function for getting the last 5 values of the dataset.
This is just to get a high-level overview of what your data might look like. Once ready, you can start the main data visualizations tasks, using Matplotlib. Punch in the following code to make the plotting interactive and view the same in your notebook itself.
We would see the functionalities of the top 5 visualizations in matplotlib. Before going into it, we should know some other functions which control our plots. The functions like:
- Labels: xlabel(), ylabel(). They are for the x-axis and y-axis labels.
- Legend: It is used for making the legend for the plot.
- Title: To assign a title for your plot
- And finally, show function to view the plot.
Let us see the visualizations now. We would start with the basic plot. The plt.plot() is used to generate a simple line plot for your data. The function requires two parameters in compulsion, and these are x-axis data and y-axis data. You may optionally provide the styles and name and colour for the plot. Here is how it looks in code.
The second plot is the Histogram. A histogram helps you view the frequency or distribution of a particular feature. It helps you in viewing how the quantities relate to each other. Plt.hist() is the base function to create a histogram on your data. You can mention the bins parameter to control the number on the plot. You only need to pass a single axis data if you want a univariate analysis.
Another plot that you would see a lot is the bar plot. It helps in analyzing and comparing different features. Unlike histograms, bar plots are used for working with categorical data.
You can directly apply the plot on the DataFrame, or you can specify the parameters inside the plt.bar() function. Here is how we use it.
df = pd.DataFrame(np.random.rand(15, 5), columns=[‘t1’, ‘t2’, ‘t3’, ‘t4’, ‘t5’])
You can also use the bar plot horizontally by using barh() function.
Another insightful graph is the boxplot. It helps in understanding the distribution of values within each feature. You can use the plt.boxplot() function to specify the data on which you want to generate a boxplot. The plot is especially useful when you need to view the dispersion in the dataset or skewness quickly. Here is how you can use it.
Whenever you work with statistical data, you would definitely see a scatter plot. A scatter plot helps in observing the relationship between two features. The plot requires numeric values for both x-axis data as well as the y-axis. You can simply provide those two values in the plt.scatter() function or can directly apply on the DataFrame by specifying column names in the x and y attributes. Here is how you can use that:
Now is an appropriate time to introduce you to Seaborn functions. The scatter plot in seaborn is more intuitive than the matplotlib because it also by-default provides a regression line in the plot, to visualize the plot better. You can use the sns.lmplot() function to make that plot.
sns.lmplot(‘age’, ‘chol’, data=data_df)
As you can see in the plot above, the regression line helps understand the distribution even better.
Another improvement using seaborn is the swarm plot. It is used to draw a categorical scatter plot. One of the advantages of the swarm plot over the similar strip plot is that it uses the non-overlapping points only. So, it is a cleaner plot and hence gives a better insight.
So, these are the different types of plots in Matplotlib and Seaborn. This is just the tip of the iceberg, and there are hundreds of other different ways of plotting your data to extract creative insights about it.
Now that you know the plots let us see how to do actual data analysis using python. We would take a look at some more plots and see what they show us about data analysis using python.
After loading the data, the first thing that any data analyst does now is making a pandas profile. Now, this can be viewed as a shortcut also, but if you want to see all the relationships and counts and histograms of the variables in the dataset, you can use pandas profiling. It is very easy to generate, just download the pandas-profiling module and punch in the following code:
profile = pandas_profiling.ProfileReport(data_df)
As you would be able to see, there is a huge amount of metadata information and also individual feature information. These could lead to some great understanding.
The second thing we can do is generate a heatmap. Now what a heatmap does is, it shows the correlation of each feature with the other. And if we find value with a higher correlation, that means the two features closely resemble each other. So, we can drop one of the features, and still, the model will work fine.
sns.heatmap(data_df.corr(), annot = True, cmap=’Oranges’)
Here we can see none are highly related so we can tell the model engineer that we would need all the features as an input.
We can see what is the age distribution because we are dealing with the heart disease dataset, let us see the distribution, so we can use the distplot of seaborn.
sns.distplot(data_df[‘age’], color = ‘cyan’)
From the plot, you can say that most people suffering from heart diseases are between the ages of 50 and 60. In the same way, we can also view some other important features like the resting blood pressure, which is denoted by tresbps. We can make a box plot to see the distribution, in comparison to the target value, i.e. 0 and 1.
sns.boxplot(data_df[‘target’], data_df[‘trestbps’], palette = ‘twilight’)
We can conclude from the plot that if the person has lower tres bps, then the chances of them suffering from heart disease are lower than those with a higher value of tres bps.
In the same way, we can also see the relation with cholesterol levels. We do see people with lesser cholesterol levels have a lower chance of suffering heart disease.
You can document all these insights and provide it to the machine learning engineer who can then use the same for making an efficient model.
So, this is how you can do data analysis using python. This is just the first step in the data science journey. To learn more about extracting creative insights from data and overall data science, head down to the courses offered by upGrad here. You will find a spectrum of helpful courses that will effectively guide data analysis using python.
If you are curious to learn about Python, everything about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.