While dealing with any statistical data analysis project, there are many handy tools you can apply. The basic idea is to identify the question and use the necessary function to answer that question. For example, if the data distribution needs to be seen, the ideal answer is to plot a data distribution function.
If it is necessary to see the values and compare them with the other columns’ value, the best way is to plot a bar plot or histogram. But what if a statistical query needs to be satisfied? The trend can be observed in a distribution function, but there is no easy way out if we need to check a specific percentile of data.
Boxplot comes as a solution to the above problem. Boxplots are used to describe the attribute’s percentile values, as per the column it is plotted against. Boxplot can be quite insightful in rule-based model engineering as well as exploratory data analysis in general.
Boxplot deals with quartiles.
Let us first plot a pandas boxplot and then understand the parts of it.
Plotting a Pandas Boxplot
To implement a pandas boxplot, there are only two requirements, Pandas and matplotlib. The use of matplotlib is to visualize the plots and see the plots inside the Jupyter notebook.
Here is how we import both the libraries. We use the inline magic function so that the plots can be seen directly inside the notebook.
import pandas as pd
import matplotlib.pyplot as plt
Now, we import our data and read it into a DataFrame. Here is how to do it.
data = pd.read_csv(“FIFA 2018 Statistics.csv”)
DataFrame is the fundamental data structure of Pandas. Here are the first five samples of our data.
After the data is imported, we can directly use the pandas boxplot function over the DataFrame object. Here is how to use it:
data.boxplot(by=”Round”, column=[‘Goal Scored’])
The pandas boxplot function takes two arguments. The ‘by’ parameter is used to select the X-axis. And the ‘column’ is the data to plot on the Y-axis.
Here we are plotting the Goals Scored by Round.
Here is the plot:
the next biggest thing
Checkout: Python Interview Questions
Reading the boxplots
Now let us read the plots. First, understand the values of the axis. Y-axis has the number of goals scored in the match, and the X-axis shows the rounds under which the game was played. Let us take the example of the final round.
If we carefully observe, the box is made somewhere between two and four, with the middle line at three. The box is plotted using three values – the 25th, 50th, and 75th percentile values. The lower line of the plot denotes the 25th percentile of the goals scored in the match, the middle denotes the 50th percentile, and the upper line denotes the 75th percentile. So, boxplot works with the inter-quartile range (IQR) of data.
Now, there is one more thing drawn above and below the box. These lines are known as whiskers. Hence, sometimes boxplot is also known as the box-and-whiskers plot.
There is no unique way to plot the whiskers. The most common way to denote whiskers is to mark them at the minimum and maximum values in the data column. Some libraries like seaborn use a multiplicative value of the IQR to mark the whiskers. Pandas boxplot uses the maximum and minimum values to mark the whiskers.
If you notice, there are some points between four and six. These are known as outliers. Boxplots are reasonably useful in the rule-based systems as the error calculation, or can quickly identify the misclassifications. For example, in the graph, if you only need to distinguish between 3rd place rounds and final rounds, you can easily make a rule-based system, which will accurately categorize your data. If between zero to two, mark the 3rd round, and if between two to four, mark the final round.
Boxplots help understands the overall distribution of the data columns. The plots show the distributions by using the quartile values. It makes it easier for you to quickly analyze the data, as the distribution has been marked appropriately. The whiskers denote the remaining values in the column.
The lower end denotes the data lower than 25%, while the upper end denotes the higher than 75%. If outliers are less, pandas boxplots can help in identifying those quickly. Overall, if you can read them properly, boxplots are incredibly useful in data analysis.
If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.