While dealing with any statistical data analysis project, there are many handy tools you can apply. The basic idea is to identify the question and use the necessary function to answer that question. For example, if the data distribution needs to be seen, the ideal answer is to plot a data distribution function.
If it is necessary to see the values and compare them with the other columns’ value, the best way is to plot a bar plot or histogram. But what if a statistical query needs to be satisfied? The trend can be observed in a distribution function, but there is no easy way out if we need to check a specific percentile of data. Check out our data science training from recognized universities to gain advantage over the competition.
Boxplot comes as a solution to the above problem. Boxplots are used to describe the attribute’s percentile values, as per the column it is plotted against. Boxplot can be quite insightful in rule-based model engineering as well as exploratory data analysis in general.
Boxplot deals with quartiles.
Let us first plot a pandas boxplot and then understand the parts of it.
Plotting a Pandas Boxplot
To implement a pandas boxplot, there are only two requirements, Pandas and matplotlib. The use of matplotlib is to visualize the plots and see the plots inside the Jupyter notebook.
Here is how we import both the libraries. We use the inline magic function so that the plots can be seen directly inside the notebook.
import pandas as pd
import matplotlib.pyplot as plt
Now, we import our data and read it into a DataFrame. Here is how to do it.
data = pd.read_csv(“FIFA 2018 Statistics.csv”)
DataFrame is the fundamental data structure of Pandas. Here are the first five samples of our data.
After the data is imported, we can directly use the pandas boxplot function over the DataFrame object. Here is how to use it:
data.boxplot(by=”Round”, column=[‘Goal Scored’])
The pandas boxplot function takes two arguments. The ‘by’ parameter is used to select the X-axis. And the ‘column’ is the data to plot on the Y-axis.
Here we are plotting the Goals Scored by Round.
Here is the plot:
Checkout: Python Interview Questions
Explore our Popular Data Science Courses
Reading the boxplots
Now let us read the plots. First, understand the values of the axis. Y-axis has the number of goals scored in the match, and the X-axis shows the rounds under which the game was played. Let us take the example of the final round.
If we carefully observe, the box is made somewhere between two and four, with the middle line at three. The box is plotted using three values – the 25th, 50th, and 75th percentile values. The lower line of the plot denotes the 25th percentile of the goals scored in the match, the middle denotes the 50th percentile, and the upper line denotes the 75th percentile. So, boxplot works with the inter-quartile range (IQR) of data.
Now, there is one more thing drawn above and below the box. These lines are known as whiskers. Hence, sometimes boxplot is also known as the box-and-whiskers plot.
There is no unique way to plot the whiskers. The most common way to denote whiskers is to mark them at the minimum and maximum values in the data column. Some libraries like seaborn use a multiplicative value of the IQR to mark the whiskers. Pandas boxplot uses the maximum and minimum values to mark the whiskers.
If you notice, there are some points between four and six. These are known as outliers. Boxplots are reasonably useful in the rule-based systems as the error calculation, or can quickly identify the misclassifications. For example, in the graph, if you only need to distinguish between 3rd place rounds and final rounds, you can easily make a rule-based system, which will accurately categorize your data. If between zero to two, mark the 3rd round, and if between two to four, mark the final round.
Boxplots help understands the overall distribution of the data columns. The plots show the distributions by using the quartile values. It makes it easier for you to quickly analyze the data, as the distribution has been marked appropriately. The whiskers denote the remaining values in the column.
Read our popular Data Science Articles
The lower end denotes the data lower than 25%, while the upper end denotes the higher than 75%. If outliers are less, pandas boxplots can help in identifying those quickly. Overall, if you can read them properly, boxplots are incredibly useful in data analysis.
If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG Programme in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.
What type of data is portrayed by a box plot?
Box plot visualization is highly used in descriptive statistics. It is a type of chart that is often used for exploratory data analysis. By displaying the quartiles (percentages) and averages, the box plots can visually portray the distribution of numerical data along with its skewness.
The summary of a set of data is displayed with the help of box plots in visual format under five different categories. The data provided by the box plot are:
1. Minimum score
2. First or we can say the lower quartile
3. Median of the box plot Third or we can say the upper quartile
The data here is divided into different sections to make it easy to represent the data and understand the data pretty easily visually.
Why are box plots found to be useful?
The work of box plots is to divide a dataset into different sections, where every section approximately contains 25% of data. Box plots are found to be really useful because they provide a visual summary of the data present. This allows the researchers to identify the mean values easily, find the skewness signs, and know the datasets' dispersion.
The box plot can provide you with a visual image to see whether the statistical dataset is skewed or normally distributed. If it is normally distributed, the median will be in the middle of the box, and the box will be symmetric. On the other hand, the box will be asymmetric, and the median will be towards the bottom or top of the box when the distribution is skewed.
Can we utilize Pandas for Data Visualization?
Pandas is known to be the most useful library in Python language when it comes to Data Science. Pandas is found to be really helpful for manipulating, importing, and also cleaning the datasets. Other than that, Pandas is also widely utilized for data visualization.
In data visualization, Pandas is used for plotting different basic plots. The functionalities of this library are also found in time series data visualization. In simple words, it can be said that if you wish to plot a simple bar, count plots, or lines, you should utilize Pandas in data visualization.