Pandas is clearly one of the most used and loved libraries when it comes to Data Science and Data Analysis with Python. What makes it special? In this tutorial, we will go over 5 such functions that make Pandas an extremely useful tool in a Data Scientist’s tool kit.
By the end of this tutorial, you’ll have the knowledge of the below functions in Pandas and how to use them for your applications:
- value_counts
- groupby
- loc and iloc
- unique and nunique
- Cut and qcut
Top Pandas Functions For Data Scientists
1. value_counts()
Pandas’ value_counts() function is used to show the counts of all unique elements in columns of a dataframe.
Pro Tip: While Pandas gives the output as plain text, you can easily plot the values using the inbuilt bar plot in Pandas for a graphical representation of the same information.
To demonstrate, I’ll be using the Titanic Dataset.
Now, to find the counts of classes in the Embarked feature, we can call the value_counts function:
train[‘Embarked’].value_counts() |
#Output: S 644 C 168 Q 77 |
Also, if these number don’t make much sense, you can view their percentages instead:
train[‘Embarked’].value_counts(normalize=True) |
#Output: S 0.724409 C 0.188976 Q 0.086614 |
Moreover, value_counts doesn’t consider the NaN or the missing values by default which is very essential to check. To do that, you can set the parameter dropna as false.
train[‘Embarked’].value_counts(dropna=False) |
#Output: S 644 C 168 Q 77 NaN 2 |
2. group_by()
With Pandas group_by, we can split and group our dataframe by certain columns to be able to view patterns and details in the data. Group_by involves 3 main steps: splitting, applying and combining.
train.groupby(‘Sex’).mean() |
Output:
As you see, we grouped the data frame by the feature ‘sex’ and aggregated using the means.
You can also plot it using Pandas’ built-in visualization:
df.groupby(‘Sex’).sum().plot(kind=‘bar’) |
We can also group by using multiple features for a hierarchical splitting.
df.groupby([‘Sex’, ‘Survived’] )[‘Survived’].count() |
Must Read: Pandas Interview Questions
3. loc and iloc
Indexing in Pandas is one of the most basic operations and the best way to do it is using either loc or iloc. “Loc” stands for location and the “i” stands for indexed location. In other words, when you want to index a dataframe using names or labels of columns/rows, you’d use loc. And when you want to index columns or rows using the positions, you’d use the iloc function. Let’s check out loc first.
train.loc[2, ‘sex’] |
The above operation gives us the element of row index 2 and column ‘sex’. Similarly, if you’d needed all the values of the sex column, you’d do:
train.loc[:, ‘sex’] |
Also, you can filter out multiple columns like:
train.loc[:, ‘sex’, ‘Embarked’] |
You can also filter out using boolean conditions within the loc like:
train.loc[train.age >= 25] |
Top Essential Data Science Skills to Learn
To only view certain rows, you can slice the dataframe using loc:
train.loc[100:200] |
Moreover, you can slice the dataframe on the column axis as:
train.loc[:, ‘sex’ : ‘fare’] |
The above operation will slice the dataframe from the column ‘sex’ to ‘fare’ for all the rows.
Explore our Popular Data Science Degrees
Now, let’s move on to iloc. iloc only indexes using index numbers or the positions. You can slice dataframes like:
train.iloc[100:200, 2:9] |
The above operations will slice rows from 100 to 199 and the columns 2 through 8. Similarly, if you’d want to split your data horizontally, you can do:
train.iloc[:300, :] |
4. unique() and nunique()
Pandas unique is used to get all the unique values from any feature. This is mostly used to get the categories in categorical features in the data. Unique shows all the unique values including NaNs. It treats it as a different unique value. Let’s take a look:
train[‘sex’].unique() |
#Output: [‘female’, ‘male’] |
As we see, it gives us the unique values in the ‘sex’ feature.
Similarly, you can also check the number of unique values as there might be a lot of unique values in some features.
train[‘sex’].nunique() |
#Output: 2 |
However, you should keep in mind that nunique() doesn’t consider NaNs as unique values. If there are any NaNs in your data then you’d need to pass the dropna parameter as False to make sure Pandas gives you the count including the NaNs too.
train[‘sex’].nunique(dropna=False) |
#Output: 3 |
5. cut() and qcut()
Pandas cut is used to bin values in ranges in order to discretize the features. Let’s dive down into it. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in.
Let’s understand this with a small example.
Suppose, we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”.
df = pd.Dataframe(data= { ‘Name’: [‘Ck’, ‘Ron’, ‘Mat’, ‘Josh’, ‘Tim’, ‘SypherPK’, ‘Dew’, ‘Vin’], ‘Marks’:[37, 91, 66, 42, 99, 81, 45, 71] }) df[‘marks_bin’] = pd.cut(df[‘Marks’], bins=[0, 50, 70, 100], labels=[1, 2, 3]) |
Then we can just append the output as a new feature, and the Marks feature can be dropped. The new dataframe looks something like:
#Output: Name Marks marks_bin 0 Ck 37 1 1 Ron 91 3 2 Mat 66 2 3 Josh 42 1 4 Tim 99 3 5 SypherPK 81 3 6 Dew 45 1 7 Vin 71 3 |
So, when I say bins = [0, 50, 70, 100], it means that there are 3 ranges:
0 to 50 for bin 1,
51 to 70 for bin 2, and
71 to 100 belonging to bin 3.
So, now our feature doesn’t contain the marks but the range or the bin to which the marks for that student are.
Read our popular Data Science Articles
Similar to cut(), Pandas also offers its brother function called qcut(). Pandas qcut takes in the number of quantiles, and divides the data points to each bin based on the data distribution. So, we can just change the cut function in the above to qcut:
df[‘marks_bin’] = pd.qcut(df[‘Marks’], q=3, labels=[1, 2, 3]) |
In the above operation, we tell Pandas to cut the feature into 3 equal parts and assign them the labels. The output comes as:
Name Marks marks_bin 0 Ck 37 1 1 Ron 91 3 2 Mat 66 2 3 Josh 42 1 4 Tim 99 3 5 SypherPK 81 3 6 Dew 45 1 7 Vin 71 2 |
Notice how the last value changed from 3 to 2.
Our learners also read: Learn Python Online Course Free
upGrad’s Exclusive Data Science Webinar for you –
Watch our Webinar on How to Build Digital & Data Mindset?
Before you go
We saw some most used Pandas functions. But these are not the only ones that are important and we’d encourage you to learn more of Pandas mostly used functions. This is a good and efficient approach as you might not be using all the functions that Pandas has, but only a few of them.
Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.