Top 5 Pandas Functions Essential for Data Scientists [2021]

Pandas is clearly one of the most used and loved libraries when it comes to Data Science and Data Analysis with Python. What makes it special? In this tutorial, we will go over 5 such functions that make Pandas an extremely useful tool in a Data Scientist’s tool kit.

By the end of this tutorial, you’ll have the knowledge of the below functions in Pandas and how to use them for your applications:

  • value_counts
  • groupby
  • loc and iloc
  • unique and nunique
  • Cut and qcut

Top Pandas Functions For Data Scientists

1. value_counts()

Pandas’ value_counts() function is used to show the counts of all unique elements in columns of a dataframe. 

Pro Tip: While Pandas gives the output as plain text, you can easily plot the values using the inbuilt bar plot in Pandas for a graphical representation of the same information.

To demonstrate, I’ll be using the Titanic Dataset. 

Now, to find the counts of classes in the Embarked feature, we can call the value_counts function:

train[‘Embarked’].value_counts()

 

#Output:
S      644
C      168
Q       77

Also, if these number don’t make much sense, you can view their percentages instead:

train[‘Embarked’].value_counts(normalize=True)

 

#Output:
S    0.724409
C    0.188976
Q    0.086614

Moreover, value_counts doesn’t consider the NaN or the missing values by default which is very essential to check. To do that, you can set the parameter dropna as false.

train[‘Embarked’].value_counts(dropna=False)

 

#Output:
S      644
C      168
Q       77
NaN      2

2. group_by()

With Pandas group_by, we can split and group our dataframe by certain columns to be able to view patterns and details in the data. Group_by involves 3 main steps: splitting, applying and combining.

train.groupby(‘Sex’).mean()

Output:

As you see, we grouped the data frame by the feature ‘sex’ and aggregated using the means.

You can also plot it using Pandas’ built-in visualization:

df.groupby(‘Sex’).sum().plot(kind=‘bar’)

We can also group by using multiple features for a hierarchical splitting.

df.groupby([‘Sex’, ‘Survived’] )[‘Survived’].count()

Must Read: Pandas Interview Questions

3. loc and iloc

Indexing in Pandas is one of the most basic operations and the best way to do it is using either loc or iloc. “Loc” stands for location and the “i” stands for indexed location. In other words, when you want to index a dataframe using names or labels of columns/rows, you’d use loc. And when you want to index columns or rows using the positions, you’d use the iloc function. Let’s check out loc first.

train.loc[2, ‘sex’]

The above operation gives us the element of row index 2 and column ‘sex’. Similarly, if you’d needed all the values of the sex column, you’d do:

train.loc[:, ‘sex’]

Also, you can filter out multiple columns like:

train.loc[:, ‘sex’, ‘Embarked’]

You can also filter out using boolean conditions within the loc like:

train.loc[train.age >= 25]


To only view certain rows, you can slice the dataframe using loc:

train.loc[100:200]

Moreover, you can slice the dataframe on the column axis as:

train.loc[:, ‘sex’ : ‘fare’]

 

The above operation will slice the dataframe from the column ‘sex’ to ‘fare’ for all the rows.

Now, let’s move on to iloc. iloc only indexes using index numbers or the positions. You can slice dataframes like:

train.iloc[100:200, 2:9]


The above operations will slice rows from 100 to 199 and the columns 2 through 8. Similarly, if you’d want to split your data horizontally, you can do:

train.iloc[:300, :]

4. unique() and nunique()

Pandas unique is used to get all the unique values from any feature. This is mostly used to get the categories in categorical features in the data. Unique shows all the unique values including NaNs. It treats it as a different unique value. Let’s take a look:

train[‘sex’].unique()

 

#Output:
[‘female’, ‘male’]

As we see, it gives us the unique values in the ‘sex’ feature.

Similarly, you can also check the number of unique values as there might be a lot of unique values in some features.

train[‘sex’].nunique()

 

#Output:
2

However, you should keep in mind that nunique() doesn’t consider NaNs as unique values. If there are any NaNs in your data then you’d need to pass the dropna parameter as False to make sure Pandas gives you the count including the NaNs too.

train[‘sex’].nunique(dropna=False)

 

#Output:
3

5. cut() and qcut()

Pandas cut is used to bin values in ranges in order to discretize the features. Let’s dive down into it. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in.

Let’s understand this with a small example.

Suppose, we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”. 

df = pd.Dataframe(data= {
‘Name’: [‘Ck’, ‘Ron’, ‘Mat’, ‘Josh’, ‘Tim’, ‘SypherPK’, ‘Dew’, ‘Vin’],
‘Marks’:[37, 91, 66, 42, 99, 81, 45, 71]
})

df[‘marks_bin’] = pd.cut(df[‘Marks’], bins=[0, 50, 70, 100], labels=[1, 2, 3])

Then we can just append the output as a new feature, and the Marks feature can be dropped. The new dataframe looks something like:

#Output:
      Name     Marks    marks_bin
0        Ck       37         1
1       Ron       91         3
2       Mat       66         2
3      Josh       42         1
4       Tim       99         3
5  SypherPK       81         3
6       Dew       45         1
7       Vin       71         3

So, when I say bins = [0, 50, 70, 100], it means that there are 3 ranges:

0 to 50 for bin 1,

51 to 70 for bin 2, and 

71 to 100 belonging to bin 3.

So, now our feature doesn’t contain the marks but the range or the bin to which the marks for that student are.

Similar to cut(), Pandas also offers its brother function called qcut(). Pandas qcut takes in the number of quantiles, and divides the data points to each bin based on the data distribution. So, we can just change the cut function in the above to qcut:

df[‘marks_bin’] = pd.qcut(df[‘Marks’], q=3, labels=[1, 2, 3])

In the above operation, we tell Pandas to cut the feature into 3 equal parts and assign them the labels. The output comes as:

        Name   Marks    marks_bin
0        Ck     37         1
1       Ron     91         3
2       Mat     66         2
3      Josh     42         1
4       Tim     99         3
5  SypherPK     81         3
6       Dew     45         1
7       Vin     71         2

Notice how the last value changed from 3 to 2. 

Also Read: Pandas Dataframe Astype

Before you go

We saw some most used Pandas functions. But these are not the only ones that are important and we’d encourage you to learn more of Pandas mostly used functions. This is a good and efficient approach as you might not be using all the functions that Pandas has, but only a few of them. 

If you are curious to learn about Pandas, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Plan Your Data Science Career Today

UPGRAD AND IIIT-BANGALORE'S PG DIPLOMA IN DATA SCIENCE
Learn More

Leave a comment

Your email address will not be published.

×