Top 5 Pandas Functions Essential for Data Scientists [2024]

Pandas is clearly one of the most used and loved libraries when it comes to Data Science and Data Analysis with Python. What makes it special? In this tutorial, we will go over 5 such functions that make Pandas an extremely useful tool in a Data Scientist’s tool kit.

By the end of this tutorial, you’ll have the knowledge of the below functions in Pandas and how to use them for your applications:

value_counts
groupby
loc and iloc
unique and nunique
Cut and qcut

Top Pandas Functions For Data Scientists

1. value_counts()

Pandas’ value_counts() function is used to show the counts of all unique elements in columns of a dataframe.

Pro Tip: While Pandas gives the output as plain text, you can easily plot the values using the inbuilt bar plot in Pandas for a graphical representation of the same information.

To demonstrate, I’ll be using the Titanic Dataset.

Now, to find the counts of classes in the Embarked feature, we can call the value_counts function:

train[‘Embarked’].value_counts()

#Output:
S 644
C 168
Q 77

Also, if these number don’t make much sense, you can view their percentages instead:

train[‘Embarked’].value_counts(normalize=True)

#Output:
S 0.724409
C 0.188976
Q 0.086614

Moreover, value_counts doesn’t consider the NaN or the missing values by default which is very essential to check. To do that, you can set the parameter dropna as false.

train[‘Embarked’].value_counts(dropna=False)

#Output:
S 644
C 168
Q 77
NaN 2

2. group_by()

With Pandas group_by, we can split and group our dataframe by certain columns to be able to view patterns and details in the data. Group_by involves 3 main steps: splitting, applying and combining.

train.groupby(‘Sex’).mean()

Output:

As you see, we grouped the data frame by the feature ‘sex’ and aggregated using the means.

You can also plot it using Pandas’ built-in visualization:

df.groupby(‘Sex’).sum().plot(kind=‘bar’)

We can also group by using multiple features for a hierarchical splitting.

df.groupby([‘Sex’, ‘Survived’] )[‘Survived’].count()

Must Read: Pandas Interview Questions

3. loc and iloc

Indexing in Pandas is one of the most basic operations and the best way to do it is using either loc or iloc. “Loc” stands for location and the “i” stands for indexed location. In other words, when you want to index a dataframe using names or labels of columns/rows, you’d use loc. And when you want to index columns or rows using the positions, you’d use the iloc function. Let’s check out loc first.

train.loc[2, ‘sex’]

The above operation gives us the element of row index 2 and column ‘sex’. Similarly, if you’d needed all the values of the sex column, you’d do:

train.loc[:, ‘sex’]

Also, you can filter out multiple columns like:

train.loc[:, ‘sex’, ‘Embarked’]

You can also filter out using boolean conditions within the loc like:

train.loc[train.age >= 25]

Top Essential Data Science Skills to Learn

SL. No	Top Data Science Skills to Learn
1	Data Analysis Certifications	Inferential Statistics Certifications
2	Hypothesis Testing Certifications	Logistic Regression Certifications
3	Linear Regression Certifications	Linear Algebra for Analysis Certifications

To only view certain rows, you can slice the dataframe using loc:

train.loc[100:200]

Moreover, you can slice the dataframe on the column axis as:

train.loc[:, ‘sex’ : ‘fare’]

The above operation will slice the dataframe from the column ‘sex’ to ‘fare’ for all the rows.

Explore our Popular Data Science Degrees

Executive Post Graduate Programme in Data Science from IIITB	Professional Certificate Program in Data Science for Business Decision Making	Master of Science in Data Science from University of Arizona
Advanced Certificate Programme in Data Science from IIITB	Professional Certificate Program in Data Science and Business Analytics from University of Maryland	Data Science Degrees

Now, let’s move on to iloc. iloc only indexes using index numbers or the positions. You can slice dataframes like:

train.iloc[100:200, 2:9]

The above operations will slice rows from 100 to 199 and the columns 2 through 8. Similarly, if you’d want to split your data horizontally, you can do:

train.iloc[:300, :]

4. unique() and nunique()

Pandas unique is used to get all the unique values from any feature. This is mostly used to get the categories in categorical features in the data. Unique shows all the unique values including NaNs. It treats it as a different unique value. Let’s take a look:

train[‘sex’].unique()

#Output:
[‘female’, ‘male’]

As we see, it gives us the unique values in the ‘sex’ feature.

Similarly, you can also check the number of unique values as there might be a lot of unique values in some features.

train[‘sex’].nunique()

#Output:
2

However, you should keep in mind that nunique() doesn’t consider NaNs as unique values. If there are any NaNs in your data then you’d need to pass the dropna parameter as False to make sure Pandas gives you the count including the NaNs too.

train[‘sex’].nunique(dropna=False)

#Output:
3

5. cut() and qcut()

Pandas cut is used to bin values in ranges in order to discretize the features. Let’s dive down into it. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in.

Let’s understand this with a small example.

Suppose, we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”.

df = pd.Dataframe(data= {
‘Name’: [‘Ck’, ‘Ron’, ‘Mat’, ‘Josh’, ‘Tim’, ‘SypherPK’, ‘Dew’, ‘Vin’],
‘Marks’:[37, 91, 66, 42, 99, 81, 45, 71]
})

df[‘marks_bin’] = pd.cut(df[‘Marks’], bins=[0, 50, 70, 100], labels=[1, 2, 3])

Then we can just append the output as a new feature, and the Marks feature can be dropped. The new dataframe looks something like:

#Output:
Name Marks marks_bin
0 Ck 37 1
1 Ron 91 3
2 Mat 66 2
3 Josh 42 1
4 Tim 99 3
5 SypherPK 81 3
6 Dew 45 1
7 Vin 71 3

So, when I say bins = [0, 50, 70, 100], it means that there are 3 ranges:

0 to 50 for bin 1,

51 to 70 for bin 2, and

71 to 100 belonging to bin 3.

So, now our feature doesn’t contain the marks but the range or the bin to which the marks for that student are.

Read our popular Data Science Articles

Data Science Career Path: A Comprehensive Career Guide	Data Science Career Growth: The Future of Work is here	Why is Data Science Important? 8 Ways Data Science Brings Value to the Business
Relevance of Data Science for Managers	The Ultimate Data Science Cheat Sheet Every Data Scientists Should Have	Top 6 Reasons Why You Should Become a Data Scientist
A Day in the Life of Data Scientist: What do they do?	Myth Busted: Data Science doesn’t need Coding	Business Intelligence vs Data Science: What are the differences?

Similar to cut(), Pandas also offers its brother function called qcut(). Pandas qcut takes in the number of quantiles, and divides the data points to each bin based on the data distribution. So, we can just change the cut function in the above to qcut:

df[‘marks_bin’] = pd.qcut(df[‘Marks’], q=3, labels=[1, 2, 3])

In the above operation, we tell Pandas to cut the feature into 3 equal parts and assign them the labels. The output comes as:

Name Marks marks_bin
0 Ck 37 1
1 Ron 91 3
2 Mat 66 2
3 Josh 42 1
4 Tim 99 3
5 SypherPK 81 3
6 Dew 45 1
7 Vin 71 2

Notice how the last value changed from 3 to 2.

Our learners also read: Learn Python Online Course Free

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

Before you go

We saw some most used Pandas functions. But these are not the only ones that are important and we’d encourage you to learn more of Pandas mostly used functions. This is a good and efficient approach as you might not be using all the functions that Pandas has, but only a few of them.

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Frequently Asked Questions (FAQs)

1. Why is the Pandas library so popular?

This library is indeed quite popular among data scientists and data analysts. The reason for this is its great support of a large number of file formats and a rich collection of features to manipulate the extracted data. It can easily integrate with other libraries and packages such as NumPy.

This powerful library provides various useful functions for manipulating huge data sets in a flexible manner. Once you have mastered it, you can easily achieve great tasks with a few lines of code.

2. What is the merge function and why is it used?

The merge function is a special function of Pandas data frame that is used to merge multiple rows or columns of 2 data frames. It is a high-memory join operation and resembles relational databases. You can use on = Column Name to merge data frames on the common column.

You can update left_on = Column Name or right_on = Column Name to align tables using columns from the left or right data frame as keys.

3. Apart from Pandas library, what are the other Python libraries for data science?

Apart from Pandas library, there are a bunch of Python libraries that are considered to be some of the best libraries for data science. These include PySpark, TensorFlow, Matplotlib, Scikit Learn, SciPy and many more. Each one of them is widely used for its unique and amazing features and functions.

Every library has its own significance like SciKit Learn is more often used when you have to deal with statistical data. Apart from analysing the data, you can also create dashboards and visual reports using the functions provided by these amazing libraries.

Suggested Blogs

905295

Top 13 Highest Paying Data Science Jobs in India [A Complete Report]

In this article, you will learn about Top 13 Highest Paying Data Science Jobs in India. Take a glimpse below. Data Analyst Data Scientist Machine

by Rohit Sharma

12 Apr 2024

20941

Most Common PySpark Interview Questions & Answers [For Freshers & Experienced]

Attending a PySpark interview and wondering what are all the questions and discussions you will go through? Before attending a PySpark interview, it’s

by Rohit Sharma

05 Mar 2024

5069

Data Science for Beginners: A Comprehensive Guide

Data science is an important part of many industries today. Having worked as a data scientist for several years, I have witnessed the massive amounts

by Harish K

28 Feb 2024

5181

6 Best Data Science Institutes in 2024 (Detailed Guide)

Data science training is one of the most hyped skills in today’s world. Based on my experience as a data scientist, it’s evident that we are in

by Harish K

28 Feb 2024

5075

Data Science Course Fees: The Roadmap to Your Analytics Career

A data science course syllabus covers several basic and advanced concepts of statistics, data analytics, machine learning, and programming languages.

by Harish K

28 Feb 2024

17658

Inheritance in Python | Python Inheritance [With Example]

Python is one of the most popular programming languages. Despite a transition full of ups and downs from the Python 2 version to Python 3, the Object-

by Rohan Vats

27 Feb 2024

10808

Data Mining Architecture: Components, Types & Techniques

Introduction Data mining is the process in which information that was previously unknown, which could be potentially very useful, is extracted from a

by Rohit Sharma

27 Feb 2024

80818

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About

What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes

by Rohit Sharma

19 Feb 2024

139162

Sorting in Data Structure: Categories & Types [With Examples]

The arrangement of data in a preferred order is called sorting in the data structure. By sorting data, it is easier to search through it quickly and e

by Rohit Sharma

19 Feb 2024

Top 5 Pandas Functions Essential for Data Scientists [2024]