Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconTop 5 Pandas Functions Essential for Data Scientists [2024]

Top 5 Pandas Functions Essential for Data Scientists [2024]

Last updated:
1st Oct, 2022
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Top 5 Pandas Functions Essential for Data Scientists [2024]

Pandas is clearly one of the most used and loved libraries when it comes to Data Science and Data Analysis with Python. What makes it special? In this tutorial, we will go over 5 such functions that make Pandas an extremely useful tool in a Data Scientist’s tool kit.

By the end of this tutorial, you’ll have the knowledge of the below functions in Pandas and how to use them for your applications:

  • value_counts
  • groupby
  • loc and iloc
  • unique and nunique
  • Cut and qcut

Top Pandas Functions For Data Scientists

1. value_counts()

Pandas’ value_counts() function is used to show the counts of all unique elements in columns of a dataframe. 

Pro Tip: While Pandas gives the output as plain text, you can easily plot the values using the inbuilt bar plot in Pandas for a graphical representation of the same information.

To demonstrate, I’ll be using the Titanic Dataset.

Now, to find the counts of classes in the Embarked feature, we can call the value_counts function:

train[‘Embarked’].value_counts()

 

#Output:
S      644
C      168
Q       77

Also, if these number don’t make much sense, you can view their percentages instead:

train[‘Embarked’].value_counts(normalize=True)

 

#Output:
S    0.724409
C    0.188976
Q    0.086614

Moreover, value_counts doesn’t consider the NaN or the missing values by default which is very essential to check. To do that, you can set the parameter dropna as false.

train[‘Embarked’].value_counts(dropna=False)

 

#Output:
S      644
C      168
Q       77
NaN      2

2. group_by()

With Pandas group_by, we can split and group our dataframe by certain columns to be able to view patterns and details in the data. Group_by involves 3 main steps: splitting, applying and combining.

train.groupby(‘Sex’).mean()

Output:

As you see, we grouped the data frame by the feature ‘sex’ and aggregated using the means.

You can also plot it using Pandas’ built-in visualization:

df.groupby(‘Sex’).sum().plot(kind=‘bar’)

We can also group by using multiple features for a hierarchical splitting.

df.groupby([‘Sex’, ‘Survived’] )[‘Survived’].count()

Must Read: Pandas Interview Questions

3. loc and iloc

Indexing in Pandas is one of the most basic operations and the best way to do it is using either loc or iloc. “Loc” stands for location and the “i” stands for indexed location. In other words, when you want to index a dataframe using names or labels of columns/rows, you’d use loc. And when you want to index columns or rows using the positions, you’d use the iloc function. Let’s check out loc first.

train.loc[2, ‘sex’]

The above operation gives us the element of row index 2 and column ‘sex’. Similarly, if you’d needed all the values of the sex column, you’d do:

train.loc[:, ‘sex’]

Also, you can filter out multiple columns like:

train.loc[:, ‘sex’, ‘Embarked’]

You can also filter out using boolean conditions within the loc like:

train.loc[train.age >= 25]

 

Top Essential Data Science Skills to Learn


To only view certain rows, you can slice the dataframe using loc:

train.loc[100:200]

Moreover, you can slice the dataframe on the column axis as:

train.loc[:, ‘sex’ : ‘fare’]

 

The above operation will slice the dataframe from the column ‘sex’ to ‘fare’ for all the rows.

Explore our Popular Data Science Degrees

 

Now, let’s move on to iloc. iloc only indexes using index numbers or the positions. You can slice dataframes like:

train.iloc[100:200, 2:9]


The above operations will slice rows from 100 to 199 and the columns 2 through 8. Similarly, if you’d want to split your data horizontally, you can do:

train.iloc[:300, :]

4. unique() and nunique()

Pandas unique is used to get all the unique values from any feature. This is mostly used to get the categories in categorical features in the data. Unique shows all the unique values including NaNs. It treats it as a different unique value. Let’s take a look:

train[‘sex’].unique()

 

#Output:
[‘female’, ‘male’]

As we see, it gives us the unique values in the ‘sex’ feature.

Similarly, you can also check the number of unique values as there might be a lot of unique values in some features.

train[‘sex’].nunique()

 

#Output:
2

However, you should keep in mind that nunique() doesn’t consider NaNs as unique values. If there are any NaNs in your data then you’d need to pass the dropna parameter as False to make sure Pandas gives you the count including the NaNs too.

train[‘sex’].nunique(dropna=False)

 

#Output:
3

5. cut() and qcut()

Pandas cut is used to bin values in ranges in order to discretize the features. Let’s dive down into it. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in.

Let’s understand this with a small example.

Suppose, we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”. 

df = pd.Dataframe(data= {
‘Name’: [‘Ck’, ‘Ron’, ‘Mat’, ‘Josh’, ‘Tim’, ‘SypherPK’, ‘Dew’, ‘Vin’],
‘Marks’:[37, 91, 66, 42, 99, 81, 45, 71]
})

df[‘marks_bin’] = pd.cut(df[‘Marks’], bins=[0, 50, 70, 100], labels=[1, 2, 3])

Then we can just append the output as a new feature, and the Marks feature can be dropped. The new dataframe looks something like:

#Output:
      Name     Marks    marks_bin
0        Ck       37         1
1       Ron       91         3
2       Mat       66         2
3      Josh       42         1
4       Tim       99         3
5  SypherPK       81         3
6       Dew       45         1
7       Vin       71         3

So, when I say bins = [0, 50, 70, 100], it means that there are 3 ranges:

0 to 50 for bin 1,

51 to 70 for bin 2, and 

71 to 100 belonging to bin 3.

So, now our feature doesn’t contain the marks but the range or the bin to which the marks for that student are.

Read our popular Data Science Articles

Similar to cut(), Pandas also offers its brother function called qcut(). Pandas qcut takes in the number of quantiles, and divides the data points to each bin based on the data distribution. So, we can just change the cut function in the above to qcut:

df[‘marks_bin’] = pd.qcut(df[‘Marks’], q=3, labels=[1, 2, 3])

In the above operation, we tell Pandas to cut the feature into 3 equal parts and assign them the labels. The output comes as:

        Name   Marks    marks_bin
0        Ck     37         1
1       Ron     91         3
2       Mat     66         2
3      Josh     42         1
4       Tim     99         3
5  SypherPK     81         3
6       Dew     45         1
7       Vin     71         2

Notice how the last value changed from 3 to 2. 

Our learners also read: Learn Python Online Course Free

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

 

Before you go

We saw some most used Pandas functions. But these are not the only ones that are important and we’d encourage you to learn more of Pandas mostly used functions. This is a good and efficient approach as you might not be using all the functions that Pandas has, but only a few of them. 

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Profile

Rohit Sharma

Blog Author
Rohit Sharma is the Program Director for the UpGrad-IIIT Bangalore, PG Diploma Data Analytics Program.

Frequently Asked Questions (FAQs)

1Why is the Pandas library so popular?

This library is indeed quite popular among data scientists and data analysts. The reason for this is its great support of a large number of file formats and a rich collection of features to manipulate the extracted data. It can easily integrate with other libraries and packages such as NumPy.

This powerful library provides various useful functions for manipulating huge data sets in a flexible manner. Once you have mastered it, you can easily achieve great tasks with a few lines of code.

2What is the merge function and why is it used?

The merge function is a special function of Pandas data frame that is used to merge multiple rows or columns of 2 data frames. It is a high-memory join operation and resembles relational databases. You can use on = Column Name to merge data frames on the common column.

You can update left_on = Column Name or right_on = Column Name to align tables using columns from the left or right data frame as keys.

3Apart from Pandas library, what are the other Python libraries for data science?

Apart from Pandas library, there are a bunch of Python libraries that are considered to be some of the best libraries for data science. These include PySpark, TensorFlow, Matplotlib, Scikit Learn, SciPy and many more. Each one of them is widely used for its unique and amazing features and functions.

Every library has its own significance like SciKit Learn is more often used when you have to deal with statistical data. Apart from analysing the data, you can also create dashboards and visual reports using the functions provided by these amazing libraries.

Explore Free Courses

Suggested Blogs

Top 13 Highest Paying Data Science Jobs in India [A Complete Report]
905295
In this article, you will learn about Top 13 Highest Paying Data Science Jobs in India. Take a glimpse below. Data Analyst Data Scientist Machine
Read More

by Rohit Sharma

12 Apr 2024

Most Common PySpark Interview Questions & Answers [For Freshers & Experienced]
20941
Attending a PySpark interview and wondering what are all the questions and discussions you will go through? Before attending a PySpark interview, it’s
Read More

by Rohit Sharma

05 Mar 2024

Data Science for Beginners: A Comprehensive Guide
5069
Data science is an important part of many industries today. Having worked as a data scientist for several years, I have witnessed the massive amounts
Read More

by Harish K

28 Feb 2024

6 Best Data Science Institutes in 2024 (Detailed Guide)
5181
Data science training is one of the most hyped skills in today’s world. Based on my experience as a data scientist, it’s evident that we are in
Read More

by Harish K

28 Feb 2024

Data Science Course Fees: The Roadmap to Your Analytics Career
5075
A data science course syllabus covers several basic and advanced concepts of statistics, data analytics, machine learning, and programming languages.
Read More

by Harish K

28 Feb 2024

Inheritance in Python | Python Inheritance [With Example]
17658
Python is one of the most popular programming languages. Despite a transition full of ups and downs from the Python 2 version to Python 3, the Object-
Read More

by Rohan Vats

27 Feb 2024

Data Mining Architecture: Components, Types & Techniques
10808
Introduction Data mining is the process in which information that was previously unknown, which could be potentially very useful, is extracted from a
Read More

by Rohit Sharma

27 Feb 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
80818
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

19 Feb 2024

Sorting in Data Structure: Categories & Types [With Examples]
139162
The arrangement of data in a preferred order is called sorting in the data structure. By sorting data, it is easier to search through it quickly and e
Read More

by Rohit Sharma

19 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon