Python Pandas Tutorial: Everything Beginners Need to Know about Python Pandas

In this article, we’ll be taking a look at one of the popular libraries of Python essential for data professionals, Pandas. You’d get to learn about its basics as well as its operations.

Let’s get started. 

What is Pandas?

Python Pandas is popular for many reasons. Its primary application is data manipulation, its analysis as well as cleaning. You can use it for various data types and datasets, including unlabelled data, and ordered time-series data. To put it simply, we can say that Pandas is your data’s home. You can perform numerous operations on your data with this tool. 

You can convert the data format of a file, merge two data sets, make calculations, visualize it by taking help from Matplotlib, etc. With so many functionalities, it’s a popular choice among data professionals. That’s why learning about it is essential. And without understanding its working, you can’t use it, so in this Python Pandas tutorial, we’ll be focusing on the same. 

Read: Python Data Visualization Libraries

Role of Pandas in Data Science

The Pandas library is an integral part of any data professional’s arsenal. It’s based on NumPy, which is another popular Python library. A lot of NumPy’s structure is present in Pandas, so if you’re familiar with the former, you wouldn’t have any difficulty in getting familiar with the latter. 

Most of the time, experts use Pandas to feed data in SciPy for statistical analysis. They also use this data with Matplotlib or Scikit-learn for their functions (plotting functions and machine learning, respectively). 

Learn more about Python’s machine learning libraries.

Prerequisites

Before we begin discussing the working of Python Pandas and its operations, we should first make it clear as to who can use it properly and who can’t. You should first be familiar with Python’s underlying code and NumPy. 

The first one, i.e., Python’s fundamentals, is vital for obvious reasons. You wouldn’t understand much without knowing how Python code works. And even if you do, you wouldn’t be able to try out the code as you’d still need to learn the underlying code first. 

The second one, NumPy, is essential to learn because Pandas is based on it. Having an understanding of NumPy will help you considerably in getting familiar with Pandas. 

You can learn about Python through our blogs on data science and Python. We have many helpful guides and articles that can make you familiar with the basics. It’s free, and if you have any doubts, you can write them down in the comment section. 

If you’re familiar with both of the topics we mentioned, let’s take a look at Pandas deeply:

Installing Pandas

To use Pandas, you’ll have to install it. The best thing is, installation and import of Pandas is very easy. Just open up the command line (if you use a Mac, you’ll have to open the terminal) and install Pandas by using these codes:

 

For PC users: pip install pandas

For Mac users: conda install pandas

 

In Pandas, you’ll be dealing with series and dataframes. While a series refers to a column, a data frame refers to a multi-dimensional table that has multiple series. Let’s now take a look at the operations you can perform in Pandas.

Operations in Pandas

Now that we’ve discussed its importance and definition, we should now consider the actions you can perform in this Python Pandas tutorial. Pandas provides you with a lot of functions, and we’ve discussed them below:

Data viewing

You’ll want to print out some of the rows of your data set in the beginning to keep them as a visual reference. And you can do so with the .head() function. 

file1.head()

This function gives you the first five rows of the data frame. If you want to get more rows than the first five, you can just pass the required number in the function. Suppose you want the first 15 rows of the data frame, you’ll write the following code:

file1.head(15)

You also have the option of viewing the last five rows of the data frame. You can do so by using the .tail() function. And just like the .head() function, the .tail() function can also accept a number and give you the required quantity of rows.

file1.tail(20)

This code would give you the last 20 rows of your data frame. 

Getting Information

One of the first functions data scientists use with Pandas is .info(). That’s because it displays information about the data frame and gives you a deeper understanding of what you’re working with. Here’s how you use it in Pandas:

file1.info()

It provides you with a lot of useful information about the dataset, such as the quantity of the non-null values, the number of rows, the type of data present in a column, etc. 

Knowing the datatype of your data frame’s values is essential in many cases. Suppose you need to perform arithmetic operations on the data but it has strings. When you’d run your mathematical operations, you’d see an error pop up because you can’t perform such operations on strings. If one the other hand, you’d use the .info() function before doing any operations, you’d know already that you have strings. 

While the .info() function shows you the general information about your dataset, the .shape attribute gives you a tuple of your data frame. You can find out how many rows and columns your dataset has with the help of the .shape attribute. And you can use it in the following way:

file1.shape

This attribute doesn’t have parentheses because it only gives you a tuple of rows and columns. You’ll be using the .shape attribute quite often while cleaning your data. 

Also learn: Python Developer Salary in India

Concatenation

Let’s now discuss the concatenation attribute in this Python Pandas tutorial. Concatenation refers to joining two or more things together. So, with this attribute, you can combine two datasets without modifying their values or data points in any way. They combine together as is. You’ll have to use the .concat() function for this purpose. Here’s how:

 result = pd.concat([file1,file2])

It’ll combine the file1 and file2 dataframes and show them as a single data frame. 

df1 = pd.DataFrame({“HPI”:[80,90,70,60],”Int_Rate”:[2,1,2,3], “IND_GDP”:[50,45,45,67]}, index=[2001, 2002,2003,2004])

df2 = pd.DataFrame({“HPI”:[80,90,70,60],”Int_Rate”:[2,1,2,3],”IND_GDP”:[50,45,45,67]}, index=[2005, 2006,2007,2008])

concat= pd.concat([df1,df2])

print(concat)

The output of the above code: 

HPI IND_GDP Int_Rate

2001 80 50 2

2002 90 45 1

2003 70 45 2

2004 60 67 3

2005 80 50 2

2006 90 45 1

2007 70 45 2

2008 60 67 3

You must’ve noticed how the .concat() function has combined the two dataframes and converted them into one. 

Changing the Index

You can change the index values in your data frame as well. For that purpose, you’ll need to use the .set_index() function. In the parentheses of this function, you’d have to enter the details to change the index. Take a look at the following example to understand it better. 

import pandas as pd

df= pd.DataFrame({“Day”:[1,2,3,4], “Visitors”:[200, 100,230,300], “Bounce_Rate”:[20,45,60,10]})

df.set_index(“Day”, inplace= True)

print(df)

The output of the above code:

Bounce_Rate Visitors

Day

1 20   200

2 45   100

3 60   230

4 10   300

You can see that our code changed the index value of the data according to the days. 

Changing the Column Headers

You can change the column headers in Python Pandas as well. All you have to do is to use the .rename() function. You can enter the column names that were present initially in the parentheses and the column names you want to appear in the output code. 

Suppose you have a table with its column header as ‘Time,’ and you want to change it into ‘Hours.’ You can change the name of this column with the following code:

df = df.rename(columns={“Time” : “Hours”})

This code will change the name of the column header from ‘Time’ to ‘Hours.’ This is an excellent function for efficient practices. Let’s take a look at how you can convert the formats of your data. 

Data Munging

With data munging, you have the option of converting the format of specific data. You can convert a .csv file into an .html file or do vice versa. Here’s an example of how you can do so:

import pandas as pd

country= pd.read_csv(“D:UsersUser1Downloadsworld-bank-youth-unemploymentAPI_ILO_country_YU.csv”,index_col=0)

country.to_html(‘file1.html’)

After you’ve run this code, it’ll create an HTML file for you, which you can run on your browser. Data munging is an excellent function, and you’ll find its use in many situations. 

Conclusion

And now, we have reached the end of this Python Pandas tutorial. We hope you found it useful and informative. Python Pandas is a vast topic, and with the numerous functions it has, it would take some time for one to get familiar with it completely. 

If you’re interested in learning more about Python, its various libraries, including Pandas, and its application in data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Do I need to know Python for using Pandas?

Before you get started with Pandas, you need to understand that it is a package built for Python. So, you definitely need to have a firm grip on the basics as well as the syntax of Python programming to start using Pandas with ease. Whenever it comes down to working with tabular data in Python, Pandas is considered the best choice.

But, you need to get clear with the syntax being used in Python before starting with Pandas. It is unnecessary to spend a huge amount of time on it, but you only need to put in enough time to get clear with the basic syntax so that you can start with tasks involving Pandas.

How long does it take to learn Pandas in Python?

Pandas is the most widely used Python library for dealing with tabular data. You can use Pandas for all the tasks that you might use Excel for. If you are already aware of Python programming and its syntax, then you can easily get familiar with the functioning of Pandas within two weeks. When you are beginning with Pandas, you should start with the basic data manipulation projects in order to get a grip.

As you progress further, you’ll notice that Pandas is a very useful data science tool that can be a key factor driving business decisions in several industries.

Should I prefer learning Numpy or Pandas first?

It is preferred to learn Numpy before Pandas because Numpy is the most fundamental module in Python for scientific computing. You will also receive the support of highly optimized multidimensional arrays that are considered to be the most basic data structure of every Machine Learning algorithm.

Once you are done with learning Numpy, then you should begin with Pandas because Pandas is considered to be an extension of Numpy. This is because the underlying code of Pandas uses the Numpy library extensively.

Plan Your Career in Data Science Now.

Leave a comment

Your email address will not be published.

×
Let’s do it!
No, thanks.