In this article, we’ll be taking a look at one of the popular libraries of Python essential for data professionals, Pandas. You’d get to learn about its basics as well as its operations.
Let’s get started.
What is Pandas?
Python Pandas is popular for many reasons. Its primary application is data manipulation, its analysis as well as cleaning. You can use it for various data types and datasets, including unlabelled data, and ordered time-series data. To put it simply, we can say that Pandas is your data’s home. You can perform numerous operations on your data with this tool.
You can convert the data format of a file, merge two data sets, make calculations, visualize it by taking help from Matplotlib, etc. With so many functionalities, it’s a popular choice among data professionals. That’s why learning about it is essential. And without understanding its working, you can’t use it, so in this Python Pandas tutorial, we’ll be focusing on the same.
Read: Python Data Visualization Libraries
Role of Pandas in Data Science
The Pandas library is an integral part of any data professional’s arsenal. It’s based on NumPy, which is another popular Python library. A lot of NumPy’s structure is present in Pandas, so if you’re familiar with the former, you wouldn’t have any difficulty in getting familiar with the latter.
Most of the time, experts use Pandas to feed data in SciPy for statistical analysis. They also use this data with Matplotlib or Scikit-learn for their functions (plotting functions and machine learning, respectively).
Learn more about Python’s machine learning libraries.
Prerequisites
Before we begin discussing the working of Python Pandas and its operations, we should first make it clear as to who can use it properly and who can’t. You should first be familiar with Python’s underlying code and NumPy.
The first one, i.e., Python’s fundamentals, is vital for obvious reasons. You wouldn’t understand much without knowing how Python code works. And even if you do, you wouldn’t be able to try out the code as you’d still need to learn the underlying code first.
The second one, NumPy, is essential to learn because Pandas is based on it. Having an understanding of NumPy will help you considerably in getting familiar with Pandas.
You can learn about Python through our blogs on data science and Python. We have many helpful guides and articles that can make you familiar with the basics. It’s free, and if you have any doubts, you can write them down in the comment section.
If you’re familiar with both of the topics we mentioned, let’s take a look at Pandas deeply:
Learn data science course from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Installing Pandas
To use Pandas, you’ll have to install it. The best thing is, installation and import of Pandas is very easy. Just open up the command line (if you use a Mac, you’ll have to open the terminal) and install Pandas by using these codes:
For PC users: pip install pandas
For Mac users: conda install pandas
In Pandas, you’ll be dealing with series and dataframes. While a series refers to a column, a data frame refers to a multi-dimensional table that has multiple series. Let’s now take a look at the operations you can perform in Pandas.
Operations in Pandas
Now that we’ve discussed its importance and definition, we should now consider the actions you can perform in this Python Pandas tutorial. Pandas provides you with a lot of functions, and we’ve discussed them below:
Data viewing
You’ll want to print out some of the rows of your data set in the beginning to keep them as a visual reference. And you can do so with the .head() function.
file1.head()
This function gives you the first five rows of the data frame. If you want to get more rows than the first five, you can just pass the required number in the function. Suppose you want the first 15 rows of the data frame, you’ll write the following code:
file1.head(15)
You also have the option of viewing the last five rows of the data frame. You can do so by using the .tail() function. And just like the .head() function, the .tail() function can also accept a number and give you the required quantity of rows.
file1.tail(20)
This code would give you the last 20 rows of your data frame.
Getting Information
One of the first functions data scientists use with Pandas is .info(). That’s because it displays information about the data frame and gives you a deeper understanding of what you’re working with. Here’s how you use it in Pandas:
file1.info()
It provides you with a lot of useful information about the dataset, such as the quantity of the non-null values, the number of rows, the type of data present in a column, etc.
Knowing the datatype of your data frame’s values is essential in many cases. Suppose you need to perform arithmetic operations on the data but it has strings. When you’d run your mathematical operations, you’d see an error pop up because you can’t perform such operations on strings. If one the other hand, you’d use the .info() function before doing any operations, you’d know already that you have strings.
Explore our Popular Data Science Courses
While the .info() function shows you the general information about your dataset, the .shape attribute gives you a tuple of your data frame. You can find out how many rows and columns your dataset has with the help of the .shape attribute. And you can use it in the following way:
file1.shape
This attribute doesn’t have parentheses because it only gives you a tuple of rows and columns. You’ll be using the .shape attribute quite often while cleaning your data.
Also learn: Python Developer Salary in India
upGrad’s Exclusive Data Science Webinar for you –
Watch our Webinar on The Future of Consumer Data in an Open Data Economy
Concatenation
Let’s now discuss the concatenation attribute in this Python Pandas tutorial. Concatenation refers to joining two or more things together. So, with this attribute, you can combine two datasets without modifying their values or data points in any way. They combine together as is. You’ll have to use the .concat() function for this purpose. Here’s how:
result = pd.concat([file1,file2])
It’ll combine the file1 and file2 dataframes and show them as a single data frame.
df1 = pd.DataFrame({“HPI”:[80,90,70,60],”Int_Rate”:[2,1,2,3], “IND_GDP”:[50,45,45,67]}, index=[2001, 2002,2003,2004])
df2 = pd.DataFrame({“HPI”:[80,90,70,60],”Int_Rate”:[2,1,2,3],”IND_GDP”:[50,45,45,67]}, index=[2005, 2006,2007,2008])
concat= pd.concat([df1,df2])
print(concat)
Top Data Science Skills to Learn to upskill
SL. No | Top Data Science Skills to Learn | |
1 | Data Analysis Online Courses | Inferential Statistics Online Courses |
2 | Hypothesis Testing Online Courses | Logistic Regression Online Courses |
3 | Linear Regression Courses | Linear Algebra for Analysis Online Courses |
The output of the above code:
HPI IND_GDP Int_Rate
2001 80 50 2
2002 90 45 1
2003 70 45 2
2004 60 67 3
2005 80 50 2
2006 90 45 1
2007 70 45 2
2008 60 67 3
You must’ve noticed how the .concat() function has combined the two dataframes and converted them into one.
Changing the Index
You can change the index values in your data frame as well. For that purpose, you’ll need to use the .set_index() function. In the parentheses of this function, you’d have to enter the details to change the index. Take a look at the following example to understand it better.
import pandas as pd
df= pd.DataFrame({“Day”:[1,2,3,4], “Visitors”:[200, 100,230,300], “Bounce_Rate”:[20,45,60,10]})
df.set_index(“Day”, inplace= True)
print(df)
The output of the above code:
Bounce_Rate Visitors
Day
1 20 200
2 45 100
3 60 230
4 10 300
You can see that our code changed the index value of the data according to the days.
Changing the Column Headers
You can change the column headers in Python Pandas as well. All you have to do is to use the .rename() function. You can enter the column names that were present initially in the parentheses and the column names you want to appear in the output code.
Suppose you have a table with its column header as ‘Time,’ and you want to change it into ‘Hours.’ You can change the name of this column with the following code:
df = df.rename(columns={“Time” : “Hours”})
This code will change the name of the column header from ‘Time’ to ‘Hours.’ This is an excellent function for efficient practices. Let’s take a look at how you can convert the formats of your data.
Data Munging
With data munging, you have the option of converting the format of specific data. You can convert a .csv file into an .html file or do vice versa. Here’s an example of how you can do so:
import pandas as pd
country= pd.read_csv(“D:UsersUser1Downloadsworld-bank-youth-unemploymentAPI_ILO_country_YU.csv”,index_col=0)
country.to_html(‘file1.html’)
After you’ve run this code, it’ll create an HTML file for you, which you can run on your browser. Data munging is an excellent function, and you’ll find its use in many situations.
Read our popular Data Science Articles
Conclusion
And now, we have reached the end of this Python Pandas tutorial. We hope you found it useful and informative. Python Pandas is a vast topic, and with the numerous functions it has, it would take some time for one to get familiar with it completely.
If you’re interested in learning more about Python, its various libraries, including Pandas, and its application in data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.