Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconData Sciencebreadcumb forward arrow iconMastering Pandas: Important Pandas Functions For Your Next Project

Mastering Pandas: Important Pandas Functions For Your Next Project

Last updated:
30th Nov, 2020
Views
Read Time
6 Mins
share image icon
In this article
Chevron in toc
View All
Mastering Pandas: Important Pandas Functions For Your Next Project

Pandas library has been an all-time favorite for all Data Scientists or analysts because of its easy-to-use nature, a wide range of functionalities, and better interpretation of the results. Any individual starting their Data Science journey is advised to have a good command over pandas, come up with pipelines to reduce the manual effort of cleaning and preprocessing the data.

Pandas is built over Numpy which allows faster execution of commands and getting the work done in less time. In this article, we will share some underrated pandas functions that can enrich your project’s code quality.

Before moving ahead, here is a quick legend:

  • All the commands mentioned assume that the data frame is named as ‘df’ which is an object of pd.DataFrame()
  • The Pandas library has been imported as an alias as ‘pd’.

Check out our data science online courses to upskill yourself

String Accessors

String or text data contributes a major part to a dataset. Whether it is information related to the author, title, publication of a book, or tweets made for a particular hashtag, we have a lot of text data and this data comes in handy when cleaned properly and feed to any classifier like Naive Bayes, etc. Here are some tricks you can apply:

  • To access the string type data, use the ‘str’ accessor. For example, df[‘column_name’].str
  • This makes it possible to do all the string operations on the column selected.
  • Some common operations include, 
    • df[‘column_name’].str.len(): length of each string
    • .str.split(): Splitting at particular character
    • .str.contains(): Returns T/F about whether the particular word is present in the string
    • .str.count(): Returns the count of rows satisfying the regular expression passed. 
    • .str.findall(): Returns the results which match the expression passed.
    • .str.replace(): Same as findall but here replacement of matched items occur
    • All string operations such as .title, .isalpha, .isalnum, .isdecimal etc are supported.

Also Read: Pandas Dataframe Astype

Datetime Accessors

Dates and time are commonly present in datasets in the form of timestamps, start time, end time, or any other timing associated with that event. It is useful to parse this data properly as it gives trends along a timeline that can be put out to predict future events or we call quote it as time-series analysis. Let’s see some useful commands:

  • To access the DateTime data, convert the current data type (date values are parsed as string or object) to DateTime using the pd.to_datetime() function.
  • Now, using the ‘.dt’ accessor, we can access any DateTime information required such as :
    • df[‘column_name’].dt.day: Returns the day of the date.
    • .dt.time: Time
    • .dt.year: Year of the date
    • .dt.month: Month of the date
    • .dt.weekday: Whether it is Sunday, Monday… in the numerical form where 0 represents Monday. If you want day names, then use .dt.day_name
    • .dt.is_month_start: Returns T/F depending on whether the date is the first of the month.
    • .dt.is_month_end Same functionality as month_start but here the last date of the month is verified.
    • .dt.quater: Returns in which quarter the date lies
    • .dt.is_quater_start:  Returns T/F whether the date is the first day of the quarter
    • .dt.is_quater_end: whether it is the last day of the quarter
    • .dt.normalize: When the time component does not add a valuable contribution to the analysis, it can be ignored. This command rounds off the time to midnight i.e., 00:00:00. 

Pandas Plotting

Plotting visualizations is one of the key components of Data Analysis and plays a major role while performing feature engineering. For example, outliers in a dataset can be detected using box plots which represents the median and interquartile range, leaving outliers at the extreme ends.

Plotting is done mostly via other libraries such as seaborn, plotly, bokeh, matplotlib, but when you want to instantly visualize data without explicitly defining the libraries? Pandas got the solution. Using the pd.plot() function, you can directly plot graphs that are invoked internally using matplotlib. Various options available for this:

  • df.plot() or df[‘column_name’].plot() (depending upon type of graph) 
  • df.plot() has parameter ‘kind’ which defines the graph. By default, it is a ‘line’ plot but other options available are ‘bar’, ‘barh’, ‘box’, ‘hist’, ‘kde’ etc.
  • It invokes matplotlib backend that means we can access its arguments via an ‘ax’ accessor. 
  • .plot() function can also take arguments such as ‘title’, ‘xticks’, ‘xlim’, ‘xlabel’, ‘fontsize’, ‘colormap’ which eradicates the need of defining external libraries up to some extent. 

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

Explore our Popular Data Science Online Courses

Miscellaneous Functions

  • pd.get_dummies(): While preprocessing data, sometimes we are encountered with categorical data that needs to be converted into numerical form to be fed to the model. When these categories are fairly low, one-hot encoding is preferred, but doing this manually takes along. This dummies function not only transforms the values but, if drop_first set to True, drops the previous column containing all the categories.
  • df.query(): It is the function that allows you to apply the conditional mask over the data frame. The basic difference between this and normal masking is that this function directly returns the values instead of the boolean mask, reducing the effort of creating the mask and applying it to the data frame.
  • df.select_dtypes(): Sometimes we need to perform some specific tasks on one type of data type. For example, while reading data from external files, some data types are defined as objects. While cleaning the data, the dataset must have all the correct data types, and doing it manually by df.astype(‘data-type’) would be tedious when the number of such data types is large. This function selects the specified data type and it can be combined with the .apply() function. A sample code would look like this:

df.select_dtypes(object).apply(astype(str))

Top Data Science Skills to Learn to upskill

Must Read: Pandas Interview Questions

Read our popular Data Science Articles

Conclusion

This assignment is referred to as chaining, and it is very common while doing data science tasks to reduce the effort of defining variables for every step to be performed.

If you are curious to learn about Pandas, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

, to_datetime(), value_counts(). These functions are extremely important for Data Scientists and Data Analysts. The functions help to view data, edit values, return outcomes, cast, access datasets, change formats, find unique and duplicate values, merge data, and sort data. ” image-2=”” count=”3″ html=”true” css_class=””]
Profile

Rohit Sharma

Blog Author
Rohit Sharma is the Program Director for the UpGrad-IIIT Bangalore, PG Diploma Data Analytics Program.

Explore Free Courses

Suggested Blogs

Most Common PySpark Interview Questions & Answers [For Freshers & Experienced]
20809
Attending a PySpark interview and wondering what are all the questions and discussions you will go through? Before attending a PySpark interview, it’s
Read More

by Rohit Sharma

05 Mar 2024

Data Science for Beginners: A Comprehensive Guide
5061
Data science is an important part of many industries today. Having worked as a data scientist for several years, I have witnessed the massive amounts
Read More

by Harish K

28 Feb 2024

6 Best Data Science Institutes in 2024 (Detailed Guide)
5148
Data science training is one of the most hyped skills in today’s world. Based on my experience as a data scientist, it’s evident that we are in
Read More

by Harish K

28 Feb 2024

Data Science Course Fees: The Roadmap to Your Analytics Career
5074
A data science course syllabus covers several basic and advanced concepts of statistics, data analytics, machine learning, and programming languages.
Read More

by Harish K

28 Feb 2024

Inheritance in Python | Python Inheritance [With Example]
17573
Python is one of the most popular programming languages. Despite a transition full of ups and downs from the Python 2 version to Python 3, the Object-
Read More

by Rohan Vats

27 Feb 2024

Data Mining Architecture: Components, Types & Techniques
10755
Introduction Data mining is the process in which information that was previously unknown, which could be potentially very useful, is extracted from a
Read More

by Rohit Sharma

27 Feb 2024

6 Phases of Data Analytics Lifecycle Every Data Analyst Should Know About
80547
What is a Data Analytics Lifecycle? Data is crucial in today’s digital world. As it gets created, consumed, tested, processed, and reused, data goes
Read More

by Rohit Sharma

19 Feb 2024

Sorting in Data Structure: Categories & Types [With Examples]
138926
The arrangement of data in a preferred order is called sorting in the data structure. By sorting data, it is easier to search through it quickly and e
Read More

by Rohit Sharma

19 Feb 2024

Data Science Vs Data Analytics: Difference Between Data Science and Data Analytics
68927
Summary: In this article, you will learn, Difference between Data Science and Data Analytics Job roles Skills Career perspectives Which one is right
Read More

by Rohit Sharma

19 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon