Regular Expressions in Python [With Examples]: How to Implement?
By Rohit Sharma
Updated on Nov 30, 2022 | 7 min read | 6.28K+ views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Nov 30, 2022 | 7 min read | 6.28K+ views
Share:
Table of Contents
While processing raw data from any source, extracting the right information is important so that meaningful insights can be obtained from the data. Sometimes it becomes difficult to take out the specific pattern from the data especially in the case of textual data.
The textual data consist of paragraphs of information collected via survey forms, scrapping websites, and other sources. The Channing of different string accessors with pandas functions or other custom functions can get the work done, but what if a more specific pattern needs to be obtained? Regular expressions do this job with ease.
A regular expression is a representation of a set of characters for strings. It presents a generalized formula for a particular pattern in the strings which helps in segregating the right information from the pool of data. The expression usually consists of symbols or characters that help in forming the rule but, at first glance, it may seem weird and difficult to grasp. These symbols have associated meanings that are described here.
Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Meta-characters in RegEx
Our learners also read: Top Python Free Courses
Now that you are aware of the characters that make up a RegEx, let’s see how this works:
1. Email Filtering:
Suppose you want to filter out all the email ids from a long paragraph. The general format for an email is:
username@domain_name. <top_level_domain>
The username can be alphanumeric, and therefore, we can use \w to denote them but there is a possibility that the user creates an account as firstName.surname. To tackle this, we will escape the dot and create a set of characters. Next, domain_name should be only alphabetic and therefore, A-Za-z will denote that. The top-level domain is usually .com, .in, .org but depending on the use case, you can choose either the whole alphabet range or filter specific domains.
The regular expression of this will look like this:
^([a-zA-Z0-9_.]+)@([a-zA-Z0-9-]+)\.([a-zA-Z]{2,4})$
Here the start and end of the pattern are also declared as well the top-level domain can only contain 2-4 characters. The whole expression has 3 groups.
2. Dates Filtering:
The textual information you are extracting may contain the dates and no separate column is made available for you. The dates are an essential factor that helps in filtering data or time series analysis. A particular date takes the format of date/month/year, where date and month can interchange.
Also, months can be numeric as well as alphabets form and in alphabets either abbreviations or full names. It mainly depends on how many cases are present in our data and can only be achieved by hit and trial.
A simple RegEx that covers a variety of dates is shown below:
^(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})$
This pattern captures the date format with a hyphen or forward slash. The date and month are confined to one or two-digit and year up to four0 digits. The respective entities are captured as groups that are optional in this case.
Also Read: Python Project Ideas and Topics
The regular expressions we just built are satisfying the respective criteria we assumed and now it’s time to implement them in Python code. Python has a built-in module called re module that implements the working of these expressions. Simply,
import re
pattern = ‘^(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})$’
Remodule offers a wide range of functions and all of them have different use cases. Let’s look at some of the important functions:
string = ‘25-12-1999 random text here 25/12/1999’
print(re.findall(pattern, string))
Popular Data Science Programs
It will return only the dates from the string in a list.
To perform this:
match = re.search(pattern, string)
match.group(1)
Group(0) returns the whole match and corresponding next numbers denote other groups.
Checkout: Python Developer Salary in India
upGrad’s Exclusive Data Science Webinar for you –
Watch our Webinar on The Future of Consumer Data in an Open Data Economy
Data Science Courses to upskill
Explore Data Science Courses for Career Progression
Regular expressions are a powerful way to capture patterns in textual data. It may take a bit of extra effort to hold command of the various characters but it simplifies the process of data extraction in complex use cases.
The following examples illustrate the functioning or regular expressions in Python:
a. Email Filtering
The regular expressions can be efficiently used to filter emails. The regular syntax for email filtering is - ^((a-zA-Z0-9_.)+)@((a-zA-Z0-9-)+).((a-zA-Z){2,4})$
This expression is divided into three groups and tackles many cases including - when the username is alphanumeric and when it has a dot, for eg., “first.last@”. This expression will be used for top domains that contain 2-4 characters.
b. Dates Filtering
Dates can be a crucial factor while handling data filtering. The textual data that you are dealing with often contains dates. The regular expression or RegEx that extracts the data from a normal text is - ^(d{1,2})(/-)(d{1,2})(/-)(d{2,4})$
The date and the month can be up to 2 digits while the month can be up to 4 digits.
The following functions are involved in the implementation of regular expressions in Python:
1. re.findall() - This function accepts a pattern that is to be matched with the text string. It returns the strings that are a match.
2. re.sub() - Sub in “re.sub” stands for “substitution”. This method performs exactly the same function as the “re.findall()” function does.
3. re.split() - It separates the strings around the separator which is to be passed to it as its parameter. The separator could be anything.
4. re.search() - This function returns the match found in the string along with other string groups that it has captured.
The following are some of the special sequences used in regular expressions:
1. A: Check if the string starts with the given character.
2. (Forward Slash) b: Checks if the string starts or ends with the given character. (string)/b checks for the beginning while (backward slash) b (string) checks for the end.
3. B: It is exactly opposite to the b. Checks if the string does not start with the given character.
4. d: Checks for the numerical values in the string.
5. D: Checks for any non-numerical value or character.
6. s: Checks for any whitespace character.
7. S: Checks for any non-whitespace character.
8. w: Checks for any alphanumeric character.
9. W: Checks for any non-alphanumeric character.
834 articles published
Rohit Sharma is the Head of Revenue & Programs (International), with over 8 years of experience in business analytics, EdTech, and program management. He holds an M.Tech from IIT Delhi and specializes...
Speak with Data Science Expert
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources