Regular Expressions in Python [With Examples]: How to Implement?

While processing raw data from any source, extracting the right information is important so that meaningful insights can be obtained from the data. Sometimes it becomes difficult to take out the specific pattern from the data especially in the case of textual data.

The textual data consist of paragraphs of information collected via survey forms, scrapping websites, and other sources. The Channing of different string accessors with pandas functions or other custom functions can get the work done, but what if a more specific pattern needs to be obtained? Regular expressions do this job with ease.

What is a Regular Expression (RegEx)?

A regular expression is a representation of a set of characters for strings. It presents a generalized formula for a particular pattern in the strings which helps in segregating the right information from the pool of data. The expression usually consists of symbols or characters that help in forming the rule but, at first glance, it may seem weird and difficult to grasp. These symbols have associated meanings that are described here.

Meta-characters in RegEx

  1. ‘.’: is a wildcard, matches a single character (any character, but just once)
  2. ^: denotes start of the string
  3. $: denotes the end of the string
  4. [ ]: matches one of the sets of characters within [ ]
  5. [a-z]: matches one of the range of characters a,b,…,z
  6. [^abc] : matches a character that is not a,b or c.
  7. a|b: matches either a or b, where a and b are strings
  8. () : provides scoping for operators
  9. \ : enables escape for special characters (\t, \n, \b, \.)
  10. \b: matches word boundary
  11. \d : any digit, equivalent to [0-9]
  12. \D: any non digit, equivalent to [^0-9]
  13. \s : any whitespace, equivalent to [ \t\n\r\f\v]
  14. \S : any non-whitespace, equivalent to [^\t\n\r\f\v]
  15. \w : any alphanumeric, equivalent to [a-zA-Z0-9_]
  16. \W : any non-alphanumeric, equivalent to [^a-zA-Z0-9_]
  17. ‘*’: matches zero or more occurrences
  18. ‘+’: matches one or more occurrences
  19. ‘?’: matches zero or one occurrence
  20. {n}: exactly n repetitions, n>=0
  21. {n,}: at least n repetitions
  22. {,n}: at most n repetitions
  23. {m,n}: at least m repetitions and at most n repetitions

Examples to Understand The Workaround

Now that you are aware of the characters that make up a RegEx, let’s see how this works:

1. Email Filtering:

Suppose you want to filter out all the email ids from a long paragraph. The general format for an email is:

username@domain_name. <top_level_domain>

The username can be alphanumeric, and therefore, we can use \w to denote them but there is a possibility that the user creates an account as firstName.surname. To tackle this, we will escape the dot and create a set of characters. Next, domain_name should be only alphabetic and therefore, A-Za-z will denote that. The top-level domain is usually .com, .in, .org but depending on the use case, you can choose either the whole alphabet range or filter specific domains.

The regular expression of this will look like this:

^([a-zA-Z0-9_.]+)@([a-zA-Z0-9-]+)\.([a-zA-Z]{2,4})$

Here the start and end of the pattern are also declared as well the top-level domain can only contain 2-4 characters. The whole expression has 3 groups.

2. Dates Filtering:

The textual information you are extracting may contain the dates and no separate column is made available for you. The dates are an essential factor that helps in filtering data or time series analysis. A particular date takes the format of date/month/year, where date and month can interchange.

Also, months can be numeric as well as alphabets form and in alphabets either abbreviations or full names. It mainly depends on how many cases are present in our data and can only be achieved by hit and trial.

A simple RegEx that covers a variety of dates is shown below:

^(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})$

This pattern captures the date format with a hyphen or forward slash. The date and month are confined to one or two-digit and year up to four0 digits. The respective entities are captured as groups that are optional in this case.

Also Read: Python Project Ideas and Topics

How to Implement it in Python?

The regular expressions we just built are satisfying the respective criteria we assumed and now it’s time to implement them in Python code. Python has a built-in module called re module that implements the working of these expressions. Simply,

import re

pattern = ‘^(\d{1,2})[/-](\d{1,2})[/-](\d{2,4})$’

Re module offers a wide range of functions and all of them have different use cases. Let’s look at some of the important functions:

  1. re.findall(): This function returns the list of all the matches in the test string based on the pattern passed. Consider this example:

string = ‘25-12-1999 random text here 25/12/1999’

print(re.findall(pattern, string))

It will return only the dates from the string in a list.

  1. re.sub(): Sub in this function stands for substitution and does the same thing. It substitutes the matches with the replacement value provided. The function takes in the pattern, string, replacement value, and optional parameter of the count. The count parameter controls how many occurrences you want to replace. By default, it replaces all of them and returns the new string.
  2. re.split(): It splits the string at the matched sites and returns the parts as separate strings in a list.
  3. re.search(): This function returns the match object that contains the match found in the string along with all the groups it captured. It can come in handy when you want to store these groups as separate columns.

To perform this:

match  = re.search(pattern, string)

match.group(1)

Group(0) returns the whole match and corresponding next numbers denote other groups.

Checkout: Python Developer Salary in India

Conclusion

Regular expressions are a powerful way to capture patterns in textual data. It may take a bit of extra effort to hold command of the various characters but it simplifies the process of data extraction in complex use cases.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Prepare for a Career of the Future

UPGRAD AND IIIT-BANGALORE'S PG DIPLOMA IN DATA SCIENCE
Enroll Now @ upGrad

Leave a comment

Your email address will not be published.

×