Replacing characters and strings in Python is a crucial task when it comes to Data Cleaning or Text Processing. Your data might have formatting issues with garbage characters that need to be removed, the categories might be having spelling issues, etc. Also while text preprocessing for NLP based problems, string replacement is the most basic and important step while preparing the textual data.
In this tutorial, we will be going over multiple ways to replace different types of strings. By the end of this tutorial, you will have the knowledge of the following:
- Python replace() method
- Regex sub() method
- join() and filter()
- Replacing numeric data in strings
Python replace()
The replace(old_str, new_str, count) method consists of 3 arguments:
- old_str: The string or a part of the string that needs to be replaced
- new_str: The string with which the old string needs to be replaced
- count: The count of times the particular string needs to be replaced
Let’s go over a few examples to understand the working.
Single replace
Mystr = “This is a sample string” Newstr = Mystr.replace(‘is’, ‘was’) |
#Output: Thwas was a sample string |
If you recall, Strings in Python are immutable. So when we call the replace method, it essentially makes another string object with the modified data. Moreover, we didn’t specify the count parameter in the above example. If not specified, the replace method will replace all the occurrences of the string.
Multiple replace
Mystr = “This is a sample string” Newstr = Mystr.replace(“s”, “X”) |
#Output: ThiX iX a Xample Xtring |
Multiple replace first n occurrences
If you only want first N occurrences,
Mystr = “This is a sample string” Newstr = Mystr.replace(“s”, “X”, 3) |
#Output: ThiX iX a Xample string |
Multiple strings replace
In the above examples, we replaced one string a different number of times. Now what if you want to replace different strings in the same big string. We can write an effective function for it and get it done using the same method.
Consider the example as above, but now we want to replace “h”, “is” and “ng” with “X”.
def MultipleStrings(mainStr, strReplaceList, newStr): # Iterating over the strings to be replaced for elem in strReplaceList: # Checking if string is in the main string if elem in mainStr : # Replace the string mainStr = mainStr.replace(elem, newStr) return mainStr |
Mystr = “This is a sample string” Newstr = MultipleStrings(Mystr, [‘h’, ‘is’, ‘ng’] , “X”) |
#Output: TXX X a sample striX |
Read: Python Tutorial
Replacing with regex
Python’s regex is a module specifically for dealing with text data – be it finding substrings, replacing strings or anything. Regex has the sub() function to find and replace/substitute substrings easily. Let’s go over its syntax and a few use cases.
The regex.sub(pattern, replacement, original_string) function takes 3 arguments:
- pattern: the substring that needs to be matched and replaced.
- replacement: can be a string which needs to be put in place, or a callable function which returns the value that needs to be put in place.
- original_string: the main string in which the substring has to be replaced.
Same as the replace method, regex also creates another string object with the modified string. Let’s go over a few working examples.
Removing whitespace
Whitespaces can be treated as special characters and replaced with other characters. In the below example, we intend to replace whitespaces with “X”.
import re Mystr = “This is a sample string” # Replace all whitespaces in Mystr with ‘X’ Newstr = re.sub(r”\s+”, ‘X’, Mystr) |
#Output: ThisXisXaXsampleXstring |
As we see, all the whitespaces were replaced. The pattern is given by r”\s+” which means all the whitespace characters.
Removing all special characters
To remove all the special characters, we will pass a pattern which matches with all the special characters.
import re import string Mystr = “Tempo@@&[(000)]%%$@@66isit$$#$%-+Str” pattern = r'[‘ + string.punctuation + ‘]’ # Replace all special characters in a string with X Newstr = re.sub(pattern, ‘X’, Mystr) |
#Output: TempoXXXXX000XXXXXXX66isitXXXXXXXStr |
Removing substring as case insensitive
In real life data, there might be cases where there might be many versions of the same word with different upper and lower case characters. To remove them all, putting all the words separately as the pattern wouldn’t be effective. The regex sub() function takes the flag re.IGNORECASE to ignore the cases. Let’s see how it works.
import re Mystr = “This IS a sample Istring” # Replace substring in a string with a case-insensitive approach Newstr = re.sub(r’is’,‘**’, Mystr, flags=re.IGNORECASE) |
#Output: Th** ** a sample **tring |
Removing multiple characters using regex
The regex function can easily remove multiple characters from a string. Below is an example.
import re Mystr = “This is a sample string” pattern = r'[hsa]’ # Remove characters ‘h’, ‘s’ and ‘a’ from a string Newstr = re.sub(pattern, ”, Mystr) |
#Output: Ti i mple tring |
Replacing using join()
Another way to remove or replace characters is to iterate through the string and check them against some condition.
charList = [‘h’, ‘s’, ‘a’] Mystr = “This is a sample string” # Remove all characters in list, from the string Newstr = ”.join((elem for elem in Mystr if elem not in charList)) |
#Output: Ti i mple tring |
Replacing using join() and filter()
Above example can also be done by using the filter function.
Mystr = “This is a sample string” charList = [‘h’, ‘s’, ‘a’] # Remove all characters in list, from the string Newstr = “”.join(filter(lambda k: k not in charList , Mystr)) |
#Output: Ti i mple trying |
Must Read: Fascinating Python Applications in Real World
Replacing numbers
Many times the numerical data is also present in the strings that might need to be removed and processed separately as a different feature. Let’s go over a few examples to see how these can be implemented.
Using regex
Consider the below string from which we need to remove the numeric data.
Mystr = “Sample string9211 of year 20xx” pattern = r'[0-9]’ # Match all digits in the string and replace them by empty string Newstr = re.sub(pattern, “”, Mystr) |
#Output: Sample string of year xx |
In the above code, we use the matching pattern r'[0-9]’ to match for all the digits.
Using join() function
We can also iterate upon the string and filter out the digits using the isdigit() method which returns False for alphabets.
Mystr = “Sample string9211 of year 20xx” # Iterates over the chars in the string and joins all characters except digits Newstr = “”.join((item for item in Mystr if not item.isdigit())) |
#Output: Sample string of year xx |
Using join() and filter()
Similarly, we can also put the filtering condition in the filter function to only return the characters which return True.
Mystr = “Sample string9211 of year 20xx” # Filter all the digits from characters in string & join remaining chars Newstr = “”.join(filter(lambda item: not item.isdigit(), Mystr)) |
#Output: Sample string of year xx |
Before you go
We covered a lot of examples showing different ways to remove or replace characters/whitespaces/numbers from a string. We highly recommend you to try out more examples and different ways to do the above examples and also more examples of your own.
If you are curious to learn about python, data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.