Programs

Stemming & Lemmatization in Python: Which One To Use?

Natural Language Processing (NLP) is a communication processing technique that involves extracting important features from the language. It is an advancement in Artificial intelligence that involves building intelligent agents with previous experience. The previous experience here refers to the training that is performed over humongous datasets involving textual data from sources including social media, web scraping, survey forms, and many other data collection techniques.

The initial step after data gathering is the cleaning of this data and conversion into the machine-readable form, the numerical form that the machine can interpret. While the conversion process is a whole another thing, the cleaning process is the first step to be performed. In this cleaning task, inflection is an important concept that needs a clear understanding before moving on to stemming and lemmatization. 

Inflection

We know textual data comprises sentences with words and other characters that may or may not impact our predictions. The sentences comprise words and the words which are commonly used such as is, there, and, are called stop words. These can be removed easily by forming a corpus for them, but what about different forms of the same word? 

You don’t want your machine to consider ‘study’ and ‘studying’ as different words as the intent behind these words remains the same and both convey the same meaning. Handling this type of case is a common practice in NLP, and this is known as inflection. This is the base idea of stemming and lemmatization with different approaches. Let’s discover the differences between them and have a look at which one is better to use. 

Stemming

It is one of the text normalization techniques that focuses on reducing the ambiguity of words. The stemming focuses on stripping the word round to the stem word. It does so by removing the prefixes or suffixes, depending upon the word under consideration. This technique reduces the words according to the defined set of rules. 

The resulted words may or may not have any actual meaningful root words. Its main purpose is to form groups of similar words together so that further preprocessing can be optimized. For example, words like play, playing, and played all belong to the stem word “play”. This also helps in reducing the search time in search engines, as now more focus is given on the key element. 

Two cases need to be discussed regarding stemming, i.e., over steaming and under stemming. While removing the prefixes and suffixes from the word solves some cases, some words are stripped more than the requirements.

This can lead to more trash words with no meanings. Though this is the disadvantage of stemming as a whole, and if it happens more drastically, it is known as over stemming. Under stemming is the reverse where the stemming process results in very little or difference in words.

Lemmatization

Another approach for normalizing the text and converting them to root meanings is Lemmatization. This has the same motive of grouping similar intent words into one group, but the difference is that here the resultant words are meaningful.

They are not stripped off with pre-defined rules but are formed using a dictionary or we call it Lemma. Here the process of conversion takes more time because first, the words are matched with their parts of speech, which itself is time taking process. 

This ensures that the root word has a literal meaning that helps in deriving good results in analysis. This is useful when we don’t want to spend much time on data cleaning, and cleaner data is required for further analysis. One drawback of this technique is that as it focuses more on the grammar of the words, different languages would require separate corpora leading to more and more data handling. 

Checkout: Deep Learning Project Ideas for Beginners

Which One to Use?

Now comes the point of picking the one between the two of them. It is highly subjective to choose anyone as the use case you are targeting plays a major role here. 

If you want to analyze a chunk of text but time is a constraint, then you can opt for stemming as it performs this action in less time but with a low success rate, and the stems are provided via an algorithmic way that may not have any meaning. 

Adopting Lemmatization gives an added advantage of getting meaningful and accurate root words clubbed from different forms. If you can afford good computing resources with more time, then this is can be a better choice. This should be adopted where we want precise analysis. It can also be the case of some searching techniques on the search engines where the root word is enough to fetch the results user wants. 

Python Implementation

The NLTK (Natural Language Tool Kit) package is the Python implementation of the tasks around the NLP. This library has all the required tools such as Stemmers. Lemmatizers, stop words removal, creating custom parser trees, and much more. It also contains the corpus data from prominent sources included in the package itself. 

The stemming technique has many implementations, but the most popular and oldest one is the Porter Stemmer algorithm. Snowball stemmer is also used in some projects. For understanding the difference between stemming and lemmatization more clearly, look at the code below and the output of the same:

import nltk

from nltk.stem import PorterStemmer

from nltk.stem import WordNetLemmatizer

word_stemmer = PorterStemmer()

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize(‘flies’))

print(word_stemmer.stem(‘flies’))

Output:

fly

fli

The first output is from the lemmatizer and the second from the stemmer. You can see the difference that the lemmatizer gave the root word as the output while the stemmer just trimmed the word from the end. 

Also Read: Machine Learning Project Ideas

Conclusion

NLP is growing every day and new methods evolve with time. Most of them focus on how to efficiently extract the right information from the text data with minimum loss and eliminating all the noises. Both the techniques are popularly used. All it matters is that the analysis is carried on clean data.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
ENROLL NOW @ UPGRAD

Leave a comment

Your email address will not be published.

Accelerate Your Career with upGrad

Our Popular Machine Learning Course

×