Lemmatization and Stemming in Python
Updated on Mar 25, 2026 | 9 min read | 8.2K+ views
Share:
All courses
Certifications
More
Updated on Mar 25, 2026 | 9 min read | 8.2K+ views
Share:
Table of Contents
Stemming and lemmatization are text normalization methods in Natural Language Processing. They reduce words to a base form to simplify analysis. Stemming trims words using rules, while lemmatization returns meaningful base words using vocabulary. In Python, the NLTK library provides simple tools to apply both techniques in real tasks.
In this blog you will learn what these techniques mean, how they differ, and how to apply them in Python with clear examples.
If you want to go beyond the basics of NLP and build real expertise, explore upGrad’s Artificial Intelligence courses and gain hands-on skills from experts today!
Popular AI Programs
To get started with Natural Language Processing, you must first ask: what is stemming and lemmatization. Both are processes used to reduce a word to its base form, but they go about it in very different ways.
Stemming is a rough, rule-based process that chops off the ends of words in the hope of reaching the root. For example, a stemmer might turn "laziness" into "lazi." It is fast and efficient but often results in "stems" that aren't actually real words.
Lemmatization in python, on the other hand, is a more sophisticated approach. It uses a dictionary (like WordNet) and considers the context of the word to find its "lemma," or dictionary form. If you lemmatize "better," you get "good." This requires more computational power because the computer has to understand the part of speech, but the results are much more accurate.
Also Read: Types of Natural Language Processing with Examples
Key Difference between both are:
| Feature | Stemming | Lemmatization |
|---|---|---|
| Method | Heuristic (chopping off suffixes) | Morphological analysis (using a dictionary) |
| Accuracy | Lower (may produce non-words) | Higher (produces meaningful base words) |
| Speed | Very Fast | Slower (requires lookups) |
| Use Case | Search engines, large-scale indexing | Chatbots, Sentiment analysis, Q&A |
You use libraries like NLTK to apply stemming in Python. It reduces words to a base form using rule-based methods.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "studies", "playing"]
for word in words:
print(stemmer.stem(word))
Output
Also Read: Why Do We Do Stemming in NLP?
Stemming is a good choice when you want quick results and can accept less accurate word forms.
Machine Learning Courses to upskill
Explore Machine Learning Courses for Career Progression
You use libraries like NLTK to perform lemmatization in python. It converts words into their meaningful base form using vocabulary and grammar rules.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "better", "studies"]
for word in words:
print(lemmatizer.lemmatize(word))
Output
Lemmatization works better when you pass the correct part of speech.
from nltk.corpus import wordnet
print(lemmatizer.lemmatize("running", pos=wordnet.VERB))
Also Read: What Is Tokenization and Stemming Techniques In NLP?
Output
Lemmatization is the better choice when you want meaningful and accurate word forms instead of quick approximations.
You choose between lemmatization and stemming in python based on your goal. Each method fits different types of tasks.
Pick stemming when you need speed and can ignore exact meaning.
Quick decision guide
| Situation | Best choice |
| Speed is priority | Stemming |
| Accuracy is priority | Lemmatization |
| Large raw text data | Stemming |
Pick the method that matches your task, not just the tool.
Also Read: NLP Models in Machine Learning and Deep Learning
You use lemmatization and stemming in python to clean and simplify text data. Stemming gives quick results with less accuracy, while lemmatization provides meaningful words with better precision. Choose based on your task. Use stemming for speed and lemmatization when your model needs correct context and understanding.
"Want personalized guidance on AI and upskilling opportunities? Connect with upGrad’s experts for a free 1:1 counselling session today!"
Stemming and lemmatization are two ways to shorten words to their base form so a computer can understand them better. Stemming is like a fast pair of scissors that just snips off the ends of words, sometimes leaving them incomplete. Lemmatization is like a smart librarian who looks the word up in a dictionary to find its proper root, ensuring the final result is always a real word.
You should use lemmatization when the actual meaning and "real-word" status of the root are important for your project. This is common in chatbots, translation tools, and sentiment analysis where the context matters. Stemming is better when you have a huge amount of data and need to group similar words together very quickly, such as in a search engine's index.
The NLTK library is the best place to start because it offers a wide variety of stemmers and lemmatizers for educational purposes. For professional, high-speed production environments, the SpaCy library is often preferred because it performs lemmatization automatically as part of its processing pipeline. Both libraries are free and have massive community support in 2026.
The biggest disadvantage of stemming is "over-stemming" or "under-stemming." This happens when the algorithm chops off too much of a word, turning "university" into "univers," or not enough, leaving two related words with different stems. Because it doesn't use a dictionary, it often creates stems that are not real English words, which can be confusing for some downstream tasks.
Yes, but it is much more complex for some languages than others. While English lemmatization is well-supported, languages with complex grammar like Arabic, Turkish, or even Hindi require specialized dictionaries and more advanced models. Most modern Python libraries now offer multilingual support, but the accuracy will vary depending on the availability of linguistic data for that specific language.
A "stem" is the part of the word that remains after you remove its suffixes; it doesn't have to be a valid word. A "lemma" is the actual dictionary form of a word, also known as its "canonical form." For example, for the word "saw," the stem might be "s" (if using an aggressive stemmer), but the lemma would be "see" (if used as a verb) or "saw" (if used as a noun).
Yes, you can use other libraries like SpaCy, TextBlob, or Gensim. SpaCy is particularly popular in 2026 because it is built for speed and handles lemmatization automatically when you process a string of text. Many developers prefer SpaCy because it is "opinionated," meaning it chooses the best algorithms for you so you don't have to pick between different stemmers manually.
Generally, no. Stemming and lemmatization algorithms are designed to work on alphabetic strings and will usually ignore numbers or punctuation. If your dataset contains a lot of "noise" like hashtags or emojis, you should clean those out using "Regular Expressions" in Python before you start the stemming or lemmatization process.
Providing Part of Speech (POS) tags helps the lemmatizer understand the context. For example, the word "meeting" could be a noun ("The meeting was long") or a verb ("I am meeting him"). Without a POS tag, the lemmatizer might not know whether to leave it as "meeting" or change it to "meet." Adding this context significantly improves the accuracy of the results.
In SEO, lemmatization helps search engines understand that a user searching for "running shoes" might also be interested in "run shoes" or "runner shoes." By normalizing the keywords, search engines can provide more relevant results. This is why content writers often use different variations of a keyword; the search engine's NLP brain will eventually map them all back to the same core intent.
Yes, they are still very relevant for "pre-processing" data. While large models like GPT-4 can understand different word forms on their own, using lemmatization can help reduce the size of your vocabulary and make smaller, specialized models much more efficient. It is also a vital step in "Topic Modeling" and "Word Clouds," where you want to count how many times a core concept appears.
901 articles published
Pavan Vadapalli is the Director of Engineering , bringing over 18 years of experience in software engineering, technology leadership, and startup innovation. Holding a B.Tech and an MBA from the India...
Speak with AI & ML expert
By submitting, I accept the T&C and
Privacy Policy
Top Resources