Lemmatization and Stemming in Python

Updated on Mar 25, 2026 | 9 min read | 8.2K+ views

Table of Contents

View all

What is Stemming and Lemmatization in Python?
What is Stemming in Python with Example
What is Lemmatization in Python with Example
When to use lemmatization and stemming in python
Conclusion

Stemming and lemmatization are text normalization methods in Natural Language Processing. They reduce words to a base form to simplify analysis. Stemming trims words using rules, while lemmatization returns meaningful base words using vocabulary. In Python, the NLTK library provides simple tools to apply both techniques in real tasks.

In this blog you will learn what these techniques mean, how they differ, and how to apply them in Python with clear examples.

If you want to go beyond the basics of NLP and build real expertise, explore upGrad’s Artificial Intelligence courses and gain hands-on skills from experts today!       

Popular AI Programs

LLM Law and Technology Online Program Generative AI Certification Course Generative AI Program for Business Leaders PG Diploma in AI and ML Masters in AI and ML Online Degree

What is Stemming and Lemmatization in Python?

To get started with Natural Language Processing, you must first ask: what is stemming and lemmatization. Both are processes used to reduce a word to its base form, but they go about it in very different ways.

Stemming is a rough, rule-based process that chops off the ends of words in the hope of reaching the root. For example, a stemmer might turn "laziness" into "lazi." It is fast and efficient but often results in "stems" that aren't actually real words.

Lemmatization in python, on the other hand, is a more sophisticated approach. It uses a dictionary (like WordNet) and considers the context of the word to find its "lemma," or dictionary form. If you lemmatize "better," you get "good." This requires more computational power because the computer has to understand the part of speech, but the results are much more accurate.

Also Read: Types of Natural Language Processing with Examples

Key Difference between both are:

Feature	Stemming	Lemmatization
Method	Heuristic (chopping off suffixes)	Morphological analysis (using a dictionary)
Accuracy	Lower (may produce non-words)	Higher (produces meaningful base words)
Speed	Very Fast	Slower (requires lookups)
Use Case	Search engines, large-scale indexing	Chatbots, Sentiment analysis, Q&A

What is Stemming in Python with Example

You use libraries like NLTK to apply stemming in Python. It reduces words to a base form using rule-based methods.

Code example

from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["running", "studies", "playing"]
for word in words:
print(stemmer.stem(word))

Output

run
studi
play

How stemming works

Uses algorithms like Porter Stemmer
Removes common suffixes such as -ing, -ed, -ies
Does not check the meaning of the word
Applies fixed rules to all words

Also Read: Why Do We Do Stemming in NLP?

Why it is useful

Reduces word variations
Speeds up processing
Saves memory in large datasets

Stemming is a good choice when you want quick results and can accept less accurate word forms.

Machine Learning Courses to upskill

Explore Machine Learning Courses for Career Progression

IIIT Bangalore

Executive Diploma in Machine Learning and AI

360° Career Support

Executive Diploma12 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Double Credentials

Master's Degree18 Months

What is Lemmatization in Python with Example

You use libraries like NLTK to perform lemmatization in python. It converts words into their meaningful base form using vocabulary and grammar rules.

Code example

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
words = ["running", "better", "studies"]
for word in words:
   print(lemmatizer.lemmatize(word))

Output

running
better
study

Improve results with POS tagging

Lemmatization works better when you pass the correct part of speech.

from nltk.corpus import wordnet
print(lemmatizer.lemmatize("running", pos=wordnet.VERB))

Also Read: What Is Tokenization and Stemming Techniques In NLP?

Output

Lemmatization is the better choice when you want meaningful and accurate word forms instead of quick approximations.

When to use lemmatization and stemming in python

You choose between lemmatization and stemming in python based on your goal. Each method fits different types of tasks.

Use stemming

Pick stemming when you need speed and can ignore exact meaning.

Search engines
Keyword matching systems
Large datasets with millions of words
Basic text preprocessing

Use lemmatization

Chatbots
Text classification
Sentiment analysis
Machine translation

Quick decision guide

Situation	Best choice
Speed is priority	Stemming
Accuracy is priority	Lemmatization
Large raw text data	Stemming

Pick the method that matches your task, not just the tool.

Also Read: NLP Models in Machine Learning and Deep Learning

Conclusion

You use lemmatization and stemming in python to clean and simplify text data. Stemming gives quick results with less accuracy, while lemmatization provides meaningful words with better precision. Choose based on your task. Use stemming for speed and lemmatization when your model needs correct context and understanding.

"Want personalized guidance on AI and upskilling opportunities? Connect with upGrad’s experts for a free 1:1 counselling session today!"       

Frequently Asked Questions (FAQs)

1. What is stemming and lemmatization in simple words?

Stemming and lemmatization are two ways to shorten words to their base form so a computer can understand them better. Stemming is like a fast pair of scissors that just snips off the ends of words, sometimes leaving them incomplete. Lemmatization is like a smart librarian who looks the word up in a dictionary to find its proper root, ensuring the final result is always a real word.

2. When should I use lemmatization in python instead of stemming?

You should use lemmatization when the actual meaning and "real-word" status of the root are important for your project. This is common in chatbots, translation tools, and sentiment analysis where the context matters. Stemming is better when you have a huge amount of data and need to group similar words together very quickly, such as in a search engine's index.

3. Which library is best for lemmatization and stemming in python?

The NLTK library is the best place to start because it offers a wide variety of stemmers and lemmatizers for educational purposes. For professional, high-speed production environments, the SpaCy library is often preferred because it performs lemmatization automatically as part of its processing pipeline. Both libraries are free and have massive community support in 2026.

4. What is stemming's biggest disadvantage?

The biggest disadvantage of stemming is "over-stemming" or "under-stemming." This happens when the algorithm chops off too much of a word, turning "university" into "univers," or not enough, leaving two related words with different stems. Because it doesn't use a dictionary, it often creates stems that are not real English words, which can be confusing for some downstream tasks.

5. Does lemmatization in python work for all languages?

Yes, but it is much more complex for some languages than others. While English lemmatization is well-supported, languages with complex grammar like Arabic, Turkish, or even Hindi require specialized dictionaries and more advanced models. Most modern Python libraries now offer multilingual support, but the accuracy will vary depending on the availability of linguistic data for that specific language.

6. What is the difference between a stem and a lemma?

A "stem" is the part of the word that remains after you remove its suffixes; it doesn't have to be a valid word. A "lemma" is the actual dictionary form of a word, also known as its "canonical form." For example, for the word "saw," the stem might be "s" (if using an aggressive stemmer), but the lemma would be "see" (if used as a verb) or "saw" (if used as a noun).

7. Can I perform lemmatization and stemming in python without NLTK?

Yes, you can use other libraries like SpaCy, TextBlob, or Gensim. SpaCy is particularly popular in 2026 because it is built for speed and handles lemmatization automatically when you process a string of text. Many developers prefer SpaCy because it is "opinionated," meaning it chooses the best algorithms for you so you don't have to pick between different stemmers manually.

8. Does stemming work on numbers or special characters?

Generally, no. Stemming and lemmatization algorithms are designed to work on alphabetic strings and will usually ignore numbers or punctuation. If your dataset contains a lot of "noise" like hashtags or emojis, you should clean those out using "Regular Expressions" in Python before you start the stemming or lemmatization process.

9. Why do we need to provide POS tags for lemmatization?

Providing Part of Speech (POS) tags helps the lemmatizer understand the context. For example, the word "meeting" could be a noun ("The meeting was long") or a verb ("I am meeting him"). Without a POS tag, the lemmatizer might not know whether to leave it as "meeting" or change it to "meet." Adding this context significantly improves the accuracy of the results.

10. How does lemmatization help in SEO?

In SEO, lemmatization helps search engines understand that a user searching for "running shoes" might also be interested in "run shoes" or "runner shoes." By normalizing the keywords, search engines can provide more relevant results. This is why content writers often use different variations of a keyword; the search engine's NLP brain will eventually map them all back to the same core intent.

11. Is lemmatization and stemming in python still used in the age of GPT-4?

Yes, they are still very relevant for "pre-processing" data. While large models like GPT-4 can understand different word forms on their own, using lemmatization can help reduce the size of your vocabulary and make smaller, specialized models much more efficient. It is also a vital step in "Topic Modeling" and "Word Clouds," where you want to count how many times a core concept appears.

Pavan Vadapalli

901 articles published

Pavan Vadapalli is the Director of Engineering , bringing over 18 years of experience in software engineering, technology leadership, and startup innovation. Holding a B.Tech and an MBA from the India...

Speak with AI & ML expert

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources