Home
Blog
Artificial Intelligence
Tokenization in Natural Language Processing

Tokenization in Natural Language Processing

Updated on Dec 30, 2024 | 7 min read | 6.23K+ views

Table of Contents

View all

Python Split()
Regular Expression
NLTK Tokenizers
SpaCy Tokenizers
Keras Tokenization
Gensim Tokenizer
Before You Go

When dealing with textual data, the most basic step is to tokenize the text. ‘Tokens’ can be considered as individual words, sentences, or any minimum unit. Therefore, breaking the sentences into separate units is nothing but Tokenization.

Top Machine Learning and AI Courses Online

Master of Science in Machine Learning & AI from LJMU		Executive Post Graduate Programme in Machine Learning & AI from IIITB
Advanced Certificate Programme in Machine Learning & NLP from IIITB	Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB	Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland
Machine Learning Certification

By the end of this tutorial, you will have the knowledge of the following:

What is Tokenization
Different types of Tokenizations
Different ways to Tokenize

Tokenization is the most fundamental step in an NLP pipeline.

But why is that?

These words or tokens are later converted into numeric values so that the computer can understand and make sense out of it. These tokens are cleaned, pre-processed and then converted into numeric values by the methods of “Vectorization”. These vectors can then be fed to the Machine Learning algorithms and neural networks.

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program12 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree18 Months

Trending Machine Learning Skills

AI Courses	Tableau Certification
Natural Language Processing	Deep Learning AI

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Tokenization can not only be word level, but also sentence level. That is, text can be either tokenized with words as tokens or sentences as tokens. Let’s discuss a few ways to perform tokenization.

Python Split()

The split() function of Python returns the list of tokens splitted by the character mentioned. By default, it splits the words by spaces.

Word Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”
Tokens = Mystr.split()

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods,’, ‘and’, ‘ways?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks.’]

Sentence Tokenization

The same text can be splitted into sentences by passing the separator as “.”.

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”

Tokens = Mystr.split(“.”)

#Output:
>> [‘This is a tokenization tutorial’, ‘ We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks’, ”]

Though this seems straightforward and simple, it has a lot of flaws. And if you notice, it splits after the last “.” as well. And it doesn’t consider the “?” as an indicator of next sentence because it only takes one character, which is “.”.

Text data in real life scenarios is very dirty and not nicely put in words and sentences. A lot of garbage text might be present which will make it very difficult for you to tokenize this way. Therefore, let’s move ahead to better and more optimized ways of tokenization.

Must Read: Top 10 Deep Learning Techniques You Should Know

Regular Expression

Regular Expression (RegEx) is a sequence of characters that are used to match against a pattern of characters. We use RegEx to find certain patterns, words or characters to replace them or do any other operation on them. Python has the module re which is used for working with RegEx. Let’s see how we can tokenize the text using re.

Word Tokenization\

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = re.findall(“[\w’]+”, Mystr)

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’]

So, what happened here?

The re.findall() function matches against all the sequences that match with it and stores them in a list. The expression “[\w]+” means that any character – be it alphabets or numbers or Underscore (“_”). The “+” symbol means all the occurrences of the pattern. So essentially it will scan all the characters and put them in the list as one token when it hits a whitespace or any other special character apart from an underscore.

Please notice that the word “NLP’s” is a single word but our regex expression broke it into “NLP” and “s” because of apostrophe.

Sentence Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = re.compile(‘[.!?] ‘).split(Mystr)

#Output:
>> [‘This is a tokenization tutorial’, ‘We are learning different tokenization methods, and ways’, ‘Tokenization is essential in NLP tasks.’]

Now, here we combined multiple splitting characters into one condition and called the re.split function. Therefore, when it hits any of these 3 characters, it will treat it as a separate sentence. This is an advantage of RegEx over the python split function where you cannot pass multiple characters to split at.

Also Read: Applications of Natural Language Processing

NLTK Tokenizers

Natural Language Toolkit (NLTK) is a Python library specifically for handling NLP tasks. NLTK consists of functions and modules built-in which are made for some specific processes of the complete NLP pipeline. Let’s have a look at how NLTK handles tokenization.

Word Tokenization

NLTK has a separate module, NLTK.tokenize, to handle tokenization tasks. For word tokenization, one of the methods it consists of is word_tokenize.

from nltk.tokenize import word_tokenize

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
word_tokenize(Mystr)

#Output:
>>[‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’, ‘.’]

Please notice that word_tokenize considered the punctuations as separate tokens. To prevent this from happening, we need to remove all the punctuations and special characters before this step itself.

Sentence Tokenization

from nltk.tokenize import sent_tokenize

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
sent_tokenize(Mystr)

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, ‘Tokenization is essential in NLP tasks.’]

SpaCy Tokenizers

SpaCy is probably one of the most advanced libraries for NLP tasks. It consists of support for almost 50 languages. Therefore the first step is to download the core for English language. Next, we need to import the English module which loads the tokenizer, tagger, parser, NER and word vectors.

Word Tokenization

from spacy.lang.en import English

nlp = English()
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
my_doc = nlp(Mystr)

Tokens = []
for token in my_doc:
    Tokens.append(token.text)
Tokens

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, “‘s”, ‘tasks’, ‘.’]

Here, when we call the function nlp with MyStr passed, it creates the word tokens for it. Then we index through them and store them in a separate list.

Sentence Tokenization

from spacy.lang.en import English

nlp = English()
sent_tokenizer = nlp.create_pipe(‘sentencizer’)
nlp.add_pipe(sent_tokenizer)

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

my_doc = nlp(Mystr)

Sents = []
for sent in doc.sents:
    Sents.append(sent.text)
Sents

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]

For sentence tokenization, call the creat_pipe method to create the sentencizer component which creates sentence tokens. We then add the pipeline to the nlp object. When we pass the text string to nlp object, it creates sentence tokens for it this time. Now they can be added to a list in the same way as in the previous example.

Keras Tokenization

Keras is one of the most preferred deep learning frameworks currently. Keras also offers a dedicated class for text processing tasks – keras.preprocessing.text. This class has the text_to_word_sequence function which creates word level tokens from the text. Let’s have a quick look.

from keras.preprocessing.text import text_to_word_sequence

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = text_to_word_sequence(Mystr)
Tokens

#Output:
>> [‘this’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘we’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘tokenization’, ‘is’, ‘essential’, ‘in’, “nlp’s”, ‘tasks’]

Please notice that it treated the word “NLP’s” as a single token. Plus, this keras tokenizer lowercased all the tokens which is an added bonus.

Gensim Tokenizer

Gensim is another popular library for handling NLP based tasks and topic modelling. The class gensim.utils offers a method tokenize, which can be used for our tokenization tasks.

Word Tokenization

from gensim.utils import tokenize
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

list(tokenize(Mystr))

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘s’, ‘tasks’]

Sentence Tokenization

For sentence tokenization, we use the split_sentences method from the gensim.summarization.textcleaner class.

from gensim.summarization.textcleaner import split_sentences

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = split_sentences(Mystr)
Tokens

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]

Popular AI Programs

Masters in AI and ML PG Diploma in AI and ML Generative AI Program for Business Leaders Gen AI Certification LLM Law and Technology Online Program

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm? Simple & Easy
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is Information Technology?
Permutation vs Combination: Difference between Permutation and Combination	Learning Artificial Intelligence & Machine Learning - How to Start	Machine Learning with R: Everything You Need to Know
NLP Free Course	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

Before You Go

In this tutorial we discussed various ways to tokenize your text data based on applications. This is an essential step of the NLP pipeline, but it is necessary to have the data cleaned before proceeding to tokenization.

If you’re interested to learn more about machine learning & AI, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Pavan Vadapalli

900 articles published

Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology s...

Get Free Consultation

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources