Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconTokenization in Natural Language Processing

Tokenization in Natural Language Processing

Last updated:
1st Dec, 2020
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Tokenization in Natural Language Processing

When dealing with textual data, the most basic step is to tokenize the text. ‘Tokens’ can be considered as individual words, sentences, or any minimum unit. Therefore, breaking the sentences into separate units is nothing but Tokenization.

Top Machine Learning and AI Courses Online

By the end of this tutorial, you will have the knowledge of the following:

  • What is Tokenization
  • Different types of Tokenizations
  • Different ways to Tokenize

Tokenization is the most fundamental step in an NLP pipeline.

Ads of upGrad blog

But why is that?

These words or tokens are later converted into numeric values so that the computer can understand and make sense out of it. These tokens are cleaned, pre-processed and then converted into numeric values by the methods of “Vectorization”. These vectors can then be fed to the Machine Learning algorithms and neural networks. 

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Tokenization can not only be word level, but also sentence level. That is, text can be either tokenized with words as tokens or sentences as tokens. Let’s discuss a few ways to perform tokenization.

Python Split()

The split() function of Python returns the list of tokens splitted by the character mentioned. By default, it splits the words by spaces.

Word Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”
Tokens = Mystr.split()

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods,’, ‘and’, ‘ways?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks.’]

Sentence Tokenization

The same text can be splitted into sentences by passing the separator as “.”.

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”

Tokens = Mystr.split(“.”)

 

#Output:
>> [‘This is a tokenization tutorial’, ‘ We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks’, ]

Though this seems straightforward and simple, it has a lot of flaws. And if you notice, it splits after the last “.” as well. And it doesn’t consider the “?” as an indicator of next sentence because it only takes one character, which is “.”.

Text data in real life scenarios is very dirty and not nicely put in words and sentences. A lot of garbage text might be present which will make it very difficult for you to tokenize this way. Therefore, let’s move ahead to better and more optimized ways of tokenization.

Must Read: Top 10 Deep Learning Techniques You Should Know

Regular Expression

Regular Expression (RegEx) is a sequence of characters that are used to match against a pattern of characters. We use RegEx to find certain patterns, words or characters to replace them or do any other operation on them. Python has the module re which is used for working with RegEx. Let’s see how we can tokenize the text using re.

Word Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = re.findall(“[\w’]+”, Mystr)

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’]

So, what happened here?

The re.findall() function matches against all the sequences that match with it and stores them in a list. The expression “[\w]+” means that any character – be it alphabets or numbers or Underscore (“_”). The “+” symbol means all the occurrences of the pattern. So essentially it will scan all the characters and put them in the list as one token when it hits a whitespace or any other special character apart from an underscore.

Please notice that the word “NLP’s” is a single word but our regex expression broke it into “NLP” and “s” because of apostrophe. 

Sentence Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = re.compile(‘[.!?] ‘).split(Mystr)

 

#Output:
>> [‘This is a tokenization tutorial’, ‘We are learning different tokenization methods, and ways’, ‘Tokenization is essential in NLP tasks.’]

Now, here we combined multiple splitting characters into one condition and called the re.split function. Therefore, when it hits any of these 3 characters, it will treat it as a separate sentence. This is an advantage of RegEx over the python split function where you cannot pass multiple characters to split at.

Also Read: Applications of Natural Language Processing

NLTK Tokenizers

Natural Language Toolkit (NLTK) is a Python library specifically for handling NLP tasks. NLTK consists of functions and modules built-in which are made for some specific processes of the complete NLP pipeline. Let’s have a look at how NLTK handles tokenization.

Word Tokenization

NLTK has a separate module, NLTK.tokenize, to handle tokenization tasks. For word tokenization, one of the methods it consists of is word_tokenize.

from nltk.tokenize import word_tokenize

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
word_tokenize(Mystr)

 

#Output:
>>[‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’, ‘.’]

 

Please notice that word_tokenize considered the punctuations as separate tokens. To prevent this from happening, we need to remove all the punctuations and special characters before this step itself.

Sentence Tokenization

 

from nltk.tokenize import sent_tokenize

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
sent_tokenize(Mystr)

 

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, ‘Tokenization is essential in NLP tasks.’]

SpaCy Tokenizers

SpaCy is probably one of the most advanced libraries for NLP tasks. It consists of support for almost 50 languages. Therefore the first step is to download the core for English language. Next, we need to import the English module which loads the tokenizer, tagger, parser, NER and word vectors.

Word Tokenization 

from spacy.lang.en import English

nlp = English()
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
my_doc = nlp(Mystr)

Tokens = []
for token in my_doc:
    Tokens.append(token.text)
Tokens 

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, “‘s”, ‘tasks’, ‘.’]

 

Here, when we call the function nlp with MyStr passed, it creates the word tokens for it. Then we index through them and store them in a separate list.

Sentence Tokenization

 

from spacy.lang.en import English

nlp = English()
sent_tokenizer = nlp.create_pipe(‘sentencizer’)
nlp.add_pipe(sent_tokenizer)

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

my_doc = nlp(Mystr)

Sents = []
for sent in doc.sents:
    Sents.append(sent.text)
Sents 

 

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]

For sentence tokenization, call the creat_pipe method to create the sentencizer component which creates sentence tokens. We then add the pipeline to the nlp object. When we pass the text string to nlp object, it creates sentence tokens for it this time. Now they can be added to a list in the same way as in the previous example.

Keras Tokenization

Keras is one of the most preferred deep learning frameworks currently. Keras also offers a dedicated class for text processing tasks – keras.preprocessing.text. This class has the text_to_word_sequence function which creates word level tokens from the text. Let’s have a quick look.

 

from keras.preprocessing.text import text_to_word_sequence

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = text_to_word_sequence(Mystr)
Tokens 

 

#Output:
>> [‘this’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘we’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘tokenization’, ‘is’, ‘essential’, ‘in’, “nlp’s”, ‘tasks’]

Please notice that it treated the word “NLP’s” as a single token. Plus, this keras tokenizer lowercased all the tokens which is an added bonus.

Gensim Tokenizer

Gensim is another popular library for handling NLP based tasks and topic modelling. The class gensim.utils offers a method tokenize, which can be used for our tokenization tasks.

Word Tokenization

from gensim.utils import tokenize
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

list(tokenize(Mystr))

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘s’, ‘tasks’]

Sentence Tokenization

For sentence tokenization, we use the split_sentences method from the gensim.summarization.textcleaner class.

 

from gensim.summarization.textcleaner import split_sentences

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = split_sentences(Mystr)
Tokens 

 

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]
Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

Before You Go

In this tutorial we discussed various ways to tokenize your text data based on applications. This is an essential step of the NLP pipeline, but it is necessary to have the data cleaned before proceeding to tokenization.

If you’re interested to learn more about machine learning & AI, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Explore Free Courses

Suggested Blogs

15 Interesting MATLAB Project Ideas & Topics For Beginners [2024]
82457
Diving into the world of engineering and data science, I’ve discovered the potential of MATLAB as an indispensable tool. It has accelerated my c
Read More

by Pavan Vadapalli

09 Jul 2024

5 Types of Research Design: Elements and Characteristics
47126
The reliability and quality of your research depend upon several factors such as determination of target audience, the survey of a sample population,
Read More

by Pavan Vadapalli

07 Jul 2024

Biological Neural Network: Importance, Components & Comparison
50612
Humans have made several attempts to mimic the biological systems, and one of them is artificial neural networks inspired by the biological neural net
Read More

by Pavan Vadapalli

04 Jul 2024

Production System in Artificial Intelligence and its Characteristics
86790
The AI market has witnessed rapid growth on the international level, and it is predicted to show a CAGR of 37.3% from 2023 to 2030. The production sys
Read More

by Pavan Vadapalli

03 Jul 2024

AI vs Human Intelligence: Difference Between AI & Human Intelligence
112983
In this article, you will learn about AI vs Human Intelligence, Difference Between AI & Human Intelligence. Definition of AI & Human Intelli
Read More

by Pavan Vadapalli

01 Jul 2024

Career Opportunities in Artificial Intelligence: List of Various Job Roles
89547
Artificial Intelligence or AI career opportunities have escalated recently due to its surging demands in industries. The hype that AI will create tons
Read More

by Pavan Vadapalli

26 Jun 2024

Gini Index for Decision Trees: Mechanism, Perfect & Imperfect Split With Examples
70805
As you start learning about supervised learning, it’s important to get acquainted with the concept of decision trees. Decision trees are akin to
Read More

by MK Gurucharan

24 Jun 2024

Random Forest Vs Decision Tree: Difference Between Random Forest and Decision Tree
51730
Recent advancements have paved the growth of multiple algorithms. These new and blazing algorithms have set the data on fire. They help in handling da
Read More

by Pavan Vadapalli

24 Jun 2024

Basic CNN Architecture: Explaining 5 Layers of Convolutional Neural Network
270717
Introduction In the last few years of the IT industry, there has been a huge demand for once particular skill set known as Deep Learning. Deep Learni
Read More

by MK Gurucharan

21 Jun 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon