Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconTokenization in Natural Language Processing

Tokenization in Natural Language Processing

Last updated:
1st Dec, 2020
Views
Read Time
7 Mins
share image icon
In this article
Chevron in toc
View All
Tokenization in Natural Language Processing

When dealing with textual data, the most basic step is to tokenize the text. ‘Tokens’ can be considered as individual words, sentences, or any minimum unit. Therefore, breaking the sentences into separate units is nothing but Tokenization.

Top Machine Learning and AI Courses Online

By the end of this tutorial, you will have the knowledge of the following:

  • What is Tokenization
  • Different types of Tokenizations
  • Different ways to Tokenize

Tokenization is the most fundamental step in an NLP pipeline.

Ads of upGrad blog

But why is that?

These words or tokens are later converted into numeric values so that the computer can understand and make sense out of it. These tokens are cleaned, pre-processed and then converted into numeric values by the methods of “Vectorization”. These vectors can then be fed to the Machine Learning algorithms and neural networks. 

Trending Machine Learning Skills

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Tokenization can not only be word level, but also sentence level. That is, text can be either tokenized with words as tokens or sentences as tokens. Let’s discuss a few ways to perform tokenization.

Python Split()

The split() function of Python returns the list of tokens splitted by the character mentioned. By default, it splits the words by spaces.

Word Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”
Tokens = Mystr.split()

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods,’, ‘and’, ‘ways?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks.’]

Sentence Tokenization

The same text can be splitted into sentences by passing the separator as “.”.

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”

Tokens = Mystr.split(“.”)

 

#Output:
>> [‘This is a tokenization tutorial’, ‘ We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks’, ]

Though this seems straightforward and simple, it has a lot of flaws. And if you notice, it splits after the last “.” as well. And it doesn’t consider the “?” as an indicator of next sentence because it only takes one character, which is “.”.

Text data in real life scenarios is very dirty and not nicely put in words and sentences. A lot of garbage text might be present which will make it very difficult for you to tokenize this way. Therefore, let’s move ahead to better and more optimized ways of tokenization.

Must Read: Top 10 Deep Learning Techniques You Should Know

Regular Expression

Regular Expression (RegEx) is a sequence of characters that are used to match against a pattern of characters. We use RegEx to find certain patterns, words or characters to replace them or do any other operation on them. Python has the module re which is used for working with RegEx. Let’s see how we can tokenize the text using re.

Word Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = re.findall(“[\w’]+”, Mystr)

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’]

So, what happened here?

The re.findall() function matches against all the sequences that match with it and stores them in a list. The expression “[\w]+” means that any character – be it alphabets or numbers or Underscore (“_”). The “+” symbol means all the occurrences of the pattern. So essentially it will scan all the characters and put them in the list as one token when it hits a whitespace or any other special character apart from an underscore.

Please notice that the word “NLP’s” is a single word but our regex expression broke it into “NLP” and “s” because of apostrophe. 

Sentence Tokenization

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = re.compile(‘[.!?] ‘).split(Mystr)

 

#Output:
>> [‘This is a tokenization tutorial’, ‘We are learning different tokenization methods, and ways’, ‘Tokenization is essential in NLP tasks.’]

Now, here we combined multiple splitting characters into one condition and called the re.split function. Therefore, when it hits any of these 3 characters, it will treat it as a separate sentence. This is an advantage of RegEx over the python split function where you cannot pass multiple characters to split at.

Also Read: Applications of Natural Language Processing

NLTK Tokenizers

Natural Language Toolkit (NLTK) is a Python library specifically for handling NLP tasks. NLTK consists of functions and modules built-in which are made for some specific processes of the complete NLP pipeline. Let’s have a look at how NLTK handles tokenization.

Word Tokenization

NLTK has a separate module, NLTK.tokenize, to handle tokenization tasks. For word tokenization, one of the methods it consists of is word_tokenize.

from nltk.tokenize import word_tokenize

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
word_tokenize(Mystr)

 

#Output:
>>[‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’, ‘.’]

 

Please notice that word_tokenize considered the punctuations as separate tokens. To prevent this from happening, we need to remove all the punctuations and special characters before this step itself.

Sentence Tokenization

 

from nltk.tokenize import sent_tokenize

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
sent_tokenize(Mystr)

 

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, ‘Tokenization is essential in NLP tasks.’]

SpaCy Tokenizers

SpaCy is probably one of the most advanced libraries for NLP tasks. It consists of support for almost 50 languages. Therefore the first step is to download the core for English language. Next, we need to import the English module which loads the tokenizer, tagger, parser, NER and word vectors.

Word Tokenization 

from spacy.lang.en import English

nlp = English()
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
my_doc = nlp(Mystr)

Tokens = []
for token in my_doc:
    Tokens.append(token.text)
Tokens 

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, “‘s”, ‘tasks’, ‘.’]

 

Here, when we call the function nlp with MyStr passed, it creates the word tokens for it. Then we index through them and store them in a separate list.

Sentence Tokenization

 

from spacy.lang.en import English

nlp = English()
sent_tokenizer = nlp.create_pipe(‘sentencizer’)
nlp.add_pipe(sent_tokenizer)

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

my_doc = nlp(Mystr)

Sents = []
for sent in doc.sents:
    Sents.append(sent.text)
Sents 

 

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]

For sentence tokenization, call the creat_pipe method to create the sentencizer component which creates sentence tokens. We then add the pipeline to the nlp object. When we pass the text string to nlp object, it creates sentence tokens for it this time. Now they can be added to a list in the same way as in the previous example.

Keras Tokenization

Keras is one of the most preferred deep learning frameworks currently. Keras also offers a dedicated class for text processing tasks – keras.preprocessing.text. This class has the text_to_word_sequence function which creates word level tokens from the text. Let’s have a quick look.

 

from keras.preprocessing.text import text_to_word_sequence

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = text_to_word_sequence(Mystr)
Tokens 

 

#Output:
>> [‘this’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘we’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘tokenization’, ‘is’, ‘essential’, ‘in’, “nlp’s”, ‘tasks’]

Please notice that it treated the word “NLP’s” as a single token. Plus, this keras tokenizer lowercased all the tokens which is an added bonus.

Gensim Tokenizer

Gensim is another popular library for handling NLP based tasks and topic modelling. The class gensim.utils offers a method tokenize, which can be used for our tokenization tasks.

Word Tokenization

from gensim.utils import tokenize
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

list(tokenize(Mystr))

 

#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘s’, ‘tasks’]

Sentence Tokenization

For sentence tokenization, we use the split_sentences method from the gensim.summarization.textcleaner class.

 

from gensim.summarization.textcleaner import split_sentences

Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”

Tokens = split_sentences(Mystr)
Tokens 

 

#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]
Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

Before You Go

In this tutorial we discussed various ways to tokenize your text data based on applications. This is an essential step of the NLP pipeline, but it is necessary to have the data cleaned before proceeding to tokenization.

If you’re interested to learn more about machine learning & AI, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Explore Free Courses

Suggested Blogs

45+ Best Machine Learning Project Ideas For Beginners [2024]
329925
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

21 May 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
64849
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 May 2024

40 Best IoT Project Ideas & Topics For Beginners 2024 [Latest]
765490
In this article, you will learn the 40Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Best Simple IoT Proje
Read More

by Kechit Goyal

19 May 2024

Top 22 Artificial Intelligence Project Ideas & Topics for Beginners [2024]
414952
In this article, you will learn the 22 AI project ideas & Topics. Take a glimpse below. Best AI Project Ideas & Topics Predict Housing Price
Read More

by Pavan Vadapalli

18 May 2024

Image Segmentation Techniques [Step By Step Implementation]
64012
What do you see first when you look at your selfie? Your face, right? You can spot your face because your brain is capable of identifying your face an
Read More

by Pavan Vadapalli

16 May 2024

6 Types of Regression Models in Machine Learning You Should Know About
283369
Introduction Linear regression and logistic regression are two types of regression analysis techniques that are used to solve the regression problem
Read More

by Pavan Vadapalli

16 May 2024

How to Make a Chatbot in Python Step By Step [With Source Code]
31218
Creating a chatbot in Python is an essential skill for modern developers looking to enhance user interaction and automate responses within application
Read More

by Kechit Goyal

13 May 2024

Artificial Intelligence course fees
5801
Artificial intelligence (AI) was one of the most used words in 2023, which emphasizes how important and widespread this technology has become. If you
Read More

by venkatesh Rajanala

29 Feb 2024

Artificial Intelligence in Banking 2024: Examples & Challenges
6681
Introduction Millennials and their changing preferences have led to a wide-scale disruption of daily processes in many industries and a simultaneous g
Read More

by Pavan Vadapalli

27 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon