How Natural Language Processing is used in Speech Recognition – Download EBook


Natural Language Processing (NLP) is the amalgamation of linguistics and Machine Learning (ML). NLP attempts to parse human-human and human-computer interactions in the form of language (speech or text) to deliver actionable results. NLP is an ML application, as machines “learn” from millions of sample datasets to understand the language in its natural form.

Deep diving: The layers of NLP

Language simply doesn’t fit into neatly lined boxes. Put another way, parsing language using rows and columns is rather unfounded, as it is inherently ambiguous. But since the computer can only process information in rows of zeros and ones, the attempt is inevitable. An attempt that has resulted in several real-world applications! Because of this conjuncture, NLP is considered a “Core AI” technology – in that it is one of the few disciplines that is a pursuit to further the understanding of computers on how the human brain functions.

1. Input Processing:

Input processing is the method of collecting or rather archiving and segmenting speech or text in its raw form, i.e. dismantling long sentences into more digestible pieces. 

For text inputs: If the data isn’t available in the standard UTF-8 character sets but is embedded in the image, and OCR (Optical Character Recognition) software is used. As this technology is quite old (especially given the advances that image processing has made over the years), plenty of open-source software is available to achieve this. 

For speech inputs: When it comes to speech, input processing gets slightly more complicated. An entire field, known as Speech Recognition, forms a Deep Learning subset in the NLP universe. Let’s take a small segue into how Speech-to-text is accomplished today.

2. Morphological Analysis

This phase aims to derive more meaning from the tokens themselves. The processes involved here are:

  • Predicting parts-of-speech

Each token is tagged as noun, pronoun, verb, adverb, adjective, and so on. POS tagging requires the usage of predictive analytics, as many words can take different forms when used in different contexts. For example, in the sentence “The moral law is above the civil”, the word “above” acts as a preposition. In the sentence “our blessings come from above”, the same word acts as a noun. Most models simply use statistical data, including common prefixes and suffixes, to tag the token.

  • Lemmatization/Stemming

The process of lemmatization strips down a word to its root level using a language dictionary. On the other hand, stemming simply aims to remove any suffixes to derive a root word using pattern matching. Lemmatizing is far more accurate than stemming.

  • Identifying the stop word

Some words like ‘a’, ‘and’ and ‘the’ simply create distraction and confusion in the statistical process. While they may be needed to identify parts-of-speech, their use beyond that is moot. So some NLP pipelines might categorize these words as stop words. The list and number of stop words used can affect the efficiency of the entire NLP system.

  • Decompounding

In some languages (like Germanic, Scandinavian and Dravidian languages), compound words are common and require to be split into their basic parts. For example, the word “hellblau” means “light blue” and can be split into its root words for a better context. To know more about NLP and Speech Recognition

Download this Ebook


Leave a comment

Your email address will not be published. Required fields are marked *