Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconIntroduction to Optical Character Recognition [OCR] For Beginners

Introduction to Optical Character Recognition [OCR] For Beginners

Last updated:
8th Feb, 2021
Views
Read Time
6 Mins
share image icon
In this article
Chevron in toc
View All
Introduction to Optical Character Recognition [OCR] For Beginners

OCR or optical character recognition(OCR) is used to extract information from images of bills and receipts, or anything that has written content on it.  To develop this solution, OpenCV can be used to process the images which can be further fed into a Tesseract OCR engine that can extract the text from those images. 

Top Machine Learning and AI Courses Online

However, the text removal process can be efficient only if the image is clear and the texts are visible enough. In retail applications, for extracting texts from invoices, the invoice may be inundated with watermarks, or there can be a shadow on the bill that hinders the information to be captured.

Capturing pieces of information from longer pages of texts can also be an arduous task. To tackle these problems, it is prudent that in the information extraction pipeline, there is a place from the image processing module that deals with the aforementioned difficulties. 

Ads of upGrad blog

It comprises several sub-processes, i.e, localization of texts, character segmentation, and recognition of those characters. Although few systems manage without segmentation. Such methods are produced utilizing several procedures, such as applying the least square method to reduce the error rate and support vector machines to match the characters.

Trending Machine Learning Skills

Still, often to identify the occupancy of a character in an image, Convolutional Neural Networks (CNN) are employed. Texts can be viewed as a consistent sequence of characters. Detecting and identifying these characters with greater accuracy is a difficulty that can be resolved by using a special type of neural network, namely, recurrent neural networks (RNNs)  and long short term memory (LSTM).

Words are collected by adjusting texts into blobs. These lines and regions are moreover examined for equivalent text. Text lines are divided into words only according to the sort of spacing among them. The method of identification is split into two steps. Firstly, each word is identified. Every perfect or correctly identified word is additionally passed to an adaptive classifier as training data. 

The image that is received as input is examined and processed in parts. The text is fed into the LSTM model line by line. Tesseract, which is an optical character recognition engine, is available for various operating systems. It uses a combination of CNN and LSTM architecture to identify and derive texts from image data precisely. However, images with noise or shadows hamper the retrieval accuracy.

To minimize the noise, or improve the image quality, Preprocessing of the image can be performed using the OpenCV library. Such pre-processing steps can comprise discovering the ROI or the region of interest, cropping of the image, removal of noise(or unwanted regions), thresholding, dilation and erosion, detection of contours or edges. After those steps are completed the OCR engines can read the image and extricate relevant texts from it perfectly.

Tools Used 

1. OpenCV 

OpenCV is a library originally compatible with languages C/C++ and python.  It is used commonly for processing data with image samples. A plethora of predefined useful functions are present in the library that implements necessary transformations on the image samples. All the aforementioned functions like dilation, erosion, slicing, edge detection, and many more can easily be done using this library.

2. Tesseract OCR Engine 

Released by Google, it is an open-source library that is widely used for text recognition. It can be used to detect and identify texts in various languages. The processing is quite fast and gives the textual output of an image almost immediately. Many scanning applications leverage this library and rely on its extraction techniques. 

Steps Involved in the Text Extraction Process

(1) Firstly, Possible image processing techniques like contour detection, noise removal, and erosion and dilation functions are applied to the incoming noisy image sample. 

(2) After this step, removal of watermarks and shadows from the bill is done.

(3) Furthermore, the bill is segmented into parts. 

(4) The segmented parts are passed through the Tesseract OCR engine to get the complete text. 

(5) Finally using Regex, we get all the vital information like the total amount, date of purchase, and expenses per item.

let me talk about a specific image with texts – invoices and bills.  They usually have watermarks on them, most of the company that is issuing the bills. As mentioned earlier, these watermarks are impediments in the way of efficient text extraction. Oftentimes, these watermarks themselves contain the text.

These can be regarded as noise as the Tesseract engine recognizes texts of every size in a line. Like watermarks, shadows also inhibit the engine’s accuracy to extract texts efficiently. Shadows are removed by enhancing the contrast and brightness of the image.

For images that have stickers or watermarks, a multi-step process is carried out. The process involves converting an image into grayscale, applying morphological transformations, applying thresholding (it can be a binary inversion or an otsu transformation), extracting darker pixels in the darker region, and lastly, pasting the darker pixels in the watermark region. Coming back to the process of shadow removal.

Firstly, dilation is applied to the grayscale image. Above this, a medium blue with an appropriate kernel suppresses the text. The output of this step is an image that contains portions of shadows and any other discolorations present. Now a simple difference operation is computed between the original image and the obtained image. Finally, after applying thresholding what we get is an image with no shadows.

Recognition and Extraction of Text 

A Convolutional Neural Network model can be built and trained on the imprinted text found in images. The model can further be used for detecting text from other similar images with the same font. A Tesseract OCR engine is used to recover text from the images that have been processed using the computer vision algorithms.

For Optical Character Recognition, we have to perform text localization, followed by character segmentation, and then, recognition of characters. All of these steps are carried out by the Tesseract OCR. Tesseract OCR engine proves to be highly accurate when used on printed text rather than handwritten text.

Getting Relevant Information 

Talkin about invoices specifically, out of all the text extracted, vital information like the date of purchase, Total amount, etc. can be readily obtained using multiple regular expressions. The total amount that is imprinted on the invoice can be extracted by applying regular expressions owing to the fact that it usually appears at the end of the invoice. Many such useful pieces of information can be stored according to their dates so that they are easily accessible. 

Accuracy

Accuracy for text retrieval can be defined as the ratio of the correct number of information that is obtained by the Tesseract OCR and that are already in the invoice to the cumulative number of words actually present in the textual image. Higher accuracy signifies higher efficiency of pre-processing techniques and the ability of the Tesseract OCR to extract information.

Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

What Next?

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms. 

Learn ML Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select Coursecaret down icon
Selectcaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Explore Free Courses

Suggested Blogs

Top 9 Python Libraries for Machine Learning in 2024
74353
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn
Read More

by upGrad

19 Feb 2024

Top 15 IoT Interview Questions & Answers 2024 – For Beginners & Experienced
63868
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th
Read More

by Kechit Goyal

19 Feb 2024

Data Preprocessing in Machine Learning: 7 Easy Steps To Follow
146925
Summary: In this article, you will learn about data preprocessing in Machine Learning: 7 easy steps to follow. Acquire the dataset Import all the cr
Read More

by Kechit Goyal

18 Feb 2024

Artificial Intelligence Salary in India [For Beginners & Experienced] in 2024
906191
Artificial Intelligence (AI) has been one of the hottest buzzwords in the tech sphere for quite some time now. As Data Science is advancing, both AI a
Read More

by upGrad

18 Feb 2024

24 Exciting IoT Project Ideas & Topics For Beginners 2024 [Latest]
742084
Summary: In this article, you will learn the 24 Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Smart Agr
Read More

by Kechit Goyal

18 Feb 2024

Natural Language Processing (NLP) Projects & Topics For Beginners [2023]
105209
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation
Read More

by Pavan Vadapalli

17 Feb 2024

45+ Interesting Machine Learning Project Ideas For Beginners [2024]
323007
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith
Read More

by Jaideep Khare

16 Feb 2024

AWS Salary in India in 2023 [For Freshers & Experienced]
903480
Summary: In this article, you will learn about AWS Salary in India For Freshers & Experienced. AWS Salary in India INR 6,07,000 per annum AW
Read More

by Pavan Vadapalli

15 Feb 2024

Top 8 Exciting AWS Projects & Ideas For Beginners [2023]
95460
AWS Projects & Topics Looking for AWS project ideas? Then you’ve come to the right place because, in this article, we’ve shared multiple AWS proj
Read More

by Pavan Vadapalli

13 Feb 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon