OCR or optical character recognition(OCR) is used to extract information from images of bills and receipts, or anything that has written content on it. To develop this solution, OpenCV can be used to process the images which can be further fed into a Tesseract OCR engine that can extract the text from those images.
However, the text removal process can be efficient only if the image is clear and the texts are visible enough. In retail applications, for extracting texts from invoices, the invoice may be inundated with watermarks, or there can be a shadow on the bill that hinders the information to be captured.
Capturing pieces of information from longer pages of texts can also be an arduous task. To tackle these problems, it is prudent that in the information extraction pipeline, there is a place from the image processing module that deals with the aforementioned difficulties.
It comprises several sub-processes, i.e, localization of texts, character segmentation, and recognition of those characters. Although few systems manage without segmentation. Such methods are produced utilizing several procedures, such as applying the least square method to reduce the error rate and support vector machines to match the characters.
Still, often to identify the occupancy of a character in an image, Convolutional Neural Networks (CNN) are employed. Texts can be viewed as a consistent sequence of characters. Detecting and identifying these characters with greater accuracy is a difficulty that can be resolved by using a special type of neural network, namely, recurrent neural networks (RNNs) and long short term memory (LSTM).
Words are collected by adjusting texts into blobs. These lines and regions are moreover examined for equivalent text. Text lines are divided into words only according to the sort of spacing among them. The method of identification is split into two steps. Firstly, each word is identified. Every perfect or correctly identified word is additionally passed to an adaptive classifier as training data.
The image that is received as input is examined and processed in parts. The text is fed into the LSTM model line by line. Tesseract, which is an optical character recognition engine, is available for various operating systems. It uses a combination of CNN and LSTM architecture to identify and derive texts from image data precisely. However, images with noise or shadows hamper the retrieval accuracy.
To minimize the noise, or improve the image quality, Preprocessing of the image can be performed using the OpenCV library. Such pre-processing steps can comprise discovering the ROI or the region of interest, cropping of the image, removal of noise(or unwanted regions), thresholding, dilation and erosion, detection of contours or edges. After those steps are completed the OCR engines can read the image and extricate relevant texts from it perfectly.
OpenCV is a library originally compatible with languages C/C++ and python. It is used commonly for processing data with image samples. A plethora of predefined useful functions are present in the library that implements necessary transformations on the image samples. All the aforementioned functions like dilation, erosion, slicing, edge detection, and many more can easily be done using this library.
2. Tesseract OCR Engine
Released by Google, it is an open-source library that is widely used for text recognition. It can be used to detect and identify texts in various languages. The processing is quite fast and gives the textual output of an image almost immediately. Many scanning applications leverage this library and rely on its extraction techniques.
Steps Involved in the Text Extraction Process
(1) Firstly, Possible image processing techniques like contour detection, noise removal, and erosion and dilation functions are applied to the incoming noisy image sample.
(2) After this step, removal of watermarks and shadows from the bill is done.
(3) Furthermore, the bill is segmented into parts.
(4) The segmented parts are passed through the Tesseract OCR engine to get the complete text.
(5) Finally using Regex, we get all the vital information like the total amount, date of purchase, and expenses per item.
let me talk about a specific image with texts – invoices and bills. They usually have watermarks on them, most of the company that is issuing the bills. As mentioned earlier, these watermarks are impediments in the way of efficient text extraction. Oftentimes, these watermarks themselves contain the text.
These can be regarded as noise as the Tesseract engine recognizes texts of every size in a line. Like watermarks, shadows also inhibit the engine’s accuracy to extract texts efficiently. Shadows are removed by enhancing the contrast and brightness of the image.
For images that have stickers or watermarks, a multi-step process is carried out. The process involves converting an image into grayscale, applying morphological transformations, applying thresholding (it can be a binary inversion or an otsu transformation), extracting darker pixels in the darker region, and lastly, pasting the darker pixels in the watermark region. Coming back to the process of shadow removal.
Firstly, dilation is applied to the grayscale image. Above this, a medium blue with an appropriate kernel suppresses the text. The output of this step is an image that contains portions of shadows and any other discolorations present. Now a simple difference operation is computed between the original image and the obtained image. Finally, after applying thresholding what we get is an image with no shadows.
Recognition and Extraction of Text
A Convolutional Neural Network model can be built and trained on the imprinted text found in images. The model can further be used for detecting text from other similar images with the same font. A Tesseract OCR engine is used to recover text from the images that have been processed using the computer vision algorithms.
For Optical Character Recognition, we have to perform text localization, followed by character segmentation, and then, recognition of characters. All of these steps are carried out by the Tesseract OCR. Tesseract OCR engine proves to be highly accurate when used on printed text rather than handwritten text.
Getting Relevant Information
Talkin about invoices specifically, out of all the text extracted, vital information like the date of purchase, Total amount, etc. can be readily obtained using multiple regular expressions. The total amount that is imprinted on the invoice can be extracted by applying regular expressions owing to the fact that it usually appears at the end of the invoice. Many such useful pieces of information can be stored according to their dates so that they are easily accessible.
Accuracy for text retrieval can be defined as the ratio of the correct number of information that is obtained by the Tesseract OCR and that are already in the invoice to the cumulative number of words actually present in the textual image. Higher accuracy signifies higher efficiency of pre-processing techniques and the ability of the Tesseract OCR to extract information.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Learn ML Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.