Image Classification Gets a Makeover. Thanks to CNN.
Convolutional Neural Networks (CNNs) are the backbone of image classification, a deep learning phenomenon that takes an image and assigns it a class and a label that makes it unique. Image classification using CNN forms a significant part of machine learning experiments.
Together with using CNN and its induced capabilities, it is now widely used for a range of applications-right from Facebook picture tagging to Amazon product recommendations and healthcare imagery to automatic cars. The reason CNN is so popular is that it requires very little pre-processing, meaning that it can read 2D images by applying filters that other conventional algorithms cannot. We will delve deeper into the process of how image classification using CNN works.
How Does CNN work?
CNN’s are equipped with an input layer, an output layer, and hidden layers, all of which help process and classify images. The hidden layers comprise convolutional layers, ReLU layers, pooling layers, and fully connected layers, all of which play a crucial role. Learn more about convolutional neural network.
Let’s look at how image classification using CNN works:
Imagine that the input image is that of an elephant. This image, with pixels, is first entered into the convolutional layers. If it is a black and white picture, the image is interpreted as a 2D layer, with every pixel assigned a value between ‘0’and ‘255’, ‘0’ being wholly black, and ‘255’ completely white. If, on the other hand, it is a colour picture, this becomes a 3D array, with a blue, green, and red layer, with each colour value between 0 and 255.
The reading of the matrix then begins, for which the software selects a smaller image, known as the ‘filter’ (or kernel). The depth of the filter is the same as the depth of the input. The filter then produces a convolution movement along with the input image, moving right along the image by 1 unit.
It then multiplies the values with the original picture values. All the multiplied figures are added up together, and a single number is generated. The process is repeated along with the entire image, and a matrix is obtained, smaller than the original input image.
The final array is called the feature map of an activation map. Convolution of an image helps perform operations such as edge detection, sharpening, and blurring, by applying different filters. All one needs to do is specify aspects such as the size of the filter, the number of filters and/or the architecture of the network.
From a human perspective, this action is akin to identifying the simple colours and boundaries of an image. However, to classify the image and recognize the features that make it, say, that of an elephant and not of a cat, unique features such as large ears and trunk of the elephant need to be identified. This is where the non-linear and pooling layers come in.
The non-linear layer (ReLU) follows the convolution layer, where an activation function is applied to the feature maps to increase the non-linearity of the image. The ReLU layer removes all negative values and increases the accuracy of the image. Although there are other operations like tanh or sigmoid, ReLU is the most popular since it can train the network much faster.
The next step is to create several images of the same object so that the network can always recognize that image, whatever its size or location. For instance, in the elephant picture, the network must recognize the elephant, whether it is walking, standing still, or running. There must be image flexibility, and that’s where the pooling layer comes in.
It works with the image’s measurements (height and width) to progressively reduce the size of the input image so that the objects in the image can be spotted and identified wherever it is located.
Pooling also helps control ‘overfitting’ where there is too much information with no scope for new ones. Perhaps, the most common example of pooling is max pooling, where the image is divided into a series of non-overlapping areas.
Max pooling is all about identifying the maximum value in each area so that all extra information is excluded, and the image becomes smaller in size. This action helps account for distortions in the image as well.
Now comes the fully connected layer that adds an artificial neural network for using CNN. This artificial network combines different features and helps predict the image classes with greater accuracy. At this stage, the gradient of the error function is calculated concerning the neural network’s weight. The weights and feature detectors are adjusted to optimize performance, and this process is repeated repeatedly.
Here’s what the CNN architecture looks like:
Leveraging datasets for CNN Application-MNIST
Several datasets can be used to apply CNN effectively. The three most popular ones vital in image classification using CNN are MNIST, CIFAR-10, and ImageNet. Let’s look at MNIST first.
MNIST is an acronym for the Modified National Institute of Standards and Technology dataset and comprises 60,000 small, square 28×28 grayscale images of single, handwritten digits between 0 and 9. MNIST is a popular and well-understood dataset that is, for the greater part, ‘solved.’ It can be used in computer vision and deep learning to practice, develop, and evaluate image classification using CNN. Among other things, this includes steps to evaluate the performance of the model, explore possible improvements, and use it to predict new data.
Its USP is that it already has a well-defined train and test dataset that we can use. This training set can further be divided into a train and validate dataset if one needs to evaluate the performance of a training run model. Its performance in the train and validate set on each run can be recorded as learning curves for greater insight into how well the model is learning the problem.
Keras, one of the leading neural network APIs, supports this by stipulating the “validation_data” argument to the model. Fit()function when training the model, which eventually returns an object that mentions model performance for the loss and metrics on each training run. Fortunately, MNIST is equipped with Keras by default, and the train and test files can be loaded using just a few lines of code.
Natural Language Processing
Interestingly, an article by Yann LeCun, Professor at The Courant Institute of Mathematical Sciences at New York University and Corinna Cortes, Research Scientist at Google Labs in New York, points out that MNIST’s Special Database 3 (SD-3) was originally assigned as a training set. Special Database 1 (SD-1) was designated as a test set.
However, they believe that SD-3 is much easier to identify and recognize than SD-1 because SD-3 was gathered from employees working in the Census Bureau, while SD-1 was sourced from among high-school students. Since accurate conclusions from learning experiments mandates that the result must be independent of the training set and test, it was deemed necessary to develop a fresh database by missing the datasets.
When using the dataset, it is recommended to divide it into minibatches, store it in shared variables, and access it based on the minibatch index. You might wonder at the need for shared variables, but this is connected with using the GPU. What happens is that when copying data into the GPU memory, if you copy each minibatch separately as and when needed, the GPU code will slow down and not be much faster than the CPU code. If you have your data in Theano shared variables, there is a good chance of copying the whole data onto the GPU at one go when the shared variables are built.
Later the GPU can use the minibatch by accessing these shared variables without needing to copy information from the CPU memory. Also, because the data points are usually real numbers and label integers, it would be good to use different variables for these as well as for the validation set, a training set, and testing set, to make the code easier to read.
The code below shows you how to store data and access a minibatch:
2. CIFAR-10 Dataset
CIFAR stands for the Canadian Institute for Advanced Research, and the CIFAR-10 dataset was developed by researchers at the CIFAR institute, along with the CIFAR-100 dataset. The CIFAR-10 dataset consists of 60,000 32×32 pixel colour images of objects belonging to ten classes such as cats, ships, birds, frogs, etc. These images are much smaller than an average photograph and are intended for computer vision purposes.
CIFAR is a well understood, straightforward dataset that is 80% accurate in the image classification using the CNN process and 90% on the test dataset. Also, as many as 1,000 images spread out over one test batch and five training batches.
The CIFAR-10 dataset consists of 1,000 randomly selected images from each class, but some batches might contain more images from one class than another. However, the training batches contain exactly 5,000 images from each class. The CIFAR-10 dataset is preferred for its ease of use as a starting point for solving image classification CNN using problems.
The design of its test harness is modular, and it can be developed with five elements that include dataset loading, model definition, dataset preparation, and the evaluation and result presentation. The example below shows the CIFAR-10 dataset using the Keras API with the first nine images in the training dataset:
Running the example loads the CIFAR-10 dataset and prints their shape.
ImageNet aims to categorize and label images into nearly 22,000 categories based on predefined words and phrases. To do this, it follows the WordNet hierarchy, where every word or phrase is a synonym or synset (in short). In ImageNet, all images are organized according to these synsets, to have over a thousand images per synset.
However, when ImageNet is referred to in computer vision and deep learning, what is actually meant is the ImageNet Large Scale Recognition Challenge or ILSVRC. The goal here is to categorize an image into 1,000 different categories by using over 100,000 test images since the training dataset contains around 1.2 million images.
Perhaps the greatest challenge here is that the images in ImageNet measure 224×224, and so processing such a large amount of data requires massive CPU, GPU, and RAM capacity. This might prove impossible for an average laptop, so how does one overcome this problem?
One way of doing this is to use Imagenette, a dataset extracted from ImageNet that doesn’t require too many resources. This dataset has two folders named ‘train’ (training) and ‘Val’ (validation) with individual folders for each class. All these classes have the same ID as the original dataset, with each of the classes having around 1,000 images, so the whole set up is pretty balanced.
Another option is to use transfer learning, a method that uses pre-trained weights on large datasets. This is a very effective way of image classification using CNN because we can use it to produce models that work well for us. The one aspect that an image classification using the CNN model should be able to do is to classify images belonging to the same class and distinguish between those that are different. This is where we can make use of the pre-trained weights. The advantage here is that we can use different methods depending on the kind of dataset we’re working with.
To sum up, image classification using CNN has made the process easier, more accurate, and less process-heavy. If you’d like to delve deeper into machine learning, upGrad has a range of courses that help you master it like a pro!
upGrad offers various courses online with a wide range of subcategories; visit the official site for further information.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.