Table of Contents

**Introduction **

While going through the Facebook feed, have you ever wondered how the people in a group photo are automatically labelled by Facebook’s software? Behind every interactive user interface of Facebook you see, there is a complex and strong algorithm that is used to recognize and label each picture that is uploaded by us on to the social media platform. With every picture of ours, we only aid in improving the algorithm’s efficiency. Yes, Image Classification is one of the most widely used algorithms where we see the application of Artificial Intelligence.

In recent times, Convolutional Neural Networks (CNN) has become one of the strongest proponents of Deep Learning. One popular application of these Convolutional Networks is Image Classification. In this tutorial, we will go through the basics of Convolutional Neural Networks, see the various layers involved in building a CNN model and finally visualize an example of the Image Classification task.

**Image Classification **

Before we get into the details of Deep Learning and Convolutional Neural Networks, let us understand the basics of Image Classification. In general, Image Classification is defined as the task in which we give an image as the input to a model built using a specific algorithm that outputs the class or the probability of the class that the image belongs to. This process in which we label an image to a particular class is called Supervised Learning.

There is a huge difference between how we see an image and how the machine (computer) sees the same image. To us, we are able to visualize the image and characterize it based on colour and size. On the other hand, to the machine, all it gets to see are numbers. The numbers that are seen are called pixels.

Each pixel has a value between 0 and 255. Hence, with these numerical data, the machine requires some pre-processing steps in order to derive some specific patterns or features that distinguish one image from the other. Convolutional Neural Networks help us build algorithms that are capable of deriving the specific pattern from images.

**What We See Vs What the Computer Sees**

Source – Difference between Computer and Human Eye

**Deep Learning for Image Classification **

Now that we have understood what is Image Classification, let us now see how we can implement it using Artificial Intelligence. For this, we use the popular Deep Learning methods. Deep Learning is a subset of Artificial Intelligence that makes use of large image datasets to recognize and derive patterns from various images to differentiate between various classes present in the image dataset.

The major challenge that Deep Learning faces is that for a huge database, it takes a very long time and it has a high computational cost. However, the Convolutional Neural Networks, which is a type of Deep Learning algorithm addresses this problem well.

**Convolutional Neural Networks **

In Deep Learning, Convolutional Neural Networks are a class of Deep Neural Networks that are mostly used in visual imagery. They are a special architecture of the Artificial Neural Networks (ANN) which was proposed in 1998 by Yann LeCunn. The Convolutional Neural Networks consist of two parts.

The first part consists of the Convolutional layers and the Pooling layers in which the main feature extraction process takes place. In the second part, the Fully Connected and the Dense layers perform several non-linear transformations on the extracted features and act as the classifier part. Learn CNN for image classification.

Consider the above-shown image example of what the human and the machine sees. As we see, the computer sees an array of pixels. For example, if the image size if 500×500, then the size of the array will be 500x500x3. Here, 500 stands for each height and width, 3 stands for the RGB channel where each colour channel is represented by a separate array. The pixel intensity varies from 0 to 255.

Now for Image Classification, the computer will look for the features at the base level. According to us as humans, these base-level features of the cat are its ears, nose and whiskers. While for the computer, these base-level features are the curvatures and boundaries. In this way by using several different layers such as the Convolutional layers and the Pooling layers, the computer extracts the base level features from the images.

In the Convolutional Neural Network model, there are several types of layers such as the –

- Input Layer
- Convolutional Layer
- Pooling Layer
- Fully Connected Layer
- Output Layer
- Activation Functions

Let us go through each of the layers in brief before we get into its application in Image Classification.

**Input Layer **

From the name, we understand that this is the layer in which the input image will be fed into the CNN model. Depending upon our requirement, we can reshape the image to different sizes such as (28,28,3)

**Convolutional Layer **

Then comes the most important layer which consists of a filter (also known as a kernel) with a fixed size. The mathematical operation of Convolution is performed between the input image and the filter. This is the stage in which most of the base features such as sharp edges and curves are extracted from the image and hence this layer is also known as the feature extractor layer.

**Pooling Layer **

After performing the convolution operation, we perform the Pooling operation. This is also known as downsampling where the spatial volume of the image is reduced. For example, if we perform a Pooling operation with a stride of 2 on an image with dimensions 28×28, then the image size reduced to 14×14, it gets reduced to half of its original size.

**Fully Connected Layer **

The Fully Connected Layer (FC) is placed just before the final classification output of the CNN model. These layers are used to flatten the results before classifying. It involves several biases, weights and neurons. Attaching an FC layer before classification results in an N-dimensional vector where N is a number of classes out of which the model has to choose a class.

**Output Layer **

Finally, the Output Layer consists of the label which is mostly encoded by using the one-hot encoding method.

**Activation Function **

These Activation Functions are the core of any Convolutional Neural Network model. These functions are used to determine the output of a neural network. In short, it determines whether a particular neuron should be activated (“fired”) or not. These are usually non-linear functions that are performed on the input signals. This transformed output is then sent as an input to the next layer of neurons. There are several activation functions such as the Sigmoid, ReLU, Leaky ReLU, TanH and Softmax.

**Basic CNN Architecture **

Source: Basic CNN Architecture

As defined earlier the above-shown diagram is the basic architecture of a Convolutional Neural Network model. Now that we are ready with the basics of Image Classification and CNN, let us now dive into its application with a real-time problem. Learn more about basic CNN architecture.

**Convolutional Neural Networks Implementation **

Now that we have understood the basics of Image Classification and Convolutional Neural Networks, let us visualize its implementation in TensorFlow/Keras with Python coding. In this, we shall build a simple Convolutional Neural Network Model with a Basic LeNet Architecture, train the model on a training set & test set and finally obtain the accuracy of the model on the test set data.

**Problem Set **

In this article for building and training the Convolutional Neural Network Model, we shall be using the famous Fashion MNIST dataset. MNIST stands for Modified National Institute of Standards and Technology. Fashion-MNIST is a dataset of Zalando’s article images—consisting of a training set of 60,000 examples and a test set of 10,000 examples. Each example is a 28×28 grayscale image, associated with a label from 10 classes.

Each training and test example is assigned to one of the following labels:

0 – T-shirt/top

1 – Trouser

2 – Pullover

3 – Dress

4 – Coat

5 – Sandal

6 – Shirt

7 – Sneaker

8 – Bag

9 – Ankle Boots

Source: Fashion MNIST Dataset Images

**Program Code **

**Step 1 – Importing the Libraries **

The First step to building any Deep Learning model is to import the libraries that are necessary for the program. In our example, as we are using the TensorFlow framework, we shall import the Keras library and also other important libraries such as the number for calculation and the matplotlib for plotting the plots.

#TensorFlow – Importing the Libraries

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

import tensorflow as tf

from tensorflow import Keras

**Step 2 – Getting and Splitting the Dataset**

Once we have imported the libraries, the next step is to download the dataset and split the Fashion MNIST dataset into the respective 60,000 training and 10,000 test data. Fortunately, keras provides us with a predefined function to import the Fashion MNIST dataset and we can split them in the next line using a simple line of code that is self-understood.

#TensorFlow – Getting and Splitting the Dataset

fashion_mnist = keras.datasets.fashion_mnist

(train_images_tf, train_labels_tf), (test_images_tf, test_labels_tf) = fashion_mnist.load_data()

**Step 3 – Visualizing the Data**

As the dataset is downloaded along with the images and their corresponding labels, to make it more clear to the user, it is always advised to view the data so that we can understand the type of data that we are dealing with the build the Convolutional Neural Network Model accordingly. Here, with this simple block of code given below, we shall visualize the first 3 images of the training dataset that is shuffled randomly.

#TensorFlow – Visualizing the Data

def imshowTensorFlow(img):

plt.imshow(img, cmap=’gray’)

print(“Label:”, img[0])

imshowTensorFlow(train_images_tf[0])

Label: 9 Label: 0 Label: 3

The above-given image and their labels can be verified with the labels which are given in the Fashion MNIST dataset details above. From this, we infer that our data image is a grayscale image with a height of 28 pixels and a width of 28 pixels.

Hence, the model can be built with an input size of (28,28,1), where 1 stands for the grayscale image.

**Step 4 – Building the Model**

As mentioned above, in this article we will be building a simple Convolutional Neural Network with the LeNet architecture. LeNet is a convolutional neural network structure proposed by Yann LeCun et al. in 1989. In general, LeNet refers to LeNet-5 and is a simple Convolutional Neural Network.

Source: The LeNet Architecture

From the above-given Architecture diagram of the LeNet CNN Model, we see that there are 5+2 layers. The first and second layers are a Convolutional layer followed by a Pooling layer. Again, the third and fourth layers consist of a Convolutional layer and a Pooling layer. As a result of these operations, the size of the input image from 28×28 reduces to 7×7.

The fifth layer of the LeNet Model is the Fully Connected Layer which flattens the previous layer’s output. Followed by two Dense layers, the final output layer of the CNN model consist of a Softmax activation function with 10 units. Softmax function predicts a class probability for each of the 10 classes of the Fashion MNIST dataset.

#TensorFlow – Building the Model

model = keras.Sequential([

keras.layers.Conv2D(input_shape=(28,28,1), filters=6, kernel_size=5, strides=1, padding=”same”, activation=tf.nn.relu),

keras.layers.AveragePooling2D(pool_size=2, strides=2),

keras.layers.Conv2D(16, kernel_size=5, strides=1, padding=”same”, activation=tf.nn.relu),

keras.layers.AveragePooling2D(pool_size=2, strides=2),

keras.layers.Flatten(),

keras.layers.Dense(120, activation=tf.nn.relu),

keras.layers.Dense(84, activation=tf.nn.relu),

keras.layers.Dense(10, activation=tf.nn.softmax)

])

**Step 5 – Model Summary**

Once the layers of the LeNet model are finalized, we can proceed to compile the model and view a summaried version of the CNN model designed.

#TensorFlow – Model Summary

model.compile(loss=keras.losses.categorical_crossentropy,

optimizer=’adam’,

metrics=[‘acc’])

model.summary()

In this, as the final output has more than 2 classes (10 classes), we use the categorical crossentropy as the loss function and the Adam Optimizer to our model built. The model summary is given below.

**Step 6 – Training the Model **

Finally, we come to the part where we begin the training process of the LeNet CNN model. Firstly, we reshape the training dataset and normalize it to smaller values by dividing with 255.0 to reduce the computational cost. Then the training labels are converted from an integer class vector to a binary class matrix. For example, label 3 is converted to [0, 0, 0, 1, 0, 0, 0, 0, 0]

#TensorFlow – Training the Model

train_images_tensorflow = (train_images_tf / 255.0).reshape(train_images_tf.shape[0], 28, 28, 1)

test_images_tensorflow = (test_images_tf / 255.0).reshape(test_images_tf.shape[0], 28, 28 ,1)

train_labels_tensorflow=keras.utils.to_categorical(train_labels_tf)

test_labels_tensorflow=keras.utils.to_categorical(test_labels_tf)

H = model.fit(train_images_tensorflow, train_labels_tensorflow, epochs=30, batch_size=32)

At the end of training after 30 epochs, we obtain the final training accuracy and loss as,

Epoch 30/30

1875/1875 [==============================] – 4s 2ms/step – loss: 0.0421 – acc: 0.9850

Training Accuracy: 98.294997215271 %

Training Loss: 0.04584110900759697

**Step 7 – Predicting the Results**

Finally, once we are done with our training process of the CNN model, we shall fit the same model on the test dataset and predict the accuracy of 10,000 test images.

#TensorFlow – Comparing the Results

predictions = model.predict(test_images_tensorflow)

correct = 0

for i, pred in enumerate(predictions):

if np.argmax(pred) == test_labels_tf[i]:

correct += 1

print(‘Test Accuracy of the model on the {} test images: {}% with TensorFlow’.format(test_images_tf.shape[0],100 * correct/test_images_tf.shape[0]))

The output that we get is,

Test Accuracy of the model on the 10000 test images: 90.67% with TensorFlow

With this, we come to an end to the program on building an Image Classification Model with Convolutional Neural Networks.

**Also Read: **Machine Learning Project Ideas

**Conclusion **

Thus, in this tutorial on implementing Image Classification in CNN, we have understood the basic concepts behind Image Classification, Convolutional Neural Networks along with its implementation in Python programming language with TensorFlow framework.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

## Which CNN model is considered to be the most optimum for image classification?

The best CNN model for image classification is the VGG-16, which stands for Very Deep Convolutional Networks for Large-Scale Image Recognition. VGG, which was designed as a deep CNN, outperforms baselines on a wide range of tasks and datasets outside of ImageNet. The model's distinguishing feature is that when it was being created, more attention was placed on incorporating excellent convolution layers rather than focusing on adding a large number of hyper parameters. It has a total of 16 layers, 5 blocks, and each block has a maximum pooling layer, making it a quite large network.

## What are the disadvantages of using CNN models for image classification?

When it comes to image classification, CNN models are highly successful. However, there are several drawbacks to employing CNNs. If the picture to be identified is slanted or rotated, the CNN model has problems accurately identifying the image. When CNN visualizes the images, there are no internal representations of the components and their part-whole connections. Furthermore, if the CNN model to be employed includes numerous convolutional layers, the classification process will take a long time.

## Why is the use of the CNN model preferred over the ANN for image data as input?

By combining filters or transformations, CNN can learn many layers of feature representations for every image provided as input. Overfitting is decreased since the number of parameters for the network to learn in CNN is substantially smaller than in multilayer neural networks. When using ANN, neural networks may learn a single feature representation of the image, but, in the case of complex images, ANN will fail to provide improved visualizations or classifications since it cannot learn pixel dependencies existing in the input images.