One-Shot Learning with Siamese Network [For Facial Recognition]

The following article talks about the need for using One-shot learning along with its variations and drawbacks.

To begin with, in order to train any deep learning model, we need a large amount of data so that our model performs the desired prediction or classification task efficiently. For instance, detecting a dog from images will require you to train a neural network model on hundreds and thousands of dog and non-dog images for it to accurately distinguish one from the other. However, this neural network model will fail to work if it is trained on one or very few training data. 

With the lack of data, extracting relevant features at different layers becomes difficult. The model will not be able to generalize well between different classes thereby affecting its overall performance.

For illustration, consider the example of facial recognition at an airport. In this, we do not have the liberty to train our model of hundreds and thousands of images of each person containing different expressions, background lighting et al. With more than thousands of passengers arriving daily it is an impossible task! Besides, storing such a huge chunk of data adds up to the cost. 

To tackle the above problem, we use a technique in which classification or categorization tasks can be achieved with one or a few examples to classify many new examples. This technique is called One-shot learning. 

In recent years One-shot learning technology is being used extensively in facial recognition and passport checks. The concept being used is- The model takes input 2 images; one being the image from the passport and the other being the image of the person looking at the camera. The model then outputs a value which is the similarity between the 2 images. If the value of the output is low then the two images are similar else they are different.

Siamese Network

The architecture used for One-shot learning is called the Siamese Network. This architecture comprises two parallel neural networks with each taking different input. The output of the model is a value or a similarity index which indicates whether the two input images are alike or not. A value below a pre-defined threshold corresponds to the high similarity between the two images and visa versa. 

When the images are passed a series of Convolutional layers, max-pooling layers, and fully connected layers what we achieve is a vector that encodes the features of the images. Here because we input two images, two vectors encompassing the features of the input images will be generated. The value which we were talking about is the distance between the two feature vectors which can be calculated by finding the norm of the difference between the two vectors. 

Triplet loss function

As the name suggests, to train the model we require three images- one anchor (A) image, one positive (P), and one negative (N) image. Since two inputs can be provided to the model, an anchor image with either a positive or negative image is given. The model learns the parameter in such a fashion that the distance between the anchor image and the positive image is low while the distance between the anchor image and the negative image is high. 

The constructive loss function penalizes the model if the distance between A and N is low or A and P is high, while it encourages the model or learns features when the distance between A and N is high and A and P is low.

To understand more about the anchor, positive and negative images let’s consider the previous example of that at an airport. In such a case, the anchor image will be your image when you look at the camera, the positive image will be the one on your passport photo and the negative image will be a random image of a passenger present at the airport. 

Whenever we train a Siaseme network we provide it with the APN trios (Anchor, positive and negative) images. Creating this dataset is much easier and would require fewer images to train. 

Limitations of One-shot learning

One-shot learning is still a mature machine learning algorithm and does possess some limitations. For instance, the model will not work well if the input image has some modifications- a person wearing a hat, sunglasses et al. Further, a model that is trained for one application cannot be generalized for another application. 

Moving on let’s see a few variations of One-shot learning which entails Zero-shot learning and Few-shot learning.

Zero-shot learning

Zero-shot learning is the ability of the model to identify new or unseen labeled data while being trained on seen data and knowing the semantic features of new or unseen data. For instance, a child who has seen a cat can identify it by its distinct features. Moreover, if the child is aware that the dog’s bark and possesses more solid characteristics than a cat, then the child would have no problem in recognizing the dog.

To conclude, we can say that ZSL recognition functions in a manner that takes into account the labeled training set of seen classes coupled with the knowledge about how each unseen class is semantically related to the seen classes.

N-shot learning

As the name suggests, in N shot learning we will have n labeled data of each class available for training. The model is trained on K classes each containing n labeled data. After extracting relevant features and patterns the model has to categorize a new unlabelled image into one of the K classes. They use Matching networks that work on the nearest neighbors based approach trained fully end to end. 


In conclusion, the field of One-shot learning and its counterparts have immense potential to solve some of the challenging problems. Though, being a relatively new area of research, it is making fast progress, and researchers are working trying to bridge the gap between machines and humans. 

With this, we have come to an end of this post, I hope you enjoyed reading it. 

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Lead the AI Driven Technological Revolution


Leave a comment

Your email address will not be published.

Accelerate Your Career with upGrad

Our Popular Machine Learning Course