Deep learning is a great advancement over machine learning in terms of flexibility, higher accuracy, and a wide range of possibilities in industry applications. Whether it’s a chat application, grammar auto-correction, translation among different languages, fake news detection, or automatic story writing based on some initial wordings, Deep learning finds its usage in almost every sector.
With this much usage, it becomes important that these algorithms run under minimum resources so we can reduce recurring costs and provide efficient results in less time. An optimizer is a method or algorithm to update the various parameters that can reduce the loss in much less effort. Let’s look at some popular Deep learning optimizers that deliver acceptable results.
Gradient Descent (GD)
This is the most basic optimizer that directly uses the derivative of the loss function and learning rate to reduce the loss and achieve the minima. This approach is also adopted in backpropagation in neural networks where the updated parameters are shared between different layers depending upon when the minimum loss is achieved. It is easy to implement and interpret the results, but it has various issues.
The weights are updated when the whole dataset gradient is calculated, which slows down the process. It also requires a large amount of memory to store this temporary data, making it a resource-hungry process. Though the idea behind this algorithm is well suited, it needs to be tweaked.
Stochastic Gradient Descent
This is a changed version of the GD method, where the model parameters are updated on every iteration. It means that after every training sample, the loss function is tested and the model is updated. These frequent updates result in converging to the minima in less time, but it comes at the cost of increased variance that can make the model overshoot the required position.
But an advantage of this technique is low memory requirement as compared to the previous one because now there is no need to store the previous values of the loss functions.
Mini-Batch Gradient Descent
Another variant of this GD approach is mini-batch, where the model parameters are updated in small batch sizes. It means that after every n batches, the model parameters will be updated and this ensures that the model is proceeding towards minima in fewer steps without getting derailed often. This results in less memory usage and low variance in the model.
Momentum Based Gradient Descent
Let’s revisit the method we are using to update the parameters. Based on the first-order derivative of the loss function, we are back-propagating the gradients. The frequency of updates can be after every iteration, a batch, or at the last, but we are not considering how many updates we have in the parameters.
If this history element is included in the next updates, then it can speed the whole process and this is what momentum means in this optimizer. This history element is like how our mind memorizes things. If you are walking on a street and you cover a pretty large distance, then you will be sure that your destination is some distance ahead and you will increase your speed.
This element depends on the previous value, learning rate, and a new parameter called gamma, which controls this history update. The update rule will be something like w = w – v, where v is the history element.
Nesterov Accelerated Gradient (NAG)
The momentum-based GD gave a boost to the currently used optimizers by converging to the minima at the earliest, but it introduced a new problem. This method takes a lot of u-turns and oscillates in and out in the minima valley adding to the total time. The time taken is still way too less than normal GD, but this issue also needs a fix and this is done in NAG.
The approach followed here was that the parameters update would be made with the history element first and then only the derivative is calculated which can move it in the forward or backward direction. This is called the look-ahead approach, and it makes more sense because if the curve reaches near to the minima, then the derivative can make it move slowly so that there are fewer oscillations and therefore saving more time.
Also Read: Deep Learning Techniques You Should Know
Till now we are only focusing on how the model parameters are affecting our training, but we haven’t talked about the hyper-parameters that are assigned constant value throughout the training. One such important hyper-parameter is learning rate and varying this can change the pace of training.
For a sparse feature input where most of the values are zero, we can afford a higher learning rate which will boost the dying gradient resulted from these sparse features. If we have dense data, then we can have slower learning.
The solution for this is to have an adaptive learning rate that can change according to the input provided. Adagrad optimizer tries to offer this adaptiveness by decaying the learning rate in proportion to the updated history of the gradients.
It means that when there are larger updates, the history element is accumulated, and therefore it reduces the learning rate and vice versa. One disadvantage of this approach is that the learning rate decays aggressively and after some time it approaches zero.
It is an improvement to the Adagrad optimizer. This aims to reduce the aggressiveness of the learning rate by taking an exponential average of the gradients instead of the cumulative sum of squared gradients. Adaptive learning rate remains intact as now exponential average will punish larger learning rate in conditions when there are fewer updates and smaller rate in a higher number of updates.
Adaptive Moment Estimation combines the power of RMSProp (root-mean-square prop) and momentum-based GD. In Adam optimizers, the power of momentum GD to hold the history of updates and the adaptive learning rate provided by RMSProp makes Adam optimizer a powerful method. It also introduces two new hyper-parameters beta1 and beta2 which are usually kept around 0.9 and 0.99 but you can change them according to your use case.
Must Read: Regularization in Deep Learning
In this article, we looked at 8 Deep learning optimizers in the order of ease of their usage and how one optimizer’s limitation is overcome by the next one, and so on. There are more modifications of one or the other optimizers mentioned here, but these are the fundamental ones that you should consider before going for complex solutions.
Picking a winner among these is highly subjective to the use case and the problem you are dealing with but one can surely rank Adam Optimizer on the top because of its combination with the momentum concept that changed how the model parameters should be updated and adapting the changing learning rate for different scenarios enabling efficient processing of any types of inputs.
A general trend shows that for the same loss, these optimizers converge at different local minima. While adaptive learning optimizers converge at sharper minima, other types of techniques converge at flatter minima which is better for generalization. These techniques can only help to some extent because as the Deep neural networks are becoming bigger, more efficient methods are required to get good results.
If you’re interested to learn more about deep learning techniques, machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.