Gradient Descent in Machine Learning: How Does it Work?

Introduction

One of the most crucial parts of Machine Learning is the optimization of its algorithms. Almost all the algorithms in Machine Learning have an optimization algorithm at their base which acts as the core of the algorithm. As we all know, optimization is the ultimate goal of any algorithm even with real-life events or when dealing with a technology-based product in the market.

There are currently a lot of optimization algorithms that are used in several applications such as face recognition, self-driving cars, market-based analysis, etc. Similarly, in Machine Learning such optimization algorithms play an important role. One such widely used optimization algorithm is the Gradient Descent Algorithm which we shall go through in this article.

What is Gradient Descent?

In Machine Learning, the Gradient Descent algorithm is one of the most used algorithms and yet it stupefies most newcomers. Mathematically, Gradient Descent is a first-order iterative optimization algorithm that is used to find the local minimum of a differentiable function. In simple terms, this Gradient Descent algorithm is used to find the values of a function’s parameters (or coefficients) which are used to minimize a cost function as low as possible. The cost function is used to quantify the error between the predicted values and the real values of a Machine Learning model built.

Gradient Descent Intuition

Consider a large bowl with which you would normally keep fruits or eat cereal. This bowl will be the cost function (f).

Now, a random co-ordinate on any part of the surface of the bowl will be the current values of the coefficients of the cost function. The bottom of the bowl is the best set of coefficients and it is the minimum of the function.

Here, the goal is to calculate the different values of the coefficients with each iteration, evaluate the cost and choose the coefficients which have a better cost function value (lower value). On multiple iterations, it would be found that the bottom of the bowl has the best coefficients to minimize the cost function.

In this way, the Gradient Descent algorithm functions to result in minimum cost.

Gradient Descent Procedure

This process of gradient descent begins with allocating values initially to the coefficients of the cost function. This could be either a value close to 0 or a small random value.

coefficient = 0.0

Next, the cost of the coefficients is obtained by applying it to the cost function and calculating the cost.

cost = f(coefficient)

Then, the derivative of the cost function is calculated. This derivative of the cost function is obtained by the mathematical concept of differential calculus. It gives us the slope of the function at the given point where its derivative is calculated. This slope is needed to know in which direction the coefficient is to be moved in the next iteration to get a lower cost value. This is done by observing the sign of the derivative calculated.

delta = derivative(cost)

Once we know which direction is downhill from the derivative calculated, we need to update the coefficient values. For this, a parameter is known as the learning parameter, alpha (α) is utilized. This is used to control to what extent the coefficients can change with every update.

coefficient = coefficient – (alpha * delta)

Source

In this way, this process is repeated till the cost of the coefficients is equal to 0.0 or close enough to zero. This is the procedure for the gradient descent algorithm.

Types of Gradient Descent Algorithms

In modern times, there are three basic types of Gradient Descent that are used in modern machine learning and deep learning algorithms. The major difference between each of these 3 types is its computational cost and efficiency. Depending upon the amount of data used, time complexity, and accuracy the following are the three types.

  1. Batch Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini Batch Gradient Descent

Batch Gradient Descent

This is the first and basic version of the Gradient Descent algorithms in which the entire dataset is used at once to compute the cost function and its gradient. As the entire dataset is used in one go for a single update, the calculation of the gradient in this type can be very slow and is not possible with those datasets that are out of the device’s memory capacity.

Thus, this Batch Gradient Descent algorithm is used only for smaller datasets and when the number of training examples is large, the batch gradient descent is not preferred. Instead, the Stochastic and Mini Batch Gradient Descent algorithms are used.

Stochastic Gradient Descent

This is another type of gradient descent algorithm in which only one training example is processed per iteration. In this, the first step is to randomize the entire training dataset. Then, only one training example is used for updating the coefficients. This is in contrast to the Batch Gradient Descent in which the parameters (coefficients) are updated only when all the training examples are evaluated.

Stochastic Gradient Descent (SGD) has the advantage that this type of frequent update gives a detailed rate of improvement. However, in certain cases, this may turn out to be computationally expensive as it processes only one example every iteration which may cause the number of iterations to be very large.

Mini Batch Gradient Descent

This is a recently developed algorithm that is faster than both the Batch and Stochastic Gradient Descent algorithms. It is mostly preferred as it is a combination of both the previously mentioned algorithms. In this, it separates the training set into several mini-batches and performs an update for each of these batches after calculating the gradient of that batch (like in SGD).

Commonly, the batch size varies between 30 to 500 but there isn’t any fixed size as they vary for different applications. Hence, even if there is a huge training dataset, this algorithm processes it in ‘b’ mini-batches. Thus, it is suitable for large datasets with a lesser number of iterations.

If ‘m’ is the number of training examples, then if b==m the Mini Batch Gradient Descent will be similar to the Batch Gradient Descent algorithm.

Variants of Gradient Descent in Machine Learning

With this basis for Gradient Descent, there have been several other algorithms that have been developed from this. A few of them are summarized below.

Vanilla Gradient Descent

This is one of the simplest forms of the Gradient Descent Technique. The name vanilla means pure or without any adulteration. In this, small steps are taken in the direction of the minima by calculating the gradient of the cost function. Similar to the above-mentioned algorithm, the update rule is given by,

coefficient = coefficient – (alpha * delta)

Gradient Descent with Momentum

In this case, the algorithm is such that we know the previous steps before taking the next step. This is done by introducing a new term which is the product of the previous update and a constant known as the momentum. In this, the weight update rule is given by,

update = alpha * delta

velocity = previous_update * momentum

coefficient = coefficient + velocity – update

ADAGRAD

The term ADAGRAD stands for Adaptive Gradient Algorithm. As the name says, it uses an adaptive technique to update the weights. This algorithm is more suited for sparse data. This optimization changes its learning rates in relation to the frequency of the parameter updates during the training. For example, the parameters which have higher gradients are made to have a slower learning rate so that we do not end up overshooting the minimum value. Similarly, lower gradients have a faster learning rate to get trained more quickly.

ADAM 

Yet another adaptive optimization algorithm that has its roots in the Gradient Descent algorithm is the ADAM which stands for Adaptive Moment Estimation. It is a combination of both the ADAGRAD and the SGD with Momentum algorithms. It is built from the ADAGRAD algorithm and is built further downside. In simple terms ADAM = ADAGRAD + Momentum.

In this way, there are several other variants of Gradient Descent Algorithms that have been developed and are being developed in the world such as AMSGrad, ADAMax.

Conclusion

In this article, we have seen the algorithm behind one of the most commonly used optimization algorithms in Machine Learning, the Gradient Descent Algorithms along with its types and variants that have been developed.

upGrad provides a PG Diploma in Machine Learning and AI and a  Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent in Machine Learning.

Lead the AI Driven Technological Revolution

PG DIPLOMA IN MACHINE LEARNING AND ARTIFICIAL INTELLIGENCE
Learn More

Leave a comment

Your email address will not be published.

Accelerate Your Career with upGrad

Our Popular Machine Learning Course

×