Gradient Descent in Machine Learning: How Does it Work?

# Gradient Descent in Machine Learning: How Does it Work?

Last updated:
28th Jan, 2021
Views
9 Mins
View All

## Introduction

One of the most crucial parts of Machine Learning is the optimization of its algorithms. Almost all the algorithms in Machine Learning have an optimization algorithm at their base which acts as the core of the algorithm. As we all know, optimization is the ultimate goal of any algorithm even with real-life events or when dealing with a technology-based product in the market.

## Top Machine Learning and AI Courses Online

 Master of Science in Machine Learning & AI from LJMU Executive Post Graduate Programme in Machine Learning & AI from IIITB Advanced Certificate Programme in Machine Learning & NLP from IIITB Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland To Explore all our certification courses on AI & ML, kindly visit our page below. Machine Learning Certification

There are currently a lot of optimization algorithms that are used in several applications such as face recognition, self-driving cars, market-based analysis, etc. Similarly, in Machine Learning such optimization algorithms play an important role. One such widely used optimization algorithm is the Gradient Descent Algorithm which we shall go through in this article.

## Trending Machine Learning Skills

 AI Courses Tableau Certification Natural Language Processing Deep Learning AI

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

In Machine Learning, the Gradient Descent algorithm is one of the most used algorithms and yet it stupefies most newcomers. Mathematically, Gradient Descent is a first-order iterative optimization algorithm that is used to find the local minimum of a differentiable function. In simple terms, this Gradient Descent algorithm is used to find the values of a function’s parameters (or coefficients) which are used to minimize a cost function as low as possible. The cost function is used to quantify the error between the predicted values and the real values of a Machine Learning model built.

Consider a large bowl with which you would normally keep fruits or eat cereal. This bowl will be the cost function (f).

Now, a random co-ordinate on any part of the surface of the bowl will be the current values of the coefficients of the cost function. The bottom of the bowl is the best set of coefficients and it is the minimum of the function.

Here, the goal is to calculate the different values of the coefficients with each iteration, evaluate the cost and choose the coefficients which have a better cost function value (lower value). On multiple iterations, it would be found that the bottom of the bowl has the best coefficients to minimize the cost function.

In this way, the Gradient Descent algorithm functions to result in minimum cost.

This process of gradient descent begins with allocating values initially to the coefficients of the cost function. This could be either a value close to 0 or a small random value.

coefficient = 0.0

Next, the cost of the coefficients is obtained by applying it to the cost function and calculating the cost.

cost = f(coefficient)

Then, the derivative of the cost function is calculated. This derivative of the cost function is obtained by the mathematical concept of differential calculus. It gives us the slope of the function at the given point where its derivative is calculated. This slope is needed to know in which direction the coefficient is to be moved in the next iteration to get a lower cost value. This is done by observing the sign of the derivative calculated.

delta = derivative(cost)

Once we know which direction is downhill from the derivative calculated, we need to update the coefficient values. For this, a parameter is known as the learning parameter, alpha (α) is utilized. This is used to control to what extent the coefficients can change with every update.

coefficient = coefficient – (alpha * delta)

Source

In this way, this process is repeated till the cost of the coefficients is equal to 0.0 or close enough to zero. This is the procedure for the gradient descent algorithm.

## Types of Gradient Descent Algorithms

In modern times, there are three basic types of Gradient Descent that are used in modern machine learning and deep learning algorithms. The major difference between each of these 3 types is its computational cost and efficiency. Depending upon the amount of data used, time complexity, and accuracy the following are the three types.

This is the first and basic version of the Gradient Descent algorithms in which the entire dataset is used at once to compute the cost function and its gradient. As the entire dataset is used in one go for a single update, the calculation of the gradient in this type can be very slow and is not possible with those datasets that are out of the device’s memory capacity.

Thus, this Batch Gradient Descent algorithm is used only for smaller datasets and when the number of training examples is large, the batch gradient descent is not preferred. Instead, the Stochastic and Mini Batch Gradient Descent algorithms are used.

This is another type of gradient descent algorithm in which only one training example is processed per iteration. In this, the first step is to randomize the entire training dataset. Then, only one training example is used for updating the coefficients. This is in contrast to the Batch Gradient Descent in which the parameters (coefficients) are updated only when all the training examples are evaluated.

Stochastic Gradient Descent (SGD) has the advantage that this type of frequent update gives a detailed rate of improvement. However, in certain cases, this may turn out to be computationally expensive as it processes only one example every iteration which may cause the number of iterations to be very large.

This is a recently developed algorithm that is faster than both the Batch and Stochastic Gradient Descent algorithms. It is mostly preferred as it is a combination of both the previously mentioned algorithms. In this, it separates the training set into several mini-batches and performs an update for each of these batches after calculating the gradient of that batch (like in SGD).

Commonly, the batch size varies between 30 to 500 but there isn’t any fixed size as they vary for different applications. Hence, even if there is a huge training dataset, this algorithm processes it in ‘b’ mini-batches. Thus, it is suitable for large datasets with a lesser number of iterations.

If ‘m’ is the number of training examples, then if b==m the Mini Batch Gradient Descent will be similar to the Batch Gradient Descent algorithm.

## Variants of Gradient Descent in Machine Learning

With this basis for Gradient Descent, there have been several other algorithms that have been developed from this. A few of them are summarized below.

This is one of the simplest forms of the Gradient Descent Technique. The name vanilla means pure or without any adulteration. In this, small steps are taken in the direction of the minima by calculating the gradient of the cost function. Similar to the above-mentioned algorithm, the update rule is given by,

coefficient = coefficient – (alpha * delta)

In this case, the algorithm is such that we know the previous steps before taking the next step. This is done by introducing a new term which is the product of the previous update and a constant known as the momentum. In this, the weight update rule is given by,

update = alpha * delta

velocity = previous_update * momentum

coefficient = coefficient + velocity – update

The term ADAGRAD stands for Adaptive Gradient Algorithm. As the name says, it uses an adaptive technique to update the weights. This algorithm is more suited for sparse data. This optimization changes its learning rates in relation to the frequency of the parameter updates during the training. For example, the parameters which have higher gradients are made to have a slower learning rate so that we do not end up overshooting the minimum value. Similarly, lower gradients have a faster learning rate to get trained more quickly.

In this way, there are several other variants of Gradient Descent Algorithms that have been developed and are being developed in the world such as AMSGrad, ADAMax.

## Popular AI and ML Blogs & Free Courses

 IoT: History, Present & Future Machine Learning Tutorial: Learn ML What is Algorithm? Simple & Easy Robotics Engineer Salary in India : All Roles A Day in the Life of a Machine Learning Engineer: What do they do? What is IoT (Internet of Things) Permutation vs Combination: Difference between Permutation and Combination Top 7 Trends in Artificial Intelligence & Machine Learning Machine Learning with R: Everything You Need to Know AI & ML Free Courses Introduction to NLP Fundamentals of Deep Learning of Neural Networks Linear Regression: Step by Step Guide Artificial Intelligence in the Real World Introduction to Tableau Case Study using Python, SQL and Tableau

## Conclusion

In this article, we have seen the algorithm behind one of the most commonly used optimization algorithms in Machine Learning, the Gradient Descent Algorithms along with its types and variants that have been developed.

upGrad provides a Executive PG Programme in Machine Learning & AI and a  Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent in Machine Learning.

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Select
Select Area of interest
Select Work Experience
By clicking 'Submit' you Agree to

#### Our Popular Machine Learning Course

1Where can Gradient Descent Algorithm contribute maximally?

Optimisation within any machine learning algorithm is incremental to the purity of the algorithm. Gradient Descent Algorithm assists in minimising cost function errors and improving the algorithm’s parameters. Although the Gradient Descent algorithm is used widely in Machine Learning and Deep Learning, its effectiveness can be determined by the quantity of data, amount of iterations and accuracy preferred, and amount of time available. For small-scale datasets, the Batch Gradient Descent is optimal. Stochastic Gradient Descent (SGD) proves to be more efficient for detailed and more extensive data sets. In contrast, Mini Batch Gradient Descent is used for quicker optimisation.

2What are the challenges faced in gradient descent?

Gradient Descent is preferred to optimise machine learning models to reduce cost function. However, it has its shortcomings as well. Suppose the Gradient is diminished due to the minimum output functions of the model layers. In that case, the iterations won’t be as effective as the model will not retrain fully, updating its weights and biases. Sometimes an error gradient accumulates loads of weights and biases to keep the iterations updated. However, this gradient becomes too large to manage and is called an exploding gradient. The infrastructure requirements, learning rate balance, momentum need to be addressed.

Convergence is when the gradient descent algorithm successfully minimises its cost function to an optimal level. Gradient Descent Algorithm tries to minimise the cost function through the algorithm parameters. However, it can land on any of the optimal points and not necessarily the one that has a global or local optimum point. One reason for not having optimal convergence is the step size. A more significant step size results in more oscillations and may divert from the global optimal. Hence, gradient descent may not always converge on the best feature, but it still lands on the nearest feature point.

## Suggested Blogs

109357
What are Natural Language Processing Projects? NLP project ideas advanced encompass various applications and research areas that leverage computation

30 May 2024

99169
AWS Projects & Topics Looking for AWS project ideas? Then you’ve come to the right place because, in this article, we’ve shared multiple AWS proj

30 May 2024

91420
Owing to the proliferation of Machine learning applications and an increase in computing power, data scientists have inherently implemented algorithms

25 May 2024

331309
Summary: In this Article, you will learn Stock Prices Predictor Sports Predictor Develop A Sentiment Analyzer Enhance Healthcare Prepare ML Algorith

21 May 2024

76247
Machine learning is the most algorithm-intense field in computer science. Gone are those days when people had to code all algorithms for machine learn

19 May 2024

65212
These days, the minute you indulge in any technology-oriented discussion, interview questions on cloud computing come up in some form or the other. Th

19 May 2024

769866
In this article, you will learn the 40Exciting IoT Project Ideas & Topics. Take a glimpse at the project ideas listed below. Best Simple IoT Proje

19 May 2024

422688
In this article, you will learn the 22 AI project ideas & Topics. Take a glimpse below. Best AI Project Ideas & Topics Predict Housing Price