Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconGradient Descent in Machine Learning: How Does it Work?

Gradient Descent in Machine Learning: How Does it Work?

Last updated:
28th Jan, 2021
Views
Read Time
9 Mins
share image icon
In this article
Chevron in toc
View All
Gradient Descent in Machine Learning: How Does it Work?

Introduction

One of the most crucial parts of Machine Learning is the optimization of its algorithms. Almost all the algorithms in Machine Learning have an optimization algorithm at their base which acts as the core of the algorithm. As we all know, optimization is the ultimate goal of any algorithm even with real-life events or when dealing with a technology-based product in the market.

Top Machine Learning and AI Courses Online

There are currently a lot of optimization algorithms that are used in several applications such as face recognition, self-driving cars, market-based analysis, etc. Similarly, in Machine Learning such optimization algorithms play an important role. One such widely used optimization algorithm is the Gradient Descent Algorithm which we shall go through in this article.

Trending Machine Learning Skills

Ads of upGrad blog

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

What is Gradient Descent?

In Machine Learning, the Gradient Descent algorithm is one of the most used algorithms and yet it stupefies most newcomers. Mathematically, Gradient Descent is a first-order iterative optimization algorithm that is used to find the local minimum of a differentiable function. In simple terms, this Gradient Descent algorithm is used to find the values of a function’s parameters (or coefficients) which are used to minimize a cost function as low as possible. The cost function is used to quantify the error between the predicted values and the real values of a Machine Learning model built.

Gradient Descent Intuition

Consider a large bowl with which you would normally keep fruits or eat cereal. This bowl will be the cost function (f).

Now, a random co-ordinate on any part of the surface of the bowl will be the current values of the coefficients of the cost function. The bottom of the bowl is the best set of coefficients and it is the minimum of the function.

Here, the goal is to calculate the different values of the coefficients with each iteration, evaluate the cost and choose the coefficients which have a better cost function value (lower value). On multiple iterations, it would be found that the bottom of the bowl has the best coefficients to minimize the cost function.

In this way, the Gradient Descent algorithm functions to result in minimum cost.

Gradient Descent Procedure

This process of gradient descent begins with allocating values initially to the coefficients of the cost function. This could be either a value close to 0 or a small random value.

coefficient = 0.0

Next, the cost of the coefficients is obtained by applying it to the cost function and calculating the cost.

cost = f(coefficient)

Then, the derivative of the cost function is calculated. This derivative of the cost function is obtained by the mathematical concept of differential calculus. It gives us the slope of the function at the given point where its derivative is calculated. This slope is needed to know in which direction the coefficient is to be moved in the next iteration to get a lower cost value. This is done by observing the sign of the derivative calculated.

delta = derivative(cost)

Once we know which direction is downhill from the derivative calculated, we need to update the coefficient values. For this, a parameter is known as the learning parameter, alpha (α) is utilized. This is used to control to what extent the coefficients can change with every update.

coefficient = coefficient – (alpha * delta)

Source

In this way, this process is repeated till the cost of the coefficients is equal to 0.0 or close enough to zero. This is the procedure for the gradient descent algorithm.

Types of Gradient Descent Algorithms

In modern times, there are three basic types of Gradient Descent that are used in modern machine learning and deep learning algorithms. The major difference between each of these 3 types is its computational cost and efficiency. Depending upon the amount of data used, time complexity, and accuracy the following are the three types.

  1. Batch Gradient Descent
  2. Stochastic Gradient Descent
  3. Mini Batch Gradient Descent

Batch Gradient Descent

This is the first and basic version of the Gradient Descent algorithms in which the entire dataset is used at once to compute the cost function and its gradient. As the entire dataset is used in one go for a single update, the calculation of the gradient in this type can be very slow and is not possible with those datasets that are out of the device’s memory capacity.

Thus, this Batch Gradient Descent algorithm is used only for smaller datasets and when the number of training examples is large, the batch gradient descent is not preferred. Instead, the Stochastic and Mini Batch Gradient Descent algorithms are used.

Stochastic Gradient Descent

This is another type of gradient descent algorithm in which only one training example is processed per iteration. In this, the first step is to randomize the entire training dataset. Then, only one training example is used for updating the coefficients. This is in contrast to the Batch Gradient Descent in which the parameters (coefficients) are updated only when all the training examples are evaluated.

Stochastic Gradient Descent (SGD) has the advantage that this type of frequent update gives a detailed rate of improvement. However, in certain cases, this may turn out to be computationally expensive as it processes only one example every iteration which may cause the number of iterations to be very large.

Mini Batch Gradient Descent

This is a recently developed algorithm that is faster than both the Batch and Stochastic Gradient Descent algorithms. It is mostly preferred as it is a combination of both the previously mentioned algorithms. In this, it separates the training set into several mini-batches and performs an update for each of these batches after calculating the gradient of that batch (like in SGD).

Commonly, the batch size varies between 30 to 500 but there isn’t any fixed size as they vary for different applications. Hence, even if there is a huge training dataset, this algorithm processes it in ‘b’ mini-batches. Thus, it is suitable for large datasets with a lesser number of iterations.

If ‘m’ is the number of training examples, then if b==m the Mini Batch Gradient Descent will be similar to the Batch Gradient Descent algorithm.

Variants of Gradient Descent in Machine Learning

With this basis for Gradient Descent, there have been several other algorithms that have been developed from this. A few of them are summarized below.

Vanilla Gradient Descent

This is one of the simplest forms of the Gradient Descent Technique. The name vanilla means pure or without any adulteration. In this, small steps are taken in the direction of the minima by calculating the gradient of the cost function. Similar to the above-mentioned algorithm, the update rule is given by,

coefficient = coefficient – (alpha * delta)

Gradient Descent with Momentum

In this case, the algorithm is such that we know the previous steps before taking the next step. This is done by introducing a new term which is the product of the previous update and a constant known as the momentum. In this, the weight update rule is given by,

update = alpha * delta

velocity = previous_update * momentum

coefficient = coefficient + velocity – update

ADAGRAD

The term ADAGRAD stands for Adaptive Gradient Algorithm. As the name says, it uses an adaptive technique to update the weights. This algorithm is more suited for sparse data. This optimization changes its learning rates in relation to the frequency of the parameter updates during the training. For example, the parameters which have higher gradients are made to have a slower learning rate so that we do not end up overshooting the minimum value. Similarly, lower gradients have a faster learning rate to get trained more quickly.

ADAM 

Yet another adaptive optimization algorithm that has its roots in the Gradient Descent algorithm is the ADAM which stands for Adaptive Moment Estimation. It is a combination of both the ADAGRAD and the SGD with Momentum algorithms. It is built from the ADAGRAD algorithm and is built further downside. In simple terms ADAM = ADAGRAD + Momentum.

Ads of upGrad blog

In this way, there are several other variants of Gradient Descent Algorithms that have been developed and are being developed in the world such as AMSGrad, ADAMax.

Popular AI and ML Blogs & Free Courses

Conclusion

In this article, we have seen the algorithm behind one of the most commonly used optimization algorithms in Machine Learning, the Gradient Descent Algorithms along with its types and variants that have been developed.

upGrad provides a Executive PG Programme in Machine Learning & AI and a  Master of Science in Machine Learning & AI that may guide you toward building a career. These courses will explain the need for Machine Learning and further steps to gather knowledge in this domain covering varied concepts ranging from Gradient Descent in Machine Learning.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1Where can Gradient Descent Algorithm contribute maximally?

Optimisation within any machine learning algorithm is incremental to the purity of the algorithm. Gradient Descent Algorithm assists in minimising cost function errors and improving the algorithm’s parameters. Although the Gradient Descent algorithm is used widely in Machine Learning and Deep Learning, its effectiveness can be determined by the quantity of data, amount of iterations and accuracy preferred, and amount of time available. For small-scale datasets, the Batch Gradient Descent is optimal. Stochastic Gradient Descent (SGD) proves to be more efficient for detailed and more extensive data sets. In contrast, Mini Batch Gradient Descent is used for quicker optimisation.

2What are the challenges faced in gradient descent?

Gradient Descent is preferred to optimise machine learning models to reduce cost function. However, it has its shortcomings as well. Suppose the Gradient is diminished due to the minimum output functions of the model layers. In that case, the iterations won’t be as effective as the model will not retrain fully, updating its weights and biases. Sometimes an error gradient accumulates loads of weights and biases to keep the iterations updated. However, this gradient becomes too large to manage and is called an exploding gradient. The infrastructure requirements, learning rate balance, momentum need to be addressed.

3Does gradient descent always converge?

Convergence is when the gradient descent algorithm successfully minimises its cost function to an optimal level. Gradient Descent Algorithm tries to minimise the cost function through the algorithm parameters. However, it can land on any of the optimal points and not necessarily the one that has a global or local optimum point. One reason for not having optimal convergence is the step size. A more significant step size results in more oscillations and may divert from the global optimal. Hence, gradient descent may not always converge on the best feature, but it still lands on the nearest feature point.

Explore Free Courses

Suggested Blogs

15 Interesting MATLAB Project Ideas & Topics For Beginners [2024]
82457
Diving into the world of engineering and data science, I’ve discovered the potential of MATLAB as an indispensable tool. It has accelerated my c
Read More

by Pavan Vadapalli

09 Jul 2024

5 Types of Research Design: Elements and Characteristics
47126
The reliability and quality of your research depend upon several factors such as determination of target audience, the survey of a sample population,
Read More

by Pavan Vadapalli

07 Jul 2024

Biological Neural Network: Importance, Components & Comparison
50612
Humans have made several attempts to mimic the biological systems, and one of them is artificial neural networks inspired by the biological neural net
Read More

by Pavan Vadapalli

04 Jul 2024

Production System in Artificial Intelligence and its Characteristics
86790
The AI market has witnessed rapid growth on the international level, and it is predicted to show a CAGR of 37.3% from 2023 to 2030. The production sys
Read More

by Pavan Vadapalli

03 Jul 2024

AI vs Human Intelligence: Difference Between AI & Human Intelligence
112983
In this article, you will learn about AI vs Human Intelligence, Difference Between AI & Human Intelligence. Definition of AI & Human Intelli
Read More

by Pavan Vadapalli

01 Jul 2024

Career Opportunities in Artificial Intelligence: List of Various Job Roles
89547
Artificial Intelligence or AI career opportunities have escalated recently due to its surging demands in industries. The hype that AI will create tons
Read More

by Pavan Vadapalli

26 Jun 2024

Gini Index for Decision Trees: Mechanism, Perfect & Imperfect Split With Examples
70805
As you start learning about supervised learning, it’s important to get acquainted with the concept of decision trees. Decision trees are akin to
Read More

by MK Gurucharan

24 Jun 2024

Random Forest Vs Decision Tree: Difference Between Random Forest and Decision Tree
51730
Recent advancements have paved the growth of multiple algorithms. These new and blazing algorithms have set the data on fire. They help in handling da
Read More

by Pavan Vadapalli

24 Jun 2024

Basic CNN Architecture: Explaining 5 Layers of Convolutional Neural Network
270717
Introduction In the last few years of the IT industry, there has been a huge demand for once particular skill set known as Deep Learning. Deep Learni
Read More

by MK Gurucharan

21 Jun 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon