SlideShare a Scribd company logo
2
Most read
4
Most read
11
Most read
DA 5230 – Statistical & Machine Learning
Lecture 5 – Gradient Descent
Maninda Edirisooriya
manindaw@uom.lk
Linear Regression
• In its generic form, Multiple Linear Regression is
• Used when X variables are linearly correlated to Y variable
• Trying to represent data points with a linear (e.g.: flat in 2D) Hyperplane,
• Denoted by, Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn
• ML problem is finding the function coefficients (βi values) of this hyperplane
where the error (total of distances from data points to the hyperplane) is
minimized
• We use Mean Square Errors to represent the error
• We can use polynomials of Xi as the variables to represent non-linear
relationships of Xi with Y.
Linear Regression Method
• As Linear Regression is a function of parameters, f𝜷(X) = ෠
𝐘 we have to find
β so that the error ε (= Y - ෡
Y) is minimized
• There are two ways to computationally minimize this error and find
parameters
• In the Closed Form, The Normal Equation can directly find the parameter values (β
values) from the matrix formula, β =(XTX)−1XTY
• Using the iterative technique, Gradient Descent
• In this lesson we learn about Gradient Descent as
• Normal Equation is computationally expensive for computing inverse matrices for
large datasets
• Learning with Gradient Descent can tune its algorithm related parameters
(hyperparameters) to go to a stable solution than in the Normal Equation
Gradient Descent – Simple Linear Regression
• In the simplest form of Linear Regression we have f𝜷(X) = β0 + β1*X1
where, β1 is the gradient and β0 is the intercept of a straight line
• If we visualize how β0 and β1 relates to the error J(β) (also known as
Cost) varies with β0 and β1 we can visualize it in a 3D graph like
follows
Gradient Descent – Simple Linear Regression
• In the Gradient Descent algorithm first we assign some values to
constants β0 and β1 in some way. For example,
1. We can assign random values to β0 and β1 – known as Random Initialization
2. We can assign 0 (zero) values to β0 and β1 – known as Zero Initialization
• Then we try to iteratively move to the lowest cost point. E.g.:
Gradient Descent
• As it is difficult to explain this 3D scenario lets assume we want to
minimize the cost function J(β) with related to a single weight, β
J(β)
β
Gradient Descent
• When we iteratively move to the minimum cost point, you can see
that the gradient (slope of the curve) is reducing and goes to zero
• Gradient of a function is its derivative
• Therefore, the slope at 𝛃 is
ⅆ𝐉 𝛃
ⅆ𝛃
• But in real, there are more than one 𝛃, like 𝛃𝟎 and 𝛃𝟏
• Therefore, we have to use partial derivative where the slope at 𝛃 is
𝛛𝐉 𝛃
𝛛𝛃
Gradient Descent
• When the slope is positive that means the 𝛃 value is higher than the
optimal (with least cost) value of 𝛃
• In that case we have to reduce some value from current 𝛃 to bring it
to the optimal value
• What is the value to be reduced from 𝛃?
• It is better to use a value proportional to the derivative,
𝛛𝐉 𝛃
𝛛𝛃
• But that number should be sufficiently small too
• Otherwise, new 𝛃 will be too smaller than the optimal 𝛃
• For that we use a pre-defined very small constant value 𝛂 known as the Learning Rate
• So we reduce the multiplication of these values: 𝛂
𝛛𝐉 𝛃
𝛛𝛃
Gradient Descent
• Now we have the Gradient Descent’s parameter updating formula, to
be applied in each of the iteration (epoch),
𝛃 ≔ 𝛃 - 𝛂
𝛛𝐉 𝛃
𝛛𝛃
Where 𝛂 is a small value like 0.01
• Once we have initialized the value for 𝛃 we can iteratively update the
value of it until the cost functions shows no significant reduction
• Finally, we can use the value of 𝛃 as the solution of Linear Regression
• The same formula can be used when there are more than one value
for 𝛃, taking 𝛃 as the vector of all parameters β0 ,β2 … βn
Gradient Descent – Derivative of Cost
In Linear Regression (is what we discuss in this lesson), we use a slightly different
version of Mean Square of Errors (MSE) as the Cost Function, 𝐉 𝛃
J β =
1
2
෍
i=1
n
෡
Yi − Yi
2
Where, n is the number of data points
(This is why you get a convex bowl like shape for Simple Linear Regression when
there are 2 parameters)
Let’s find the derivative of Cost related to any parameter, 𝛃𝐣
𝜕J β
𝜕βj
= 2 *
1
2
෌i=1
n
෡
Yi − Yi *
𝜕 ෡
Yi−Yi
𝜕βj
(from chain rule of derivation)
= ෌i=1
n
෡
Yi − Yi *
𝜕
𝜕βj
β0 + β1∗X1 + β2∗X2 + … + βj∗Xj + … + βn∗Xn − Yi
= ෌i=1
n
෡
Yi − Yi * Xi,j
Gradient Descent – Update Rule
• Parameter update rule for parameter 𝛃𝐣, where n is the total number
of data points,
βj ≔ βj - α
𝜕J β
𝜕βj
βj ≔ βj - α ෌i=1
n
෡
Yi − Yi Xi,j
Gradient Descent – Algorithm (Summary)
• Initialize 𝛃𝐣 parameters
• Assign a small value to the Learning Rate 𝜶. (e.g.: 0.01)
• Apply the parameter update rule for parameter 𝛃𝐣, (where n is the
number of data points) in each epoch,
βj ≔ βj - α ෌i=1
n
෡
Yi − Yi Xi,j
• Stop the repetition when the cost function reduction is very little
• Now you can use 𝛃𝐣 values to predict ෡
𝐘 values for new X values
Gradient Descent – Convergence
• In each epoch the cost is reduced with a
reducing rate if the process is
Convergent (going to a certain lesser
error level)
• After large number of epochs the cost
reduction becomes insignificant and
stables around a certain value
• Linear Regression is always Convergent
when proper learning rate is used, as
there are no multiple local minima (i.e.
no more than one point where the cost
is minimized)
Cost Function – 2D Visualization
• As the cost function of Simple Linear Regression, 𝐉 𝛃 needs 3D
visualization, it needs a way to visualize it as a 2D image
• Contour Curves is a way of converting a 3D visualization to 2D
Cost Function – Effect of Learning Rate
• Learning rate is a hyperparameter that
has to be manually set making sure,
• The model converge to a solution
• i.e.: Should not diverge
• Training time should be lesser
• Final cost should be lesser
• Too large Learning Rates have a higher
tendency of diverging
• Too lower Learning Rates train slower
• Hence, have to find an optimum rate
Cost Function – Effect of Learning Rate
• Learning Rate is like a compromise between high risk for faster convergence
• Depending on the situation, higher learning rates may converge faster, or
convergence may even get slower down due to higher oscillation, or even
diverge
• On the other hand lower learning rate is slow at converging but is highly
probable at converging
Batch Gradient Descent
• The iteration step we already learned is Batch Gradient Descent
• The update rule is,
• In each iteration (epoch)
• βj ≔ βj - α ෌i=1
n
෡
Yi − Yi Xi,j
• Here we use the whole dataset (Batch) of size n in each epoch
• Very good at updating to the correct direction in each epoch
• But very computationally expensive as the whole batch of size n is
iterated inside each epoch
Stochastic Gradient Descent (SGD)
• Instead of the batch, each data point is used to update 𝛃𝐣 at a time
• The update rule is,
• In each epoch,
• For each data point i
• βj ≔ βj - α ෡
Yi − Yi Xi,j
• As each data point is used for updating, convergence is faster for
larger datasets (e.g.: 100000 data points)
• As each data point is highly different from the distribution, each
update may not be happening on the correct direction
• Will not be that stable on a certain minimum cost as the cost gets
changed during each of the update
Mini-Batch Gradient Descent
• This is a balance between Batch Gradient Descent and the Stochastic
Gradient Descent
• The update rule is,
• In each iteration (epoch)
• For all the mini batches (i.e.: n/m)
• βj ≔ βj - α ෌i=1
m
෡
Yi − Yi Xi,j
• Here n is the batch size and m is the minibatch size
• Where m in general is 64, 128, 256, 512 or 1024
• As m >> 1, gradient changes in a much correct direction in each
epoch, and stables much closer to the optimum point than in SGD
Convergence Patterns (Summary)
One Hour Homework
• Officially we have one more hour to do after the end of the lectures
• Therefore, for this week’s extra hour you have a homework
• Gradient Descent is the core learning algorithm in almost all the ML ahead
including in Deep Learning related subject modules
• Go through the slides until you clearly understand Gradient Descent
• Refer external sources to clarify all the ambiguities related to it
• Good Luck!
Questions?

More Related Content

PDF
2 linear regression with one variable
PPTX
Linear Regression.pptx
PDF
3.1. Linear Regression and Gradient Desent.pdf
PPTX
Linear regression with gradient descent
PDF
Linear Regression with Gradient Descent.pdf
PPTX
Gradient Descent DS Rohit Sharma fench knjs.pptx
PPTX
Regression ppt
PPTX
Advance Machine Learning presentation.pptx
2 linear regression with one variable
Linear Regression.pptx
3.1. Linear Regression and Gradient Desent.pdf
Linear regression with gradient descent
Linear Regression with Gradient Descent.pdf
Gradient Descent DS Rohit Sharma fench knjs.pptx
Regression ppt
Advance Machine Learning presentation.pptx

Similar to Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machine Learning (20)

PDF
CS229 Machine Learning Lecture Notes
PDF
Lecture 5 - Linear Regression Linear Regression
PPTX
2. Linear regression with one variable.pptx
PDF
Presentation about the Linear Regression.pdf
PDF
Introduction to Artificial Neural Networks
PPTX
Optimization of mathematical function using gradient descent algorithm.pptx
PPTX
Linear regression, costs & gradient descent
PPTX
Week 2 - ML models and Linear Regression.pptx
PDF
Machine learning
PDF
ML_Lec4 introduction to linear regression.pdf
PDF
Regression_1.pdf
PDF
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
PDF
Machine learning (1)
PDF
A Brief Introduction to Linear Regression
PPTX
4. OPTIMIZATION NN AND FL.pptx
PDF
Deep learning concepts
PDF
Overview on Optimization algorithms in Deep Learning
PDF
X01 Supervised learning problem linear regression one feature theorie
DOCX
Ann a Algorithms notes
PPTX
Gradient descent variants in deep laearning
CS229 Machine Learning Lecture Notes
Lecture 5 - Linear Regression Linear Regression
2. Linear regression with one variable.pptx
Presentation about the Linear Regression.pdf
Introduction to Artificial Neural Networks
Optimization of mathematical function using gradient descent algorithm.pptx
Linear regression, costs & gradient descent
Week 2 - ML models and Linear Regression.pptx
Machine learning
ML_Lec4 introduction to linear regression.pdf
Regression_1.pdf
A Novel Methodology to Implement Optimization Algorithms in Machine Learning
Machine learning (1)
A Brief Introduction to Linear Regression
4. OPTIMIZATION NN AND FL.pptx
Deep learning concepts
Overview on Optimization algorithms in Deep Learning
X01 Supervised learning problem linear regression one feature theorie
Ann a Algorithms notes
Gradient descent variants in deep laearning
Ad

More from Maninda Edirisooriya (20)

PDF
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
PDF
Lecture 11 - Advance Learning Techniques
PDF
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
PDF
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
PDF
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
PDF
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
PDF
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
PDF
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
PDF
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
PDF
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
PDF
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
PDF
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
PDF
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
PDF
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
PDF
WSO2 BAM - Your big data toolbox
PDF
Training Report
PDF
GViz - Project Report
PPTX
PPT
Hafnium impact 2008
PPTX
Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and ...
Lecture 11 - Advance Learning Techniques
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
WSO2 BAM - Your big data toolbox
Training Report
GViz - Project Report
Hafnium impact 2008
Ad

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
“Next-Gen AI: Trends Reshaping Our World”
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PPTX
anatomy of limbus and anterior chamber .pptx
PPTX
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
PPTX
Strings in CPP - Strings in C++ are sequences of characters used to store and...
PPTX
AgentX UiPath Community Webinar series - Delhi
PDF
Model Code of Practice - Construction Work - 21102022 .pdf
PPTX
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
PDF
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PPTX
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
Internship_Presentation_Final engineering.pptx
PPTX
OOP with Java - Java Introduction (Basics)
PPT
Drone Technology Electronics components_1
PPTX
436813905-LNG-Process-Overview-Short.pptx
PPTX
bas. eng. economics group 4 presentation 1.pptx
web development for engineering and engineering
“Next-Gen AI: Trends Reshaping Our World”
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
anatomy of limbus and anterior chamber .pptx
MET 305 MODULE 1 KTU 2019 SCHEME 25.pptx
Strings in CPP - Strings in C++ are sequences of characters used to store and...
AgentX UiPath Community Webinar series - Delhi
Model Code of Practice - Construction Work - 21102022 .pdf
Unit 5 BSP.pptxytrrftyyydfyujfttyczcgvcd
algorithms-16-00088-v2hghjjnjnhhhnnjhj.pdf
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
ANIMAL INTERVENTION WARNING SYSTEM (4).pptx
Embodied AI: Ushering in the Next Era of Intelligent Systems
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
Internship_Presentation_Final engineering.pptx
OOP with Java - Java Introduction (Basics)
Drone Technology Electronics components_1
436813905-LNG-Process-Overview-Short.pptx
bas. eng. economics group 4 presentation 1.pptx

Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machine Learning

  • 1. DA 5230 – Statistical & Machine Learning Lecture 5 – Gradient Descent Maninda Edirisooriya [email protected]
  • 2. Linear Regression • In its generic form, Multiple Linear Regression is • Used when X variables are linearly correlated to Y variable • Trying to represent data points with a linear (e.g.: flat in 2D) Hyperplane, • Denoted by, Y = β0 + β1*X1 + β2*X2 + ... + βn*Xn • ML problem is finding the function coefficients (βi values) of this hyperplane where the error (total of distances from data points to the hyperplane) is minimized • We use Mean Square Errors to represent the error • We can use polynomials of Xi as the variables to represent non-linear relationships of Xi with Y.
  • 3. Linear Regression Method • As Linear Regression is a function of parameters, f𝜷(X) = ෠ 𝐘 we have to find β so that the error ε (= Y - ෡ Y) is minimized • There are two ways to computationally minimize this error and find parameters • In the Closed Form, The Normal Equation can directly find the parameter values (β values) from the matrix formula, β =(XTX)−1XTY • Using the iterative technique, Gradient Descent • In this lesson we learn about Gradient Descent as • Normal Equation is computationally expensive for computing inverse matrices for large datasets • Learning with Gradient Descent can tune its algorithm related parameters (hyperparameters) to go to a stable solution than in the Normal Equation
  • 4. Gradient Descent – Simple Linear Regression • In the simplest form of Linear Regression we have f𝜷(X) = β0 + β1*X1 where, β1 is the gradient and β0 is the intercept of a straight line • If we visualize how β0 and β1 relates to the error J(β) (also known as Cost) varies with β0 and β1 we can visualize it in a 3D graph like follows
  • 5. Gradient Descent – Simple Linear Regression • In the Gradient Descent algorithm first we assign some values to constants β0 and β1 in some way. For example, 1. We can assign random values to β0 and β1 – known as Random Initialization 2. We can assign 0 (zero) values to β0 and β1 – known as Zero Initialization • Then we try to iteratively move to the lowest cost point. E.g.:
  • 6. Gradient Descent • As it is difficult to explain this 3D scenario lets assume we want to minimize the cost function J(β) with related to a single weight, β J(β) β
  • 7. Gradient Descent • When we iteratively move to the minimum cost point, you can see that the gradient (slope of the curve) is reducing and goes to zero • Gradient of a function is its derivative • Therefore, the slope at 𝛃 is ⅆ𝐉 𝛃 ⅆ𝛃 • But in real, there are more than one 𝛃, like 𝛃𝟎 and 𝛃𝟏 • Therefore, we have to use partial derivative where the slope at 𝛃 is 𝛛𝐉 𝛃 𝛛𝛃
  • 8. Gradient Descent • When the slope is positive that means the 𝛃 value is higher than the optimal (with least cost) value of 𝛃 • In that case we have to reduce some value from current 𝛃 to bring it to the optimal value • What is the value to be reduced from 𝛃? • It is better to use a value proportional to the derivative, 𝛛𝐉 𝛃 𝛛𝛃 • But that number should be sufficiently small too • Otherwise, new 𝛃 will be too smaller than the optimal 𝛃 • For that we use a pre-defined very small constant value 𝛂 known as the Learning Rate • So we reduce the multiplication of these values: 𝛂 𝛛𝐉 𝛃 𝛛𝛃
  • 9. Gradient Descent • Now we have the Gradient Descent’s parameter updating formula, to be applied in each of the iteration (epoch), 𝛃 ≔ 𝛃 - 𝛂 𝛛𝐉 𝛃 𝛛𝛃 Where 𝛂 is a small value like 0.01 • Once we have initialized the value for 𝛃 we can iteratively update the value of it until the cost functions shows no significant reduction • Finally, we can use the value of 𝛃 as the solution of Linear Regression • The same formula can be used when there are more than one value for 𝛃, taking 𝛃 as the vector of all parameters β0 ,β2 … βn
  • 10. Gradient Descent – Derivative of Cost In Linear Regression (is what we discuss in this lesson), we use a slightly different version of Mean Square of Errors (MSE) as the Cost Function, 𝐉 𝛃 J β = 1 2 ෍ i=1 n ෡ Yi − Yi 2 Where, n is the number of data points (This is why you get a convex bowl like shape for Simple Linear Regression when there are 2 parameters) Let’s find the derivative of Cost related to any parameter, 𝛃𝐣 𝜕J β 𝜕βj = 2 * 1 2 ෌i=1 n ෡ Yi − Yi * 𝜕 ෡ Yi−Yi 𝜕βj (from chain rule of derivation) = ෌i=1 n ෡ Yi − Yi * 𝜕 𝜕βj β0 + β1∗X1 + β2∗X2 + … + βj∗Xj + … + βn∗Xn − Yi = ෌i=1 n ෡ Yi − Yi * Xi,j
  • 11. Gradient Descent – Update Rule • Parameter update rule for parameter 𝛃𝐣, where n is the total number of data points, βj ≔ βj - α 𝜕J β 𝜕βj βj ≔ βj - α ෌i=1 n ෡ Yi − Yi Xi,j
  • 12. Gradient Descent – Algorithm (Summary) • Initialize 𝛃𝐣 parameters • Assign a small value to the Learning Rate 𝜶. (e.g.: 0.01) • Apply the parameter update rule for parameter 𝛃𝐣, (where n is the number of data points) in each epoch, βj ≔ βj - α ෌i=1 n ෡ Yi − Yi Xi,j • Stop the repetition when the cost function reduction is very little • Now you can use 𝛃𝐣 values to predict ෡ 𝐘 values for new X values
  • 13. Gradient Descent – Convergence • In each epoch the cost is reduced with a reducing rate if the process is Convergent (going to a certain lesser error level) • After large number of epochs the cost reduction becomes insignificant and stables around a certain value • Linear Regression is always Convergent when proper learning rate is used, as there are no multiple local minima (i.e. no more than one point where the cost is minimized)
  • 14. Cost Function – 2D Visualization • As the cost function of Simple Linear Regression, 𝐉 𝛃 needs 3D visualization, it needs a way to visualize it as a 2D image • Contour Curves is a way of converting a 3D visualization to 2D
  • 15. Cost Function – Effect of Learning Rate • Learning rate is a hyperparameter that has to be manually set making sure, • The model converge to a solution • i.e.: Should not diverge • Training time should be lesser • Final cost should be lesser • Too large Learning Rates have a higher tendency of diverging • Too lower Learning Rates train slower • Hence, have to find an optimum rate
  • 16. Cost Function – Effect of Learning Rate • Learning Rate is like a compromise between high risk for faster convergence • Depending on the situation, higher learning rates may converge faster, or convergence may even get slower down due to higher oscillation, or even diverge • On the other hand lower learning rate is slow at converging but is highly probable at converging
  • 17. Batch Gradient Descent • The iteration step we already learned is Batch Gradient Descent • The update rule is, • In each iteration (epoch) • βj ≔ βj - α ෌i=1 n ෡ Yi − Yi Xi,j • Here we use the whole dataset (Batch) of size n in each epoch • Very good at updating to the correct direction in each epoch • But very computationally expensive as the whole batch of size n is iterated inside each epoch
  • 18. Stochastic Gradient Descent (SGD) • Instead of the batch, each data point is used to update 𝛃𝐣 at a time • The update rule is, • In each epoch, • For each data point i • βj ≔ βj - α ෡ Yi − Yi Xi,j • As each data point is used for updating, convergence is faster for larger datasets (e.g.: 100000 data points) • As each data point is highly different from the distribution, each update may not be happening on the correct direction • Will not be that stable on a certain minimum cost as the cost gets changed during each of the update
  • 19. Mini-Batch Gradient Descent • This is a balance between Batch Gradient Descent and the Stochastic Gradient Descent • The update rule is, • In each iteration (epoch) • For all the mini batches (i.e.: n/m) • βj ≔ βj - α ෌i=1 m ෡ Yi − Yi Xi,j • Here n is the batch size and m is the minibatch size • Where m in general is 64, 128, 256, 512 or 1024 • As m >> 1, gradient changes in a much correct direction in each epoch, and stables much closer to the optimum point than in SGD
  • 21. One Hour Homework • Officially we have one more hour to do after the end of the lectures • Therefore, for this week’s extra hour you have a homework • Gradient Descent is the core learning algorithm in almost all the ML ahead including in Deep Learning related subject modules • Go through the slides until you clearly understand Gradient Descent • Refer external sources to clarify all the ambiguities related to it • Good Luck!