SlideShare a Scribd company logo
Gradient Descent Optimization
SKKU Data Mining Lab
Hojin Yang
Index
Gradient Descent Method – batch, mini-batch, stochastic method
Problem case of GD
Gradient Descent Optimization – momentum, Adagrad, RMSprop, Adam
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
ℎ 𝜃 𝑥 = 𝜃𝑥 𝐽 𝜃 =
1
2 ∙ 8
Σ(ℎ 𝜃 𝑥 − 𝑦)2
Data(Experience)
Hypothesis(Task) Loss function(performance measure)
𝜃
𝐽 𝜃
2
Intro
First-order iterative optimization algorithm for finding the
minimum of a loss function
Gradient Descent Method
takes steps proportional to the negative of the gradient of
the function at the current point
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
𝜂 : learning rate
𝐽 𝜃 : loss function
∇ 𝜃 𝐽 𝜃 : gradient value for 𝜃
𝜃
𝐽 𝜃
2
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝐽′ 𝜃 =
1
8
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝜃 ≔ 𝜃 − 𝜂 ∙
1
8
{(2𝜃 − 4) ∙ 2 + (3𝜃 − 6) ∙ 3 + ⋯ +(20𝜃 − 40) ∙ 20}
𝐽′ 𝜃 =
1
8
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Randomly selected
at each iteration
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝜃 ≔ 𝜃 − 𝜂 ∙ (3.2𝜃 − 6.5) ∙ 3.2
𝐽′ 𝜃 = (𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
Specific x&y
•Batch gradient descent: Use all m examples in each iteration
•Stochastic gradient descent: Use 1 example in each iteration
•Mini-batch gradient descent: Use b examples in each iteration
Gradient Descent Method
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
𝜃 ≔ 𝜃 − 𝜂 ∙
1
2
{(4𝜃 − 7.5) ∙ 4 + (3.2𝜃 − 6.5) ∙ 3.2}
𝐽′ 𝜃 =
1
b
Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
b
Randomly selected
at each iteration(b=2)
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
Gradient Descent Method
𝐽 𝜃
𝜃
𝐽 𝜃 =
1
2 ∙ 8
Σ(𝜃𝑥 − 𝑦)2
2
X Y
2 4
3 6
4 7.5
5 10
3.2 6.5
10 20
11 23
20 40
Stochastic gradient descent
https://cdnpythonmachinelearning.azureedge.net/wp-content/uploads/2017/09/GD-v-SGD.png?x64257
https://www.safaribooksonline.com/library/view/hands-on-machine-learning/9781491962282/assets/mlst_0410.png
Gradient Descent Method
# of data is m, At every iteration:
Batch: 𝒪 𝑚
Mini-batch(with batch size of k): 𝒪 𝑘
Stochastic: 𝒪 1
class SGD:
def __init__(self, lr=0.01):
self.lr = lr
def update(self, params, grads):
for key in params.keys():
params[key] -= self.lr * grads[key]
Python
class
𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 0.95, iter=30
https://github.com/WegraLee
data X1 X2 Y
#1 1 100 10
#2 2 200 20
#3 3 300 30
ℎ 𝜃 𝑥 = 𝑥1 𝜃1 + 𝑥2 𝜃2
𝐽 𝜃 =
1
3
Σ(ℎ 𝜃 𝑥 − 𝑦)2
𝐽 𝜃 =
1
3
{ 1 ∙ 𝜃1 + 100 ∙ 𝜃2 − 10 2
+ 2 ∙ 𝜃1 + 200 ∙ 𝜃2 − 20 2
+ 3 ∙ 𝜃1 + 300 ∙ 𝜃2 − 30 2
}
=
1
3
{14 ∙ 𝜃1
2
+ ⋯ + 140000 ∙ 𝜃2
2
+ ⋯ }
𝜃1 𝜃2
𝐽 𝜃 𝐽 𝜃
Gradient Descent Problem
Gradient descent optimizer
Gradient descent optimizer
iter 1:
slope: -10
slope: -0.1
10 ∙ 𝜂
0.1 ∙ 𝜂
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
𝜂
-1 ∙ slope ∙ learning rate
iter 1:
iter 2:
10 ∙ 𝜂
0.1 ∙ 𝜂
slope: 15
slope: -0.05
-15 ∙ 𝜂
0.05 ∙ 𝜂
-1 ∙ slope ∙ learning rate
𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 0.95 𝑤 =
𝑥2
20
+ 𝑦2
, Learning rate = 1.01
data X1 X2 Y
#1 1 100 10
#2 2 200 20
#3 3 300 30
Feature Scaling
1 ≤ 𝑋1 ≤ 3
100 ≤ 𝑋2 ≤ 300
0 ≤ 𝑋1 ≤ 1
0 ≤ 𝑋2 ≤ 1
https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re
Gradient Descent Optimization
Main Idea
- Remember the movement in the past
- Reflect that on the current movement
Momentum(관성)
Offset effect
past
current
+
=
Accelerate effect
past
+
current
=
Saves proportion of the previous movements
Momentum(관성)
(𝛾 : usually about 0.9)
Momentum(관성)
(𝛾 : usually about 0.9)
Momentum(관성)
(𝛾 : usually about 0.9)
iter 1:
slope: -10
slope: -0.1
10 ∙ 𝜂
0.1 ∙ 𝜂
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
-1 ∙ slope ∙ learning rate
iter 1:
iter 2(vanilla GD) :
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
slope: 15
slope: -0.05
iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
Offset effect
iter 1:
iter 2(before add past step) :
0.9 X
iter 2(after add past step) :
+
=
1 X
10 ∙ 𝜂
0.1 ∙ 𝜂
-15 ∙ 𝜂
0.05 ∙ 𝜂
-6 ∙ 𝜂
0.14 ∙ 𝜂
Accelerate
effect
can expect to move out of local minima and
move to the better minima because of momentum
Avoiding Local Minima. Picture from http://www.yaldex.com.
Momentum(관성)
Need more memory(X2)
class Momentum:
def __init__(self, lr=0.01, momentum=0.9):
self.lr = lr
self.momentum = momentum
self.v = None
def update(self, params, grads):
if self.v is None:
self.v = {}
for key, val in params.items():
self.v[key] = np.zeros_like(val)
for key in params.keys():
self.v[key] = self.momentum*self.v[key] - self.lr*grads[key]
params[key] += self.v[key]
Python
class
https://github.com/WegraLee
W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow
Gradient descent optimizer
Adagrad(Adaptive Gradient)
Main Idea
- Increase the learning rate of variables that have not changed much so far
- decrease the learning rate of variables that have much changed so far
𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃
Fixed →Adaptive!
𝜃1 𝜃2
𝐽 𝜃 𝐽 𝜃
Adagrad(Adaptive Gradient)
Accumulate the square of gradient
Adagrad(Adaptive Gradient)
Accumulate the square of gradient
Adagrad(Adaptive Gradient)
As the cumulative value increases, the learning rate
decreases.
iter 1:
slope: -10
slope: -0.1
-10 ∙ 𝜂
-0.1 ∙ 𝜂
slope ∙ learning rate
iter 1(vanilla GD):
-10 ∙ 𝜂
-0.1 ∙ 𝜂
cache2=102
cache1= 0.12
iter 1(adagrad):
-10 ∙ (𝜂 / cache2) = −𝜂
-0.1 ∙ (𝜂 / cache1) = −𝜂
slope: -10
slope: -0.1
slope: 0.3
slope: -0.08
cache2=
102 + 0.32
cache1=
0.12
+ 0.082
iter 2(after update):
0.3 ∙ (𝜂 / cache2)
-0.08 ∙ (𝜂 / cache1)
iter 1:
10 ∙ (𝜂 / cache2) = −𝜂
0.1 ∙ (𝜂 / cache1) = −𝜂
class AdaGrad:
def __init__(self, lr=0.01):
self.lr = lr
self.h = None
def update(self, params, grads):
if self.h is None:
self.h = {}
for key, val in params.items():
self.h[key] = np.zeros_like(val)
for key in params.keys():
self.h[key] += grads[key] * grads[key]
params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7)
Python
class
https://github.com/WegraLee
W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.AdagradOptimizer(learning_rate,initial_accumulator_value).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow
Gradient descent optimizer
RMSProp
- the G part obtained by adding the square of the gradient
is replaced with exponential averages(지수평균)
- possible to maintain the relative size difference between
the variables of the recent change amount without
increasing G indefinitely.
https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwi0uszPs7_YAhVFybwKHcWRDfYQjhwIBQ&url=https%3A%2F%2Fp.rizon.top%3A443%2Fhttps%2Finsidehpc.com%2F2015%2F06%2Fpodcast-geoffrey-hinton-on-the-
rise-of-deep-learning%2F&psig=AOvVaw1Tpp31PE1Bg2r8cpN4KDUn&ust=1515192917829215
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
Momentum
exponential averages of
previous slopes
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
RMSprop
Adam(Adaptive Moment Estimation)
Hybrid of Momentum and RMSprop
Adam에서는 m과 v가 처음에 0으로 초기화되어 있기 때문에 학습의 초반부에서는 mt,vtmt,vt가 0에
가깝게 bias 되어있을 것이라고 판단하여 이를 unbiased 하게 만들어주는 작업을 거친다.
mtmt 와 vtvt의 식을 ∑∑ 형태로 펼친 후 양변에 expectation을 씌워서 정리해보면, 다음과 같은 보정
을 통해 unbiased 된 expectation을 얻을 수 있다.
이 보정된 expectation들을 가지고 gradient가 들어갈 자리에 mt^mt^, GtGt가 들어갈 자리에 vt^vt^를
넣어 계산을 진행한다.
(𝛽1, 𝛽2 : usually about 0.9, 0.999)
W_decode = tf.Variable(tf.random_normal([n_hidden,n_input]))
b_decode = tf.Variable(tf.random_normal([n_input]))
decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode)
cost = tf.reduce_mean(tf.pow(X-decoder,2))
optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(cost)
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
Tensor flow
Gradient descent optimizer
https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/05/Comparison-of-Adam-to-
Other-Optimization-Algorithms-Training-a-Multilayer-Perceptron.png
Cannot choose one solution
Use Adam in most case

More Related Content

PPTX
An overview of gradient descent optimization algorithms
PDF
Overview on Optimization algorithms in Deep Learning
PDF
Optimization for Deep Learning
PDF
Optimization in deep learning
PPTX
Linear regression with gradient descent
PPTX
Optimization in Deep Learning
PPTX
Optimization/Gradient Descent
PDF
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018
An overview of gradient descent optimization algorithms
Overview on Optimization algorithms in Deep Learning
Optimization for Deep Learning
Optimization in deep learning
Linear regression with gradient descent
Optimization in Deep Learning
Optimization/Gradient Descent
Loss Functions for Deep Learning - Javier Ruiz Hidalgo - UPC Barcelona 2018

What's hot (20)

PDF
Gradient descent method
PPTX
K-Folds Cross Validation Method
PDF
Introduction to XGBoost
PDF
Understanding Bagging and Boosting
PPTX
Ensemble methods in machine learning
PPTX
Gradient descent method
PPTX
Time series predictions using LSTMs
PDF
Anomaly Detection using Deep Auto-Encoders
PPTX
PPTX
Support vector machine
PDF
Decision trees in Machine Learning
PPT
Support Vector Machines
PPTX
Gradient Descent. How NN learns
PDF
TensorFlow and Keras: An Overview
PDF
What is the Expectation Maximization (EM) Algorithm?
PDF
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
PPTX
Feed forward ,back propagation,gradient descent
PDF
XGBoost & LightGBM
PPTX
Support vector machine
PPTX
boosting algorithm
Gradient descent method
K-Folds Cross Validation Method
Introduction to XGBoost
Understanding Bagging and Boosting
Ensemble methods in machine learning
Gradient descent method
Time series predictions using LSTMs
Anomaly Detection using Deep Auto-Encoders
Support vector machine
Decision trees in Machine Learning
Support Vector Machines
Gradient Descent. How NN learns
TensorFlow and Keras: An Overview
What is the Expectation Maximization (EM) Algorithm?
Logistic Regression in Python | Logistic Regression Example | Machine Learnin...
Feed forward ,back propagation,gradient descent
XGBoost & LightGBM
Support vector machine
boosting algorithm
Ad

Similar to Gradient descent optimizer (20)

PDF
Dep Neural Networks introduction new.pdf
PPTX
Deep Neural Network Module 3A Optimization.pptx
PPTX
Gradient descent variants in deep laearning
PDF
Lesson 5_VARIOUS_ optimization_algos.pdf
PDF
An overview of gradient descent optimization algorithms.pdf
PDF
Data Con LA 2019 - Optimization Algorithms for Deep Learning by Ash Pahwa
PPTX
Gradient Descent or Assent is to find optimal parameters that minimize the l...
PDF
Chap 8. Optimization for training deep models
PDF
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
PDF
CS229 Machine Learning Lecture Notes
PPTX
Advance Machine Learning presentation.pptx
PPTX
Optimization techniq
PPTX
Gradient Descent DS Rohit Sharma fench knjs.pptx
PPTX
Introduction to PyTorch
PPTX
DeepLearningLecture.pptx
PPTX
Stochastic Gradient Decent (SGD).pptx
PPTX
3. Training Artificial Neural Networks.pptx
PDF
weights training of perceptron (using 3 training rules)
PPTX
4. OPTIMIZATION NN AND FL.pptx
Dep Neural Networks introduction new.pdf
Deep Neural Network Module 3A Optimization.pptx
Gradient descent variants in deep laearning
Lesson 5_VARIOUS_ optimization_algos.pdf
An overview of gradient descent optimization algorithms.pdf
Data Con LA 2019 - Optimization Algorithms for Deep Learning by Ash Pahwa
Gradient Descent or Assent is to find optimal parameters that minimize the l...
Chap 8. Optimization for training deep models
Gradient Descent, Back Propagation, and Auto Differentiation - Advanced Spark...
CS229 Machine Learning Lecture Notes
Advance Machine Learning presentation.pptx
Optimization techniq
Gradient Descent DS Rohit Sharma fench knjs.pptx
Introduction to PyTorch
DeepLearningLecture.pptx
Stochastic Gradient Decent (SGD).pptx
3. Training Artificial Neural Networks.pptx
weights training of perceptron (using 3 training rules)
4. OPTIMIZATION NN AND FL.pptx
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PDF
Business Analytics and business intelligence.pdf
PDF
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
annual-report-2024-2025 original latest.
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Managing Community Partner Relationships
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Introduction to Data Science and Data Analysis
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
Computer network topology notes for revision
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Business Analytics and business intelligence.pdf
BF and FI - Blockchain, fintech and Financial Innovation Lesson 2.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
annual-report-2024-2025 original latest.
IBA_Chapter_11_Slides_Final_Accessible.pptx
Managing Community Partner Relationships
Miokarditis (Inflamasi pada Otot Jantung)
Supervised vs unsupervised machine learning algorithms
STUDY DESIGN details- Lt Col Maksud (21).pptx
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
SAP 2 completion done . PRESENTATION.pptx
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
IB Computer Science - Internal Assessment.pptx
Computer network topology notes for revision
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Introduction-to-Cloud-ComputingFinal.pptx

Gradient descent optimizer

  • 1. Gradient Descent Optimization SKKU Data Mining Lab Hojin Yang
  • 2. Index Gradient Descent Method – batch, mini-batch, stochastic method Problem case of GD Gradient Descent Optimization – momentum, Adagrad, RMSprop, Adam
  • 3. X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 ℎ 𝜃 𝑥 = 𝜃𝑥 𝐽 𝜃 = 1 2 ∙ 8 Σ(ℎ 𝜃 𝑥 − 𝑦)2 Data(Experience) Hypothesis(Task) Loss function(performance measure) 𝜃 𝐽 𝜃 2 Intro
  • 4. First-order iterative optimization algorithm for finding the minimum of a loss function Gradient Descent Method takes steps proportional to the negative of the gradient of the function at the current point 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 𝜂 : learning rate 𝐽 𝜃 : loss function ∇ 𝜃 𝐽 𝜃 : gradient value for 𝜃 𝜃 𝐽 𝜃 2
  • 5. •Batch gradient descent: Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝐽′ 𝜃 = 1 8 Σ(𝜃𝑥 − 𝑦) ∙ 𝑥
  • 6. •Batch gradient descent: Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝜃 ≔ 𝜃 − 𝜂 ∙ 1 8 {(2𝜃 − 4) ∙ 2 + (3𝜃 − 6) ∙ 3 + ⋯ +(20𝜃 − 40) ∙ 20} 𝐽′ 𝜃 = 1 8 Σ(𝜃𝑥 − 𝑦) ∙ 𝑥 𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃
  • 7. •Batch gradient descent: Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Randomly selected at each iteration 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝜃 ≔ 𝜃 − 𝜂 ∙ (3.2𝜃 − 6.5) ∙ 3.2 𝐽′ 𝜃 = (𝜃𝑥 − 𝑦) ∙ 𝑥 𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃 Specific x&y
  • 8. •Batch gradient descent: Use all m examples in each iteration •Stochastic gradient descent: Use 1 example in each iteration •Mini-batch gradient descent: Use b examples in each iteration Gradient Descent Method X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 𝜃 ≔ 𝜃 − 𝜂 ∙ 1 2 {(4𝜃 − 7.5) ∙ 4 + (3.2𝜃 − 6.5) ∙ 3.2} 𝐽′ 𝜃 = 1 b Σ(𝜃𝑥 − 𝑦) ∙ 𝑥 𝜃 ≔ 𝜃 − 𝜂 ∙ 𝐽′ 𝜃 b Randomly selected at each iteration(b=2)
  • 9. Gradient Descent Method 𝐽 𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40
  • 10. Gradient Descent Method 𝐽 𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40
  • 11. Gradient Descent Method 𝐽 𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 12. Gradient Descent Method 𝐽 𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 13. Gradient Descent Method 𝐽 𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 14. Gradient Descent Method 𝐽 𝜃 𝜃 𝐽 𝜃 = 1 2 ∙ 8 Σ(𝜃𝑥 − 𝑦)2 2 X Y 2 4 3 6 4 7.5 5 10 3.2 6.5 10 20 11 23 20 40 Stochastic gradient descent
  • 16. class SGD: def __init__(self, lr=0.01): self.lr = lr def update(self, params, grads): for key in params.keys(): params[key] -= self.lr * grads[key] Python class 𝑤 = 𝑥2 20 + 𝑦2 , Learning rate = 0.95, iter=30 https://github.com/WegraLee
  • 17. data X1 X2 Y #1 1 100 10 #2 2 200 20 #3 3 300 30 ℎ 𝜃 𝑥 = 𝑥1 𝜃1 + 𝑥2 𝜃2 𝐽 𝜃 = 1 3 Σ(ℎ 𝜃 𝑥 − 𝑦)2 𝐽 𝜃 = 1 3 { 1 ∙ 𝜃1 + 100 ∙ 𝜃2 − 10 2 + 2 ∙ 𝜃1 + 200 ∙ 𝜃2 − 20 2 + 3 ∙ 𝜃1 + 300 ∙ 𝜃2 − 30 2 } = 1 3 {14 ∙ 𝜃1 2 + ⋯ + 140000 ∙ 𝜃2 2 + ⋯ } 𝜃1 𝜃2 𝐽 𝜃 𝐽 𝜃 Gradient Descent Problem
  • 20. iter 1: slope: -10 slope: -0.1 10 ∙ 𝜂 0.1 ∙ 𝜂 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 𝜂 -1 ∙ slope ∙ learning rate
  • 21. iter 1: iter 2: 10 ∙ 𝜂 0.1 ∙ 𝜂 slope: 15 slope: -0.05 -15 ∙ 𝜂 0.05 ∙ 𝜂 -1 ∙ slope ∙ learning rate
  • 22. 𝑤 = 𝑥2 20 + 𝑦2 , Learning rate = 0.95 𝑤 = 𝑥2 20 + 𝑦2 , Learning rate = 1.01
  • 23. data X1 X2 Y #1 1 100 10 #2 2 200 20 #3 3 300 30 Feature Scaling 1 ≤ 𝑋1 ≤ 3 100 ≤ 𝑋2 ≤ 300 0 ≤ 𝑋1 ≤ 1 0 ≤ 𝑋2 ≤ 1 https://stats.stackexchange.com/questions/111467/is-it-necessary-to-scale-the-target-value-in-addition-to-scaling-features-for-re
  • 25. Main Idea - Remember the movement in the past - Reflect that on the current movement Momentum(관성) Offset effect past current + = Accelerate effect past + current =
  • 26. Saves proportion of the previous movements Momentum(관성) (𝛾 : usually about 0.9)
  • 29. iter 1: slope: -10 slope: -0.1 10 ∙ 𝜂 0.1 ∙ 𝜂 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 -1 ∙ slope ∙ learning rate
  • 30. iter 1: iter 2(vanilla GD) : 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 slope: 15 slope: -0.05
  • 31. iter 1: iter 2(before add past step) : 0.9 X iter 2(after add past step) : + = 1 X 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 -6 ∙ 𝜂 0.14 ∙ 𝜂
  • 32. iter 1: iter 2(before add past step) : 0.9 X iter 2(after add past step) : + = 1 X 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 -6 ∙ 𝜂 0.14 ∙ 𝜂 Offset effect
  • 33. iter 1: iter 2(before add past step) : 0.9 X iter 2(after add past step) : + = 1 X 10 ∙ 𝜂 0.1 ∙ 𝜂 -15 ∙ 𝜂 0.05 ∙ 𝜂 -6 ∙ 𝜂 0.14 ∙ 𝜂 Accelerate effect
  • 34. can expect to move out of local minima and move to the better minima because of momentum Avoiding Local Minima. Picture from http://www.yaldex.com. Momentum(관성) Need more memory(X2)
  • 35. class Momentum: def __init__(self, lr=0.01, momentum=0.9): self.lr = lr self.momentum = momentum self.v = None def update(self, params, grads): if self.v is None: self.v = {} for key, val in params.items(): self.v[key] = np.zeros_like(val) for key in params.keys(): self.v[key] = self.momentum*self.v[key] - self.lr*grads[key] params[key] += self.v[key] Python class https://github.com/WegraLee
  • 36. W_decode = tf.Variable(tf.random_normal([n_hidden,n_input])) b_decode = tf.Variable(tf.random_normal([n_input])) decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode) cost = tf.reduce_mean(tf.pow(X-decoder,2)) optimizer = tf.train.MomentumOptimizer(learning_rate, momentum).minimize(cost) init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) Tensor flow
  • 38. Adagrad(Adaptive Gradient) Main Idea - Increase the learning rate of variables that have not changed much so far - decrease the learning rate of variables that have much changed so far 𝜃 ≔ 𝜃 − 𝜂 ∙ ∇ 𝜃 𝐽 𝜃 Fixed →Adaptive! 𝜃1 𝜃2 𝐽 𝜃 𝐽 𝜃
  • 40. Accumulate the square of gradient Adagrad(Adaptive Gradient)
  • 41. Accumulate the square of gradient Adagrad(Adaptive Gradient) As the cumulative value increases, the learning rate decreases.
  • 42. iter 1: slope: -10 slope: -0.1 -10 ∙ 𝜂 -0.1 ∙ 𝜂 slope ∙ learning rate
  • 43. iter 1(vanilla GD): -10 ∙ 𝜂 -0.1 ∙ 𝜂 cache2=102 cache1= 0.12 iter 1(adagrad): -10 ∙ (𝜂 / cache2) = −𝜂 -0.1 ∙ (𝜂 / cache1) = −𝜂 slope: -10 slope: -0.1
  • 44. slope: 0.3 slope: -0.08 cache2= 102 + 0.32 cache1= 0.12 + 0.082 iter 2(after update): 0.3 ∙ (𝜂 / cache2) -0.08 ∙ (𝜂 / cache1) iter 1: 10 ∙ (𝜂 / cache2) = −𝜂 0.1 ∙ (𝜂 / cache1) = −𝜂
  • 45. class AdaGrad: def __init__(self, lr=0.01): self.lr = lr self.h = None def update(self, params, grads): if self.h is None: self.h = {} for key, val in params.items(): self.h[key] = np.zeros_like(val) for key in params.keys(): self.h[key] += grads[key] * grads[key] params[key] -= self.lr * grads[key] / (np.sqrt(self.h[key]) + 1e-7) Python class https://github.com/WegraLee
  • 46. W_decode = tf.Variable(tf.random_normal([n_hidden,n_input])) b_decode = tf.Variable(tf.random_normal([n_input])) decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode) cost = tf.reduce_mean(tf.pow(X-decoder,2)) optimizer = tf.train.AdagradOptimizer(learning_rate,initial_accumulator_value).minimize(cost) init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) Tensor flow
  • 48. RMSProp - the G part obtained by adding the square of the gradient is replaced with exponential averages(지수평균) - possible to maintain the relative size difference between the variables of the recent change amount without increasing G indefinitely. https://www.google.co.kr/url?sa=i&rct=j&q=&esrc=s&source=images&cd=&ved=0ahUKEwi0uszPs7_YAhVFybwKHcWRDfYQjhwIBQ&url=https%3A%2F%2Fp.rizon.top%3A443%2Fhttps%2Finsidehpc.com%2F2015%2F06%2Fpodcast-geoffrey-hinton-on-the- rise-of-deep-learning%2F&psig=AOvVaw1Tpp31PE1Bg2r8cpN4KDUn&ust=1515192917829215
  • 49. Adam(Adaptive Moment Estimation) Hybrid of Momentum and RMSprop
  • 50. Adam(Adaptive Moment Estimation) Hybrid of Momentum and RMSprop Momentum exponential averages of previous slopes
  • 51. Adam(Adaptive Moment Estimation) Hybrid of Momentum and RMSprop RMSprop
  • 52. Adam(Adaptive Moment Estimation) Hybrid of Momentum and RMSprop Adam에서는 m과 v가 처음에 0으로 초기화되어 있기 때문에 학습의 초반부에서는 mt,vtmt,vt가 0에 가깝게 bias 되어있을 것이라고 판단하여 이를 unbiased 하게 만들어주는 작업을 거친다. mtmt 와 vtvt의 식을 ∑∑ 형태로 펼친 후 양변에 expectation을 씌워서 정리해보면, 다음과 같은 보정 을 통해 unbiased 된 expectation을 얻을 수 있다. 이 보정된 expectation들을 가지고 gradient가 들어갈 자리에 mt^mt^, GtGt가 들어갈 자리에 vt^vt^를 넣어 계산을 진행한다. (𝛽1, 𝛽2 : usually about 0.9, 0.999)
  • 53. W_decode = tf.Variable(tf.random_normal([n_hidden,n_input])) b_decode = tf.Variable(tf.random_normal([n_input])) decoder = tf.nn.sigmoid(tf.matmul(encoder,W_decode)+b_decode) cost = tf.reduce_mean(tf.pow(X-decoder,2)) optimizer = tf.train.AdamOptimizer(learning_rate, beta1, beta2, epsilon).minimize(cost) init = tf.global_variables_initializer() sess = tf.Session() sess.run(init) Tensor flow