SlideShare a Scribd company logo
Policy Gradient
Jie-Han Chen
NetDB, National Cheng Kung University
5/22, 2018 @ National Cheng Kung University, Taiwan
Some content and images in this slides were borrowed from:
1. Sergey Levine’s Deep Reinforcement Learning class in UCB
2. David Silver’s Reinforcement Learning class in UCL
3. Rich Sutton’s textbook
4. Deep Reinforcement Learning and Control in CMU (CMU 10703)
2
Disclaimer
Outline
● Pitfall of Value-based Reinforcement Learning
● Policy gradient
● Variance reduction
● Policy in policy gradient
● Off-policy policy gradient
● Reference
Value-based Reinforcement Learning
In previous lecture, we introduce how to use neural network to approximate value
function and how to learn the optimal policy in discrete action space.
Value-based Reinforcement Learning
In Deep Q Network, we use neural network to approximate the action value
function Q(s, a).
The greedy policy is
Value-based Reinforcement Learning
● The optimal policy learned from value-based method is deterministic
● It’s hard to be applied in continuous action problem
Pitfall of Value-based Reinforcement Learning
Consider the following simple maze, the features are constructed by 4 elements
and each element means whether facing the wall in that direction (N, S, W, E).
feature: (1, 1, 0, 0)
Pitfall of Value-based Reinforcement Learning
For deterministic policy:
● It will move east either or west in both grey states.
● It may get stuck and never reach the goal state.
Pitfall of Value-based Reinforcement Learning
Although well-defined observation could help the agent to distinguish the
difference in different states, sometimes we prefer to use stochastic policy.
Pitfall of Value-based Reinforcement Learning
In robotics, the action(control) is often continuous. We
need to decide the degree/torque in the robotic arm
given observation.
It’s hard to use argmax to demonstrate the optimal
action of robotic arm, so we need other solutions in
continuous control problem.
Value-based and Policy-based RL
Value-Based
● Learnt Value Function
● Implicit policy
Policy-Based
● No Value Function
● Learnt Policy
Actor-Critic
● Learnt Value Function
● Learnt Policy
Policy Gradient
The objective of reinforcement learning is to maximize the expected episodic
rewards (here, we take episodic task as our example)
We define as and is represented for the episodic trajectory,
the objective can be expressed as following:
Policy Gradient
action
p(s’|s, a)
state (s)
Policy Gradient
action
p(s’|s, a)
state (s)
In this example, the sample is a episodic trajectory
not the experience for each transition (s, a, r, s’)
Policy Gradient
How to do we find the optimal parameters of neural network?
???
Policy Gradient
How to do we find the optimal parameters of neural network?
Gradient Ascent ! (maximize objective)
*** Math Caution ***
Policy Gradient
tips:
Policy Gradient
tips:
Policy Gradient
tips:
We just sample trajectories using current policy and
adjust the likelihood of trajectories by episodic rewards.
Policy Gradient
Policy Gradient
Policy Gradient
Policy Gradient
Policy Gradient
The gradient of objective:
Adjust the action probability
taken in that trajectory.
According to the magnitude
of episodic rewards
Policy Gradient
The gradient of objective:
In practice, we replace expectation by sampling multiple trajectories.
Vanilla Policy Gradient - REINFORCE algorithm
REINFORCE algorithm:
1. sample trajectory from
2.
3.
Features of Policy-Based RL
Advantages
● Better convergence
● Effective in high-dimensional or continuous action space
● Stochastic policy
Disadvantages:
● Local optimal rather than global optimal
● Inefficient learning (learned by episodes)
● High variance
REINFORCE: bias and variance
The estimator is unbiased, because it use true
rewards to evaluate policy.
The estimator of REINFORCE is known to have
high variance because of huge difference in
episodic rewards. High variance results in slow
convergence.
Variance reduction
There are two method to reduce variance
1. Causality
2. Baseline
Variance reduction: causality
Original:
Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’
Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’
Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’
REINFORCE: reduce variance
There are two method to reduce variance
Original:
1. Causality:
2. Baseline
Variance reduction: baseline
baseline:
Variance reduction: baseline
baseline:
we can choose baseline like:
Average sampled trajectories’ rewards
Variance reduction: baseline
baseline:
Do baselines introduce bias in expectation?
Variance reduction: baseline
baseline:
Do baselines introduce bias in expectation?
Analyze:
Variance reduction: baseline
*baseline is independent of policy
Reduce variance with baseline won’t make model biased, as long as the
baseline is independent of the policy (not action-related)
Variance reduction: baseline
● Subtracting a baseline is unbiased in expectation, It won’t make the
estimator biased.
● The baseline can be any function, random variable, as long as it does
not vary with action.
Variance reduction: baseline
Variance:
Related Paper:
VARIANCE REDUCTION FOR POLICY GRADIENT WITH ACTION-DEPENDENT FACTORIZED
BASELINES, Cathy Wu*, Aravind Rajeswaran*, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor
Mordatch, Pieter Abbeel OpenAI (currently under review at ICLR 2018)
Variance reduction: causality + baseline
In previous, we introduce 2 method to reduce variance.
● Causality
● Baseline
Question: Can we combine two variance reduction method together?
Variance reduction: causality + baseline
The ideal form:
Terminal state
Staring state
If you are in certain state of trajectory, there are many potential
path to reach different terminal state. Of course, the remaining
rewards would also be different.
The naive concept is to find the average remaining rewards in
that state, in other words, value function.
The state in trajectory
Variance reduction: causality + baseline
We can learn value function by the method mentioned before
(tabular method or using function approximator)
In REINFORCE algorithm, the agent play with environment
until reaching terminal state. We know the remaining rewards
at each step so that we can use Monte Carlo method to
evaluate policy, and the loss could be MSE.
Policy
Policy Gradient can be apply to discrete action space problem so as continuous
action space problem.
● Discrete action problem: Softmax Policy
● Continuous action problem: Gaussian Policy
Policy: Softmax Policy
Here, we suppose the function approximator is h(s, a) and is its parameters
we will sample the action according to its softmax probability.
observation
(features)
action
probability
Policy: Gaussian Policy
In continuous control, a Gaussian policy is common.
● Gaussian distribution:
● Gaussian Policy: we use neural network to approximate mean, and sample the
action according to Gaussian distribution. The variance can also be
parameterized.
Policy: Gaussian Policy
Here, we use fixed variance and mean value is computed by linear combination of
state features, where is used to features transformation.
The gradient of the log of the policy is
In neural network, you just need to backpropagate the sampled action probability.
Reason of Inefficient Learning
The main reasons for inefficient learning in REINFORCE are:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
Reason of Inefficient Learning
The main reasons for inefficient learning in REINFORCE are:
● The REINFORCE algorithm is on-policy
The learning process of REINFORCE:
1. Sample multiple trajectories from
2. Fit model
If the sampled trajectories are from different distribution, the
learning result would be wrong.
Reason of Inefficient Learning
The main reason for inefficient learning in REINFORCE is:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
○ In vanilla policy gradient, we sample multiple trajectories but just update model once. In
contrast to TD learning, vanilla policy gradient learns much slower.
Reason of Inefficient Learning
The main reason for inefficient learning in REINFORCE is:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
Improve by importance sampling
Improve by Actor-Critic
Importance sampling
Importance sampling is a statistical technique that we could use to estimate the
properties from different distribution here.
Suppose the objective is , but the data are sampled from q(X), we
can do such transformation:
Objective of on-policy policy gradient :
Objective of off-policy policy gradient:
Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to collect samples.
We sample the trajectories from the objective would be:
Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to collect samples.
We sample the trajectories from the objective would be:
Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to interact with environment.
We sample the trajectories from the objective would be:
Importance sampling ratio
ends up depending only on
the two policies and the
sequence.
Off-policy & importance sampling
Suppose the off-policy objective function is:
target policy (learner neural net)
behavior policy (expert/behavior neural net)
Off-policy & importance sampling
Suppose the off-policy objective function is:
Off-policy & importance sampling
Suppose the off-policy objective function is:
Off-policy & importance sampling
Suppose the off-policy objective function is:
How about causality?
Off-policy & importance sampling
Suppose the off-policy objective function is:
Off-policy & importance sampling
The gradient of off-policy objective:
future action won’t affect current weight
Off-policy & importance sampling
The gradient of off-policy objective:
1. This is the general form of off-policy policy gradient. if we
use on-policy learning, the form is as same as vanilla policy
gradient (importance sampling ratio is 1)
2. In practice, We store trajectories along with its action
probability each step, and then update the neural network
by adding importance sampling ratio.
Reason of Inefficient Learning
The main reason for inefficient learning in REINFORCE is:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method
We’ve already discussed.
In the next section
Reference
● CS 294, Berkeley lecture 4: http://rll.berkeley.edu/deeprlcourse/
● David Silver RL course lecture7: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf
● Baseline Substraction from Shan-Hung Wu https://www.youtube.com/watch?v=XnXRzOB0Pc8
● Andrej Karpathy’s blog: http://karpathy.github.io/2016/05/31/rl/
● Policy Gradient in pytorch https://github.com/pytorch/examples/tree/master/reinforcement_learning
Outline
● Pitfall of Value-based Reinforcement Learning
○ Value-based policy is deterministic
○ Hard to handle continuous control
● Policy gradient
● Variance reduction
○ Causality
○ Baseline
● Policy in policy gradient
○ Softmax policy
○ Gaussian policy
● Off-policy policy gradient
○ Importance sampling

More Related Content

PDF
Actor critic algorithm
PPT
PPT
JavaScript - An Introduction
PDF
Recommending and searching @ Spotify
PDF
Optimization for Deep Learning
PDF
Flutter tutorial for Beginner Step by Step
PDF
Reinforcement learning-ebook-part1
Actor critic algorithm
JavaScript - An Introduction
Recommending and searching @ Spotify
Optimization for Deep Learning
Flutter tutorial for Beginner Step by Step
Reinforcement learning-ebook-part1

What's hot (20)

PDF
Temporal difference learning
PPTX
Reinforcement learning:policy gradient (part 1)
PPTX
Reinforcement Learning : A Beginners Tutorial
PPTX
Deep Reinforcement Learning
PDF
Deep reinforcement learning from scratch
PPTX
Reinforcement learning
PDF
Deep Reinforcement Learning: Q-Learning
PDF
An introduction to deep reinforcement learning
PDF
Reinforcement learning, Q-Learning
PPTX
Reinforcement learning
PDF
Continuous control with deep reinforcement learning (DDPG)
PDF
Reinforcement Learning 6. Temporal Difference Learning
PPTX
Linear regression
PDF
Reinforcement Learning 5. Monte Carlo Methods
PPTX
Intro to Deep Reinforcement Learning
PDF
Proximal Policy Optimization (Reinforcement Learning)
PDF
Reinforcement Learning 8: Planning and Learning with Tabular Methods
PPTX
Unit 2 unsupervised learning.pptx
PDF
ddpg seminar
PPTX
An introduction to reinforcement learning
Temporal difference learning
Reinforcement learning:policy gradient (part 1)
Reinforcement Learning : A Beginners Tutorial
Deep Reinforcement Learning
Deep reinforcement learning from scratch
Reinforcement learning
Deep Reinforcement Learning: Q-Learning
An introduction to deep reinforcement learning
Reinforcement learning, Q-Learning
Reinforcement learning
Continuous control with deep reinforcement learning (DDPG)
Reinforcement Learning 6. Temporal Difference Learning
Linear regression
Reinforcement Learning 5. Monte Carlo Methods
Intro to Deep Reinforcement Learning
Proximal Policy Optimization (Reinforcement Learning)
Reinforcement Learning 8: Planning and Learning with Tabular Methods
Unit 2 unsupervised learning.pptx
ddpg seminar
An introduction to reinforcement learning
Ad

Similar to Policy gradient (20)

PPTX
Proximal Policy Optimization
PDF
Modern Recommendation for Advanced Practitioners part2
PDF
Intro to Reinforcement learning - part II
PDF
Trust Region Policy Optimization, Schulman et al, 2015
PDF
Regression analysis made easy
PDF
Deep Reinforcement learning
PPTX
How to formulate reinforcement learning in illustrative ways
PPTX
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
PDF
actor_critic_pdf in Reinforcement learning.pdf
PPTX
RL_in_10_min.pptx
PPTX
What is Reinforcement Algorithms and how worked.pptx
PPTX
Reinforcement Learning: An Introduction.pptx
PPTX
value and policy iteration presentation.pptx
PDF
Proximal Policy Optimization Algorithms, Schulman et al, 2017
PDF
A brief introduction to Searn Algorithm
PDF
reinforcement-learning-141009013546-conversion-gate02.pdf
PDF
Reinforcement Learning Guide For Beginners
PPTX
reinforcement-learning-141009013546-conversion-gate02.pptx
PPTX
An efficient use of temporal difference technique in Computer Game Learning
PDF
Introduction of Deep Reinforcement Learning
Proximal Policy Optimization
Modern Recommendation for Advanced Practitioners part2
Intro to Reinforcement learning - part II
Trust Region Policy Optimization, Schulman et al, 2015
Regression analysis made easy
Deep Reinforcement learning
How to formulate reinforcement learning in illustrative ways
Andrii Prysiazhnyk: Why the amazon sellers are buiyng the RTX 3080: Dynamic p...
actor_critic_pdf in Reinforcement learning.pdf
RL_in_10_min.pptx
What is Reinforcement Algorithms and how worked.pptx
Reinforcement Learning: An Introduction.pptx
value and policy iteration presentation.pptx
Proximal Policy Optimization Algorithms, Schulman et al, 2017
A brief introduction to Searn Algorithm
reinforcement-learning-141009013546-conversion-gate02.pdf
Reinforcement Learning Guide For Beginners
reinforcement-learning-141009013546-conversion-gate02.pptx
An efficient use of temporal difference technique in Computer Game Learning
Introduction of Deep Reinforcement Learning
Ad

More from Jie-Han Chen (10)

PDF
Frontier in reinforcement learning
PDF
Temporal difference learning
PDF
Deep reinforcement learning
PDF
Markov decision process
PDF
Multi armed bandit
PDF
An introduction to reinforcement learning
PDF
Discrete sequential prediction of continuous actions for deep RL
PDF
BiCNet presentation (multi-agent reinforcement learning)
PDF
Data science-toolchain
PDF
The artofreadablecode
Frontier in reinforcement learning
Temporal difference learning
Deep reinforcement learning
Markov decision process
Multi armed bandit
An introduction to reinforcement learning
Discrete sequential prediction of continuous actions for deep RL
BiCNet presentation (multi-agent reinforcement learning)
Data science-toolchain
The artofreadablecode

Recently uploaded (20)

PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Digestion and Absorption of Carbohydrates, Proteina and Fats
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
PDF
RMMM.pdf make it easy to upload and study
PPTX
Lesson notes of climatology university.
PDF
IGGE1 Understanding the Self1234567891011
PPTX
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
PDF
Indian roads congress 037 - 2012 Flexible pavement
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
PDF
advance database management system book.pdf
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Introduction to Building Materials
PDF
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Digestion and Absorption of Carbohydrates, Proteina and Fats
Final Presentation General Medicine 03-08-2024.pptx
CHAPTER IV. MAN AND BIOSPHERE AND ITS TOTALITY.pptx
RMMM.pdf make it easy to upload and study
Lesson notes of climatology university.
IGGE1 Understanding the Self1234567891011
Chinmaya Tiranga Azadi Quiz (Class 7-8 )
Indian roads congress 037 - 2012 Flexible pavement
Weekly quiz Compilation Jan -July 25.pdf
LNK 2025 (2).pdf MWEHEHEHEHEHEHEHEHEHEHE
Orientation - ARALprogram of Deped to the Parents.pptx
Radiologic_Anatomy_of_the_Brachial_plexus [final].pptx
advance database management system book.pdf
Chinmaya Tiranga quiz Grand Finale.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Introduction to Building Materials
OBE - B.A.(HON'S) IN INTERIOR ARCHITECTURE -Ar.MOHIUDDIN.pdf

Policy gradient

  • 1. Policy Gradient Jie-Han Chen NetDB, National Cheng Kung University 5/22, 2018 @ National Cheng Kung University, Taiwan
  • 2. Some content and images in this slides were borrowed from: 1. Sergey Levine’s Deep Reinforcement Learning class in UCB 2. David Silver’s Reinforcement Learning class in UCL 3. Rich Sutton’s textbook 4. Deep Reinforcement Learning and Control in CMU (CMU 10703) 2 Disclaimer
  • 3. Outline ● Pitfall of Value-based Reinforcement Learning ● Policy gradient ● Variance reduction ● Policy in policy gradient ● Off-policy policy gradient ● Reference
  • 4. Value-based Reinforcement Learning In previous lecture, we introduce how to use neural network to approximate value function and how to learn the optimal policy in discrete action space.
  • 5. Value-based Reinforcement Learning In Deep Q Network, we use neural network to approximate the action value function Q(s, a). The greedy policy is
  • 6. Value-based Reinforcement Learning ● The optimal policy learned from value-based method is deterministic ● It’s hard to be applied in continuous action problem
  • 7. Pitfall of Value-based Reinforcement Learning Consider the following simple maze, the features are constructed by 4 elements and each element means whether facing the wall in that direction (N, S, W, E). feature: (1, 1, 0, 0)
  • 8. Pitfall of Value-based Reinforcement Learning For deterministic policy: ● It will move east either or west in both grey states. ● It may get stuck and never reach the goal state.
  • 9. Pitfall of Value-based Reinforcement Learning Although well-defined observation could help the agent to distinguish the difference in different states, sometimes we prefer to use stochastic policy.
  • 10. Pitfall of Value-based Reinforcement Learning In robotics, the action(control) is often continuous. We need to decide the degree/torque in the robotic arm given observation. It’s hard to use argmax to demonstrate the optimal action of robotic arm, so we need other solutions in continuous control problem.
  • 11. Value-based and Policy-based RL Value-Based ● Learnt Value Function ● Implicit policy Policy-Based ● No Value Function ● Learnt Policy Actor-Critic ● Learnt Value Function ● Learnt Policy
  • 12. Policy Gradient The objective of reinforcement learning is to maximize the expected episodic rewards (here, we take episodic task as our example) We define as and is represented for the episodic trajectory, the objective can be expressed as following:
  • 14. Policy Gradient action p(s’|s, a) state (s) In this example, the sample is a episodic trajectory not the experience for each transition (s, a, r, s’)
  • 15. Policy Gradient How to do we find the optimal parameters of neural network? ???
  • 16. Policy Gradient How to do we find the optimal parameters of neural network? Gradient Ascent ! (maximize objective)
  • 20. Policy Gradient tips: We just sample trajectories using current policy and adjust the likelihood of trajectories by episodic rewards.
  • 25. Policy Gradient The gradient of objective: Adjust the action probability taken in that trajectory. According to the magnitude of episodic rewards
  • 26. Policy Gradient The gradient of objective: In practice, we replace expectation by sampling multiple trajectories.
  • 27. Vanilla Policy Gradient - REINFORCE algorithm REINFORCE algorithm: 1. sample trajectory from 2. 3.
  • 28. Features of Policy-Based RL Advantages ● Better convergence ● Effective in high-dimensional or continuous action space ● Stochastic policy Disadvantages: ● Local optimal rather than global optimal ● Inefficient learning (learned by episodes) ● High variance
  • 29. REINFORCE: bias and variance The estimator is unbiased, because it use true rewards to evaluate policy. The estimator of REINFORCE is known to have high variance because of huge difference in episodic rewards. High variance results in slow convergence.
  • 30. Variance reduction There are two method to reduce variance 1. Causality 2. Baseline
  • 32. Variance reduction: causality Original: Causality: policy at time t’ cannot affect reward at time t when t < t’
  • 33. Variance reduction: causality Original: Causality: policy at time t’ cannot affect reward at time t when t < t’
  • 34. Variance reduction: causality Original: Causality: policy at time t’ cannot affect reward at time t when t < t’
  • 35. REINFORCE: reduce variance There are two method to reduce variance Original: 1. Causality: 2. Baseline
  • 37. Variance reduction: baseline baseline: we can choose baseline like: Average sampled trajectories’ rewards
  • 38. Variance reduction: baseline baseline: Do baselines introduce bias in expectation?
  • 39. Variance reduction: baseline baseline: Do baselines introduce bias in expectation? Analyze:
  • 40. Variance reduction: baseline *baseline is independent of policy Reduce variance with baseline won’t make model biased, as long as the baseline is independent of the policy (not action-related)
  • 41. Variance reduction: baseline ● Subtracting a baseline is unbiased in expectation, It won’t make the estimator biased. ● The baseline can be any function, random variable, as long as it does not vary with action.
  • 42. Variance reduction: baseline Variance: Related Paper: VARIANCE REDUCTION FOR POLICY GRADIENT WITH ACTION-DEPENDENT FACTORIZED BASELINES, Cathy Wu*, Aravind Rajeswaran*, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor Mordatch, Pieter Abbeel OpenAI (currently under review at ICLR 2018)
  • 43. Variance reduction: causality + baseline In previous, we introduce 2 method to reduce variance. ● Causality ● Baseline Question: Can we combine two variance reduction method together?
  • 44. Variance reduction: causality + baseline The ideal form: Terminal state Staring state If you are in certain state of trajectory, there are many potential path to reach different terminal state. Of course, the remaining rewards would also be different. The naive concept is to find the average remaining rewards in that state, in other words, value function. The state in trajectory
  • 45. Variance reduction: causality + baseline We can learn value function by the method mentioned before (tabular method or using function approximator) In REINFORCE algorithm, the agent play with environment until reaching terminal state. We know the remaining rewards at each step so that we can use Monte Carlo method to evaluate policy, and the loss could be MSE.
  • 46. Policy Policy Gradient can be apply to discrete action space problem so as continuous action space problem. ● Discrete action problem: Softmax Policy ● Continuous action problem: Gaussian Policy
  • 47. Policy: Softmax Policy Here, we suppose the function approximator is h(s, a) and is its parameters we will sample the action according to its softmax probability. observation (features) action probability
  • 48. Policy: Gaussian Policy In continuous control, a Gaussian policy is common. ● Gaussian distribution: ● Gaussian Policy: we use neural network to approximate mean, and sample the action according to Gaussian distribution. The variance can also be parameterized.
  • 49. Policy: Gaussian Policy Here, we use fixed variance and mean value is computed by linear combination of state features, where is used to features transformation. The gradient of the log of the policy is In neural network, you just need to backpropagate the sampled action probability.
  • 50. Reason of Inefficient Learning The main reasons for inefficient learning in REINFORCE are: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method
  • 51. Reason of Inefficient Learning The main reasons for inefficient learning in REINFORCE are: ● The REINFORCE algorithm is on-policy The learning process of REINFORCE: 1. Sample multiple trajectories from 2. Fit model If the sampled trajectories are from different distribution, the learning result would be wrong.
  • 52. Reason of Inefficient Learning The main reason for inefficient learning in REINFORCE is: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method ○ In vanilla policy gradient, we sample multiple trajectories but just update model once. In contrast to TD learning, vanilla policy gradient learns much slower.
  • 53. Reason of Inefficient Learning The main reason for inefficient learning in REINFORCE is: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method Improve by importance sampling Improve by Actor-Critic
  • 54. Importance sampling Importance sampling is a statistical technique that we could use to estimate the properties from different distribution here. Suppose the objective is , but the data are sampled from q(X), we can do such transformation: Objective of on-policy policy gradient : Objective of off-policy policy gradient:
  • 55. Off-policy & importance sampling ● target policy ( ): the learning policy, which we are interested in. ● behavior policy ( ): the policy used to collect samples. We sample the trajectories from the objective would be:
  • 56. Off-policy & importance sampling ● target policy ( ): the learning policy, which we are interested in. ● behavior policy ( ): the policy used to collect samples. We sample the trajectories from the objective would be:
  • 57. Off-policy & importance sampling ● target policy ( ): the learning policy, which we are interested in. ● behavior policy ( ): the policy used to interact with environment. We sample the trajectories from the objective would be: Importance sampling ratio ends up depending only on the two policies and the sequence.
  • 58. Off-policy & importance sampling Suppose the off-policy objective function is: target policy (learner neural net) behavior policy (expert/behavior neural net)
  • 59. Off-policy & importance sampling Suppose the off-policy objective function is:
  • 60. Off-policy & importance sampling Suppose the off-policy objective function is:
  • 61. Off-policy & importance sampling Suppose the off-policy objective function is: How about causality?
  • 62. Off-policy & importance sampling Suppose the off-policy objective function is:
  • 63. Off-policy & importance sampling The gradient of off-policy objective: future action won’t affect current weight
  • 64. Off-policy & importance sampling The gradient of off-policy objective: 1. This is the general form of off-policy policy gradient. if we use on-policy learning, the form is as same as vanilla policy gradient (importance sampling ratio is 1) 2. In practice, We store trajectories along with its action probability each step, and then update the neural network by adding importance sampling ratio.
  • 65. Reason of Inefficient Learning The main reason for inefficient learning in REINFORCE is: ● The REINFORCE algorithm is on-policy ● We need to learning by Monte-Carlo method We’ve already discussed. In the next section
  • 66. Reference ● CS 294, Berkeley lecture 4: http://rll.berkeley.edu/deeprlcourse/ ● David Silver RL course lecture7: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf ● Baseline Substraction from Shan-Hung Wu https://www.youtube.com/watch?v=XnXRzOB0Pc8 ● Andrej Karpathy’s blog: http://karpathy.github.io/2016/05/31/rl/ ● Policy Gradient in pytorch https://github.com/pytorch/examples/tree/master/reinforcement_learning
  • 67. Outline ● Pitfall of Value-based Reinforcement Learning ○ Value-based policy is deterministic ○ Hard to handle continuous control ● Policy gradient ● Variance reduction ○ Causality ○ Baseline ● Policy in policy gradient ○ Softmax policy ○ Gaussian policy ● Off-policy policy gradient ○ Importance sampling