Policy gradient

Policy Gradient
Jie-Han Chen
NetDB, National Cheng Kung University
5/22, 2018 @ National Cheng Kung University, Taiwan

Some content and images in this slides were borrowed from:
1. Sergey Levine’s Deep Reinforcement Learning class in UCB
2. David Silver’s Reinforcement Learning class in UCL
3. Rich Sutton’s textbook
4. Deep Reinforcement Learning and Control in CMU (CMU 10703)
2
Disclaimer

Outline
● Pitfall of Value-based Reinforcement Learning
● Policy gradient
● Variance reduction
● Policy in policy gradient
● Off-policy policy gradient
● Reference

Value-based Reinforcement Learning
In previous lecture, we introduce how to use neural network to approximate value
function and how to learn the optimal policy in discrete action space.

In Deep Q Network, we use neural network to approximate the action value
function Q(s, a).
The greedy policy is

● The optimal policy learned from value-based method is deterministic
● It’s hard to be applied in continuous action problem

Pitfall of Value-based Reinforcement Learning
Consider the following simple maze, the features are constructed by 4 elements
and each element means whether facing the wall in that direction (N, S, W, E).
feature: (1, 1, 0, 0)

For deterministic policy:
● It will move east either or west in both grey states.
● It may get stuck and never reach the goal state.

Although well-defined observation could help the agent to distinguish the
difference in different states, sometimes we prefer to use stochastic policy.

In robotics, the action(control) is often continuous. We
need to decide the degree/torque in the robotic arm
given observation.
It’s hard to use argmax to demonstrate the optimal
action of robotic arm, so we need other solutions in
continuous control problem.

Value-based and Policy-based RL
Value-Based
● Learnt Value Function
● Implicit policy
Policy-Based
● No Value Function
● Learnt Policy
Actor-Critic
● Learnt Value Function
● Learnt Policy

Policy Gradient
The objective of reinforcement learning is to maximize the expected episodic
rewards (here, we take episodic task as our example)
We define as and is represented for the episodic trajectory,
the objective can be expressed as following:

Policy Gradient
action
p(s’|s, a)
state (s)

Policy Gradient
action
p(s’|s, a)
state (s)
In this example, the sample is a episodic trajectory
not the experience for each transition (s, a, r, s’)

Policy Gradient
How to do we find the optimal parameters of neural network?
???

Policy Gradient
How to do we find the optimal parameters of neural network?
Gradient Ascent ! (maximize objective)

Policy Gradient
tips:
We just sample trajectories using current policy and
adjust the likelihood of trajectories by episodic rewards.

Policy Gradient
The gradient of objective:
Adjust the action probability
taken in that trajectory.
According to the magnitude
of episodic rewards

Policy Gradient
The gradient of objective:
In practice, we replace expectation by sampling multiple trajectories.

Vanilla Policy Gradient - REINFORCE algorithm
REINFORCE algorithm:
1. sample trajectory from
2.
3.

Features of Policy-Based RL
Advantages
● Better convergence
● Effective in high-dimensional or continuous action space
● Stochastic policy
Disadvantages:
● Local optimal rather than global optimal
● Inefficient learning (learned by episodes)
● High variance

REINFORCE: bias and variance
The estimator is unbiased, because it use true
rewards to evaluate policy.
The estimator of REINFORCE is known to have
high variance because of huge difference in
episodic rewards. High variance results in slow
convergence.

Variance reduction
There are two method to reduce variance
1. Causality
2. Baseline

Variance reduction: causality
Original:

Variance reduction: causality
Original:
Causality: policy at time t’ cannot affect reward at time t when t < t’

REINFORCE: reduce variance
There are two method to reduce variance
Original:
1. Causality:
2. Baseline

Variance reduction: baseline
baseline:

baseline:
we can choose baseline like:
Average sampled trajectories’ rewards

baseline:
Do baselines introduce bias in expectation?

baseline:
Do baselines introduce bias in expectation?
Analyze:

*baseline is independent of policy
Reduce variance with baseline won’t make model biased, as long as the
baseline is independent of the policy (not action-related)

● Subtracting a baseline is unbiased in expectation, It won’t make the
estimator biased.
● The baseline can be any function, random variable, as long as it does
not vary with action.

Variance:
Related Paper:
VARIANCE REDUCTION FOR POLICY GRADIENT WITH ACTION-DEPENDENT FACTORIZED
BASELINES, Cathy Wu*, Aravind Rajeswaran*, Yan Duan, Vikash Kumar, Alexandre M Bayen, Sham Kakade, Igor
Mordatch, Pieter Abbeel OpenAI (currently under review at ICLR 2018)

Variance reduction: causality + baseline
In previous, we introduce 2 method to reduce variance.
● Causality
● Baseline
Question: Can we combine two variance reduction method together?

The ideal form:
Terminal state
Staring state
If you are in certain state of trajectory, there are many potential
path to reach different terminal state. Of course, the remaining
rewards would also be different.
The naive concept is to find the average remaining rewards in
that state, in other words, value function.
The state in trajectory

We can learn value function by the method mentioned before
(tabular method or using function approximator)
In REINFORCE algorithm, the agent play with environment
until reaching terminal state. We know the remaining rewards
at each step so that we can use Monte Carlo method to
evaluate policy, and the loss could be MSE.

Policy
Policy Gradient can be apply to discrete action space problem so as continuous
action space problem.
● Discrete action problem: Softmax Policy
● Continuous action problem: Gaussian Policy

Policy: Softmax Policy
Here, we suppose the function approximator is h(s, a) and is its parameters
we will sample the action according to its softmax probability.
observation
(features)
action
probability

Policy: Gaussian Policy
In continuous control, a Gaussian policy is common.
● Gaussian distribution:
● Gaussian Policy: we use neural network to approximate mean, and sample the
action according to Gaussian distribution. The variance can also be
parameterized.

Policy: Gaussian Policy
Here, we use fixed variance and mean value is computed by linear combination of
state features, where is used to features transformation.
The gradient of the log of the policy is
In neural network, you just need to backpropagate the sampled action probability.

Reason of Inefficient Learning
The main reasons for inefficient learning in REINFORCE are:
● The REINFORCE algorithm is on-policy
● We need to learning by Monte-Carlo method

The main reasons for inefficient learning in REINFORCE are:
The learning process of REINFORCE:
1. Sample multiple trajectories from
2. Fit model
If the sampled trajectories are from different distribution, the
learning result would be wrong.

The main reason for inefficient learning in REINFORCE is:
○ In vanilla policy gradient, we sample multiple trajectories but just update model once. In
contrast to TD learning, vanilla policy gradient learns much slower.

Improve by importance sampling
Improve by Actor-Critic

Importance sampling
Importance sampling is a statistical technique that we could use to estimate the
properties from different distribution here.
Suppose the objective is , but the data are sampled from q(X), we
can do such transformation:
Objective of on-policy policy gradient :
Objective of off-policy policy gradient:

Off-policy & importance sampling
● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to collect samples.
We sample the trajectories from the objective would be:

● target policy ( ): the learning policy, which we are interested in.
● behavior policy ( ): the policy used to interact with environment.
We sample the trajectories from the objective would be:
Importance sampling ratio
ends up depending only on
the two policies and the
sequence.

Suppose the off-policy objective function is:
target policy (learner neural net)
behavior policy (expert/behavior neural net)

How about causality?

The gradient of off-policy objective:
future action won’t affect current weight

The gradient of off-policy objective:
1. This is the general form of off-policy policy gradient. if we
use on-policy learning, the form is as same as vanilla policy
gradient (importance sampling ratio is 1)
2. In practice, We store trajectories along with its action
probability each step, and then update the neural network
by adding importance sampling ratio.

We’ve already discussed.
In the next section

Reference
● CS 294, Berkeley lecture 4: http://rll.berkeley.edu/deeprlcourse/
● David Silver RL course lecture7: http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching_files/pg.pdf
● Baseline Substraction from Shan-Hung Wu https://www.youtube.com/watch?v=XnXRzOB0Pc8
● Andrej Karpathy’s blog: http://karpathy.github.io/2016/05/31/rl/
● Policy Gradient in pytorch https://github.com/pytorch/examples/tree/master/reinforcement_learning

Outline
● Pitfall of Value-based Reinforcement Learning
○ Value-based policy is deterministic
○ Hard to handle continuous control
● Policy gradient
● Variance reduction
○ Causality
○ Baseline
● Policy in policy gradient
○ Softmax policy
○ Gaussian policy
● Off-policy policy gradient
○ Importance sampling

Policy gradient

More Related Content

What's hot (20)

Similar to Policy gradient (20)

More from Jie-Han Chen (10)

Recently uploaded (20)

Policy gradient