← Lessons

quiz vs the machine

Gold1410

Machine Learning

The REINFORCE Policy Gradient

Optimizing a parameterized policy directly by following the gradient of expected return.

6 min read · core · beat Gold to climb

Optimizing the policy itself

Instead of learning values and acting greedily, policy gradient methods parameterize a policy directly and adjust its parameters to raise expected return. REINFORCE is the foundational example, using complete episode returns.

The gradient estimator

The policy gradient theorem gives a gradient you can estimate from samples. For each action taken, you scale the gradient of its log probability by the return that followed:

  • Run an episode and record states, actions, and rewards.
  • Compute the return from each time step onward.
  • Push up the log probability of actions that led to high returns and down for low ones.

Variance and baselines

REINFORCE is unbiased but high variance, because full returns are noisy. Subtracting a baseline, typically a state value estimate, from the return leaves the gradient unbiased while sharply cutting variance. The baseline centers returns so only the relative advantage of an action drives the update.

Key idea

REINFORCE follows the policy gradient by weighting each action's log probability gradient by the return it produced, and a baseline reduces the otherwise high variance without adding bias.

Check yourself

Answer to earn rating on the learn ladder.

1. What does REINFORCE scale the log probability gradient by?

2. Why subtract a baseline in REINFORCE?