The REINFORCE Policy Gradient

Optimizing a parameterized policy directly by following the gradient of expected return.

Optimizing the policy itself

Instead of learning values and acting greedily, policy gradient methods parameterize a policy directly and adjust its parameters to raise expected return. REINFORCE is the foundational example, using complete episode returns.

The gradient estimator

The policy gradient theorem gives a gradient you can estimate from samples. For each action taken, you scale the gradient of its log probability by the return that followed:

Run an episode and record states, actions, and rewards.
Compute the return from each time step onward.
Push up the log probability of actions that led to high returns and down for low ones.

Variance and baselines

REINFORCE is unbiased but high variance, because full returns are noisy. Subtracting a baseline, typically a state value estimate, from the return leaves the gradient unbiased while sharply cutting variance. The baseline centers returns so only the relative advantage of an action drives the update.

Key idea

REINFORCE follows the policy gradient by weighting each action's log probability gradient by the return it produced, and a baseline reduces the otherwise high variance without adding bias.

The REINFORCE Policy Gradient

Optimizing the policy itself

The gradient estimator

Variance and baselines

Key idea

Check yourself