Optimizing the policy itself
Instead of learning values and acting greedily, policy gradient methods parameterize a policy directly and adjust its parameters to raise expected return. REINFORCE is the foundational example, using complete episode returns.
The gradient estimator
The policy gradient theorem gives a gradient you can estimate from samples. For each action taken, you scale the gradient of its log probability by the return that followed:
- Run an episode and record states, actions, and rewards.
- Compute the return from each time step onward.
- Push up the log probability of actions that led to high returns and down for low ones.
Variance and baselines
REINFORCE is unbiased but high variance, because full returns are noisy. Subtracting a baseline, typically a state value estimate, from the return leaves the gradient unbiased while sharply cutting variance. The baseline centers returns so only the relative advantage of an action drives the update.
Key idea
REINFORCE follows the policy gradient by weighting each action's log probability gradient by the return it produced, and a baseline reduces the otherwise high variance without adding bias.