Policy Gradient Methods
Policy gradient methods optimize a parameterized policy directly, adjusting its parameters in the direction that increases expected return. They sidestep value tables entirely.
The core idea
The policy is a function, often a neural network, that outputs action probabilities. We want parameters that maximize expected return, so we estimate the gradient of return with respect to those parameters and take ascent steps.
The policy gradient theorem gives a usable form: increase the probability of actions that led to high return and decrease those that led to low return, weighted by how good the outcome was.
REINFORCE
The simplest algorithm, REINFORCE, runs full episodes, then for each action scales its log probability gradient by the return that followed. Over many episodes the policy shifts toward rewarding behavior.
Strengths
- It handles continuous action spaces naturally, which value based methods struggle with.
- It can learn stochastic policies, useful when randomness is optimal.
- It optimizes the objective we actually care about, return.
The variance problem
The plain estimator has high variance because returns vary wildly across episodes. A standard fix subtracts a baseline, often a value estimate, leaving an advantage that says how much better an action was than average. This keeps the gradient unbiased while shrinking variance.
Key idea
Policy gradient methods adjust policy parameters along the gradient of expected return, naturally handling continuous and stochastic actions, with baselines used to cut variance.