Dynamic Programming for RL
When the transitions and rewards of an MDP are known, dynamic programming can compute the optimal policy exactly by repeatedly applying the Bellman equation.
Policy evaluation
Given a fixed policy, policy evaluation computes its value function. It sweeps over all states, updating each value to the immediate reward plus discounted neighbor values, and repeats until the values stop changing.
Policy improvement
Once values are known, policy improvement makes the policy greedy: in each state choose the action with the highest expected value. The new policy is guaranteed to be at least as good as the old one.
Two algorithms
- Policy iteration alternates full evaluation and improvement until the policy stops changing.
- Value iteration merges the two by applying the Bellman optimality update directly, skipping full evaluation between improvements.
Both converge to the optimal policy because each step is a contraction that moves values toward the unique fixed point.
The catch
Dynamic programming needs the full model and visits every state, so it does not scale to huge or unknown environments. It is the theoretical backbone, while sampling methods handle the realistic case.
Key idea
Dynamic programming alternates evaluating and improving a policy using a known model, converging to the optimal policy but requiring full knowledge of the MDP.