Dynamic Programming for RL

When the transitions and rewards of an MDP are known, dynamic programming can compute the optimal policy exactly by repeatedly applying the Bellman equation.

Policy evaluation

Given a fixed policy, policy evaluation computes its value function. It sweeps over all states, updating each value to the immediate reward plus discounted neighbor values, and repeats until the values stop changing.

Policy improvement

Once values are known, policy improvement makes the policy greedy: in each state choose the action with the highest expected value. The new policy is guaranteed to be at least as good as the old one.

Two algorithms

Policy iteration alternates full evaluation and improvement until the policy stops changing.
Value iteration merges the two by applying the Bellman optimality update directly, skipping full evaluation between improvements.

Both converge to the optimal policy because each step is a contraction that moves values toward the unique fixed point.

The catch

Dynamic programming needs the full model and visits every state, so it does not scale to huge or unknown environments. It is the theoretical backbone, while sampling methods handle the realistic case.

Key idea

Dynamic programming alternates evaluating and improving a policy using a known model, converging to the optimal policy but requiring full knowledge of the MDP.

Dynamic Programming for RL