A fixed point you can iterate
The Bellman optimality equation is a fixed point, and value iteration finds it by repeated application. You start with arbitrary value estimates and update every state using the current estimates of its neighbors.
The update sweep
Each sweep applies the Bellman optimality backup to all states:
- For each state, compute the best action value using current next state values.
- Set the new state value to that maximum.
- Repeat sweeps until values stop changing meaningfully.
The update is a contraction with factor gamma, so each sweep shrinks the error toward the true optimal values. Convergence is geometric.
Extracting the policy
Value iteration converges on the values, not the policy. Once values have settled, you read off the optimal policy by choosing, in each state, the action that achieves the max. You only need this greedy step once at the end.
Key idea
Value iteration repeatedly applies the Bellman optimality backup to every state, contracting toward the optimal value function, then reads off the policy greedily at the end.