The Value Iteration Algorithm

Turning the Bellman optimality equation into a repeated sweep that converges to optimal values.

A fixed point you can iterate

The Bellman optimality equation is a fixed point, and value iteration finds it by repeated application. You start with arbitrary value estimates and update every state using the current estimates of its neighbors.

The update sweep

Each sweep applies the Bellman optimality backup to all states:

For each state, compute the best action value using current next state values.
Set the new state value to that maximum.
Repeat sweeps until values stop changing meaningfully.

The update is a contraction with factor gamma, so each sweep shrinks the error toward the true optimal values. Convergence is geometric.

Extracting the policy

Value iteration converges on the values, not the policy. Once values have settled, you read off the optimal policy by choosing, in each state, the action that achieves the max. You only need this greedy step once at the end.

Key idea

Value iteration repeatedly applies the Bellman optimality backup to every state, contracting toward the optimal value function, then reads off the policy greedily at the end.

The Value Iteration Algorithm

A fixed point you can iterate

The update sweep

Extracting the policy

Key idea

Check yourself