Two alternating phases
Policy iteration improves a policy by cycling two steps until convergence:
- Policy evaluation computes the value function of the current fixed policy, solving the linear Bellman expectation equations.
- Policy improvement makes the policy greedy with respect to those values.
Why it works
The policy improvement theorem guarantees that acting greedily over a policy's own values yields a policy at least as good. Each improvement step strictly helps unless the policy is already optimal, in which case it stays put. Because there are finitely many deterministic policies, the loop terminates at the optimum.
Versus value iteration
Policy iteration often converges in remarkably few iterations because each evaluation fully solves the current policy before improving. The cost is that full evaluation is expensive. Value iteration truncates evaluation to a single sweep. Modified policy iteration sits between them, running a few evaluation sweeps before improving.
Key idea
Policy iteration alternates exact policy evaluation with greedy policy improvement; the improvement theorem ensures monotonic progress to the optimal policy in finitely many steps.