The Policy Iteration Algorithm

Two alternating phases

Policy iteration improves a policy by cycling two steps until convergence:

Policy evaluation computes the value function of the current fixed policy, solving the linear Bellman expectation equations.
Policy improvement makes the policy greedy with respect to those values.

Why it works

The policy improvement theorem guarantees that acting greedily over a policy's own values yields a policy at least as good. Each improvement step strictly helps unless the policy is already optimal, in which case it stays put. Because there are finitely many deterministic policies, the loop terminates at the optimum.

Versus value iteration

Policy iteration often converges in remarkably few iterations because each evaluation fully solves the current policy before improving. The cost is that full evaluation is expensive. Value iteration truncates evaluation to a single sweep. Modified policy iteration sits between them, running a few evaluation sweeps before improving.

Key idea

Policy iteration alternates exact policy evaluation with greedy policy improvement; the improvement theorem ensures monotonic progress to the optimal policy in finitely many steps.

The Policy Iteration Algorithm

Two alternating phases

Why it works

Versus value iteration

Key idea

Check yourself