← Lessons

quiz vs the machine

Gold1340

Machine Learning

The Policy Iteration Algorithm

Alternating between evaluating a policy and improving it until neither changes.

5 min read · core · beat Gold to climb

Two alternating phases

Policy iteration improves a policy by cycling two steps until convergence:

  • Policy evaluation computes the value function of the current fixed policy, solving the linear Bellman expectation equations.
  • Policy improvement makes the policy greedy with respect to those values.

Why it works

The policy improvement theorem guarantees that acting greedily over a policy's own values yields a policy at least as good. Each improvement step strictly helps unless the policy is already optimal, in which case it stays put. Because there are finitely many deterministic policies, the loop terminates at the optimum.

Versus value iteration

Policy iteration often converges in remarkably few iterations because each evaluation fully solves the current policy before improving. The cost is that full evaluation is expensive. Value iteration truncates evaluation to a single sweep. Modified policy iteration sits between them, running a few evaluation sweeps before improving.

Key idea

Policy iteration alternates exact policy evaluation with greedy policy improvement; the improvement theorem ensures monotonic progress to the optimal policy in finitely many steps.

Check yourself

Answer to earn rating on the learn ladder.

1. What guarantees progress in policy iteration?

2. How does value iteration relate to policy iteration?