A principled step size
Trust Region Policy Optimization (TRPO) was designed to make policy gradient updates monotonically improving. Rather than picking a fragile learning rate, it bounds how far the policy may move per update using a divergence constraint.
The constrained objective
TRPO maximizes a surrogate advantage objective subject to a hard constraint: the average KL divergence between the new and old policies must stay below a small threshold. The KL measures how different the action distributions are. Staying inside this trust region keeps the surrogate a reliable approximation of the true objective.
Solving it
The constrained problem is solved approximately each step:
- Linearize the objective and quadratically approximate the KL constraint.
- Use the conjugate gradient method to find the natural gradient direction without forming a full matrix.
- Apply a line search to enforce the actual KL limit and ensure improvement.
This machinery is powerful but heavy. PPO later approximated the same trust region idea with a simple clip, which is why PPO largely replaced TRPO in practice while inheriting its core insight.
Key idea
TRPO maximizes a surrogate advantage under a KL divergence trust region, using conjugate gradient and line search to take the largest safe step that still guarantees improvement.