← Lessons

quiz vs the machine

Platinum1850

Machine Learning

The TRPO Trust Region Method

Guaranteeing monotonic improvement by constraining policy updates with a KL divergence limit.

7 min read · advanced · beat Platinum to climb

A principled step size

Trust Region Policy Optimization (TRPO) was designed to make policy gradient updates monotonically improving. Rather than picking a fragile learning rate, it bounds how far the policy may move per update using a divergence constraint.

The constrained objective

TRPO maximizes a surrogate advantage objective subject to a hard constraint: the average KL divergence between the new and old policies must stay below a small threshold. The KL measures how different the action distributions are. Staying inside this trust region keeps the surrogate a reliable approximation of the true objective.

Solving it

The constrained problem is solved approximately each step:

  • Linearize the objective and quadratically approximate the KL constraint.
  • Use the conjugate gradient method to find the natural gradient direction without forming a full matrix.
  • Apply a line search to enforce the actual KL limit and ensure improvement.

This machinery is powerful but heavy. PPO later approximated the same trust region idea with a simple clip, which is why PPO largely replaced TRPO in practice while inheriting its core insight.

Key idea

TRPO maximizes a surrogate advantage under a KL divergence trust region, using conjugate gradient and line search to take the largest safe step that still guarantees improvement.

Check yourself

Answer to earn rating on the learn ladder.

1. What does TRPO constrain on each update?

2. Which method finds the update direction in TRPO?

3. How does PPO relate to TRPO?