The TRPO Trust Region Method

Guaranteeing monotonic improvement by constraining policy updates with a KL divergence limit.

A principled step size

Trust Region Policy Optimization (TRPO) was designed to make policy gradient updates monotonically improving. Rather than picking a fragile learning rate, it bounds how far the policy may move per update using a divergence constraint.

The constrained objective

TRPO maximizes a surrogate advantage objective subject to a hard constraint: the average KL divergence between the new and old policies must stay below a small threshold. The KL measures how different the action distributions are. Staying inside this trust region keeps the surrogate a reliable approximation of the true objective.

Solving it

The constrained problem is solved approximately each step:

Linearize the objective and quadratically approximate the KL constraint.
Use the conjugate gradient method to find the natural gradient direction without forming a full matrix.
Apply a line search to enforce the actual KL limit and ensure improvement.

This machinery is powerful but heavy. PPO later approximated the same trust region idea with a simple clip, which is why PPO largely replaced TRPO in practice while inheriting its core insight.

Key idea

TRPO maximizes a surrogate advantage under a KL divergence trust region, using conjugate gradient and line search to take the largest safe step that still guarantees improvement.

The TRPO Trust Region Method

A principled step size

The constrained objective

Solving it

Key idea

Check yourself