Proximal Policy Optimization

Proximal policy optimization, or PPO, is one of the most popular deep RL algorithms. It improves a policy with gradient steps while preventing each update from changing the policy too much.

The problem it solves

Policy gradient updates can be destructive. A single large step can move the policy into a bad region from which it cannot recover, because the data was collected under the old policy. PPO keeps updates proximal, close to the current policy, for stability.

The clipped objective

PPO measures the ratio of new to old action probability and optimizes return weighted by the advantage, but it clips that ratio to a small range around one.

If an update would push the ratio too far, the clip removes the incentive to go further.
This caps how much the policy can shift per update without any complex constraint math.

Why it caught on

It is far simpler than earlier trust region methods yet performs comparably.
It allows multiple epochs of updates on the same batch of data, improving sample efficiency.
It is robust across many tasks with little tuning, making it a default choice.

In practice

PPO is usually run as an actor critic, with the critic providing advantages and the clipped objective controlling the actor. It powers many results in robotics, games, and the training of language models from human feedback.

Key idea

PPO is a stable, simple policy gradient method that clips the probability ratio to keep each update close to the current policy, balancing improvement against safety.

Proximal Policy Optimization