Controlling update size
Large policy gradient steps can collapse performance. Proximal Policy Optimization (PPO) keeps each update close to the current policy using a clipped surrogate objective, achieving the stability of trust region methods with plain first order optimization.
The probability ratio
PPO tracks the ratio of the new policy's action probability to the old policy's. This ratio is multiplied by the advantage to form the surrogate objective. If the ratio drifts far from one, the new policy has moved too far.
The clip
The clipped objective takes the minimum of two terms:
- The raw ratio times advantage.
- The ratio clipped to a small interval around one, times advantage.
For positive advantage, clipping caps how much probability can increase; for negative advantage, it caps the decrease. The min makes the bound pessimistic, removing any incentive to push the ratio beyond the clip range. This lets you safely run several epochs of minibatch updates on the same collected data.
Key idea
PPO clips the policy probability ratio inside a pessimistic min objective, bounding each update near the old policy so multiple optimization epochs stay stable without explicit trust region machinery.