← Lessons

quiz vs the machine

Gold1350

Machine Learning

The Gradient Clipping Recap

Cap runaway gradients so a single huge step cannot wreck training.

4 min read · core · beat Gold to climb

The problem

Sometimes gradients become enormous, an exploding gradient. A single oversized step can throw the parameters far off and produce NaNs, especially in recurrent networks and deep stacks.

What clipping does

Gradient clipping limits the gradient before the update.

  • Norm clipping: if the gradient's overall norm exceeds a threshold, rescale it down to that norm while keeping its direction.
  • Value clipping: clamp each component to a range.

Norm clipping is usually preferred because it preserves the update direction.

Why it helps

Clipping bounds the worst case step size, smoothing training without changing typical small gradients. It is a cheap safety net rather than a learning rate replacement.

It is especially common when training recurrent networks and transformers on long sequences.

Key idea

Gradient clipping caps the gradient norm before an update so exploding gradients cannot cause a single ruinous step, acting as a cheap safety net that preserves the update direction.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does gradient clipping address?

2. Why is norm clipping usually preferred over value clipping?