The Gradient Clipping Recap

The problem

Sometimes gradients become enormous, an exploding gradient. A single oversized step can throw the parameters far off and produce NaNs, especially in recurrent networks and deep stacks.

What clipping does

Gradient clipping limits the gradient before the update.

Norm clipping: if the gradient's overall norm exceeds a threshold, rescale it down to that norm while keeping its direction.
Value clipping: clamp each component to a range.

Norm clipping is usually preferred because it preserves the update direction.

Why it helps

Clipping bounds the worst case step size, smoothing training without changing typical small gradients. It is a cheap safety net rather than a learning rate replacement.

It is especially common when training recurrent networks and transformers on long sequences.

Key idea

Gradient clipping caps the gradient norm before an update so exploding gradients cannot cause a single ruinous step, acting as a cheap safety net that preserves the update direction.

The Gradient Clipping Recap

The problem

What clipping does

Why it helps

Key idea

Check yourself