Gradient Clipping

The danger of huge gradients

Sometimes a batch produces an enormous gradient. This can happen with rare tokens, deep recurrent networks, or unlucky data. A single giant update can throw the weights far from any good region and cause the loss to spike or become not a number.

What clipping does

Gradient clipping limits how large the gradient can be before it is used to update the weights. The most common form is clipping by norm.

Compute the total norm of all gradients
If the norm exceeds a chosen threshold, scale the whole gradient down so its norm equals the threshold
Otherwise leave it unchanged

This preserves the direction of the update while capping its magnitude.

Clip by value

A simpler variant clips each individual gradient element to lie within a fixed range. This is easy but changes the update direction, so clipping by norm is usually preferred.

Where it helps most

Training recurrent networks and transformers
Runs that show occasional loss spikes
Large batch or high learning rate settings

A typical threshold is around one. Clipping is a safety net that keeps a rare bad batch from undoing thousands of good steps.

Key idea

Gradient clipping caps the update magnitude so a single huge gradient cannot destabilize the entire run.

The danger of huge gradients

What clipping does

Clip by value

Where it helps most

Key idea

Check yourself