Vanishing and Exploding Gradients

Why deep networks struggle to learn when gradients shrink or grow without bound.

The chain of multiplications

Training a deep network uses backpropagation, which multiplies many derivatives together as the error signal flows backward through layers. When you multiply many numbers, two failure modes appear.

If the factors are mostly less than one, the product shrinks toward zero
If the factors are mostly greater than one, the product grows without bound

These are the vanishing and exploding gradient problems.

What goes wrong

When gradients vanish, the early layers receive almost no learning signal, so they barely update and the network cannot learn long range structure. When gradients explode, the updates become huge and the loss spikes or turns into not a number.

Common cures

Better activations such as ReLU keep derivatives near one for positive inputs
Careful initialization sets weights so signal variance is preserved across layers
Normalization layers keep activations in a stable range
Residual connections give gradients a short path back to early layers
Gradient clipping directly caps the exploding case

Recurrent networks were especially prone to this, which motivated gated designs and ultimately the transformer, whose attention paths and residuals keep gradients well behaved across depth.

Key idea

Backpropagation multiplies many derivatives, so gradients can vanish or explode unless activations, initialization, normalization, and residuals keep them stable.

Vanishing and Exploding Gradients

The chain of multiplications

What goes wrong

Common cures

Key idea

Check yourself