← Lessons

quiz vs the machine

Gold1400

Machine Learning

Vanishing and Exploding Gradients

Why deep networks struggle to learn when gradients shrink or grow without bound.

5 min read · core · beat Gold to climb

The chain of multiplications

Training a deep network uses backpropagation, which multiplies many derivatives together as the error signal flows backward through layers. When you multiply many numbers, two failure modes appear.

  • If the factors are mostly less than one, the product shrinks toward zero
  • If the factors are mostly greater than one, the product grows without bound

These are the vanishing and exploding gradient problems.

What goes wrong

When gradients vanish, the early layers receive almost no learning signal, so they barely update and the network cannot learn long range structure. When gradients explode, the updates become huge and the loss spikes or turns into not a number.

Common cures

  • Better activations such as ReLU keep derivatives near one for positive inputs
  • Careful initialization sets weights so signal variance is preserved across layers
  • Normalization layers keep activations in a stable range
  • Residual connections give gradients a short path back to early layers
  • Gradient clipping directly caps the exploding case

Recurrent networks were especially prone to this, which motivated gated designs and ultimately the transformer, whose attention paths and residuals keep gradients well behaved across depth.

Key idea

Backpropagation multiplies many derivatives, so gradients can vanish or explode unless activations, initialization, normalization, and residuals keep them stable.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do gradients vanish or explode in deep networks?

2. Which technique gives gradients a short path back to early layers?

3. What is a direct remedy specifically for exploding gradients?