← Lessons

quiz vs the machine

Gold1450

Machine Learning

Vanishing and Exploding Gradients Revisited

Why deep chains of multiplication corrupt the learning signal.

5 min read · core · beat Gold to climb

Vanishing and Exploding Gradients Revisited

Because backprop multiplies many local gradients together, the signal can shrink toward zero or blow up toward infinity as it travels through many layers.

The multiplication problem

  • If each layer scales the gradient by less than one, the product vanishes after many layers.
  • If each layer scales it by more than one, the product explodes.
  • Either way, early layers receive a corrupted update.

Symptoms

Vanishing gradients show up as early layers that barely change, so the network learns slowly or not at all in its depths. Exploding gradients show up as loss spikes, NaN values, and unstable training that diverges suddenly.

Modern remedies

The field developed several fixes that work together. Careful weight initialization keeps the scale near one at the start. Nonlinearities like ReLU avoid the saturating flats that crushed older sigmoid networks. Residual connections give gradients a shortcut path. Normalization layers keep activations in a healthy range, and gradient clipping caps explosions. Together these turned very deep networks from impossible to routine.

Key idea

Repeated multiplication makes gradients vanish or explode in deep nets, so we use initialization, residuals, normalization, and clipping to keep the signal stable.

Check yourself

Answer to earn rating on the learn ladder.

1. What causes vanishing gradients in deep networks?

2. Which remedy gives gradients a shortcut path through depth?

3. A symptom of exploding gradients is