← Lessons

quiz vs the machine

Gold1430

Machine Learning

The Vanishing Gradient Problem

When gradients shrink to nothing in deep networks.

5 min read · core · beat Gold to climb

Gradients that fade away

In deep networks the gradient must travel backward through many layers. Because the chain rule multiplies many factors together, the gradient can shrink toward zero by the time it reaches the early layers. This is the vanishing gradient problem.

Why it happens

  • Squashing activations like sigmoid have derivatives well below one.
  • Multiplying many small derivatives drives the product toward zero.
  • Early layers then receive almost no learning signal.

The result is that the first layers barely update, so the network trains slowly or never learns useful low level features.

Common remedies

  • ReLU activations have a derivative of one for positive inputs, so they do not shrink the signal.
  • Residual connections add shortcut paths that let gradients skip layers.
  • Careful initialization keeps the signal scale roughly constant across layers.
  • Normalization layers stabilize the activations that gradients depend on.

These fixes are why very deep networks became trainable at all.

Key idea

Vanishing gradients arise when many small derivatives multiply during backprop, starving early layers of signal, and ReLU, residual connections, and good initialization restore it.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do gradients vanish in deep networks?

2. Which helps fight vanishing gradients?