The Vanishing Gradient Problem

Gradients that fade away

In deep networks the gradient must travel backward through many layers. Because the chain rule multiplies many factors together, the gradient can shrink toward zero by the time it reaches the early layers. This is the vanishing gradient problem.

Why it happens

Squashing activations like sigmoid have derivatives well below one.
Multiplying many small derivatives drives the product toward zero.
Early layers then receive almost no learning signal.

The result is that the first layers barely update, so the network trains slowly or never learns useful low level features.

Common remedies

ReLU activations have a derivative of one for positive inputs, so they do not shrink the signal.
Residual connections add shortcut paths that let gradients skip layers.
Careful initialization keeps the signal scale roughly constant across layers.
Normalization layers stabilize the activations that gradients depend on.

These fixes are why very deep networks became trainable at all.

Key idea

Vanishing gradients arise when many small derivatives multiply during backprop, starving early layers of signal, and ReLU, residual connections, and good initialization restore it.

The Vanishing Gradient Problem

Gradients that fade away

Why it happens

Common remedies

Key idea

Check yourself