Xavier And He Initialization

Why starting weights matter

Before training, every weight in a neural network gets an initial value. A poor choice can make signals vanish to zero or explode to huge values as they pass through layers, so the network never learns. Good initialization keeps the scale of activations and gradients roughly stable from the first layer to the last.

Two failure modes

If weights are too small, each layer shrinks the signal until deep layers see almost nothing
If weights are too large, the signal grows without bound and gradients blow up
Setting all weights equal is also fatal because every neuron then learns the same thing, a problem called broken symmetry

Principled schemes

Modern methods choose the random scale based on the layer size:

Xavier initialization suits symmetric activations like tanh by balancing input and output counts
He initialization scales for ReLU, which zeroes half its inputs, by using a larger variance

Key idea

Initialization sets the variance of weights so activations and gradients stay stable through depth, with Xavier and He scaling chosen to match the activation function.

Xavier And He Initialization

Why starting weights matter

Two failure modes

Principled schemes

Key idea

Check yourself