Why starting weights matter
Before training, every weight in a neural network gets an initial value. A poor choice can make signals vanish to zero or explode to huge values as they pass through layers, so the network never learns. Good initialization keeps the scale of activations and gradients roughly stable from the first layer to the last.
Two failure modes
- If weights are too small, each layer shrinks the signal until deep layers see almost nothing
- If weights are too large, the signal grows without bound and gradients blow up
- Setting all weights equal is also fatal because every neuron then learns the same thing, a problem called broken symmetry
Principled schemes
Modern methods choose the random scale based on the layer size:
- Xavier initialization suits symmetric activations like tanh by balancing input and output counts
- He initialization scales for ReLU, which zeroes half its inputs, by using a larger variance
Key idea
Initialization sets the variance of weights so activations and gradients stay stable through depth, with Xavier and He scaling chosen to match the activation function.