Why the start matters
A deep network multiplies signals through many layers. If weights start too large, activations explode; too small, and they vanish toward zero. Either way gradients become useless and training stalls before it begins.
Keeping variance steady
The goal is to keep the variance of activations and gradients roughly constant from layer to layer. Two famous schemes do this by scaling random weights to the layer width.
- Xavier also called Glorot, scales by the average of inputs and outputs. It suits symmetric activations like tanh.
- He scales by the number of inputs only and accounts for the fact that ReLU zeroes half its inputs. It is the default for ReLU networks.
Choosing a scheme
Practical notes
- Never initialize all weights to zero or every neuron in a layer learns the same thing.
- Biases usually start at zero, which is safe.
- Modern frameworks pick a reasonable default, but matching the scheme to the activation still speeds convergence.
A few lines of correct initialization can save many epochs of slow or failed training.
Key idea
Initialize weights so activation and gradient variance stay constant across layers. Use Xavier for tanh and He for ReLU, and never start all weights at zero.