Weight Initialization Strategies
Before the first gradient step, weights need values. Set them poorly and signals either die or explode immediately, so initialization is a quiet but decisive choice.
Bad starting points
- All zeros makes every neuron identical, so they learn the same thing forever.
- Weights too large saturate activations and explode gradients.
- Weights too small shrink the signal until learning stalls.
Principled schemes
The fix is to scale random weights by the layer width so variance stays stable across layers. Xavier or Glorot initialization targets tanh style activations by accounting for both input and output sizes. He initialization scales for ReLU networks, which discard half their inputs and so need slightly larger weights to preserve variance.
Why it matters
Good initialization places the network in a regime where activations and gradients have sensible magnitudes from step one. That lets the optimizer make progress instead of spending epochs recovering from a doomed start. Combined with normalization, modern schemes make very deep training stable from the first batch.
Key idea
Initialization scales random weights to the layer width so variance stays stable and training can start in a healthy regime.