The Weight Initialization Deep

Why the starting scale of weights decides whether a deep network learns or stalls.

Why the start matters

A deep network multiplies signals through many layers. If weights start too large, activations explode; too small, and they vanish toward zero. Either way gradients become useless and training stalls before it begins.

Keeping variance steady

The goal is to keep the variance of activations and gradients roughly constant from layer to layer. Two famous schemes do this by scaling random weights to the layer width.

Xavier also called Glorot, scales by the average of inputs and outputs. It suits symmetric activations like tanh.
He scales by the number of inputs only and accounts for the fact that ReLU zeroes half its inputs. It is the default for ReLU networks.

Choosing a scheme

Practical notes

Never initialize all weights to zero or every neuron in a layer learns the same thing.
Biases usually start at zero, which is safe.
Modern frameworks pick a reasonable default, but matching the scheme to the activation still speeds convergence.

A few lines of correct initialization can save many epochs of slow or failed training.

Key idea