← Lessons

quiz vs the machine

Silver1080

Machine Learning

The Weight Initialization Deep

Why the starting scale of weights decides whether a deep network learns or stalls.

4 min read · intro · beat Silver to climb

Why the start matters

A deep network multiplies signals through many layers. If weights start too large, activations explode; too small, and they vanish toward zero. Either way gradients become useless and training stalls before it begins.

Keeping variance steady

The goal is to keep the variance of activations and gradients roughly constant from layer to layer. Two famous schemes do this by scaling random weights to the layer width.

  • Xavier also called Glorot, scales by the average of inputs and outputs. It suits symmetric activations like tanh.
  • He scales by the number of inputs only and accounts for the fact that ReLU zeroes half its inputs. It is the default for ReLU networks.

Choosing a scheme

Practical notes

  • Never initialize all weights to zero or every neuron in a layer learns the same thing.
  • Biases usually start at zero, which is safe.
  • Modern frameworks pick a reasonable default, but matching the scheme to the activation still speeds convergence.

A few lines of correct initialization can save many epochs of slow or failed training.

Key idea

Initialize weights so activation and gradient variance stay constant across layers. Use Xavier for tanh and He for ReLU, and never start all weights at zero.

Check yourself

Answer to earn rating on the learn ladder.

1. Which initialization is the standard choice for a network of ReLU layers?

2. Why is initializing every weight to zero a mistake?