← Lessons

quiz vs the machine

Gold1520

Machine Learning

Xavier And He Initialization

Why the starting weights of a deep network decide whether it trains at all.

5 min read · core · beat Gold to climb

Why starting weights matter

Before training, every weight in a neural network gets an initial value. A poor choice can make signals vanish to zero or explode to huge values as they pass through layers, so the network never learns. Good initialization keeps the scale of activations and gradients roughly stable from the first layer to the last.

Two failure modes

  • If weights are too small, each layer shrinks the signal until deep layers see almost nothing
  • If weights are too large, the signal grows without bound and gradients blow up
  • Setting all weights equal is also fatal because every neuron then learns the same thing, a problem called broken symmetry

Principled schemes

Modern methods choose the random scale based on the layer size:

  • Xavier initialization suits symmetric activations like tanh by balancing input and output counts
  • He initialization scales for ReLU, which zeroes half its inputs, by using a larger variance

Key idea

Initialization sets the variance of weights so activations and gradients stay stable through depth, with Xavier and He scaling chosen to match the activation function.

Check yourself

Answer to earn rating on the learn ladder.

1. Why is setting every weight to the same value harmful?

2. He initialization is designed mainly for which activation?