Batch Norm in CNNs

Batch normalization standardizes the activations inside a network so training is faster and more stable. In convolutional networks it operates per channel across the batch.

What it does

For each channel, batch norm computes the mean and variance over the batch and all spatial positions, then rescales activations to have zero mean and unit variance. Two learned parameters, a scale and a shift, let the layer recover any useful range.

Statistics are gathered per channel, not per pixel.
It uses every spatial location and every image in the batch.
Learned scale and shift restore representational power.

Why it helps

By keeping activations in a consistent range, batch norm lets you use higher learning rates and reduces sensitivity to weight initialization. It also adds mild noise from batch statistics that acts as light regularization.

Train versus inference

During training it uses the current batch statistics. At inference it uses running averages collected during training, so a single image produces stable, batch independent outputs.

Placement is usually right after a convolution and before the activation, though variants exist.

Key idea

Batch norm normalizes activations per channel across the batch, enabling higher learning rates and using running averages at inference.

Batch Norm in CNNs