The Mixed Precision Training

Half the bits, more speed

Mixed precision runs most operations in a lower precision format such as sixteen bit floats while keeping a few sensitive parts in higher precision. Lower precision halves memory traffic and runs faster on modern accelerators.

Forward and backward math runs in low precision.
A master copy of the weights is kept in higher precision.
Some reductions stay high precision for accuracy.

The stability tricks

Low precision has a narrow range, so small gradients can underflow to zero. Loss scaling multiplies the loss by a large factor before the backward pass, lifting tiny gradients into the representable range, then divides it back out before the update.

Loss scaling prevents gradient underflow.
The high precision master weights avoid update drift.
Dynamic scaling adjusts the factor automatically.

The precision path

The combination gives most of the speed of low precision with the stability of high precision.

Key idea

Mixed precision runs math in low precision for speed while keeping high precision master weights and loss scaling to preserve numerical stability.

The Mixed Precision Training

Half the bits, more speed

The stability tricks

The precision path

Key idea

Check yourself