Half the bits, more speed
Mixed precision runs most operations in a lower precision format such as sixteen bit floats while keeping a few sensitive parts in higher precision. Lower precision halves memory traffic and runs faster on modern accelerators.
- Forward and backward math runs in low precision.
- A master copy of the weights is kept in higher precision.
- Some reductions stay high precision for accuracy.
The stability tricks
Low precision has a narrow range, so small gradients can underflow to zero. Loss scaling multiplies the loss by a large factor before the backward pass, lifting tiny gradients into the representable range, then divides it back out before the update.
- Loss scaling prevents gradient underflow.
- The high precision master weights avoid update drift.
- Dynamic scaling adjusts the factor automatically.
The precision path
The combination gives most of the speed of low precision with the stability of high precision.
Key idea
Mixed precision runs math in low precision for speed while keeping high precision master weights and loss scaling to preserve numerical stability.