Mixed Precision Training

What it is

Mixed precision training runs most operations in 16 bit floating point while keeping a few critical parts in 32 bit. Lower precision halves memory for activations and uses fast tensor hardware, so steps run faster.

Why not pure 16 bit

Half precision has a small range. Tiny gradient values can become zero, an effect called underflow, and large values can overflow. Either one breaks training. Two techniques keep things stable.

Master weights: the optimizer keeps a 32 bit copy of the weights. Updates accumulate there, avoiding loss of small changes.
Loss scaling: multiply the loss by a large factor before the backward pass so small gradients land in the representable range, then divide it back out before the update.

bfloat16

The bfloat16 format keeps the same exponent range as 32 bit but fewer mantissa bits. Because its range matches single precision, loss scaling is often unnecessary, which is why bfloat16 is popular on modern accelerators.

Key idea

Mixed precision uses 16 bit math for speed and memory but protects stability with 32 bit master weights and loss scaling, while bfloat16 keeps the range that avoids most scaling problems.

Mixed Precision Training

What it is

Why not pure 16 bit

bfloat16

Key idea

Check yourself