Quantization Aware Training

Why it exists

Post training quantization is easy but can lose accuracy at very low bit widths. Quantization aware training, or QAT, instead simulates quantization during training so the network adapts its weights to the rounding error before deployment.

Fake quantization

QAT inserts fake quantize operations into the graph. In the forward pass, weights and activations are rounded to the target precision, so the model sees the same coarse values it will use at inference. The numbers stay in floating point internally so the loss reflects the quantized behavior.

The gradient problem

Rounding has a derivative of zero almost everywhere, which would block learning. QAT uses the straight through estimator: in the backward pass it pretends the rounding step was the identity, passing the gradient through unchanged. This lets the optimizer keep adjusting the float weights even though the forward pass is quantized.

Cost and payoff

QAT needs a full training or fine tuning run and is more complex than PTQ. The payoff is markedly better accuracy at low precision, which matters for four bit or aggressive eight bit deployment.

Key idea

QAT inserts fake quantization in the forward pass and uses a straight through estimator in the backward pass, so the model learns to tolerate low bit math and keeps more accuracy than post training quantization.

Quantization Aware Training

Why it exists

Fake quantization

The gradient problem

Cost and payoff

Key idea

Check yourself