Why it exists
Post training quantization is easy but can lose accuracy at very low bit widths. Quantization aware training, or QAT, instead simulates quantization during training so the network adapts its weights to the rounding error before deployment.
Fake quantization
QAT inserts fake quantize operations into the graph. In the forward pass, weights and activations are rounded to the target precision, so the model sees the same coarse values it will use at inference. The numbers stay in floating point internally so the loss reflects the quantized behavior.
The gradient problem
Rounding has a derivative of zero almost everywhere, which would block learning. QAT uses the straight through estimator: in the backward pass it pretends the rounding step was the identity, passing the gradient through unchanged. This lets the optimizer keep adjusting the float weights even though the forward pass is quantized.
Cost and payoff
QAT needs a full training or fine tuning run and is more complex than PTQ. The payoff is markedly better accuracy at low precision, which matters for four bit or aggressive eight bit deployment.
Key idea
QAT inserts fake quantization in the forward pass and uses a straight through estimator in the backward pass, so the model learns to tolerate low bit math and keeps more accuracy than post training quantization.