Why post training quantization can hurt
Quantizing a trained model to low precision after the fact rounds every weight and activation. For aggressive bit widths the rounding error can noticeably drop accuracy because the model never adapted to it.
The QAT idea
Quantization aware training inserts fake quantization into the forward pass during training.
- Weights and activations are rounded to the target precision in the forward pass.
- The model sees the rounding error and learns to be robust to it.
- Backward gradients flow through using a straight through estimator since rounding has zero gradient.
Key mechanics
- The straight through estimator passes gradients as if quantization were the identity.
- Quantization ranges are calibrated or learned so the scale matches the data.
- Training often starts from a full precision checkpoint and fine tunes with fake quantization.
Trade offs
QAT recovers most of the lost accuracy at low bit widths but costs extra training compute and complexity. When post training quantization already meets the accuracy target, the simpler path is preferred.
Key idea
Quantization aware training simulates low precision in the forward pass with fake quantization and a straight through estimator, so the model adapts and the final quantized model keeps accuracy.