Trading precision for speed
Models train in floating point, but inference often does not need that precision. Quantization maps weights and activations to low precision formats such as 8 bit integers, shrinking the model and speeding up math.
Why it helps
- Smaller memory footprint means less data to move, easing bandwidth bound layers.
- Integer math runs faster and uses less energy than floating point.
- Cache efficiency improves because more values fit in fast memory.
The mapping
Quantization stores a scale and sometimes a zero point that convert between the integer grid and real values. The integer is recovered to an approximate float by multiplying by the scale.
Two main styles
- Post training quantization converts an already trained model, sometimes with a small calibration set.
- Quantization aware training simulates the rounding during training so the model learns to tolerate it, usually giving better accuracy at low bit widths.
Key idea
Quantization maps weights and activations to low precision integers using a scale and zero point, cutting memory and accelerating inference at a small accuracy cost.