The scale problem
To quantize activations to 8 bit integers you must pick a scale that maps real values onto only 256 levels. Pick it too wide and you waste levels on rare extremes; too narrow and you clip common values. Calibration finds a good scale by watching real data.
How calibration works
A small representative calibration dataset is run through the model while the quantizer records the distribution of activations at each layer.
- Min max uses the observed extremes as the range.
- Percentile clips a tiny fraction of outliers to spend levels on the bulk.
- KL divergence chooses the range that best preserves the activation distribution.
The procedure
Why outliers matter
A few large activations can stretch the range so much that ordinary values collapse into a handful of levels. Clipping those outliers, accepting a little error on them, often improves overall accuracy. The calibration set should resemble production data so the chosen ranges generalize.
Key idea
INT8 calibration runs representative data to measure activation ranges and pick per layer scales that balance clipping against wasted resolution.