What it is
Post training quantization, or PTQ, takes an already trained floating point model and converts its weights, and often activations, to low bit integers such as eight bit. Integer math is faster and uses less memory, so the model runs cheaper on the same hardware.
How values are mapped
Quantization maps a float range to integers with a scale and a zero point. A value is divided by the scale, rounded, and offset by the zero point. The challenge is picking a range that captures most values without wasting bits on rare outliers.
- Per tensor uses one scale for a whole tensor, simplest but coarse.
- Per channel uses a separate scale per output channel, which handles uneven weight ranges far better.
Static versus dynamic
- Dynamic quantization computes activation scales on the fly at inference time.
- Static quantization runs a small calibration set through the model first to record activation ranges, giving better accuracy.
PTQ is attractive because it needs no labels and no retraining, but very low bit widths can lose noticeable accuracy.
Key idea
Post training quantization converts a trained float model to low bit integers using scales and zero points; calibration and per channel scaling preserve accuracy without any retraining.