Post Training Quantization

What it is

Post training quantization, or PTQ, takes an already trained floating point model and converts its weights, and often activations, to low bit integers such as eight bit. Integer math is faster and uses less memory, so the model runs cheaper on the same hardware.

How values are mapped

Quantization maps a float range to integers with a scale and a zero point. A value is divided by the scale, rounded, and offset by the zero point. The challenge is picking a range that captures most values without wasting bits on rare outliers.

Per tensor uses one scale for a whole tensor, simplest but coarse.
Per channel uses a separate scale per output channel, which handles uneven weight ranges far better.

Static versus dynamic

Dynamic quantization computes activation scales on the fly at inference time.
Static quantization runs a small calibration set through the model first to record activation ranges, giving better accuracy.

PTQ is attractive because it needs no labels and no retraining, but very low bit widths can lose noticeable accuracy.

Key idea

Post training quantization converts a trained float model to low bit integers using scales and zero points; calibration and per channel scaling preserve accuracy without any retraining.

Post Training Quantization

What it is

How values are mapped

Static versus dynamic

Key idea

Check yourself