The Model Quantization for Inference

Trading precision for speed

Models train in floating point, but inference often does not need that precision. Quantization maps weights and activations to low precision formats such as 8 bit integers, shrinking the model and speeding up math.

Why it helps

Smaller memory footprint means less data to move, easing bandwidth bound layers.
Integer math runs faster and uses less energy than floating point.
Cache efficiency improves because more values fit in fast memory.

The mapping

Quantization stores a scale and sometimes a zero point that convert between the integer grid and real values. The integer is recovered to an approximate float by multiplying by the scale.

Two main styles

Post training quantization converts an already trained model, sometimes with a small calibration set.
Quantization aware training simulates the rounding during training so the model learns to tolerate it, usually giving better accuracy at low bit widths.

Key idea

Quantization maps weights and activations to low precision integers using a scale and zero point, cutting memory and accelerating inference at a small accuracy cost.

The Model Quantization for Inference

Trading precision for speed

Why it helps

The mapping

Two main styles

Key idea

Check yourself