Model Quantization

The idea

Quantization stores model weights and sometimes activations using fewer bits, such as eight bit or four bit integers instead of sixteen bit floats. Smaller numbers mean smaller models and faster, cheaper inference.

Why it works

Neural networks are surprisingly tolerant of low precision. Most of the information survives when values are rounded to a coarser grid, especially if the rounding is done carefully.

Post training quantization converts an already trained model directly
Quantization aware training simulates low precision during training so the model adapts
A scale factor maps the real range onto the integer grid

Watch the outliers

The main risk is accuracy loss. A few weights or activations with very large values, called outliers, can dominate the range and force coarse rounding everywhere else. Techniques handle these outliers separately or use per channel scales. With care, four bit models can run on consumer hardware with little quality loss.

Key idea

Quantization represents weights in fewer bits to shrink and speed up models, trading a little precision while managing outliers to limit accuracy loss.

Model Quantization

The idea

Why it works

Watch the outliers

Key idea

Check yourself