The idea
Quantization stores model weights and sometimes activations using fewer bits, such as eight bit or four bit integers instead of sixteen bit floats. Smaller numbers mean smaller models and faster, cheaper inference.
Why it works
Neural networks are surprisingly tolerant of low precision. Most of the information survives when values are rounded to a coarser grid, especially if the rounding is done carefully.
- Post training quantization converts an already trained model directly
- Quantization aware training simulates low precision during training so the model adapts
- A scale factor maps the real range onto the integer grid
Watch the outliers
The main risk is accuracy loss. A few weights or activations with very large values, called outliers, can dominate the range and force coarse rounding everywhere else. Techniques handle these outliers separately or use per channel scales. With care, four bit models can run on consumer hardware with little quality loss.
Key idea
Quantization represents weights in fewer bits to shrink and speed up models, trading a little precision while managing outliers to limit accuracy loss.