← Lessons

quiz vs the machine

Platinum1780

Machine Learning

Model Quantization

Shrinking models by storing weights in fewer bits.

5 min read · advanced · beat Platinum to climb

The idea

Quantization stores model weights and sometimes activations using fewer bits, such as eight bit or four bit integers instead of sixteen bit floats. Smaller numbers mean smaller models and faster, cheaper inference.

Why it works

Neural networks are surprisingly tolerant of low precision. Most of the information survives when values are rounded to a coarser grid, especially if the rounding is done carefully.

  • Post training quantization converts an already trained model directly
  • Quantization aware training simulates low precision during training so the model adapts
  • A scale factor maps the real range onto the integer grid

Watch the outliers

The main risk is accuracy loss. A few weights or activations with very large values, called outliers, can dominate the range and force coarse rounding everywhere else. Techniques handle these outliers separately or use per channel scales. With care, four bit models can run on consumer hardware with little quality loss.

Key idea

Quantization represents weights in fewer bits to shrink and speed up models, trading a little precision while managing outliers to limit accuracy loss.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main benefit of quantization?

2. What does quantization aware training do?

3. Why are outliers a problem for quantization?