← Lessons

quiz vs the machine

Gold1410

Machine Learning

The Model Quantization for Inference

Shrinking weights to low precision integers to run models faster and smaller.

5 min read · core · beat Gold to climb

Trading precision for speed

Models train in floating point, but inference often does not need that precision. Quantization maps weights and activations to low precision formats such as 8 bit integers, shrinking the model and speeding up math.

Why it helps

  • Smaller memory footprint means less data to move, easing bandwidth bound layers.
  • Integer math runs faster and uses less energy than floating point.
  • Cache efficiency improves because more values fit in fast memory.

The mapping

Quantization stores a scale and sometimes a zero point that convert between the integer grid and real values. The integer is recovered to an approximate float by multiplying by the scale.

Two main styles

  • Post training quantization converts an already trained model, sometimes with a small calibration set.
  • Quantization aware training simulates the rounding during training so the model learns to tolerate it, usually giving better accuracy at low bit widths.

Key idea

Quantization maps weights and activations to low precision integers using a scale and zero point, cutting memory and accelerating inference at a small accuracy cost.

Check yourself

Answer to earn rating on the learn ladder.

1. What does quantization store to convert between integers and real values?

2. How does quantization aware training improve low bit accuracy?