← Lessons

quiz vs the machine

Gold1400

Machine Learning

Quantization For Inference Int8

Shrink weights to eight bit integers for faster cheaper serving.

5 min read · core · beat Gold to climb

Why shrink the numbers

Models normally store weights as sixteen or thirty two bit floats. Quantization maps those values to small integers, most commonly eight bit int8, so the model takes far less memory and runs faster.

How int8 works

A range of float values is mapped to the integers from minus one twenty eight to one twenty seven using a scale factor. At inference the integers are used in fast integer math and the scale converts results back to the right magnitude.

What you gain

  • About four times smaller weights than thirty two bit floats.
  • Faster matrix math on hardware that has int8 units.
  • Lower memory bandwidth, which is often the real bottleneck.

What you risk

  • Rounding many values to few integers loses precision.
  • Some layers are sensitive and degrade accuracy if quantized.
  • Calibration on sample data picks good scale factors to limit the damage.

Key idea

Int8 quantization replaces float weights with scaled integers, cutting memory and speeding math at the cost of some precision. Calibration and leaving sensitive layers in higher precision keep accuracy close to the original.

Check yourself

Answer to earn rating on the learn ladder.

1. What does a scale factor do in int8 quantization?

2. Why is calibration used when quantizing a model?