Quantization For Inference Int8

Why shrink the numbers

Models normally store weights as sixteen or thirty two bit floats. Quantization maps those values to small integers, most commonly eight bit int8, so the model takes far less memory and runs faster.

How int8 works

A range of float values is mapped to the integers from minus one twenty eight to one twenty seven using a scale factor. At inference the integers are used in fast integer math and the scale converts results back to the right magnitude.

What you gain

About four times smaller weights than thirty two bit floats.
Faster matrix math on hardware that has int8 units.
Lower memory bandwidth, which is often the real bottleneck.

What you risk

Rounding many values to few integers loses precision.
Some layers are sensitive and degrade accuracy if quantized.
Calibration on sample data picks good scale factors to limit the damage.

Key idea

Int8 quantization replaces float weights with scaled integers, cutting memory and speeding math at the cost of some precision. Calibration and leaving sensitive layers in higher precision keep accuracy close to the original.

Quantization For Inference Int8

Why shrink the numbers

How int8 works

What you gain

What you risk

Key idea

Check yourself