Quantization to Int8 and Int4

Shrinking model weights from floating point to low precision integers to save memory.

The idea

Large models store weights as 16 or 32 bit floating point numbers. Quantization maps those values into low precision integers, such as 8 bit or 4 bit, so each weight takes far less memory. An int8 weight uses a quarter of the bytes of a 32 bit float, and int4 uses an eighth.

How the mapping works

Quantization picks a scale and sometimes a zero point to convert a range of floats into a small integer range:

Find the range of values in a group of weights.
Choose a scale so the largest values map to the edge of the integer range.
Round each weight to its nearest integer.

At inference the integers are scaled back to approximate the originals. The gap between the original and recovered value is quantization error.

Trade offs

Lower precision saves memory and can speed up math, but it adds noise. Int8 usually keeps accuracy close to the original. Int4 saves more but risks visible quality loss unless smart methods protect the most sensitive weights.

Key idea

Quantization replaces floating point weights with low bit integers using a scale, cutting memory at the cost of small rounding errors.

Quantization to Int8 and Int4

The idea

How the mapping works

Trade offs

Key idea

Check yourself