The idea
Large models store weights as 16 or 32 bit floating point numbers. Quantization maps those values into low precision integers, such as 8 bit or 4 bit, so each weight takes far less memory. An int8 weight uses a quarter of the bytes of a 32 bit float, and int4 uses an eighth.
How the mapping works
Quantization picks a scale and sometimes a zero point to convert a range of floats into a small integer range:
- Find the range of values in a group of weights.
- Choose a scale so the largest values map to the edge of the integer range.
- Round each weight to its nearest integer.
At inference the integers are scaled back to approximate the originals. The gap between the original and recovered value is quantization error.
Trade offs
Lower precision saves memory and can speed up math, but it adds noise. Int8 usually keeps accuracy close to the original. Int4 saves more but risks visible quality loss unless smart methods protect the most sensitive weights.
Key idea
Quantization replaces floating point weights with low bit integers using a scale, cutting memory at the cost of small rounding errors.