The memory problem
A single embedding can be hundreds of floating point numbers. Store millions of them and memory explodes. Product quantization, or PQ, compresses each vector into a small code, often a handful of bytes, so huge collections fit in RAM.
How it works
PQ splits a vector into several subvectors. For each subvector position it learns a small codebook of representative points. Each subvector is then replaced by the id of its nearest codebook entry, so the whole vector becomes a short list of ids.
Searching with codes
Distances are estimated from the codes using precomputed lookup tables, which is far cheaper than working with full vectors. The result is approximate, since the code only stores a representative point, not the exact value.
The tradeoff
- More subvectors and larger codebooks mean better accuracy but bigger codes.
- Aggressive compression saves memory but lowers recall.
PQ is often combined with IVF, where IVF narrows the candidates and PQ keeps them tiny.
Key idea
Product quantization splits vectors, replaces each part with a codebook id, and estimates distances from these small codes, trading accuracy for large memory savings.