The Kv Cache Optimization Deep

What the cache holds

During autoregressive generation each new token attends to all previous tokens. To avoid recomputing, the model stores each past token key and value in the kv cache. The cache grows with every generated token and can dominate memory.

Why it is the bottleneck

Each decoding step produces one token but must read the entire cache. So generation is limited by memory bandwidth, not raw compute. Shrinking the cache or reading it more efficiently directly speeds serving.

Ways to shrink it

Fewer key value heads through grouped or multi query attention.
Lower precision by storing keys and values in compressed or quantized form.
Eviction that drops tokens unlikely to matter, while keeping sinks.

Paged management

Serving systems treat the cache like virtual memory, splitting it into fixed pages so many requests share memory without fragmentation. This lets a server pack more concurrent users and reuse shared prefixes across requests.

The tradeoffs

Aggressive shrinking risks dropping a token that later proves important, so designs balance memory savings against quality, often keeping recent tokens and sinks at full fidelity.

Key idea

The kv cache stores past keys and values so generation is bandwidth bound, and optimizing it through fewer heads, lower precision, eviction, and paged sharing shrinks memory and speeds serving while balancing the risk of dropping a needed token.