The memory problem
During generation a model stores keys and values for every past token, called the kv cache. With many heads this cache is large and reading it dominates inference cost. Grouped query attention reduces it.
The idea
Standard multi head attention gives every query head its own key and value heads. Grouped query attention keeps many query heads but lets each small group of query heads share one key and value head.
- Many query heads, as before.
- Fewer key and value heads, shared within a group.
The tradeoff
Fewer key and value heads means a smaller kv cache and less memory bandwidth per step, which speeds up generation. Quality drops only slightly compared to full multi head attention, far less than the more extreme single shared head approach.
Where it sits
Grouped query attention is a middle ground. It interpolates between full multi head attention, where every head is independent, and multi query attention, where all query heads share a single key and value head.
Key idea
Grouped query attention keeps many query heads but shares each key and value head across a group of them, shrinking the kv cache and speeding generation while losing little quality compared to full multi head attention.