The Grouped Query Attention

The memory problem

During generation a model stores keys and values for every past token, called the kv cache. With many heads this cache is large and reading it dominates inference cost. Grouped query attention reduces it.

The idea

Standard multi head attention gives every query head its own key and value heads. Grouped query attention keeps many query heads but lets each small group of query heads share one key and value head.

Many query heads, as before.
Fewer key and value heads, shared within a group.

The tradeoff

Fewer key and value heads means a smaller kv cache and less memory bandwidth per step, which speeds up generation. Quality drops only slightly compared to full multi head attention, far less than the more extreme single shared head approach.

Where it sits

Grouped query attention is a middle ground. It interpolates between full multi head attention, where every head is independent, and multi query attention, where all query heads share a single key and value head.

Key idea

Grouped query attention keeps many query heads but shares each key and value head across a group of them, shrinking the kv cache and speeding generation while losing little quality compared to full multi head attention.

The Grouped Query Attention

The memory problem

The idea

The tradeoff

Where it sits

Key idea

Check yourself