One view is not enough
A single attention computation produces one set of weights, one way of mixing tokens. But language has many kinds of relations at once: syntax, coreference, topic. Multi head attention runs several attention operations in parallel so the model can attend in several ways.
How heads are formed
The model splits the projected query key and value vectors into smaller pieces, one per head. Each head has its own learned projections, so each looks at a different subspace of the representation.
- Head one might track the subject of a verb.
- Head two might track nearby punctuation.
- Head three might track long range topic words.
Combining the heads
Each head produces its own output. These outputs are concatenated back into a full width vector and passed through a final linear projection that lets the heads interact. The total compute is similar to one wide head because the dimension is divided among them.
Why it helps
Splitting capacity across heads gives the model many simultaneous attention patterns without much extra cost, which empirically improves quality over a single head.
Key idea
Multi head attention divides the representation into several heads that each learn a distinct attention pattern, then concatenates and projects them, giving richer relations at roughly the cost of one wide head.