Why one head is not enough
A single attention head produces one set of weights, so it can only emphasize one kind of relationship at a time. Language has many relationships at once, such as syntax, coreference, and topic.
The parallel trick
Multi head attention runs several attention computations in parallel, each with its own learned query key and value projections.
- Each head looks at the sequence through a different lens
- One head might track subject and verb agreement
- Another might link a word to nearby modifiers
- The model splits the embedding into smaller pieces so the total cost stays similar
After each head produces its output, the results are concatenated and passed through a final linear layer that mixes them back together.
The payoff
Because the heads are independent, they can specialize. The combined output is far more expressive than any single head, and the whole thing still runs efficiently on hardware that likes parallel work.
Key idea
Multi head attention runs several attention heads in parallel so the model can capture many relationship types at once, then combines them.