Many attentions in parallel
A single attention computes one set of weights, so it can capture one kind of relationship at a time. Multi head attention runs several attention operations in parallel, each with its own query, key, and value projections, then concatenates their outputs.
How it splits the work
- The model dimension is divided into heads, each of lower dimension.
- Each head computes its own scaled dot product attention independently.
- Outputs are concatenated and passed through a final output projection.
Why multiple heads help
Different heads can specialize. One head might track syntactic agreement, another might link a word to a distant antecedent, another might attend to the previous token. Running them in parallel lets the block model several relationship types at once without interference.
The cost balance
Splitting the dimension keeps the total compute close to a single full width attention, so heads are nearly free. The concatenation and output projection then mix the heads back into a single representation.
Key idea
Multi head attention runs several lower dimensional attentions in parallel so the model captures many relationship types at once, then concatenates and projects them back together.