Multi Head Attention Revisited

Many attentions in parallel

A single attention computes one set of weights, so it can capture one kind of relationship at a time. Multi head attention runs several attention operations in parallel, each with its own query, key, and value projections, then concatenates their outputs.

How it splits the work

The model dimension is divided into heads, each of lower dimension.
Each head computes its own scaled dot product attention independently.
Outputs are concatenated and passed through a final output projection.

Why multiple heads help

Different heads can specialize. One head might track syntactic agreement, another might link a word to a distant antecedent, another might attend to the previous token. Running them in parallel lets the block model several relationship types at once without interference.

The cost balance

Splitting the dimension keeps the total compute close to a single full width attention, so heads are nearly free. The concatenation and output projection then mix the heads back into a single representation.

Key idea

Multi head attention runs several lower dimensional attentions in parallel so the model captures many relationship types at once, then concatenates and projects them back together.

Multi Head Attention Revisited

Many attentions in parallel

How it splits the work

Why multiple heads help

The cost balance

Key idea

Check yourself