The mechanism
Attention lets a model compute a weighted average over a set of values, where the weights say how relevant each item is right now.
- Each position emits a query, a key, and a value.
- A query is compared against every key to produce a score.
- Scores pass through softmax to become weights that sum to one.
- The output is the weighted sum of the values.
Scaled dot product
The standard form divides the dot product scores by the square root of the key dimension before softmax, which keeps gradients stable. This is scaled dot product attention.
Self attention uses the same sequence for queries, keys, and values, letting every position attend to every other position directly.
Key idea
Attention scores queries against keys, softmaxes them into weights, and blends the values, giving the model direct dynamic access to whatever is most relevant.