The core idea
Self attention lets every token in a sequence look at every other token and decide how much each one matters. This is how transformers build context aware representations.
Queries keys and values
Each token is projected into three vectors.
- A query that asks what am I looking for
- A key that advertises what I contain
- A value that holds the information to pass along
For a given token, the model compares its query against every key using a dot product. These scores become weights after a softmax, and the output is a weighted sum of the values.
Why it helps
Unlike older recurrent models that process words one at a time, self attention connects distant words directly. A pronoun can attend straight to the noun it refers to, no matter how far apart they are. The scores are also scaled down by the square root of the key dimension to keep them stable.
Key idea
Self attention computes each token output as a softmax weighted sum of value vectors, letting every token gather information from the whole sequence.