What self attention does
Self attention lets each token in a sequence gather information from all other tokens in the same sequence. Instead of a fixed window, a token can attend to anything, near or far, and weight it by relevance.
Queries keys and values
Every token is projected into three vectors:
- Query is what this token is looking for.
- Key is what each token offers as an advertisement.
- Value is the content actually passed along when a match happens.
A token compares its query against every key to get scores, turns those scores into weights, then takes a weighted sum of values. The result replaces the token with a blend tuned to its current context.
Why it matters
Because the weights are computed from the data, the same word picks up different meaning depending on its neighbors. The pronoun it can bind to the right noun many words back. This content based routing is what made transformers replace recurrence.
Cost
Comparing every token against every other token costs work that grows with the square of sequence length, which is the central scaling problem later tricks attack.
Key idea
Self attention turns each token into a context aware blend by letting its query score every key and mix the matching values, giving flexible long range routing at quadratic cost.