The Self Attention Deep

What self attention does

Self attention lets each token in a sequence gather information from all other tokens in the same sequence. Instead of a fixed window, a token can attend to anything, near or far, and weight it by relevance.

Queries keys and values

Every token is projected into three vectors:

Query is what this token is looking for.
Key is what each token offers as an advertisement.
Value is the content actually passed along when a match happens.

A token compares its query against every key to get scores, turns those scores into weights, then takes a weighted sum of values. The result replaces the token with a blend tuned to its current context.

Why it matters

Because the weights are computed from the data, the same word picks up different meaning depending on its neighbors. The pronoun it can bind to the right noun many words back. This content based routing is what made transformers replace recurrence.

Cost

Comparing every token against every other token costs work that grows with the square of sequence length, which is the central scaling problem later tricks attack.

Key idea