← Lessons

quiz vs the machine

Silver1080

Machine Learning

The Self Attention Deep

How every token looks at every other token to build a context aware mix.

5 min read · intro · beat Silver to climb

What self attention does

Self attention lets each token in a sequence gather information from all other tokens in the same sequence. Instead of a fixed window, a token can attend to anything, near or far, and weight it by relevance.

Queries keys and values

Every token is projected into three vectors:

  • Query is what this token is looking for.
  • Key is what each token offers as an advertisement.
  • Value is the content actually passed along when a match happens.

A token compares its query against every key to get scores, turns those scores into weights, then takes a weighted sum of values. The result replaces the token with a blend tuned to its current context.

Why it matters

Because the weights are computed from the data, the same word picks up different meaning depending on its neighbors. The pronoun it can bind to the right noun many words back. This content based routing is what made transformers replace recurrence.

Cost

Comparing every token against every other token costs work that grows with the square of sequence length, which is the central scaling problem later tricks attack.

Key idea

Self attention turns each token into a context aware blend by letting its query score every key and mix the matching values, giving flexible long range routing at quadratic cost.

Check yourself

Answer to earn rating on the learn ladder.

1. In self attention, what does a token use to search the others?

2. Why is plain self attention expensive on long inputs?