← Lessons

quiz vs the machine

Platinum1720

Machine Learning

Self Attention

How each token decides which other tokens to focus on.

5 min read · advanced · beat Platinum to climb

The core idea

Self attention lets every token in a sequence look at every other token and decide how much each one matters. This is how transformers build context aware representations.

Queries keys and values

Each token is projected into three vectors.

  • A query that asks what am I looking for
  • A key that advertises what I contain
  • A value that holds the information to pass along

For a given token, the model compares its query against every key using a dot product. These scores become weights after a softmax, and the output is a weighted sum of the values.

Why it helps

Unlike older recurrent models that process words one at a time, self attention connects distant words directly. A pronoun can attend straight to the noun it refers to, no matter how far apart they are. The scores are also scaled down by the square root of the key dimension to keep them stable.

Key idea

Self attention computes each token output as a softmax weighted sum of value vectors, letting every token gather information from the whole sequence.

Check yourself

Answer to earn rating on the learn ladder.

1. What are the three vectors each token produces in self attention?

2. How are attention weights computed from the scores?

3. Why is self attention better at long range dependencies than recurrence?