← Lessons

quiz vs the machine

Gold1480

Machine Learning

The Attention Recap

Weighted lookups let a model focus on the most relevant inputs.

5 min read · core · beat Gold to climb

The mechanism

Attention lets a model compute a weighted average over a set of values, where the weights say how relevant each item is right now.

  • Each position emits a query, a key, and a value.
  • A query is compared against every key to produce a score.
  • Scores pass through softmax to become weights that sum to one.
  • The output is the weighted sum of the values.

Scaled dot product

The standard form divides the dot product scores by the square root of the key dimension before softmax, which keeps gradients stable. This is scaled dot product attention.

Self attention uses the same sequence for queries, keys, and values, letting every position attend to every other position directly.

Key idea

Attention scores queries against keys, softmaxes them into weights, and blends the values, giving the model direct dynamic access to whatever is most relevant.

Check yourself

Answer to earn rating on the learn ladder.

1. What turns attention scores into weights that sum to one?

2. What are the three projections in attention?

3. Why are dot product scores scaled before softmax?