Self Attention

The core idea

Self attention lets every token in a sequence look at every other token and decide how much each one matters. This is how transformers build context aware representations.

Queries keys and values

Each token is projected into three vectors.

A query that asks what am I looking for
A key that advertises what I contain
A value that holds the information to pass along

For a given token, the model compares its query against every key using a dot product. These scores become weights after a softmax, and the output is a weighted sum of the values.

Why it helps

Unlike older recurrent models that process words one at a time, self attention connects distant words directly. A pronoun can attend straight to the noun it refers to, no matter how far apart they are. The scores are also scaled down by the square root of the key dimension to keep them stable.

Key idea

Self attention computes each token output as a softmax weighted sum of value vectors, letting every token gather information from the whole sequence.

The core idea

Queries keys and values

Why it helps

Key idea

Check yourself