The Attention Mechanism Intro

Attention lets a model decide which parts of the input matter most at each step, instead of squeezing everything into one fixed vector. It was introduced to fix the bottleneck in sequence to sequence models.

How attention computes a focus

At each decoding step the model builds a weighted blend of all encoder states:

A query from the current decoder state is compared against a key for each input position.
The comparison scores become weights after a softmax, so they are positive and sum to one.
The output is a weighted sum of the value vectors, emphasizing the most relevant inputs.

Because the weights change every step, the decoder can look at different input words when producing different output words. In translation it can align an output word with its source word.

Why it matters

Attention removed the single vector bottleneck and let models handle long inputs gracefully. It also made behavior more interpretable, since the weights show what the model looked at. This idea became the foundation of the transformer, which uses attention in place of recurrence entirely.

Key idea

Attention forms a query key value weighted blend so the model focuses on the most relevant inputs at each step, removing the fixed vector bottleneck and seeding transformers.

The Attention Mechanism Intro