← Lessons

quiz vs the machine

Gold1480

Machine Learning

Attention In Seq2seq

Letting the decoder look back at the whole source.

5 min read · core · beat Gold to climb

Attention In Seq2seq

Attention removed the bottleneck of the plain encoder decoder. Instead of forcing the entire source into one fixed vector, attention lets the decoder look back at every encoder state when generating each target word.

The mechanism works in three moves at each decoding step:

  • Score how relevant each source position is to the current decoder state
  • Turn those scores into weights that sum to one using a softmax
  • Build a context vector as the weighted average of encoder states

The decoder then uses this fresh context, blended for the current word, to predict the next token. When translating a noun, the weights spike on the matching source word, effectively performing a soft alignment between languages.

The benefits are concrete. Long sentences no longer degrade as badly, because no single vector has to hold everything. The attention weights are also interpretable, since you can visualize which source words the model focused on for each output.

Attention proved so useful that researchers asked whether the recurrent backbone was even necessary. The answer led to the transformer, which is built entirely from attention with no recurrence at all. So this idea is both a fix for seq2seq and the seed of modern architectures.

Key idea

Attention lets the decoder weight all encoder states per output word, removing the fixed vector bottleneck and giving interpretable alignments.

Check yourself

Answer to earn rating on the learn ladder.

1. What problem does attention solve in seq2seq?

2. How are attention weights produced from scores?

3. Why are attention weights useful beyond accuracy?