Attention In Seq2seq
Attention removed the bottleneck of the plain encoder decoder. Instead of forcing the entire source into one fixed vector, attention lets the decoder look back at every encoder state when generating each target word.
The mechanism works in three moves at each decoding step:
- Score how relevant each source position is to the current decoder state
- Turn those scores into weights that sum to one using a softmax
- Build a context vector as the weighted average of encoder states
The decoder then uses this fresh context, blended for the current word, to predict the next token. When translating a noun, the weights spike on the matching source word, effectively performing a soft alignment between languages.
The benefits are concrete. Long sentences no longer degrade as badly, because no single vector has to hold everything. The attention weights are also interpretable, since you can visualize which source words the model focused on for each output.
Attention proved so useful that researchers asked whether the recurrent backbone was even necessary. The answer led to the transformer, which is built entirely from attention with no recurrence at all. So this idea is both a fix for seq2seq and the seed of modern architectures.
Key idea
Attention lets the decoder weight all encoder states per output word, removing the fixed vector bottleneck and giving interpretable alignments.