What it is
A sequence to sequence model maps an input sequence to an output sequence that may have a different length. Translation, summarization, and speech transcription all fit this shape.
Two parts
The classic design has two networks.
- The encoder reads the whole input and compresses it into a context representation
- The decoder generates the output one token at a time, conditioned on that context and on the tokens it has produced so far
The decoder starts from a start token and feeds each prediction back in until it emits an end token. This step by step generation is called autoregressive decoding.
The bottleneck and attention
In the earliest version the encoder squeezed everything into one fixed vector, which became a bottleneck for long inputs.
- Attention fixed this by letting the decoder look back at all encoder states
- At each output step the decoder weights the input positions it cares about most
- This made long sequence translation far more accurate and inspired the transformer
Key idea
A sequence to sequence model encodes an input into context, then a decoder generates the output token by token.