Sequence to Sequence Models

What it is

A sequence to sequence model maps an input sequence to an output sequence that may have a different length. Translation, summarization, and speech transcription all fit this shape.

Two parts

The classic design has two networks.

The encoder reads the whole input and compresses it into a context representation
The decoder generates the output one token at a time, conditioned on that context and on the tokens it has produced so far

The decoder starts from a start token and feeds each prediction back in until it emits an end token. This step by step generation is called autoregressive decoding.

The bottleneck and attention

In the earliest version the encoder squeezed everything into one fixed vector, which became a bottleneck for long inputs.

Attention fixed this by letting the decoder look back at all encoder states
At each output step the decoder weights the input positions it cares about most
This made long sequence translation far more accurate and inspired the transformer

Key idea

A sequence to sequence model encodes an input into context, then a decoder generates the output token by token.

Sequence to Sequence Models

What it is

Two parts

The bottleneck and attention

Key idea

Check yourself