Three blueprints
The same transformer block assembles into three major families, each suited to different tasks.
Encoder only
- Every token attends to all other tokens, both left and right.
- Produces rich contextual representations of an input.
- Best for understanding tasks like classification and retrieval.
Decoder only
- Uses a causal mask so each token sees only the past.
- Generates text one token at a time.
- Best for generation, and the basis of most large language models today.
Encoder decoder
- An encoder reads the input bidirectionally, a decoder generates output with cross attention into the encoder.
- Best for sequence to sequence tasks like translation and summarization.
Choosing one
If you only need to read and label, use an encoder. If you need to write, use a decoder. If you transform one sequence into another, the encoder decoder pairing gives you both a full view of the input and a causal generator.
Key idea
Encoder only models read bidirectionally for understanding, decoder only models generate causally, and encoder decoder models pair a bidirectional reader with a causal writer for sequence to sequence work.