The building block
The transformer is built from repeated identical layers. Each layer has two main parts.
- A multi head attention block that mixes information across tokens
- A feed forward network that transforms each token independently
Around each part sit a residual connection and layer normalization, which keep gradients healthy and make deep stacks trainable.
Encoders and decoders
The original design had an encoder that reads the input and a decoder that writes the output, used for translation. Modern large language models are usually decoder only. They predict the next token using masked attention so a token can only see earlier tokens, never the future.
Why it scaled
Transformers process all tokens in parallel rather than one at a time, which suits modern hardware. Stacking many layers and widening them, then training on huge text corpora, produced the steady gains that define today's large models.
Key idea
A transformer stacks attention and feed forward blocks with residuals and normalization, processing all tokens in parallel to scale to massive models.