The Transformer Architecture

The building block

The transformer is built from repeated identical layers. Each layer has two main parts.

A multi head attention block that mixes information across tokens
A feed forward network that transforms each token independently

Around each part sit a residual connection and layer normalization, which keep gradients healthy and make deep stacks trainable.

Encoders and decoders

The original design had an encoder that reads the input and a decoder that writes the output, used for translation. Modern large language models are usually decoder only. They predict the next token using masked attention so a token can only see earlier tokens, never the future.

Why it scaled

Transformers process all tokens in parallel rather than one at a time, which suits modern hardware. Stacking many layers and widening them, then training on huge text corpora, produced the steady gains that define today's large models.

Key idea

A transformer stacks attention and feed forward blocks with residuals and normalization, processing all tokens in parallel to scale to massive models.

The Transformer Architecture

The building block

Encoders and decoders

Why it scaled

Key idea

Check yourself