← Lessons

quiz vs the machine

Gold1500

Machine Learning

The Transformer Architecture

The stacked attention and feed forward design behind modern LLMs.

6 min read · core · beat Gold to climb

The building block

The transformer is built from repeated identical layers. Each layer has two main parts.

  • A multi head attention block that mixes information across tokens
  • A feed forward network that transforms each token independently

Around each part sit a residual connection and layer normalization, which keep gradients healthy and make deep stacks trainable.

Encoders and decoders

The original design had an encoder that reads the input and a decoder that writes the output, used for translation. Modern large language models are usually decoder only. They predict the next token using masked attention so a token can only see earlier tokens, never the future.

Why it scaled

Transformers process all tokens in parallel rather than one at a time, which suits modern hardware. Stacking many layers and widening them, then training on huge text corpora, produced the steady gains that define today's large models.

Key idea

A transformer stacks attention and feed forward blocks with residuals and normalization, processing all tokens in parallel to scale to massive models.

Check yourself

Answer to earn rating on the learn ladder.

1. What are the two main sublayers inside a transformer block?

2. What does masked attention enforce in a decoder only model?

3. Why do transformers scale well on modern hardware?