← Lessons

quiz vs the machine

Gold1500

Machine Learning

The Transformer Recap

Stacked self attention and feedforward blocks replaced recurrence.

6 min read · core · beat Gold to climb

The architecture

The transformer processes a whole sequence in parallel using stacked blocks built from attention rather than recurrence.

Each block contains:

  • Multi head self attention that runs several attention computations in parallel and concatenates them.
  • A position wise feedforward network applied to every token.
  • Residual connections and layer normalization around each sublayer.

Position information

Self attention is order agnostic, so transformers add positional encodings to inject sequence order into the token representations.

Why it won

  • Parallelism over the sequence makes training far faster than RNNs.
  • Direct attention paths capture long range dependencies easily.
  • The design scales smoothly to billions of parameters.

Key idea

Transformers replace recurrence with parallel multi head self attention, plus feedforward layers, residuals, normalization, and positional encodings, enabling fast training and large scale.

Check yourself

Answer to earn rating on the learn ladder.

1. What replaces recurrence in a transformer?

2. Why do transformers add positional encodings?

3. What surrounds each sublayer in a transformer block?