The architecture
The transformer processes a whole sequence in parallel using stacked blocks built from attention rather than recurrence.
Each block contains:
- Multi head self attention that runs several attention computations in parallel and concatenates them.
- A position wise feedforward network applied to every token.
- Residual connections and layer normalization around each sublayer.
Position information
Self attention is order agnostic, so transformers add positional encodings to inject sequence order into the token representations.
Why it won
- Parallelism over the sequence makes training far faster than RNNs.
- Direct attention paths capture long range dependencies easily.
- The design scales smoothly to billions of parameters.
Key idea
Transformers replace recurrence with parallel multi head self attention, plus feedforward layers, residuals, normalization, and positional encodings, enabling fast training and large scale.