One block, repeated
A transformer is mostly the same block stacked many times. Each block has two sub layers: an attention sublayer that lets tokens look at one another, and a feed forward sublayer that processes each position on its own. Stacking dozens of these blocks builds depth.
What a block does
- The attention sublayer mixes information across positions in the sequence.
- The feed forward sublayer transforms each token independently, adding nonlinear capacity.
- Each sublayer is wrapped with a residual connection and layer normalization for stable training.
Why the split matters
Attention handles relationships between tokens, like a pronoun finding its noun. The feed forward part handles per token computation, like memorizing facts. Separating them keeps each job simple and lets the model scale by adding more identical blocks rather than redesigning anything.
The shape of the data
A sequence enters as a matrix of token vectors and leaves the same shape, so blocks compose cleanly. Depth comes from repetition, not from changing the interface between layers.
Key idea
A transformer is a stack of identical blocks, each pairing an across position attention sublayer with a per position feed forward sublayer, both wrapped in residuals and normalization.