← Lessons

quiz vs the machine

Gold1380

Machine Learning

The Feed Forward Network

The per position expander that holds much of a transformer's capacity.

4 min read · core · beat Gold to climb

The position wise transform

After attention mixes information across tokens, each position is processed alone by a feed forward network. It is the same small network applied independently to every token vector in the sequence.

Its simple shape

  • A linear layer expands the dimension, often by a factor of four.
  • A nonlinearity such as GELU or ReLU is applied.
  • A second linear layer contracts back to the model dimension.

Why expand then contract

The wide hidden layer gives the network room to compute rich nonlinear features before squeezing back down. This expand and contract pattern is where a large share of a transformer's parameters and stored knowledge live.

Independent per token

Because the feed forward network sees one position at a time, it cannot move information between tokens. That job belongs entirely to attention. The two roles stay cleanly separated, which makes the block easy to reason about.

Key idea

The feed forward network applies the same expand, activate, contract transform to each position independently, holding much of the model's parameters while leaving cross token mixing to attention.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the typical shape of a transformer feed forward network?

2. Can the feed forward network move information between tokens?