The Feed Forward Network

The position wise transform

After attention mixes information across tokens, each position is processed alone by a feed forward network. It is the same small network applied independently to every token vector in the sequence.

Its simple shape

A linear layer expands the dimension, often by a factor of four.
A nonlinearity such as GELU or ReLU is applied.
A second linear layer contracts back to the model dimension.

Why expand then contract

The wide hidden layer gives the network room to compute rich nonlinear features before squeezing back down. This expand and contract pattern is where a large share of a transformer's parameters and stored knowledge live.

Independent per token

Because the feed forward network sees one position at a time, it cannot move information between tokens. That job belongs entirely to attention. The two roles stay cleanly separated, which makes the block easy to reason about.

Key idea

The feed forward network applies the same expand, activate, contract transform to each position independently, holding much of the model's parameters while leaving cross token mixing to attention.

The Feed Forward Network

The position wise transform

Its simple shape

Why expand then contract

Independent per token

Key idea

Check yourself