Looking only backward
A language model trained to predict the next token must not peek at future tokens. The causal mask enforces this by blocking each position from attending to any position that comes after it.
How the mask works
- Build the full matrix of attention scores.
- For every query, set scores to negative infinity for keys at later positions.
- After softmax those masked entries become zero weight.
Why negative infinity
Softmax of a very large negative number is effectively zero, so masked positions contribute nothing. The remaining weights still sum to one over the allowed past and present tokens.
Training efficiency
The mask lets the model compute predictions for every position in a sequence in one pass, while guaranteeing each prediction used only earlier context. This is what makes next token training both correct and efficient.
Key idea
A causal mask sets attention scores to negative infinity for future positions so each token can only attend to itself and the past, letting one forward pass train honest next token prediction at every position.