Why mask at all
Attention by default lets every token see every other token. Often we must forbid some connections. A mask sets certain scores to a very large negative number before softmax, so those positions receive essentially zero weight.
The padding mask
Batches contain sequences of different lengths, padded to a common size with filler tokens. A padding mask blocks attention to those filler positions so they cannot leak meaningless signal into real tokens.
The causal mask
In a decoder that generates text left to right, a token must not peek at future tokens. A causal mask, shaped like a lower triangle, blocks every position from attending to any later position. This is what makes autoregressive language modeling honest at training time.
Combining masks
- Apply both: block padding and block the future.
- Masks are additive in score space, so combining is just adding the negative entries.
Encoder versus decoder
- Encoders usually use only a padding mask, so tokens see the full sentence.
- Decoders use causal plus padding masks.
Key idea
Masks add large negative values to forbidden scores before softmax: padding masks hide filler tokens while causal masks hide future tokens, and decoders combine both so generation stays left to right.