The Attention Masks Types

Why mask at all

Attention by default lets every token see every other token. Often we must forbid some connections. A mask sets certain scores to a very large negative number before softmax, so those positions receive essentially zero weight.

The padding mask

Batches contain sequences of different lengths, padded to a common size with filler tokens. A padding mask blocks attention to those filler positions so they cannot leak meaningless signal into real tokens.

The causal mask

In a decoder that generates text left to right, a token must not peek at future tokens. A causal mask, shaped like a lower triangle, blocks every position from attending to any later position. This is what makes autoregressive language modeling honest at training time.

Combining masks

Apply both: block padding and block the future.
Masks are additive in score space, so combining is just adding the negative entries.

Encoder versus decoder

Encoders usually use only a padding mask, so tokens see the full sentence.
Decoders use causal plus padding masks.

Key idea

Masks add large negative values to forbidden scores before softmax: padding masks hide filler tokens while causal masks hide future tokens, and decoders combine both so generation stays left to right.