← Lessons

quiz vs the machine

Gold1320

Machine Learning

The Attention Masks Types

Padding masks, causal masks, and how they shape what a token may see.

5 min read · core · beat Gold to climb

Why mask at all

Attention by default lets every token see every other token. Often we must forbid some connections. A mask sets certain scores to a very large negative number before softmax, so those positions receive essentially zero weight.

The padding mask

Batches contain sequences of different lengths, padded to a common size with filler tokens. A padding mask blocks attention to those filler positions so they cannot leak meaningless signal into real tokens.

The causal mask

In a decoder that generates text left to right, a token must not peek at future tokens. A causal mask, shaped like a lower triangle, blocks every position from attending to any later position. This is what makes autoregressive language modeling honest at training time.

Combining masks

  • Apply both: block padding and block the future.
  • Masks are additive in score space, so combining is just adding the negative entries.

Encoder versus decoder

  • Encoders usually use only a padding mask, so tokens see the full sentence.
  • Decoders use causal plus padding masks.

Key idea

Masks add large negative values to forbidden scores before softmax: padding masks hide filler tokens while causal masks hide future tokens, and decoders combine both so generation stays left to right.

Check yourself

Answer to earn rating on the learn ladder.

1. What shape does a causal mask have?

2. What does a padding mask prevent?

3. How is a mask applied in score space?