← Lessons

quiz vs the machine

Gold1340

Machine Learning

The Causal Attention Mask

The simple trick that lets a model predict the next token honestly.

4 min read · core · beat Gold to climb

Looking only backward

A language model trained to predict the next token must not peek at future tokens. The causal mask enforces this by blocking each position from attending to any position that comes after it.

How the mask works

  • Build the full matrix of attention scores.
  • For every query, set scores to negative infinity for keys at later positions.
  • After softmax those masked entries become zero weight.

Why negative infinity

Softmax of a very large negative number is effectively zero, so masked positions contribute nothing. The remaining weights still sum to one over the allowed past and present tokens.

Training efficiency

The mask lets the model compute predictions for every position in a sequence in one pass, while guaranteeing each prediction used only earlier context. This is what makes next token training both correct and efficient.

Key idea

A causal mask sets attention scores to negative infinity for future positions so each token can only attend to itself and the past, letting one forward pass train honest next token prediction at every position.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the causal mask prevent a token from doing?

2. Why are masked scores set to negative infinity?