Masked Language Modeling

The objective

Masked language modeling, the training objective behind encoder models like BERT, hides a fraction of the tokens in a sentence and asks the model to recover them. Unlike next token prediction, the model may look at words on both sides of each blank.

How it trains

A common recipe masks about fifteen percent of tokens:

Replace some chosen tokens with a special mask symbol
Run the full sentence through the transformer encoder
At each masked position, predict the original word from the surrounding context

Because the model sees the entire sentence at once, it builds bidirectional representations that capture how a word relates to everything around it, ideal for understanding tasks like classification and entity recognition.

Versus next token prediction

Decoder models predict the next token left to right, which suits generation. Masked modeling instead optimizes for rich understanding, so the two objectives produce complementary kinds of models.

Key idea

Masked language modeling blanks out tokens and predicts them from both sides of context, producing bidirectional representations strong for understanding tasks.

Masked Language Modeling

The objective

How it trains

Versus next token prediction

Key idea

Check yourself