← Lessons

quiz vs the machine

Platinum1630

Machine Learning

Masked Language Modeling

Teaching a model to fill in blanked out words using both sides of context.

5 min read · advanced · beat Platinum to climb

The objective

Masked language modeling, the training objective behind encoder models like BERT, hides a fraction of the tokens in a sentence and asks the model to recover them. Unlike next token prediction, the model may look at words on both sides of each blank.

How it trains

A common recipe masks about fifteen percent of tokens:

  • Replace some chosen tokens with a special mask symbol
  • Run the full sentence through the transformer encoder
  • At each masked position, predict the original word from the surrounding context

Because the model sees the entire sentence at once, it builds bidirectional representations that capture how a word relates to everything around it, ideal for understanding tasks like classification and entity recognition.

Versus next token prediction

Decoder models predict the next token left to right, which suits generation. Masked modeling instead optimizes for rich understanding, so the two objectives produce complementary kinds of models.

Key idea

Masked language modeling blanks out tokens and predicts them from both sides of context, producing bidirectional representations strong for understanding tasks.

Check yourself

Answer to earn rating on the learn ladder.

1. What makes masked language modeling bidirectional?

2. Masked language modeling is best suited for which kind of task?