The objective
Masked language modeling, the training objective behind encoder models like BERT, hides a fraction of the tokens in a sentence and asks the model to recover them. Unlike next token prediction, the model may look at words on both sides of each blank.
How it trains
A common recipe masks about fifteen percent of tokens:
- Replace some chosen tokens with a special mask symbol
- Run the full sentence through the transformer encoder
- At each masked position, predict the original word from the surrounding context
Because the model sees the entire sentence at once, it builds bidirectional representations that capture how a word relates to everything around it, ideal for understanding tasks like classification and entity recognition.
Versus next token prediction
Decoder models predict the next token left to right, which suits generation. Masked modeling instead optimizes for rich understanding, so the two objectives produce complementary kinds of models.
Key idea
Masked language modeling blanks out tokens and predicts them from both sides of context, producing bidirectional representations strong for understanding tasks.