The Pretraining Objective

What pretraining optimizes

A base language model is trained with one deceptively simple objective: predict the next token given all the tokens before it. Over trillions of tokens scraped from the web, books, and code, the model learns to assign high probability to plausible continuations.

Why this works

The loss is cross entropy between the predicted distribution and the true next token.
To predict well, the model must implicitly learn grammar, facts, reasoning patterns, and style.
No human labels are needed, so the data can be enormous. This is self supervised learning.

What you get and do not get

You get a model that is fluent and knowledgeable but not aligned to follow instructions or refuse harmful requests.
A base model will happily continue a toxic prompt or ramble, because it only models the text distribution it saw.
Alignment steps like fine tuning come later to shape behavior.

The scaling story

More data, parameters, and compute reliably lower the pretraining loss along predictable scaling laws.
Lower loss tends to correlate with stronger downstream skills, though not perfectly.

Key idea

Pretraining uses self supervised next token prediction over massive text to build a fluent, knowledgeable base model that is not yet aligned to human intent.

The Pretraining Objective

What pretraining optimizes

Why this works

What you get and do not get

The scaling story

Key idea

Check yourself