← Lessons

quiz vs the machine

Silver1050

Machine Learning

The Pretraining Objective

Why next token prediction over raw text builds a capable base model.

4 min read · intro · beat Silver to climb

What pretraining optimizes

A base language model is trained with one deceptively simple objective: predict the next token given all the tokens before it. Over trillions of tokens scraped from the web, books, and code, the model learns to assign high probability to plausible continuations.

Why this works

  • The loss is cross entropy between the predicted distribution and the true next token.
  • To predict well, the model must implicitly learn grammar, facts, reasoning patterns, and style.
  • No human labels are needed, so the data can be enormous. This is self supervised learning.

What you get and do not get

  • You get a model that is fluent and knowledgeable but not aligned to follow instructions or refuse harmful requests.
  • A base model will happily continue a toxic prompt or ramble, because it only models the text distribution it saw.
  • Alignment steps like fine tuning come later to shape behavior.

The scaling story

  • More data, parameters, and compute reliably lower the pretraining loss along predictable scaling laws.
  • Lower loss tends to correlate with stronger downstream skills, though not perfectly.

Key idea

Pretraining uses self supervised next token prediction over massive text to build a fluent, knowledgeable base model that is not yet aligned to human intent.

Check yourself

Answer to earn rating on the learn ladder.

1. What objective does base model pretraining optimize?

2. Why is a freshly pretrained base model not yet aligned?