← Lessons

quiz vs the machine

Gold1430

Machine Learning

The Tokenizer Training

How a tokenizer is fit on a corpus before any model weights exist.

5 min read · core · beat Gold to climb

A separate fitting step

The tokenizer is trained before the model, on its own corpus, and then frozen. Its vocabulary is fixed for the entire life of the model, so the choices made here are permanent.

What you decide upfront

  • The corpus sample, which should mirror the languages and domains you will serve.
  • The target vocabulary size.
  • The algorithm, such as BPE, WordPiece, or unigram.
  • Normalization and pre tokenization rules.
  • Which special tokens to reserve.

Why the corpus matters

A tokenizer trained mostly on English code will tokenize prose or other languages inefficiently. The training sample bakes in biases that the model can never fully overcome, because every later token comes from this fixed vocabulary.

You cannot easily change it later

Swapping the tokenizer after pretraining would invalidate every learned embedding. That is why teams treat tokenizer design as a careful, mostly one way decision.

Key idea

A tokenizer is fit once on a representative corpus and then frozen, so its corpus, size, and algorithm choices permanently shape the model.

Check yourself

Answer to earn rating on the learn ladder.

1. When is the tokenizer trained relative to the model?

2. Why is swapping the tokenizer after pretraining problematic?