A separate fitting step
The tokenizer is trained before the model, on its own corpus, and then frozen. Its vocabulary is fixed for the entire life of the model, so the choices made here are permanent.
What you decide upfront
- The corpus sample, which should mirror the languages and domains you will serve.
- The target vocabulary size.
- The algorithm, such as BPE, WordPiece, or unigram.
- Normalization and pre tokenization rules.
- Which special tokens to reserve.
Why the corpus matters
A tokenizer trained mostly on English code will tokenize prose or other languages inefficiently. The training sample bakes in biases that the model can never fully overcome, because every later token comes from this fixed vocabulary.
You cannot easily change it later
Swapping the tokenizer after pretraining would invalidate every learned embedding. That is why teams treat tokenizer design as a careful, mostly one way decision.
Key idea
A tokenizer is fit once on a representative corpus and then frozen, so its corpus, size, and algorithm choices permanently shape the model.