Why tokenize
A neural network does not see characters or words. It sees integers. Tokenization is the step that maps a string into a sequence of integer ids drawn from a fixed vocabulary, and back again.
The spectrum of granularity
You can split text at several levels:
- Character level keeps the vocabulary tiny but makes sequences very long.
- Word level keeps sequences short but the vocabulary explodes and rare words become unknown.
- Subword level is the modern compromise: common words stay whole, rare words break into pieces.
Almost every large model today uses a subword scheme such as byte pair encoding, WordPiece, or a unigram model.
The pipeline
Text first passes through optional normalization, then a pre tokenizer splits on whitespace and punctuation, then the core model assigns ids.
The chosen scheme shapes sequence length, cost, and how gracefully the model handles typos and new words.
Key idea
Tokenization turns text into integers via a fixed vocabulary, and subword schemes balance short sequences against a manageable vocabulary.