← Lessons

quiz vs the machine

Silver1040

Machine Learning

Tokenization Overview

How raw text becomes the integer ids a language model actually reads.

4 min read · intro · beat Silver to climb

Why tokenize

A neural network does not see characters or words. It sees integers. Tokenization is the step that maps a string into a sequence of integer ids drawn from a fixed vocabulary, and back again.

The spectrum of granularity

You can split text at several levels:

  • Character level keeps the vocabulary tiny but makes sequences very long.
  • Word level keeps sequences short but the vocabulary explodes and rare words become unknown.
  • Subword level is the modern compromise: common words stay whole, rare words break into pieces.

Almost every large model today uses a subword scheme such as byte pair encoding, WordPiece, or a unigram model.

The pipeline

Text first passes through optional normalization, then a pre tokenizer splits on whitespace and punctuation, then the core model assigns ids.

The chosen scheme shapes sequence length, cost, and how gracefully the model handles typos and new words.

Key idea

Tokenization turns text into integers via a fixed vocabulary, and subword schemes balance short sequences against a manageable vocabulary.

Check yourself

Answer to earn rating on the learn ladder.

1. Why do most modern models use subword tokenization?

2. What does a tokenizer ultimately produce for the model?