Tokenization Overview

Why tokenize

A neural network does not see characters or words. It sees integers. Tokenization is the step that maps a string into a sequence of integer ids drawn from a fixed vocabulary, and back again.

The spectrum of granularity

You can split text at several levels:

Character level keeps the vocabulary tiny but makes sequences very long.
Word level keeps sequences short but the vocabulary explodes and rare words become unknown.
Subword level is the modern compromise: common words stay whole, rare words break into pieces.

Almost every large model today uses a subword scheme such as byte pair encoding, WordPiece, or a unigram model.

The pipeline

Text first passes through optional normalization, then a pre tokenizer splits on whitespace and punctuation, then the core model assigns ids.

The chosen scheme shapes sequence length, cost, and how gracefully the model handles typos and new words.

Key idea

Tokenization turns text into integers via a fixed vocabulary, and subword schemes balance short sequences against a manageable vocabulary.

Tokenization Overview

Why tokenize

The spectrum of granularity

The pipeline

Key idea

Check yourself