← Lessons

quiz vs the machine

Gold1450

Machine Learning

Subword Regularization

Sampling multiple segmentations to make models robust to tokenization noise.

5 min read · core · beat Gold to climb

One string, many splits

A single word can usually be tokenized in several valid ways. A deterministic tokenizer always picks one. Subword regularization deliberately samples among the alternatives during training.

Why add noise

Exposing the model to different segmentations of the same text:

  • Acts like data augmentation, improving robustness.
  • Makes the model less brittle to the exact split it sees at inference.
  • Helps with rare words and noisy input such as typos.

How it is done

The unigram model makes this natural, since each segmentation has a probability and you can sample from them. BPE has a variant called BPE dropout that randomly skips some merge rules, producing different splits each pass.

At inference

You typically turn sampling off at inference and use the single best segmentation, though sampling can still help for ensembling or robustness studies.

Key idea

Subword regularization samples alternative segmentations during training as augmentation, making models more robust to how text gets split.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the main benefit of subword regularization?

2. What is BPE dropout?