Subword Regularization

One string, many splits

A single word can usually be tokenized in several valid ways. A deterministic tokenizer always picks one. Subword regularization deliberately samples among the alternatives during training.

Why add noise

Exposing the model to different segmentations of the same text:

Acts like data augmentation, improving robustness.
Makes the model less brittle to the exact split it sees at inference.
Helps with rare words and noisy input such as typos.

How it is done

The unigram model makes this natural, since each segmentation has a probability and you can sample from them. BPE has a variant called BPE dropout that randomly skips some merge rules, producing different splits each pass.

At inference

You typically turn sampling off at inference and use the single best segmentation, though sampling can still help for ensembling or robustness studies.

Key idea

Subword regularization samples alternative segmentations during training as augmentation, making models more robust to how text gets split.

Subword Regularization

One string, many splits

Why add noise

How it is done

At inference

Key idea

Check yourself