One string, many splits
A single word can usually be tokenized in several valid ways. A deterministic tokenizer always picks one. Subword regularization deliberately samples among the alternatives during training.
Why add noise
Exposing the model to different segmentations of the same text:
- Acts like data augmentation, improving robustness.
- Makes the model less brittle to the exact split it sees at inference.
- Helps with rare words and noisy input such as typos.
How it is done
The unigram model makes this natural, since each segmentation has a probability and you can sample from them. BPE has a variant called BPE dropout that randomly skips some merge rules, producing different splits each pass.
At inference
You typically turn sampling off at inference and use the single best segmentation, though sampling can still help for ensembling or robustness studies.
Key idea
Subword regularization samples alternative segmentations during training as augmentation, making models more robust to how text gets split.