Data Augmentation for Text
Text is harder to augment than images because small edits can flip meaning. Still, several techniques expand a text dataset while keeping the label intact.
Common techniques
- Synonym replacement swaps words for close synonyms.
- Back translation translates a sentence to another language and back, producing a paraphrase.
- Random insertion or deletion of minor words adds mild noise.
- Contextual replacement uses a language model to suggest fitting substitutes.
The meaning hazard
Unlike flipping an image, editing words risks changing sentiment or facts. Replacing good with bad in a review reverses the label. Deleting the word not can invert a sentence entirely. Augmentations must be conservative and ideally checked so the label still holds.
When it pays off
Augmentation helps most when labeled text is scarce, such as a niche classification task. With large pretrained language models already encoding broad knowledge, heavy augmentation matters less than it once did, but it remains a cheap way to add robustness for low resource problems.
Key idea
Text augmentation creates paraphrases through synonyms or back translation but must stay conservative so the label and meaning survive.