Data Augmentation for Text

Text is harder to augment than images because small edits can flip meaning. Still, several techniques expand a text dataset while keeping the label intact.

Common techniques

Synonym replacement swaps words for close synonyms.
Back translation translates a sentence to another language and back, producing a paraphrase.
Random insertion or deletion of minor words adds mild noise.
Contextual replacement uses a language model to suggest fitting substitutes.

The meaning hazard

Unlike flipping an image, editing words risks changing sentiment or facts. Replacing good with bad in a review reverses the label. Deleting the word not can invert a sentence entirely. Augmentations must be conservative and ideally checked so the label still holds.

When it pays off

Augmentation helps most when labeled text is scarce, such as a niche classification task. With large pretrained language models already encoding broad knowledge, heavy augmentation matters less than it once did, but it remains a cheap way to add robustness for low resource problems.

Key idea

Text augmentation creates paraphrases through synonyms or back translation but must stay conservative so the label and meaning survive.

Data Augmentation for Text