← Lessons

quiz vs the machine

Gold1430

Machine Learning

Data Augmentation for Text

Growing text data while protecting meaning and labels.

5 min read · core · beat Gold to climb

Data Augmentation for Text

Text is harder to augment than images because small edits can flip meaning. Still, several techniques expand a text dataset while keeping the label intact.

Common techniques

  • Synonym replacement swaps words for close synonyms.
  • Back translation translates a sentence to another language and back, producing a paraphrase.
  • Random insertion or deletion of minor words adds mild noise.
  • Contextual replacement uses a language model to suggest fitting substitutes.

The meaning hazard

Unlike flipping an image, editing words risks changing sentiment or facts. Replacing good with bad in a review reverses the label. Deleting the word not can invert a sentence entirely. Augmentations must be conservative and ideally checked so the label still holds.

When it pays off

Augmentation helps most when labeled text is scarce, such as a niche classification task. With large pretrained language models already encoding broad knowledge, heavy augmentation matters less than it once did, but it remains a cheap way to add robustness for low resource problems.

Key idea

Text augmentation creates paraphrases through synonyms or back translation but must stay conservative so the label and meaning survive.

Check yourself

Answer to earn rating on the learn ladder.

1. How does back translation augment text?

2. Why must text augmentation be conservative?