← Lessons

quiz vs the machine

Silver1180

Machine Learning

Label Preserving Data Augmentation

Inventing realistic new training examples to fight overfitting.

4 min read · intro · beat Silver to climb

More data for free

Models generalize better with more varied data, but collecting and labeling it is expensive. Data augmentation creates new training examples by applying label preserving transformations to the ones you already have, effectively enlarging the dataset.

Common transformations

The trick is to change the input in ways that do not change its label:

  • For images, flip, rotate, crop, recolor, or add noise, since a flipped cat is still a cat
  • For text, swap synonyms, back translate through another language, or delete random words
  • For audio, shift pitch, stretch time, or mix in background sound

By seeing many variations, the model learns features that are invariant to these changes rather than memorizing exact pixels or words.

Cautions

Augmentations must respect the task. Flipping a photo of a digit can turn a six into something wrong, and overly aggressive noise can destroy the signal. Strong modern recipes like mixup blend whole examples together.

Key idea

Data augmentation applies label preserving transforms to expand and diversify training data, teaching the model invariances and reducing overfitting.

Check yourself

Answer to earn rating on the learn ladder.

1. What must a data augmentation preserve?

2. Why can horizontally flipping a digit image be a bad augmentation?