Label Preserving Data Augmentation

More data for free

Models generalize better with more varied data, but collecting and labeling it is expensive. Data augmentation creates new training examples by applying label preserving transformations to the ones you already have, effectively enlarging the dataset.

Common transformations

The trick is to change the input in ways that do not change its label:

For images, flip, rotate, crop, recolor, or add noise, since a flipped cat is still a cat
For text, swap synonyms, back translate through another language, or delete random words
For audio, shift pitch, stretch time, or mix in background sound

By seeing many variations, the model learns features that are invariant to these changes rather than memorizing exact pixels or words.

Cautions

Augmentations must respect the task. Flipping a photo of a digit can turn a six into something wrong, and overly aggressive noise can destroy the signal. Strong modern recipes like mixup blend whole examples together.

Key idea

Data augmentation applies label preserving transforms to expand and diversify training data, teaching the model invariances and reducing overfitting.

Label Preserving Data Augmentation

More data for free

Common transformations

Cautions

Key idea

Check yourself