The label bottleneck
Supervised learning needs human labels, which are slow and costly, yet the world is full of unlabeled text, images, and audio. Self supervised learning turns that raw data into a training signal by hiding part of each example and asking the model to predict it.
Pretext tasks
The model trains on a pretext task whose answer comes from the data itself, no human needed:
- Predict a masked word from its surrounding context
- Predict the next token in a sequence
- Decide whether two augmented views come from the same image
Solving these forces the model to learn deep structure: grammar, object parts, and semantic relationships. The result is a general purpose representation.
Why it changed the field
After self supervised pretraining on huge corpora, a model needs only a small labeled set to fine tune for a specific task. This pretrain then adapt recipe underlies modern language models and many vision systems.
Key idea
Self supervised learning manufactures labels from unlabeled data through pretext tasks, yielding strong representations that transfer with little labeled data.