The idea
Self supervised learning trains a model without human labels by generating the labels from the data itself. You hide part of an input and ask the model to predict it, turning unlabeled data into a supervised task. This is how large language and vision models are pretrained.
Pretext tasks
The made up task you solve is called a pretext task.
- In language, mask some words and predict them, or predict the next token
- In vision, hide patches of an image and reconstruct them
- In contrastive learning, pull two augmented views of the same image together and push different images apart
The point is not the pretext task itself but the useful representations the model learns while solving it.
Why it matters
Labeled data is scarce and costly, while raw data is abundant. Self supervised pretraining learns general features from that raw data, and then a small labeled set fine tunes the model for a specific task. This two stage recipe powers most modern foundation models.
Relation to other paradigms
It differs from supervised learning, which needs human labels, and from unsupervised clustering, because it still solves a prediction task, just one whose labels come for free from the data.
Key idea
Self supervised learning invents labels from the data to pretrain rich representations, then fine tunes them with few labels.