Self Supervised Learning

The idea

Self supervised learning trains a model without human labels by generating the labels from the data itself. You hide part of an input and ask the model to predict it, turning unlabeled data into a supervised task. This is how large language and vision models are pretrained.

Pretext tasks

The made up task you solve is called a pretext task.

In language, mask some words and predict them, or predict the next token
In vision, hide patches of an image and reconstruct them
In contrastive learning, pull two augmented views of the same image together and push different images apart

The point is not the pretext task itself but the useful representations the model learns while solving it.

Why it matters

Labeled data is scarce and costly, while raw data is abundant. Self supervised pretraining learns general features from that raw data, and then a small labeled set fine tunes the model for a specific task. This two stage recipe powers most modern foundation models.

Relation to other paradigms

It differs from supervised learning, which needs human labels, and from unsupervised clustering, because it still solves a prediction task, just one whose labels come for free from the data.

Key idea