Contrastive Language Image Pretraining

What it is

Contrastive language image pretraining, known as CLIP, learns a shared embedding space for images and text. An image encoder and a text encoder are trained together so that a picture and its caption land close together, while mismatched pairs land far apart.

The contrastive objective

Training uses large batches of image caption pairs.

Each image is encoded to a vector and each caption to a vector.
For a batch, the model computes similarity between every image and every caption.
The loss pulls each true image caption pair together and pushes all other pairs in the batch apart.

Because the negatives come free from other items in the batch, training scales to enormous web datasets without manual labels.

Why it is powerful

Zero shot classification: to classify an image, compare its embedding to text embeddings of candidate labels and pick the closest.
Retrieval: search images with text queries and the reverse.
It provides the image encoder that many multimodal models build on.

The shared space means meaning, not pixels, drives matching.

Key idea

CLIP trains image and text encoders contrastively so matched pairs align in one space, enabling zero shot classification and text image retrieval.

Contrastive Language Image Pretraining

What it is

The contrastive objective

Why it is powerful

Key idea

Check yourself