← Lessons

quiz vs the machine

Platinum1780

Machine Learning

Contrastive Language Image Pretraining

Learning a shared image text space by pulling matched pairs together.

6 min read · advanced · beat Platinum to climb

What it is

Contrastive language image pretraining, known as CLIP, learns a shared embedding space for images and text. An image encoder and a text encoder are trained together so that a picture and its caption land close together, while mismatched pairs land far apart.

The contrastive objective

Training uses large batches of image caption pairs.

  • Each image is encoded to a vector and each caption to a vector.
  • For a batch, the model computes similarity between every image and every caption.
  • The loss pulls each true image caption pair together and pushes all other pairs in the batch apart.

Because the negatives come free from other items in the batch, training scales to enormous web datasets without manual labels.

Why it is powerful

  • Zero shot classification: to classify an image, compare its embedding to text embeddings of candidate labels and pick the closest.
  • Retrieval: search images with text queries and the reverse.
  • It provides the image encoder that many multimodal models build on.

The shared space means meaning, not pixels, drives matching.

Key idea

CLIP trains image and text encoders contrastively so matched pairs align in one space, enabling zero shot classification and text image retrieval.

Check yourself

Answer to earn rating on the learn ladder.

1. How does the contrastive objective in CLIP get its negative examples?

2. How does CLIP perform zero shot image classification?