← Lessons

quiz vs the machine

Gold1450

Machine Learning

The Image Embeddings With CLIP

How images and text learn to live in the same vector space.

6 min read · core · beat Gold to climb

Two encoders one space

CLIP trains an image encoder and a text encoder so their outputs share a single space. A photo of a dog and the caption a photo of a dog should embed to nearby vectors. The image vector becomes a rich, semantically meaningful image embedding.

Contrastive training on pairs

CLIP learns from hundreds of millions of image caption pairs from the web. For each batch it builds a similarity matrix between every image and every caption and uses a contrastive loss that pushes each true image caption pair to be the most similar, treating all other combinations as negatives.

Why this is useful

  • Zero shot classification: compare an image vector to text vectors like a photo of a cat or a photo of a car and pick the closest.
  • Cross modal search: find images from a text query or vice versa.
  • Strong features: the image embeddings transfer well to many downstream tasks.

Limitations

CLIP can struggle with fine grained counting, exact text in images, and concepts rare on the web. Its knowledge reflects the biases of its training data.

Key idea

CLIP uses contrastive training on image caption pairs to put images and text in one shared space, yielding image embeddings that enable zero shot classification and cross modal search.

Check yourself

Answer to earn rating on the learn ladder.

1. What does CLIP align in a shared vector space?

2. How does CLIP enable zero shot image classification?