The Image Embeddings With CLIP

Two encoders one space

CLIP trains an image encoder and a text encoder so their outputs share a single space. A photo of a dog and the caption a photo of a dog should embed to nearby vectors. The image vector becomes a rich, semantically meaningful image embedding.

Contrastive training on pairs

CLIP learns from hundreds of millions of image caption pairs from the web. For each batch it builds a similarity matrix between every image and every caption and uses a contrastive loss that pushes each true image caption pair to be the most similar, treating all other combinations as negatives.

Why this is useful

Zero shot classification: compare an image vector to text vectors like a photo of a cat or a photo of a car and pick the closest.
Cross modal search: find images from a text query or vice versa.
Strong features: the image embeddings transfer well to many downstream tasks.

Limitations

CLIP can struggle with fine grained counting, exact text in images, and concepts rare on the web. Its knowledge reflects the biases of its training data.

Key idea

CLIP uses contrastive training on image caption pairs to put images and text in one shared space, yielding image embeddings that enable zero shot classification and cross modal search.

The Image Embeddings With CLIP

Two encoders one space

Contrastive training on pairs

Why this is useful

Limitations

Key idea

Check yourself