Two encoders one space
CLIP trains an image encoder and a text encoder so their outputs share a single space. A photo of a dog and the caption a photo of a dog should embed to nearby vectors. The image vector becomes a rich, semantically meaningful image embedding.
Contrastive training on pairs
CLIP learns from hundreds of millions of image caption pairs from the web. For each batch it builds a similarity matrix between every image and every caption and uses a contrastive loss that pushes each true image caption pair to be the most similar, treating all other combinations as negatives.
Why this is useful
- Zero shot classification: compare an image vector to text vectors like a photo of a cat or a photo of a car and pick the closest.
- Cross modal search: find images from a text query or vice versa.
- Strong features: the image embeddings transfer well to many downstream tasks.
Limitations
CLIP can struggle with fine grained counting, exact text in images, and concepts rare on the web. Its knowledge reflects the biases of its training data.
Key idea
CLIP uses contrastive training on image caption pairs to put images and text in one shared space, yielding image embeddings that enable zero shot classification and cross modal search.