The CLIP Contrastive Vision

Learning from captions

CLIP learns vision from natural language. It trains on huge sets of image and caption pairs scraped from the web, avoiding hand labeled categories entirely.

The contrastive objective

It uses two encoders, one for images and one for text, that map each input to a vector. Within a batch the goal is a contrastive match:

The correct image text pairs should have high similarity.
All mismatched pairs in the batch should have low similarity.

The loss pulls true pairs together and pushes false pairs apart in a shared space.

Zero shot classification

Because images and text live in the same space, you classify without training a new head. Write each class as a sentence, encode them, and pick the class whose text vector is most similar to the image vector. New label sets need no retraining.

Strengths and limits

CLIP transfers broadly and is robust across many datasets. Its weaknesses include sensitivity to prompt wording, trouble with fine grained or counting tasks, and inheriting biases from web data.

Key idea