Learning from captions
CLIP learns vision from natural language. It trains on huge sets of image and caption pairs scraped from the web, avoiding hand labeled categories entirely.
The contrastive objective
It uses two encoders, one for images and one for text, that map each input to a vector. Within a batch the goal is a contrastive match:
- The correct image text pairs should have high similarity.
- All mismatched pairs in the batch should have low similarity.
The loss pulls true pairs together and pushes false pairs apart in a shared space.
Zero shot classification
Because images and text live in the same space, you classify without training a new head. Write each class as a sentence, encode them, and pick the class whose text vector is most similar to the image vector. New label sets need no retraining.
Strengths and limits
CLIP transfers broadly and is robust across many datasets. Its weaknesses include sensitivity to prompt wording, trouble with fine grained or counting tasks, and inheriting biases from web data.
Key idea
CLIP trains image and text encoders with a contrastive loss so matching pairs align in a shared space, enabling zero shot classification by comparing an image to text descriptions of each class.