← Lessons

quiz vs the machine

Platinum1820

Machine Learning

The CLIP Contrastive Vision

Aligning image and text encoders to enable zero shot recognition.

6 min read · advanced · beat Platinum to climb

Learning from captions

CLIP learns vision from natural language. It trains on huge sets of image and caption pairs scraped from the web, avoiding hand labeled categories entirely.

The contrastive objective

It uses two encoders, one for images and one for text, that map each input to a vector. Within a batch the goal is a contrastive match:

  • The correct image text pairs should have high similarity.
  • All mismatched pairs in the batch should have low similarity.

The loss pulls true pairs together and pushes false pairs apart in a shared space.

Zero shot classification

Because images and text live in the same space, you classify without training a new head. Write each class as a sentence, encode them, and pick the class whose text vector is most similar to the image vector. New label sets need no retraining.

Strengths and limits

CLIP transfers broadly and is robust across many datasets. Its weaknesses include sensitivity to prompt wording, trouble with fine grained or counting tasks, and inheriting biases from web data.

Key idea

CLIP trains image and text encoders with a contrastive loss so matching pairs align in a shared space, enabling zero shot classification by comparing an image to text descriptions of each class.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the CLIP contrastive loss do?

2. How does CLIP do zero shot classification?