The Multimodal Embeddings

One space for many modalities

Multimodal embeddings map different input types, such as text, images, audio, and video, into a shared vector space where related content across modalities lands close together. CLIP is one example for images and text; broader models add more modalities.

How a shared space is built

Each modality has its own encoder producing a vector of the same size.
Paired data, like an image with its caption or a clip with its transcript, drives a contrastive loss that aligns modalities.
Some models anchor everything to one hub modality so that pairs through the hub indirectly align all modalities.

What it unlocks

Cross modal retrieval: search images with text, or audio with an image.
Any to any comparison: measure how well a caption matches a sound or a frame.
Unified downstream models that accept mixed inputs without separate pipelines.

Challenges

Aligning very different modalities is hard when paired data is scarce. A modality gap can persist, where each modality clusters in its own region even within the shared space, so careful training and evaluation are needed.

Key idea

Multimodal embeddings align several input types into one shared space through paired data and contrastive training, enabling any to any retrieval, though a modality gap can remain when paired data is limited.

The Multimodal Embeddings

One space for many modalities

How a shared space is built

What it unlocks

Challenges

Key idea

Check yourself