← Lessons

quiz vs the machine

Platinum1800

Machine Learning

The Multimodal Embeddings

Putting text, images, audio, and more into one comparable space.

6 min read · advanced · beat Platinum to climb

One space for many modalities

Multimodal embeddings map different input types, such as text, images, audio, and video, into a shared vector space where related content across modalities lands close together. CLIP is one example for images and text; broader models add more modalities.

How a shared space is built

  • Each modality has its own encoder producing a vector of the same size.
  • Paired data, like an image with its caption or a clip with its transcript, drives a contrastive loss that aligns modalities.
  • Some models anchor everything to one hub modality so that pairs through the hub indirectly align all modalities.

What it unlocks

  • Cross modal retrieval: search images with text, or audio with an image.
  • Any to any comparison: measure how well a caption matches a sound or a frame.
  • Unified downstream models that accept mixed inputs without separate pipelines.

Challenges

Aligning very different modalities is hard when paired data is scarce. A modality gap can persist, where each modality clusters in its own region even within the shared space, so careful training and evaluation are needed.

Key idea

Multimodal embeddings align several input types into one shared space through paired data and contrastive training, enabling any to any retrieval, though a modality gap can remain when paired data is limited.

Check yourself

Answer to earn rating on the learn ladder.

1. What defines a multimodal embedding space?

2. What is the modality gap?