← Lessons

quiz vs the machine

Platinum1720

Machine Learning

Multimodal Models

Models that take in and reason over several data types at once.

5 min read · advanced · beat Platinum to climb

What it is

A multimodal model processes more than one kind of input, such as text and images together, and sometimes audio or video. It maps each modality into a shared representation so the model can reason across them in one context.

How modalities meet

The usual recipe joins specialized encoders to a language model.

  • An image encoder, often a vision transformer, turns a picture into a sequence of vectors.
  • A projection layer maps those vectors into the language model's token space.
  • The language model then attends over text tokens and image tokens together.

Because images become tokens, the same transformer machinery handles both, and the model can answer questions about a chart or describe a photo.

What it enables

  • Visual question answering: ask about the content of an image.
  • Document understanding: read text and layout from a scanned page.
  • Grounded generation: write captions or instructions tied to what is shown.

The hard part is alignment: training so that an image region and the words describing it land near each other in the shared space.

Key idea

A multimodal model encodes each input type into a shared token space so one transformer can reason jointly over images and text.

Check yourself

Answer to earn rating on the learn ladder.

1. How do images typically enter a multimodal language model?

2. What is the central challenge in training multimodal models?