Multimodal Models

What it is

A multimodal model processes more than one kind of input, such as text and images together, and sometimes audio or video. It maps each modality into a shared representation so the model can reason across them in one context.

How modalities meet

The usual recipe joins specialized encoders to a language model.

An image encoder, often a vision transformer, turns a picture into a sequence of vectors.
A projection layer maps those vectors into the language model's token space.
The language model then attends over text tokens and image tokens together.

Because images become tokens, the same transformer machinery handles both, and the model can answer questions about a chart or describe a photo.

What it enables

Visual question answering: ask about the content of an image.
Document understanding: read text and layout from a scanned page.
Grounded generation: write captions or instructions tied to what is shown.

The hard part is alignment: training so that an image region and the words describing it land near each other in the shared space.

Key idea

A multimodal model encodes each input type into a shared token space so one transformer can reason jointly over images and text.

What it is

How modalities meet

What it enables

Key idea

Check yourself