What it is
A multimodal model processes more than one kind of input, such as text and images together, and sometimes audio or video. It maps each modality into a shared representation so the model can reason across them in one context.
How modalities meet
The usual recipe joins specialized encoders to a language model.
- An image encoder, often a vision transformer, turns a picture into a sequence of vectors.
- A projection layer maps those vectors into the language model's token space.
- The language model then attends over text tokens and image tokens together.
Because images become tokens, the same transformer machinery handles both, and the model can answer questions about a chart or describe a photo.
What it enables
- Visual question answering: ask about the content of an image.
- Document understanding: read text and layout from a scanned page.
- Grounded generation: write captions or instructions tied to what is shown.
The hard part is alignment: training so that an image region and the words describing it land near each other in the shared space.
Key idea
A multimodal model encodes each input type into a shared token space so one transformer can reason jointly over images and text.