Two sequences meet
In cross attention, the queries come from one sequence while the keys and values come from a different one. This is how a decoder reads an encoder, or how a text model conditions on an image.
The classic use
In an encoder decoder translator, the decoder is generating the target sentence. At each layer it runs cross attention where:
- Queries come from the partial target it is building.
- Keys and values come from the encoded source sentence.
So each target token can look back at the most relevant source words, learning a soft alignment between languages.
Contrast with self attention
Self attention mixes a sequence with itself. Cross attention mixes one sequence with another, which lets information flow between modalities or stages rather than within one stream.
Where it appears
- Encoder decoder transformers for translation and summarization.
- Multimodal models where text queries attend to image features.
- Retrieval augmented setups where a question attends to fetched passages.
Key idea
Cross attention draws queries from one sequence and keys and values from another, letting a decoder or one modality read relevant content from a separate source and build soft alignments across streams.