The Cross Attention Deep

Two sequences meet

In cross attention, the queries come from one sequence while the keys and values come from a different one. This is how a decoder reads an encoder, or how a text model conditions on an image.

The classic use

In an encoder decoder translator, the decoder is generating the target sentence. At each layer it runs cross attention where:

Queries come from the partial target it is building.
Keys and values come from the encoded source sentence.

So each target token can look back at the most relevant source words, learning a soft alignment between languages.

Contrast with self attention

Self attention mixes a sequence with itself. Cross attention mixes one sequence with another, which lets information flow between modalities or stages rather than within one stream.

Where it appears

Encoder decoder transformers for translation and summarization.
Multimodal models where text queries attend to image features.
Retrieval augmented setups where a question attends to fetched passages.

Key idea