The Cross Attention

Two sequences, one attention

In self attention the queries, keys, and values all come from the same sequence. In cross attention the queries come from one sequence while the keys and values come from another. This is how a decoder consults an encoder.

A translation example

The encoder reads a source sentence and produces key and value vectors.
The decoder generates the target sentence, position by position.
Each decoder position forms a query and attends to the encoder keys and values.

Why it is powerful

Cross attention is a flexible bridge between modalities or sequences. The same pattern lets a caption model attend to image features, or a retrieval model attend to fetched documents. The query side decides what to look for and the other side supplies the content.

Distinct from self attention

A decoder block often contains both: self attention over its own past tokens, then cross attention into the encoder output. Keeping them separate lets the model reason about its own generation and the source independently.

Key idea

Cross attention draws queries from one sequence and keys and values from another, letting a decoder pull relevant content from an encoder and acting as a general bridge between two sequences or modalities.

The Cross Attention

Two sequences, one attention

A translation example

Why it is powerful

Distinct from self attention

Key idea

Check yourself