← Lessons

quiz vs the machine

Gold1440

Machine Learning

The Cross Attention

How a decoder reads from a separate encoded sequence.

5 min read · core · beat Gold to climb

Two sequences, one attention

In self attention the queries, keys, and values all come from the same sequence. In cross attention the queries come from one sequence while the keys and values come from another. This is how a decoder consults an encoder.

A translation example

  • The encoder reads a source sentence and produces key and value vectors.
  • The decoder generates the target sentence, position by position.
  • Each decoder position forms a query and attends to the encoder keys and values.

Why it is powerful

Cross attention is a flexible bridge between modalities or sequences. The same pattern lets a caption model attend to image features, or a retrieval model attend to fetched documents. The query side decides what to look for and the other side supplies the content.

Distinct from self attention

A decoder block often contains both: self attention over its own past tokens, then cross attention into the encoder output. Keeping them separate lets the model reason about its own generation and the source independently.

Key idea

Cross attention draws queries from one sequence and keys and values from another, letting a decoder pull relevant content from an encoder and acting as a general bridge between two sequences or modalities.

Check yourself

Answer to earn rating on the learn ladder.

1. How does cross attention differ from self attention?

2. In a translation model, where do the cross attention keys and values come from?