← Lessons

quiz vs the machine

Gold1360

Machine Learning

The Cross Attention Deep

When queries come from one sequence and keys and values from another.

5 min read · core · beat Gold to climb

Two sequences meet

In cross attention, the queries come from one sequence while the keys and values come from a different one. This is how a decoder reads an encoder, or how a text model conditions on an image.

The classic use

In an encoder decoder translator, the decoder is generating the target sentence. At each layer it runs cross attention where:

  • Queries come from the partial target it is building.
  • Keys and values come from the encoded source sentence.

So each target token can look back at the most relevant source words, learning a soft alignment between languages.

Contrast with self attention

Self attention mixes a sequence with itself. Cross attention mixes one sequence with another, which lets information flow between modalities or stages rather than within one stream.

Where it appears

  • Encoder decoder transformers for translation and summarization.
  • Multimodal models where text queries attend to image features.
  • Retrieval augmented setups where a question attends to fetched passages.

Key idea

Cross attention draws queries from one sequence and keys and values from another, letting a decoder or one modality read relevant content from a separate source and build soft alignments across streams.

Check yourself

Answer to earn rating on the learn ladder.

1. In cross attention, where do keys and values come from?

2. What does cross attention enable in a translator?