← Lessons

quiz vs the machine

Silver1120

Machine Learning

The Query Key Value Projections

How one token vector becomes three different roles in attention.

4 min read · intro · beat Silver to climb

Three views of a token

Attention does not use the raw token vector directly. Instead it learns three linear projections that turn each token into a query, a key, and a value. These are just matrix multiplications with learned weight matrices.

What each role means

  • The query asks what this token is looking for.
  • The key advertises what this token offers to others.
  • The value is the content delivered when a query matches a key.

Why separate them

If a token used the same vector for asking and answering, the model could not distinguish the question from the content. Separate projections let the model learn that a verb might query for its subject while offering different information as a value.

Learned, not fixed

The three weight matrices are trained by gradient descent. Over training the projections specialize so that useful queries align with useful keys, shaping which tokens attend to which.

Key idea

Each token is linearly projected into a query, key, and value, separating the act of asking, the act of being matched, and the content delivered, all with learned weight matrices.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the query projection represent for a token?

2. How are queries, keys, and values produced from a token vector?