The Query Key Value Projections

Three views of a token

Attention does not use the raw token vector directly. Instead it learns three linear projections that turn each token into a query, a key, and a value. These are just matrix multiplications with learned weight matrices.

What each role means

The query asks what this token is looking for.
The key advertises what this token offers to others.
The value is the content delivered when a query matches a key.

Why separate them

If a token used the same vector for asking and answering, the model could not distinguish the question from the content. Separate projections let the model learn that a verb might query for its subject while offering different information as a value.

Learned, not fixed

The three weight matrices are trained by gradient descent. Over training the projections specialize so that useful queries align with useful keys, shaping which tokens attend to which.

Key idea

Each token is linearly projected into a query, key, and value, separating the act of asking, the act of being matched, and the content delivered, all with learned weight matrices.

The Query Key Value Projections

Three views of a token

What each role means

Why separate them

Learned, not fixed

Key idea

Check yourself