Three views of a token
Attention does not use the raw token vector directly. Instead it learns three linear projections that turn each token into a query, a key, and a value. These are just matrix multiplications with learned weight matrices.
What each role means
- The query asks what this token is looking for.
- The key advertises what this token offers to others.
- The value is the content delivered when a query matches a key.
Why separate them
If a token used the same vector for asking and answering, the model could not distinguish the question from the content. Separate projections let the model learn that a verb might query for its subject while offering different information as a value.
Learned, not fixed
The three weight matrices are trained by gradient descent. Over training the projections specialize so that useful queries align with useful keys, shaping which tokens attend to which.
Key idea
Each token is linearly projected into a query, key, and value, separating the act of asking, the act of being matched, and the content delivered, all with learned weight matrices.