From id to vector
A token id is just an index. The first thing the model does is look that index up in an embedding matrix, retrieving a learned dense vector for the token.
The embedding matrix
The matrix has one row per vocabulary entry and one column per model dimension. With a vocabulary of fifty thousand and a width of one thousand, that is fifty million parameters before any other layer exists.
- Each row starts random and is learned during training.
- The lookup is a simple row selection, not a matrix multiply.
- Similar tokens drift toward similar vectors as training proceeds.
Tied weights
Many models reuse the same matrix for the output projection that turns final hidden states back into token scores. This weight tying saves parameters and often helps quality.
Why it matters for tokenization
Rare tokens get few gradient updates, so their embeddings stay weak. This is a direct link between vocabulary choices and how well individual tokens are represented.
Key idea
Each token id selects a learned row from the embedding matrix, turning integers into dense vectors, and rare tokens get poorly trained rows.