The Embedding Lookup

From id to vector

A token id is just an index. The first thing the model does is look that index up in an embedding matrix, retrieving a learned dense vector for the token.

The embedding matrix

The matrix has one row per vocabulary entry and one column per model dimension. With a vocabulary of fifty thousand and a width of one thousand, that is fifty million parameters before any other layer exists.

Each row starts random and is learned during training.
The lookup is a simple row selection, not a matrix multiply.
Similar tokens drift toward similar vectors as training proceeds.

Tied weights

Many models reuse the same matrix for the output projection that turns final hidden states back into token scores. This weight tying saves parameters and often helps quality.

Why it matters for tokenization

Rare tokens get few gradient updates, so their embeddings stay weak. This is a direct link between vocabulary choices and how well individual tokens are represented.

Key idea

Each token id selects a learned row from the embedding matrix, turning integers into dense vectors, and rare tokens get poorly trained rows.

The Embedding Lookup

From id to vector

The embedding matrix

Tied weights

Why it matters for tokenization

Key idea

Check yourself