The Embedding And Unembedding

Both ends of the model

A transformer works on continuous vectors, but text is discrete tokens. The embedding maps each input token to a vector, and the unembedding maps the final vector back to scores over the vocabulary.

The embedding lookup

The vocabulary is a list of token ids.
An embedding matrix has one row per token.
Looking up a token id selects its row, the token vector.

The unembedding projection

At the output, the model has a vector per position. The unembedding multiplies it by a matrix with one column per token, producing a logit for every vocabulary entry. Softmax then turns logits into a probability distribution over the next token.

A symmetric pair

The embedding turns ids into vectors at the bottom and the unembedding turns vectors into id scores at the top. Everything between them, attention and feed forward, operates purely in vector space.

Key idea

The embedding maps discrete token ids to vectors at the input, and the unembedding projects final vectors to a logit per vocabulary token at the output, bracketing a model that works entirely in continuous vector space.

The Embedding And Unembedding

Both ends of the model

The embedding lookup

The unembedding projection

A symmetric pair

Key idea

Check yourself