From symbols to vectors
An embedding layer maps each discrete token, such as a word or category, to a dense learnable vector.
- It is implemented as a lookup table indexed by token id.
- Each row is a vector trained by gradient descent like any other parameter.
- Similar tokens tend to end up with similar vectors.
Why not one hot
A one hot vector is huge and sparse, and it treats every token as equally distant. Embeddings are compact, dense, and place related tokens near each other in space.
In transformers the token embeddings are combined with positional encodings before entering the first block. Output embeddings are sometimes tied to the input table to save parameters.
Key idea
Embedding layers are trainable lookup tables that convert discrete tokens into dense vectors, giving the model a compact space where similar tokens sit close together.