Beyond one hot
One hot encoding treats every category as equally distant from every other. A learned embedding instead maps each category to a short dense vector, placing similar categories near each other in that space.
How it learns
An embedding is a lookup table of vectors, one row per category, trained alongside the rest of the network.
- The vector for each category starts random.
- Gradients flowing back from the loss nudge the vectors so they help prediction.
- Categories used in similar contexts drift toward similar vectors.
Why it helps
- It captures relationships, so two similar products end up close together.
- It keeps dimensions small even for millions of categories.
- The learned vectors can be reused in other models or for similarity search.
Practical choices
- A common rule sets the vector size near the cube root or a small fraction of the category count.
- Reserve a slot for unknown categories seen only at prediction time.
- Embeddings shine when there are many categories and plenty of training data.
Key idea
Categorical embeddings learn dense vectors that place similar categories near each other, scaling to huge category counts while capturing relationships one hot encoding ignores.