The Matryoshka Embeddings

The nesting idea

Matryoshka representation learning trains a single embedding so that its prefixes are also good embeddings. Like nested dolls, the first 64 dimensions, the first 256, and the full 1024 each work on their own. You can truncate to a shorter length at query time without retraining.

How it is trained

Instead of optimizing only the full vector, training applies the loss at several nested lengths at once. The model is pushed to pack the most important information into the early dimensions, with later dimensions adding refinement.

Why this is useful

Adaptive cost: use short prefixes for cheap coarse retrieval, then longer prefixes only where precision matters.
One model, many budgets: a single stored vector serves devices and indexes with different memory limits.
Cheaper search: shrinking dimension cuts storage and speeds nearest neighbor lookups.

A typical flow

Retrieve a candidate set using a short truncated vector, then rerank the survivors using the full length vector. This shrinks the bulk of the work while keeping final accuracy high.

Key idea

Matryoshka embeddings front load importance so that truncating the vector to a shorter prefix still yields a usable embedding, letting one model trade dimension for cost without retraining.

The Matryoshka Embeddings

The nesting idea

How it is trained

Why this is useful

A typical flow

Key idea

Check yourself