What drift means
Embedding drift is a change over time in the distribution of the vectors a model produces, usually because the incoming data has shifted. Even if the model is unchanged, new topics, slang, or image styles can move embeddings into regions the system was not tuned for, degrading search and classification.
What to monitor
- Distribution shift: compare the statistics of recent embeddings to a reference window using a divergence measure or distance between distributions.
- Centroid movement: track how cluster centers move per category.
- Downstream metrics: watch retrieval relevance, recall, or classifier accuracy on a labeled sample.
How to detect it
A common approach computes a distance between a baseline batch of embeddings and a current batch. Large or growing distances trigger an alert. Tracking nearest neighbor recall on a fixed probe set is a practical, task aligned signal.
How to respond
- Re embed the corpus if the encoder was updated, since old and new vectors must not be mixed.
- Retrain or fine tune the encoder on fresh data when the domain has moved.
- Refresh references so the monitor reflects the current normal.
Key idea
Embedding drift monitoring compares live vectors against a reference to catch distribution shifts that quietly hurt quality, prompting re embedding, retraining, or reference refresh before users feel the decline.