A shared semantic space
Multilingual embeddings map text from many languages into one common vector space so that a sentence and its translation land near each other. This enables cross lingual search, where a query in one language retrieves matching documents in another.
How alignment is learned
- Shared subword vocabulary lets one model tokenize all languages, with overlap on numbers, names, and cognates seeding alignment.
- Parallel pairs, sentences with their translations, are used as positives in a contrastive objective so translations pull together.
- Knowledge distillation can teach a multilingual student to mimic a strong single language teacher across translations.
Why it is powerful
You can build one index covering many languages, train a classifier in a resource rich language and apply it to low resource ones, and serve global users with a single model. This is zero shot cross lingual transfer.
Pitfalls
Languages with little training data may align poorly. High resource languages can dominate the space, and scripts or domains far from training data degrade quality, so evaluation per language still matters.
Key idea
Multilingual embeddings align many languages into one space using shared vocabulary and parallel data, enabling cross lingual search and zero shot transfer, though low resource languages can still align weakly.