The Multilingual Embeddings

A shared semantic space

Multilingual embeddings map text from many languages into one common vector space so that a sentence and its translation land near each other. This enables cross lingual search, where a query in one language retrieves matching documents in another.

How alignment is learned

Shared subword vocabulary lets one model tokenize all languages, with overlap on numbers, names, and cognates seeding alignment.
Parallel pairs, sentences with their translations, are used as positives in a contrastive objective so translations pull together.
Knowledge distillation can teach a multilingual student to mimic a strong single language teacher across translations.

Why it is powerful

You can build one index covering many languages, train a classifier in a resource rich language and apply it to low resource ones, and serve global users with a single model. This is zero shot cross lingual transfer.

Pitfalls

Languages with little training data may align poorly. High resource languages can dominate the space, and scripts or domains far from training data degrade quality, so evaluation per language still matters.

Key idea

Multilingual embeddings align many languages into one space using shared vocabulary and parallel data, enabling cross lingual search and zero shot transfer, though low resource languages can still align weakly.

The Multilingual Embeddings

A shared semantic space

How alignment is learned

Why it is powerful

Pitfalls

Key idea

Check yourself