← Lessons

quiz vs the machine

Platinum1750

Machine Learning

The Multilingual Embeddings

One shared space where the same meaning lands together across languages.

6 min read · advanced · beat Platinum to climb

A shared semantic space

Multilingual embeddings map text from many languages into one common vector space so that a sentence and its translation land near each other. This enables cross lingual search, where a query in one language retrieves matching documents in another.

How alignment is learned

  • Shared subword vocabulary lets one model tokenize all languages, with overlap on numbers, names, and cognates seeding alignment.
  • Parallel pairs, sentences with their translations, are used as positives in a contrastive objective so translations pull together.
  • Knowledge distillation can teach a multilingual student to mimic a strong single language teacher across translations.

Why it is powerful

You can build one index covering many languages, train a classifier in a resource rich language and apply it to low resource ones, and serve global users with a single model. This is zero shot cross lingual transfer.

Pitfalls

Languages with little training data may align poorly. High resource languages can dominate the space, and scripts or domains far from training data degrade quality, so evaluation per language still matters.

Key idea

Multilingual embeddings align many languages into one space using shared vocabulary and parallel data, enabling cross lingual search and zero shot transfer, though low resource languages can still align weakly.

Check yourself

Answer to earn rating on the learn ladder.

1. What is the goal of a multilingual embedding space?

2. Which signal most directly teaches cross lingual alignment?