← Lessons

quiz vs the machine

Silver1040

Machine Learning

The Language Detection

Guessing which natural language a piece of text is written in.

4 min read · intro · beat Silver to climb

The task

Language detection takes a string and returns its language, such as English, French, or Japanese. It runs before translation, search, and content filtering so each pipeline applies the right model.

A simple strong baseline

Languages have distinctive character n gram patterns. Counting short letter sequences captures the fingerprint of a language.

  • Build a profile of common three and four character sequences per language.
  • Compare a new text profile to each language profile.
  • Pick the closest match.

This works even on short snippets and needs no grammar rules.

Practical challenges

  • Short text like a single word gives weak evidence and raises errors.
  • Code switching mixes two languages in one message, so a single label is wrong.
  • Shared scripts make similar languages, such as Spanish and Portuguese, easy to confuse.

Confidence and fallbacks

Good detectors return a probability, not just a label. A low confidence score can trigger a fallback such as asking the user or defaulting to a safe language.

Key idea

Language detection compares a text's character n gram fingerprint to per language profiles, working well on short input but struggling with mixed languages and similar scripts.

Check yourself

Answer to earn rating on the learn ladder.

1. What signal makes a strong language detection baseline?

2. Which case most often breaks single label detection?