← Lessons

quiz vs the machine

Silver1120

Machine Learning

N Gram Language Models

Predicting the next word from the previous few.

4 min read · intro · beat Silver to climb

N Gram Language Models

An n gram language model estimates the probability of a word from the previous few words. Rather than condition on the whole history, it makes a simplifying bet called the Markov assumption, that only the last n minus one words matter.

A bigram model uses one word of context, a trigram uses two, and so on. To estimate the probability of the next word, you count how often a given n gram appeared in a training corpus and divide by the count of its prefix.

This gives a fast, interpretable model. It powered early autocomplete, spelling correction, and speech recognition for decades.

Two problems push back. As n grows, the number of possible n grams explodes, so many never appear in training and receive a probability of zero. The cure is smoothing, which shaves probability from seen events and hands it to unseen ones so nothing is impossible.

The other limit is short memory. A trigram cannot connect a word to context many sentences earlier, which is exactly where neural models later won.

Still, n grams are a clean introduction to the core idea that language has statistical structure you can learn by counting.

Key idea

N gram models predict the next word from the previous n minus one words using corpus counts, with smoothing to handle unseen sequences.

Check yourself

Answer to earn rating on the learn ladder.

1. What does the Markov assumption claim?

2. Why is smoothing needed?

3. A trigram model conditions on how many previous words?