N Gram Language Models

An n gram language model estimates the probability of a word from the previous few words. Rather than condition on the whole history, it makes a simplifying bet called the Markov assumption, that only the last n minus one words matter.

A bigram model uses one word of context, a trigram uses two, and so on. To estimate the probability of the next word, you count how often a given n gram appeared in a training corpus and divide by the count of its prefix.

This gives a fast, interpretable model. It powered early autocomplete, spelling correction, and speech recognition for decades.

Two problems push back. As n grows, the number of possible n grams explodes, so many never appear in training and receive a probability of zero. The cure is smoothing, which shaves probability from seen events and hands it to unseen ones so nothing is impossible.

The other limit is short memory. A trigram cannot connect a word to context many sentences earlier, which is exactly where neural models later won.

Still, n grams are a clean introduction to the core idea that language has statistical structure you can learn by counting.

Key idea

N gram models predict the next word from the previous n minus one words using corpus counts, with smoothing to handle unseen sequences.

N Gram Language Models

N Gram Language Models

Key idea

Check yourself