← Lessons

quiz vs the machine

Platinum1740

Machine Learning

The Topic Modeling LDA

Discovering latent themes as distributions over words and documents.

5 min read · advanced · beat Platinum to climb

What topic modeling does

Topic modeling discovers hidden themes in a collection of documents without labels. Latent Dirichlet allocation, or LDA, is the classic probabilistic method.

The generative story

LDA imagines each document was written by a simple process.

  • Each topic is a distribution over words, like a soft cluster of co occurring words.
  • Each document is a mixture of topics with its own proportions.
  • Each word is drawn by first picking a topic for that slot, then a word from that topic.

Learning runs this story in reverse to infer the hidden topics and mixtures that best explain the observed words.

What you get out

  • A list of topics, each shown by its top probability words.
  • A topic mixture per document, useful for clustering, search, and trend tracking.

Practical notes

  • You must choose the number of topics, a key hyperparameter that shapes the result.
  • Topics are unlabeled, so a human reads the top words to name each one.
  • Coherence scores measure whether a topic's top words truly co occur, helping pick a good topic count.

Key idea

LDA models each document as a mixture of topics and each topic as a distribution over words, inferring both from raw text, while you choose the topic count and judge results with coherence.

Check yourself

Answer to earn rating on the learn ladder.

1. In LDA, what is a topic?

2. How is a document represented in LDA?

3. What key hyperparameter must you set for LDA?