What topic modeling does
Topic modeling discovers hidden themes in a collection of documents without labels. Latent Dirichlet allocation, or LDA, is the classic probabilistic method.
The generative story
LDA imagines each document was written by a simple process.
- Each topic is a distribution over words, like a soft cluster of co occurring words.
- Each document is a mixture of topics with its own proportions.
- Each word is drawn by first picking a topic for that slot, then a word from that topic.
Learning runs this story in reverse to infer the hidden topics and mixtures that best explain the observed words.
What you get out
- A list of topics, each shown by its top probability words.
- A topic mixture per document, useful for clustering, search, and trend tracking.
Practical notes
- You must choose the number of topics, a key hyperparameter that shapes the result.
- Topics are unlabeled, so a human reads the top words to name each one.
- Coherence scores measure whether a topic's top words truly co occur, helping pick a good topic count.
Key idea
LDA models each document as a mixture of topics and each topic as a distribution over words, inferring both from raw text, while you choose the topic count and judge results with coherence.