← Lessons

quiz vs the machine

Gold1320

Machine Learning

The Keyword Extraction

Pulling the most representative words and phrases from a document.

4 min read · core · beat Gold to climb

The task

Keyword extraction finds the words and phrases that best capture what a document is about. The output feeds search indexing, tagging, and quick previews.

Statistical methods

  • TF IDF rewards terms frequent in this document but rare across the collection, so common words score low and distinctive ones score high.
  • This needs a background corpus to know what is rare, and it judges single terms well but phrases less so.

Graph based methods

  • TextRank builds a graph of words connected when they appear near each other, then ranks them by centrality like web pages.
  • It needs no background corpus and naturally handles a single document.

Phrases over single words

Good keywords are often multi word, such as machine translation. Systems merge adjacent high scoring words into candidate phrases and rank those, since a phrase carries more meaning than its parts.

Evaluation

Compare extracted keywords to human assigned ones with precision and recall. Disagreement is common because keywording is subjective, so loose matching is often used.

Key idea

Keyword extraction ranks terms by distinctiveness using TF IDF or graph centrality, merges adjacent winners into phrases, and is judged against subjective human keyword sets.

Check yourself

Answer to earn rating on the learn ladder.

1. What does TF IDF reward?

2. What advantage does TextRank have over TF IDF?