The task
Keyword extraction finds the words and phrases that best capture what a document is about. The output feeds search indexing, tagging, and quick previews.
Statistical methods
- TF IDF rewards terms frequent in this document but rare across the collection, so common words score low and distinctive ones score high.
- This needs a background corpus to know what is rare, and it judges single terms well but phrases less so.
Graph based methods
- TextRank builds a graph of words connected when they appear near each other, then ranks them by centrality like web pages.
- It needs no background corpus and naturally handles a single document.
Phrases over single words
Good keywords are often multi word, such as machine translation. Systems merge adjacent high scoring words into candidate phrases and rank those, since a phrase carries more meaning than its parts.
Evaluation
Compare extracted keywords to human assigned ones with precision and recall. Disagreement is common because keywording is subjective, so loose matching is often used.
Key idea
Keyword extraction ranks terms by distinctiveness using TF IDF or graph centrality, merges adjacent winners into phrases, and is judged against subjective human keyword sets.