Text Classification Basics
Text classification assigns a document to one or more predefined categories, such as spam versus not spam or a news topic. It is one of the most common and useful NLP tasks.
The basic recipe
- Represent the text as features using bag of words, TF IDF, or embeddings.
- Train a classifier such as naive Bayes, logistic regression, or a neural network on labeled examples.
- Predict the category for new documents.
A strong baseline is TF IDF features with a linear classifier, which is fast and surprisingly hard to beat on many tasks. Naive Bayes is even simpler and works well when features are roughly independent.
Evaluation
Accuracy alone can mislead when classes are imbalanced, since always guessing the majority class can look accurate. Better measures are precision, which asks how many predicted positives are correct, and recall, which asks how many true positives were found. Their harmonic mean is the F1 score.
Key idea
Text classification represents documents as features and trains a classifier, with strong linear baselines and precision and recall measures for imbalanced data.