← Lessons

quiz vs the machine

Gold1340

Machine Learning

Text Classification Basics

Assign a category to a document using features and a trained classifier.

5 min read · core · beat Gold to climb

Text Classification Basics

Text classification assigns a document to one or more predefined categories, such as spam versus not spam or a news topic. It is one of the most common and useful NLP tasks.

The basic recipe

  • Represent the text as features using bag of words, TF IDF, or embeddings.
  • Train a classifier such as naive Bayes, logistic regression, or a neural network on labeled examples.
  • Predict the category for new documents.

A strong baseline is TF IDF features with a linear classifier, which is fast and surprisingly hard to beat on many tasks. Naive Bayes is even simpler and works well when features are roughly independent.

Evaluation

Accuracy alone can mislead when classes are imbalanced, since always guessing the majority class can look accurate. Better measures are precision, which asks how many predicted positives are correct, and recall, which asks how many true positives were found. Their harmonic mean is the F1 score.

Key idea

Text classification represents documents as features and trains a classifier, with strong linear baselines and precision and recall measures for imbalanced data.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can accuracy mislead on imbalanced classes?

2. What is a common strong baseline for text classification?