← Lessons

quiz vs the machine

Gold1360

Machine Learning

Text Feature Extraction

Turn raw text into numeric features through cleaning, tokenization, and vectorization.

5 min read · core · beat Gold to climb

Text Feature Extraction

Models cannot read raw text, so it must become numbers. Text feature extraction is a pipeline of cleaning, splitting, and vectorizing that produces a numeric representation.

Preprocessing steps

  • Normalize by lowercasing and stripping punctuation.
  • Tokenize into words or subword pieces.
  • Remove stopwords and optionally stem or lemmatize to merge word forms.

Vectorization choices

  • Bag of words counts each vocabulary word, ignoring order.
  • TF IDF weights counts by how rare a word is across documents, downplaying common words.
  • N grams capture short phrases by treating adjacent token sequences as features.
  • Embeddings map words or sentences to dense vectors that encode meaning.

Count and TF IDF features are sparse and high dimensional, while embeddings are dense and capture similarity between related words. The right choice depends on dataset size, the task, and how much context matters.

Key idea

Text feature extraction cleans and tokenizes text, then vectorizes it through bag of words, TF IDF, n grams, or dense embeddings depending on the task.

Check yourself

Answer to earn rating on the learn ladder.

1. What does TF IDF do that plain counts do not?

2. How do embeddings differ from bag of words features?