← Lessons

quiz vs the machine

Gold1440

Machine Learning

Target Encoding

Replace a category with the average outcome it tends to produce.

5 min read · core · beat Gold to climb

The idea

Target encoding replaces each category with a number derived from the target, usually the average target value for rows in that category. A city becomes its average sale price, a product becomes its average click rate.

Why use it

  • It handles high cardinality categories that would explode one hot encoding into thousands of columns.
  • It packs predictive signal into a single compact column.
  • It works smoothly with tree and linear models alike.

The overfitting trap

The danger is that the encoding leaks the target. If a category appears once, its encoding is just that row answer, which the model memorizes.

  • Use smoothing that pulls rare category averages toward the global mean.
  • Compute encodings inside cross validation folds, so a row never sees its own target.
  • This out of fold scheme is the standard safe recipe.

Practical notes

  • Add a tiny amount of noise to further reduce leakage.
  • Handle unseen categories at prediction time by falling back to the global mean.

Key idea

Target encoding maps a category to its average target, which is compact and powerful but must be smoothed and computed out of fold to avoid leaking the answer.

Check yourself

Answer to earn rating on the learn ladder.

1. When is target encoding especially useful?

2. How do you prevent target encoding from leaking?