← Lessons

quiz vs the machine

Silver1030

Machine Learning

One Hot Encoding

Turning categories into numbers without inventing a fake order.

3 min read · intro · beat Silver to climb

The problem

Most models expect numbers, but many features are categorical, like color or country. If you simply assign red as one and blue as two, the model assumes blue is somehow greater than red, which is meaningless.

The solution

One hot encoding creates one new binary column per category. A row gets a one in the column for its category and zeros everywhere else. No false ordering is introduced because each category is independent.

Tradeoffs

  • It is simple and works well for low cardinality features
  • It can explode the number of columns when a feature has thousands of categories
  • For very high cardinality, alternatives like embeddings or target encoding scale better

A practical note

To avoid perfect collinearity in some linear models, people sometimes drop one column, called the reference category. Tree based and most neural models do not require this.

Key idea

One hot encoding represents categories as independent binary columns so models do not assume a false numeric ordering.

Check yourself

Answer to earn rating on the learn ladder.

1. Why not just number categories one two three?

2. What is a downside of one hot encoding?