One Hot Encoding

The problem

Most models expect numbers, but many features are categorical, like color or country. If you simply assign red as one and blue as two, the model assumes blue is somehow greater than red, which is meaningless.

The solution

One hot encoding creates one new binary column per category. A row gets a one in the column for its category and zeros everywhere else. No false ordering is introduced because each category is independent.

Tradeoffs

It is simple and works well for low cardinality features
It can explode the number of columns when a feature has thousands of categories
For very high cardinality, alternatives like embeddings or target encoding scale better

A practical note

To avoid perfect collinearity in some linear models, people sometimes drop one column, called the reference category. Tree based and most neural models do not require this.

Key idea

One hot encoding represents categories as independent binary columns so models do not assume a false numeric ordering.

The problem

The solution

Tradeoffs

A practical note

Key idea

Check yourself