The problem
Most models expect numbers, but many features are categorical, like color or country. If you simply assign red as one and blue as two, the model assumes blue is somehow greater than red, which is meaningless.
The solution
One hot encoding creates one new binary column per category. A row gets a one in the column for its category and zeros everywhere else. No false ordering is introduced because each category is independent.
Tradeoffs
- It is simple and works well for low cardinality features
- It can explode the number of columns when a feature has thousands of categories
- For very high cardinality, alternatives like embeddings or target encoding scale better
A practical note
To avoid perfect collinearity in some linear models, people sometimes drop one column, called the reference category. Tree based and most neural models do not require this.
Key idea
One hot encoding represents categories as independent binary columns so models do not assume a false numeric ordering.