Encoding Categorical Variables
Most models do arithmetic, so they cannot consume raw text categories like red, green, or blue. We must encode categorical variables into numbers, and the method we choose matters.
A naive idea is to assign each category an integer, such as red as one and green as two. This is label encoding. The danger is that it implies an order and a distance the categories may not have. The model might wrongly conclude green is greater than red.
The safer default for unordered categories is one hot encoding. Each category becomes its own binary column that is one when present and zero otherwise. No false ordering is introduced because every category is equidistant.
Key distinctions:
- Ordinal variables have a real order, such as small, medium, large, so an integer encoding can be appropriate
- Nominal variables have no order, such as colors, so one hot encoding fits better
One hot encoding can explode the number of columns when a category has many values, a problem called high cardinality. In that case practitioners reach for techniques like grouping rare values or learned embeddings that map categories into a compact numeric space.
Key idea
Encode categories into numbers carefully; one hot encoding suits unordered values while integer encoding fits truly ordered ones.