The Sources of Bias in Data

Bias starts upstream

A model can only learn from the data it is given. If that data already reflects skewed history or flawed collection, the model will faithfully reproduce those flaws. Most fairness problems begin before any line of training code runs.

Common sources

Historical bias: the world the data describes is itself unequal, so accurate data still encodes unfair patterns.
Representation bias: some groups appear too rarely to be modeled well.
Measurement bias: the features or labels are noisy proxies that work better for some groups than others.
Aggregation bias: one model is forced onto groups that actually behave differently.

Why naming the source matters

Each source needs a different fix. Representation bias calls for better sampling, while measurement bias calls for better features or labels. Treating every problem as one thing leads to the wrong remedy.

A useful habit

Before blaming the algorithm, audit how the data was gathered, who is present, and what the labels truly measure. Bias is usually inherited, not invented by the model.

Key idea

Most model bias is inherited from data through historical, representation, measurement, or aggregation effects, so diagnosing the source comes before choosing a fix.

The Sources of Bias in Data

Bias starts upstream

Common sources

Why naming the source matters

A useful habit

Key idea

Check yourself