← Lessons

quiz vs the machine

Silver1040

Machine Learning

The Sources of Bias in Data

Where unfairness sneaks into a model before training even starts.

4 min read · intro · beat Silver to climb

Bias starts upstream

A model can only learn from the data it is given. If that data already reflects skewed history or flawed collection, the model will faithfully reproduce those flaws. Most fairness problems begin before any line of training code runs.

Common sources

  • Historical bias: the world the data describes is itself unequal, so accurate data still encodes unfair patterns.
  • Representation bias: some groups appear too rarely to be modeled well.
  • Measurement bias: the features or labels are noisy proxies that work better for some groups than others.
  • Aggregation bias: one model is forced onto groups that actually behave differently.

Why naming the source matters

Each source needs a different fix. Representation bias calls for better sampling, while measurement bias calls for better features or labels. Treating every problem as one thing leads to the wrong remedy.

A useful habit

Before blaming the algorithm, audit how the data was gathered, who is present, and what the labels truly measure. Bias is usually inherited, not invented by the model.

Key idea

Most model bias is inherited from data through historical, representation, measurement, or aggregation effects, so diagnosing the source comes before choosing a fix.

Check yourself

Answer to earn rating on the learn ladder.

1. When does most model bias originate?

2. What is representation bias?