← Lessons

quiz vs the machine

Silver1080

Machine Learning

Sampling Bias

When the data you collected does not match the world you serve.

4 min read · intro · beat Silver to climb

A skewed window

Sampling bias happens when the examples you collect are not representative of the population the model will actually face. The sample is a window onto the world, and a crooked window distorts everything seen through it.

How it creeps in

  • Convenience sampling: you use the data that was easy to grab, not the data you need.
  • Self selection: the people who show up differ from those who do not.
  • Survivorship: you only see cases that made it past some filter, missing the failures.

Why it is dangerous

A model trained on a skewed sample can score well on its own test set yet fail badly in production, because the test set shares the same skew. The error is invisible until the model meets the real distribution.

Reducing it

  • Define the target population explicitly before collecting.
  • Compare sample demographics against known population statistics.
  • Reweight or resample groups that are under or over counted.

Key idea

Sampling bias arises when collected data does not mirror the real population, and it hides inside test sets that share the same skew, so define and check the target population early.

Check yourself

Answer to earn rating on the learn ladder.

1. Why can sampling bias stay hidden during evaluation?

2. Which is an example of sampling bias?