A skewed window
Sampling bias happens when the examples you collect are not representative of the population the model will actually face. The sample is a window onto the world, and a crooked window distorts everything seen through it.
How it creeps in
- Convenience sampling: you use the data that was easy to grab, not the data you need.
- Self selection: the people who show up differ from those who do not.
- Survivorship: you only see cases that made it past some filter, missing the failures.
Why it is dangerous
A model trained on a skewed sample can score well on its own test set yet fail badly in production, because the test set shares the same skew. The error is invisible until the model meets the real distribution.
Reducing it
- Define the target population explicitly before collecting.
- Compare sample demographics against known population statistics.
- Reweight or resample groups that are under or over counted.
Key idea
Sampling bias arises when collected data does not mirror the real population, and it hides inside test sets that share the same skew, so define and check the target population early.