Sampling Techniques
When data is too large or too skewed, you work with a sample. How you draw that sample shapes everything the model learns, so the method matters.
Common methods
- Random sampling picks examples uniformly and is simple, but rare groups may vanish.
- Stratified sampling splits the population into groups and samples each so proportions are preserved.
- Reservoir sampling draws a fixed size sample from a stream of unknown length in a single pass.
Avoiding bias
A sample is only useful if it is representative. Convenience samples, such as only the most recent or easiest to reach records, quietly bias the model. If you sample only daytime traffic, the model never learns nighttime behavior.
Sampling and evaluation
Sampling also matters at evaluation. A test set drawn from a different time period or population than production gives an optimistic and false read on quality. Keeping the sampling strategy explicit and documented lets others judge whether conclusions generalize.
Key idea
Sampling must produce a representative subset, and stratified or reservoir methods help when random sampling would drop rare groups or stream data.