The Data Sampling Strategies

Why not use everything

You rarely train on every available row. Data is huge, imbalanced, and uneven in quality, so sampling chooses a subset that trains a good model efficiently. The strategy you pick changes the model you get.

Common strategies

Random sampling draws rows uniformly. It preserves the natural distribution but may starve rare classes.
Stratified sampling keeps the proportion of each group fixed, so small groups are not lost to chance.
Importance sampling over draws hard or rare examples and corrects with weights so the estimate stays unbiased.

Trade offs

More data lowers variance but costs compute, so sampling trades accuracy against budget.
Biased sampling that drops a group from the data teaches the model to ignore that group, harming fairness.

Practical guidance

Decide the unit you sample, such as rows, users, or sessions, to avoid leaking related examples across train and test.
Match the sampling to the question. To measure overall accuracy, preserve the real distribution. To learn rare events, oversample them.

Key idea

Sampling strategy decides which rows train your model, trading compute against accuracy and shaping fairness, so it must match the question you are answering.

The Data Sampling Strategies

Why not use everything

Common strategies

Trade offs

Practical guidance

Key idea

Check yourself