← Lessons

quiz vs the machine

Gold1330

Machine Learning

The Data Sampling Strategies

How choosing which rows to train on shapes accuracy, cost, and fairness.

4 min read · core · beat Gold to climb

Why not use everything

You rarely train on every available row. Data is huge, imbalanced, and uneven in quality, so sampling chooses a subset that trains a good model efficiently. The strategy you pick changes the model you get.

Common strategies

  • Random sampling draws rows uniformly. It preserves the natural distribution but may starve rare classes.
  • Stratified sampling keeps the proportion of each group fixed, so small groups are not lost to chance.
  • Importance sampling over draws hard or rare examples and corrects with weights so the estimate stays unbiased.

Trade offs

  • More data lowers variance but costs compute, so sampling trades accuracy against budget.
  • Biased sampling that drops a group from the data teaches the model to ignore that group, harming fairness.

Practical guidance

  • Decide the unit you sample, such as rows, users, or sessions, to avoid leaking related examples across train and test.
  • Match the sampling to the question. To measure overall accuracy, preserve the real distribution. To learn rare events, oversample them.

Key idea

Sampling strategy decides which rows train your model, trading compute against accuracy and shaping fairness, so it must match the question you are answering.

Check yourself

Answer to earn rating on the learn ladder.

1. What does importance sampling do?

2. Why choose the sampling unit carefully?