Data beats cleverness
In most production systems, more and cleaner data outperforms a fancier model. Plan data deliberately rather than treating it as given.
Sources of labels
- Natural labels the product produces them, such as clicks or purchases
- Human annotation explicit labeling, accurate but slow and costly
- Weak supervision heuristics or rules generate noisy labels at scale
Watch for bias and leakage
- Selection bias training data does not match production traffic
- Label leakage a feature secretly encodes the answer
- Feedback loops the model shapes the very data it later trains on
A recommender that only logs items it showed never learns about items it hid. Add exploration or randomization to break this loop.
Freshness and volume
Decide how recent data must be and how much you need. Slow moving domains tolerate stale data; fast moving ones need streaming pipelines.
Key idea
Treat data collection as a design problem: control bias, prevent leakage, and break feedback loops before they poison the model.