Synthetic Data Generation
Sometimes real data is missing, too sensitive, or too rare. Synthetic data is artificially generated data meant to stand in for or supplement real examples.
How it is made
- Simulators render scenes or events from rules, common in robotics and self driving.
- Generative models learn the distribution of real data and sample new examples from it.
- Rule based generators fill structured tables with plausible records.
Where it shines
Synthetic data helps when:
- Rare events, like specific failure modes, almost never appear in logs.
- Privacy rules forbid sharing real records, so a synthetic stand in is safer.
- A new product has no history yet to learn from.
The fidelity gap
The central risk is the reality gap. A model trained only on synthetic data may exploit artifacts that do not exist in the real world and fail on deployment. Teams mitigate this by mixing synthetic with real data, by making generators more realistic, and by always validating on real held out examples. Synthetic data is a supplement, rarely a full replacement.
Key idea
Synthetic data fills gaps for rare events and privacy, but the reality gap means it should supplement real data and be validated against it.