When real data is missing
Synthetic data generation creates artificial examples to cover cases real data lacks. It fills rare scenarios, balances classes, or sidesteps privacy limits on real records.
How it is produced
- Simulation, where a physics or game engine renders labeled scenes, common in robotics and self driving.
- Generative models, where a learned model samples new examples that resemble the real distribution.
- Rule based synthesis, where templates produce structured records like transactions or forms.
The realism gap
- Synthetic data often differs subtly from real data, a mismatch called the domain gap or sim to real gap.
- A model that trains only on synthetic data may rely on artifacts absent in reality and fail when deployed.
Using it safely
- Mix synthetic with real data rather than replacing it.
- Apply domain randomization so the model cannot lean on any single synthetic quirk.
- Always validate on real held out data, since synthetic metrics can be misleadingly optimistic.
A warning on feedback loops
- Training generative models on their own synthetic output repeatedly can degrade quality, an effect called model collapse.
Key idea
Synthetic data fills gaps real data cannot, but the domain gap means it should be mixed with real data and always validated on real held out examples.