Making data instead of collecting it
High quality labeled data is scarce and expensive. Synthetic data is generated rather than collected, often by prompting a strong model to produce instructions, answers, or labeled examples that are then used to fine tune another model.
Common patterns
- Self instruct style generation asks a model to invent diverse tasks and solutions.
- Distillation has a stronger teacher model produce targets for a smaller student.
- Augmentation paraphrases or perturbs real examples to expand coverage.
These can produce large datasets quickly and cover rare cases on demand.
The pipeline
Risks to manage
Synthetic data inherits the biases and errors of its generator and can lack diversity, leading to model collapse if a model is trained repeatedly on its own outputs. Strong filtering, verification, and mixing with real data are essential. The generator should usually be at least as capable as the target on the skills being taught.
Key idea
Synthetic data is model generated training data that scales fine tuning cheaply, but it requires filtering and real data mixing to avoid inheriting generator errors and collapsing diversity.