At some point, models stop learning — and start echoing.
TL;DR
- Synthetic data is powerful, but recursive synthetic data creates feedback loops
- Errors don't disappear — they compound
- Research now shows this can lead to model collapse
- The future isn't all synthetic or all human, but hybrid systems anchored in reality
The core idea (in one sentence)
When AI systems are trained at scale on AI-generated content, they risk becoming self-referential ecosystems where errors, biases, and assumptions are reinforced rather than corrected.
This is no longer just a theoretical concern — it is now empirically demonstrated in the literature.
Why synthetic data is so appealing
Synthetic data is often presented as a silver bullet:
- Cheaper than real-world data collection
- Easier and faster to label
- Scales almost infinitely
- Useful for rare or imbalanced classes
- Sometimes safer for privacy
In narrow, controlled domains, synthetic data can be extremely effective.
Large-scale studies show it can meaningfully improve coverage when paired with real data rather than replacing it outright (Goyal et al., 2024).
The hidden problem: feedback loops
The issue isn't synthetic data itself.
It's recursive synthetic data — where model outputs are repeatedly reused as training inputs for future models.
Recent work published in Nature demonstrates that this process can cause progressive degradation in model quality, a phenomenon the authors refer to as model collapse (Shumailov et al., 2024).
┌─────────────────┐
│ Human Reality │
└────────┬────────┘
▼
┌─────────────────┐
│ Data Collection │
└────────┬────────┘
▼
┌─────────────────┐
│ Model Training │◄──────────┐
└────────┬────────┘ │
▼ │
┌─────────────────┐ │
│ Model Outputs │───────────┘
└─────────────────┘ (Reused as training data)
Applied perspective: my own synthetic modeling work
I've explored these dynamics directly through a series of synthetic data and modeling experiments published on my site:
👉 Synthetic Models & Experiments
These projects examine how models behave when trained on combinations of real and algorithmically generated data, including effects on calibration, distributional fidelity, and downstream performance.
Final takeaway
Synthetic data will absolutely shape the future of AI.
But if we treat AI output as a replacement for human knowledge rather than a multiplier, we risk building systems that are highly articulate — yet increasingly detached from reality.
The future isn't AI training on AI.
It's AI training on reality, with synthetic data as scaffolding.
References
- Shumailov, I., et al. (2024). AI models collapse when trained on recursively generated data. Nature.
- El Amine Seddik, M., et al. (2024). How Bad is Training on Synthetic Data? arXiv:2404.05090.
- Wyllie, S., Shumailov, I., & Papernot, N. (2024). Fairness feedback loops: Training on synthetic data amplifies bias. FAccT '24.
- Goyal, M., et al. (2024). Synthetic data generation: Generative AI techniques. Electronics, 13(17).
- Koul, A., Duran, D., & Hernandez-Boussard, T. (2025). Synthetic data, synthetic trust. Patterns.
- Zhang, Z., et al. (2022). Keeping synthetic patients on track. NPJ Digital Medicine.
