Something strange is happening in the world of artificial intelligence, and most people aren’t talking about it.
Over the last couple of years, the internet has been quietly flooded with AI-generated content. Blog posts written by ChatGPT. Product descriptions churned out by Claude. News summaries, social media captions, Stack Overflow answers — all generated by models, published by humans, indexed by Google, and now sitting right there in the training data for the next generation of AI models.
And that’s where things start to go wrong.
When AI Eats Itself
Researchers have a term for this: model collapse. It sounds dramatic, but the concept is surprisingly simple. When a language model is trained on data that was itself generated by a language model, it starts to lose touch with the messy, unpredictable richness of real human language. It gets blander. More repetitive. Confidently wrong in eerily similar ways across different systems.
Think of it like photocopying a photocopy. Each generation loses a little sharpness. Do it enough times, and you can barely read the original.
A 2023 paper from researchers at Oxford and elsewhere put it bluntly: models trained on AI-generated text show a gradual degradation in output quality, eventually collapsing into repetitive, low-diversity responses. The diversity of language slang, regional phrases, weird human tangents starts disappearing from outputs because it was never in the training data to begin with.
The Web Is Already Saturated
Here’s the uncomfortable reality. As of 2024, estimates suggest that somewhere between 15% and 60% of text on the open web may have some AI involvement depending on how you count. That number is only going up.
AI content farms are producing thousands of articles a day. SEO agencies are using generative tools at scale. Even well-meaning writers are using AI to “polish” drafts before publishing. All of that ends up in the corpus that future models scrape and learn from.
The next GPT-5 or Gemini 3 won’t just be learning from Shakespeare and Reddit arguments and New York Times op-eds. It will be learning from the AI-written blog post that ranked #1 for “best air fryer 2025.”
That should worry us.
It’s Not Just About Quality
The degradation isn’t only about writing getting duller, though that’s real. The deeper problem is that models trained heavily on AI output start inheriting the biases and blind spots of earlier models and amplifying them.
If GPT-3 slightly overrepresented certain political framings, or had gaps in how it discussed certain cultures, those patterns get baked into the web content it inspired. Then GPT-4 trains on that content. Then GPT-5 trains on GPT-4’s internet. The original skew doesn’t just persist it compounds.
You’re not building on human knowledge anymore. You’re building on a reflection of a reflection of a reflection.
Is There a Way Out?
Some labs are already thinking about this. Synthetic data pipelines where companies deliberately generate and curate their own training data rather than scraping the open web are one response. Others are investing heavily in “data provenance,” trying to tag and track whether a piece of text was human-written or machine-generated.
But those are partial solutions to a structural problem. The open web, which has been the backbone of LLM training for years, is changing in ways that can’t easily be reversed. There’s no button to press that separates the human-written internet from the AI-written one.
Some researchers argue the answer is getting back to fundamentals books, academic papers, verified journalism, primary sources. The kind of content that takes time and expertise to produce, and that AI hasn’t yet drowned out completely.
The Bigger Picture
There’s something almost poetic about this situation. We built AI to learn from humanity. Now, increasingly, it’s learning from itself and slowly forgetting what humans actually sound like.
It’s not an apocalypse. Models won’t suddenly stop working. But the gradual drift toward sameness, toward confident mediocrity, toward outputs that feel vaguely familiar but somehow hollow that’s already happening. Most people just haven’t noticed yet.
The question isn’t whether this is a problem. It clearly is.
The question is whether we’ll take it seriously before the next generation of AI is already trained on the last generation’s mistakes.
What do you think is model collapse a real threat, or overhyped? Drop your thoughts in the comments.