The most valuable commodity in artificial intelligence is not compute, not algorithms, not even talent. It is data — specifically, the vast corpus of human-generated text, images, and code that taught large language models to mimic human intelligence in the first place. And that commodity is becoming contaminated.

The problem is elegantly simple: AI systems trained on internet data are now producing content that floods back onto the internet, where it will inevitably be scraped to train the next generation of AI systems. Researchers call this "model collapse" — a gradual degradation that occurs when synthetic data recursively pollutes training sets. Each generation of models trained on AI-generated content drifts slightly further from the statistical distribution of genuine human expression, like a photocopy of a photocopy slowly losing fidelity.

The contamination timeline

The scale of synthetic content online has grown faster than most observers anticipated. By conservative estimates, a substantial portion of new English-language text published on the open web now originates from language models. This includes everything from product descriptions and news summaries to social media posts and academic paper drafts. The watermarking techniques that some AI companies have implemented cover only a fraction of this output, and they are trivially easy to circumvent.

For AI labs, this creates a genuine strategic dilemma. The largest models require training datasets measured in trillions of tokens, and high-quality human-generated text is finite. Wikipedia, books, academic papers, quality journalism — these sources have already been exhausted. The frontier of available training data increasingly consists of content whose provenance is uncertain at best.

Why human data is irreplaceable

Synthetic data is not inherently useless. For certain narrow tasks — mathematical reasoning, code generation, structured problem-solving — carefully curated AI-generated examples can improve model performance. But for the broader capabilities that make language models useful, human data remains essential in ways that are poorly understood.

Human text encodes the full messiness of human cognition: the contradictions, the cultural context, the implicit knowledge that comes from living in physical bodies in a social world. When models train primarily on other models' output, they converge toward a kind of average — grammatically correct, superficially coherent, but lacking the surprising edges and genuine novelty that characterize human thought. The result is prose that reads like it was written by a very competent committee.

Some researchers have begun treating pre-2022 internet archives as a kind of digital fossil record — pristine datasets from before the synthetic flood. The commercial value of verified human-generated content is rising accordingly, creating new economic incentives around data provenance that did not exist five years ago.

The race for authentic signal

Major AI labs are responding with varied strategies. Some are licensing proprietary datasets directly from publishers and platforms, paying for guaranteed human authorship. Others are investing in human feedback at unprecedented scale, using paid annotators to provide the authentic signal that raw internet scraping no longer reliably delivers. A few are exploring techniques to detect and filter synthetic content from training pipelines, though this remains an imperfect science.

The irony is acute: the technology that promised to democratize content creation may ultimately make authentic human expression more valuable, not less. The scarcity is not of text, but of text that carries genuine information about how humans actually think and communicate.

Our take

The synthetic data problem is not a crisis that will halt AI progress overnight. It is something subtler and potentially more consequential: a slow tax on improvement, a ceiling that may prove harder to raise than the compute and capital ceilings that have defined AI scaling so far. The companies that solve data provenance — that find reliable ways to distinguish human signal from synthetic noise — will hold a durable advantage. Everyone else will be training on an internet that increasingly reflects only what AI already knows.