The fundamental assumption behind the current generation of artificial intelligence is that bigger is better. More parameters, more training data, more compute — the recipe has worked spectacularly well for a decade. But there is a problem that the industry's cheerleaders rarely discuss in public: the internet, vast as it seems, is not infinite, and we are approaching the point where there simply is not enough quality text left to train the next generation of models.

This is not a theoretical concern for some distant future. It is a constraint that leading AI laboratories are grappling with right now, and it helps explain some of the stranger behaviors emerging from the industry — from partnerships with media companies to the sudden interest in synthetic data.

The scale of the appetite

To understand the problem, consider the numbers. A large language model like GPT-4 or Claude was trained on datasets measured in trillions of tokens — roughly, word-fragments. The entire English Wikipedia contains perhaps four billion tokens. The Common Crawl, a nonprofit archive that has been scraping the web since 2008, contains tens of trillions of tokens, but the vast majority is low-quality: spam, duplicated content, machine-generated filler, and text so poorly written it teaches bad habits rather than good ones.

The pool of genuinely high-quality, human-written text — books, academic papers, well-edited journalism, thoughtful forum discussions — is far smaller than the raw numbers suggest. Researchers at Epoch AI, a nonprofit that tracks compute trends, have estimated that high-quality text data could be effectively exhausted within the next few years if current scaling trends continue.

The scramble for alternatives

This scarcity explains much of the industry's recent behavior. The licensing deals with publishers, the lawsuits over training data, the sudden corporate enthusiasm for "synthetic data" — text generated by AI models themselves, then used to train the next generation — all reflect a dawning realization that the easy abundance is ending.

Synthetic data is particularly interesting because it sounds like a free lunch: just have the model generate its own training examples. But early research suggests this approach has limits. Models trained heavily on AI-generated text tend to degrade in subtle ways, losing the diversity and unpredictability that makes human language useful. It is the intellectual equivalent of inbreeding.

Other approaches include training on video transcripts, audio, and code — domains where more data remains untapped. Multimodal training, which combines text with images and video, may also help. But none of these solutions fully resolves the underlying tension between the industry's appetite for data and the finite amount of quality human expression available.

Our take

The data ceiling is not the end of AI progress, but it may be the end of a particular kind of progress — the brute-force scaling that has defined the field since 2017. What comes next will likely require more cleverness and less sheer volume: better architectures, more efficient training methods, perhaps a fundamental rethinking of what these models are optimizing for. The companies that figure this out will define the next era. The ones that keep drilling for data that no longer exists will find themselves stuck. The era of easy gains is closing, and the hard work is just beginning.