The data that built AI is a legal and ethical minefield. Nobody wants to talk about what happens when the lawsuits arrive.

Every major language model running today was trained on text its creators did not have explicit permission to use. This is not a secret, nor is it contested — it is simply the foundational bargain of modern artificial intelligence, and it remains profoundly unresolved.

The scale is staggering. Training datasets for frontier models contain hundreds of billions of words harvested from books, news articles, academic papers, forum posts, and personal blogs. The creators of this content — novelists, journalists, researchers, ordinary people who wrote something on the internet — were never asked, never compensated, and in most cases never informed. The companies that built these systems argue this constitutes fair use, a legal doctrine designed for commentary and criticism, not for building commercial products worth hundreds of billions of dollars.

The uncomfortable economics

Licensing the training data properly would have been ruinously expensive, possibly impossible. Imagine negotiating individual rights with millions of copyright holders, many of whom are deceased or untraceable. The practical reality is that large language models exist because their creators decided to ask forgiveness rather than permission — or more precisely, to ask neither and hope the legal system would eventually ratify their choices.

This gamble has worked so far. Courts move slowly, and the technology has become so economically and strategically important that unwinding it seems unthinkable. But the lawsuits are mounting. Authors, news organizations, and visual artists have filed claims arguing that their work was stolen to train systems that now compete with them. The legal outcomes remain uncertain, but the moral question is already answered: the people who created the raw material of AI were cut out of the value it generated.

What 'fair use' was never meant to do

The fair use doctrine in American copyright law was designed to protect activities like parody, education, and journalism — uses that add new meaning or purpose to existing work. Training a commercial AI system on copyrighted text and then selling access to that system fits awkwardly, at best, into this framework. The transformation is real — a language model is not a copy of any particular book — but the commercial purpose is undeniable, and the harm to original creators is increasingly measurable.

Other jurisdictions take different approaches. The European Union has created specific exceptions for text and data mining, but with opt-out provisions that many publishers are now exercising. Japan has been more permissive. The result is a patchwork of rules that multinational AI companies navigate by training their models in the most favorable legal environments and deploying them globally.

Our take

The AI industry built something remarkable on a foundation it did not own. That was probably inevitable given the economics and the pace of development, but it does not make the underlying appropriation just. The writers, artists, and thinkers whose work became training data deserved better than to be strip-mined for someone else's trillion-dollar valuation. Whatever legal settlements or licensing regimes eventually emerge, they will arrive too late to change the fundamental distribution of value. The models are trained, the companies are capitalized, and the original creators hold nothing but grievances. This is not a bug in the system — it is the system.

The Joni Times

The data that built AI is a legal and ethical minefield. Nobody wants to talk about what happens when the lawsuits arrive.

The uncomfortable economics

What 'fair use' was never meant to do

Our take

More in AI

The architect's pencil is learning to think. And the profession is quietly rewriting itself around that fact.

The Patent Examiner's Quiet Revolution. Artificial Intelligence Is Rewriting How We Decide What Counts as Original.

Together AI just raised $800 million at an $8.3 billion valuation. The neocloud era has officially arrived.

Meta wants to rent you its spare AI brainpower. The pivot from social network to compute utility is more radical than it sounds.