The dirty secret of AI safety is that models cannot actually forget. The industry's proposed solution is mostly theater.

When regulators and ethicists demand that AI companies remove harmful capabilities from their models—whether that means erasing instructions for synthesizing pathogens or forgetting copyrighted training data—the industry nods reassuringly and points to a technique called machine unlearning. The promise is elegant: just as you can teach a model new skills, you can make it forget specific ones. The reality is considerably messier, and understanding why exposes something fundamental about the nature of artificial intelligence that most discussions conveniently elide.

The core problem is architectural. Large language models do not store knowledge the way databases do, with discrete entries that can be cleanly deleted. Instead, information is distributed across billions of parameters in ways that researchers still do not fully understand. Teaching a model to refuse certain queries is not the same as excising the underlying capability—it is closer to teaching someone to lie convincingly about what they know.

The superposition problem

Neural networks encode multiple concepts in overlapping patterns of activation, a phenomenon researchers call superposition. A single neuron might contribute to the model's understanding of chemistry, poetry, and French cuisine simultaneously. This efficiency is what allows relatively compact models to exhibit such broad competence, but it also means that removing one capability risks degrading others in unpredictable ways.

Attempts at surgical unlearning typically involve fine-tuning the model to refuse or fumble specific types of requests. Studies have repeatedly demonstrated that these interventions are brittle. With modest prompt engineering, the supposedly forgotten knowledge often resurfaces. The model has not forgotten; it has learned a new behavior layered atop the old one, and that layer can be peeled back.

Why this matters beyond safety

The unlearning problem extends far beyond biosecurity concerns. The European Union's right to be forgotten, which allows individuals to demand removal of their personal data, becomes genuinely puzzling when applied to AI systems. If a model was trained on information about you, can that information ever be truly removed without retraining from scratch? The honest answer is probably not, and retraining frontier models costs tens of millions of dollars.

This creates an uncomfortable regulatory gap. Companies can demonstrate compliance with unlearning procedures that satisfy auditors without actually solving the underlying problem. The knowledge persists in the weights, accessible to sufficiently motivated adversaries, while everyone involved maintains plausible deniability.

Our take

The machine unlearning discourse reveals a pattern common in AI governance: the industry offers technical-sounding solutions to political problems, and policymakers accept them because the alternative—admitting that some genies cannot be re-bottled—is too uncomfortable. Honest engagement with AI safety requires acknowledging that models trained on dangerous information will always carry traces of it. The real question is not how to make AI forget, but how to build systems and institutions that account for the permanence of machine knowledge. That conversation is harder, less reassuring, and far more necessary.

The Joni Times

The dirty secret of AI safety is that models cannot actually forget. The industry's proposed solution is mostly theater.

The superposition problem

Why this matters beyond safety

Our take

المزيد في الذكاء الاصطناعي

The radiologist's new colleague never sleeps. It also never takes the blame.

Your AI cannot count. The architectural reason language models fail at basic arithmetic reveals something profound about how they think.

The transformer is just a very sophisticated autocomplete. Understanding that changes everything about how you think about AI.

The Anthropic ban was never about a jailbreak. It was about who controls the government's AI infrastructure.