AI cannot reliably tell you what isn't. That single flaw reveals more than a thousand benchmarks.

Ask a large language model to list animals that are not mammals and you will likely receive a competent answer. Ask it to name a famous scientist who never won a Nobel Prize, and the confident reply may well be wrong. Request a summary of what a legal contract does not permit, and watch the hedging begin. Negation—the simple act of reasoning about what is absent, forbidden, or false—remains one of the most persistent weaknesses in today's most celebrated AI systems.

This is not a bug awaiting a patch. It is a structural consequence of how these models learn. Transformers, the architecture underpinning systems from GPT to Claude to Gemini, are trained to predict the next token in a sequence based on statistical patterns in vast corpora. They learn that "Paris" often follows "the capital of France is," and that "bark" frequently accompanies "dog." But absence leaves no statistical trace. There is no pattern for what did not happen, no token frequency for the deals that fell through, the wars that never started, the features a product lacks.

The asymmetry of presence and absence

Human cognition handles negation through a different mechanism entirely. We maintain mental models of the world and can explicitly mark propositions as false, absent, or counterfactual. When someone says "the meeting is not on Tuesday," we update our calendar by removing Tuesday, not by adding some positive fact. Language models have no such erasure operation. They can only add probability mass to tokens; they cannot subtract it in the same deliberate way.

This asymmetry surfaces constantly in practical applications. Legal professionals testing AI document review have found that models reliably extract what a contract requires but struggle to enumerate what it prohibits. Medical AI can flag symptoms present in a patient's history more accurately than symptoms conspicuously absent. Customer service bots trained on product manuals often hallucinate features rather than correctly state what a product cannot do.

Why prompting tricks only go so far

Practitioners have developed workarounds. Chain-of-thought prompting, explicit instructions to "list what is NOT included," and retrieval-augmented generation can improve performance on negation tasks. But these are patches on a foundation not designed for the job. The underlying model still lacks a native representation of falsity. It is approximating negation through the same pattern-matching machinery it uses for everything else, which means edge cases and novel phrasings continue to trip it up.

Researchers have proposed architectural modifications—explicit negation tokens, separate reasoning modules, neuro-symbolic hybrids—but none has yet achieved the seamless integration that would make negation as natural for machines as it is for humans. The problem is not computational resources; it is representational. You cannot reliably reason about what is not there if your entire worldview is built from correlations among things that are.

Our take

The negation problem is a useful corrective to the breathless coverage of AI capabilities. These systems are genuinely impressive at interpolation—finding patterns within the distribution of their training data—but they remain weak at the kind of reasoning that requires stepping outside that distribution. Absence, counterfactuals, and negation all demand exactly that. Until models can represent what is not, their understanding will remain a sophisticated mirage: fluent, confident, and subtly unreliable in ways that matter most when the stakes are high.

The Joni Times

AI cannot reliably tell you what isn't. That single flaw reveals more than a thousand benchmarks.

The asymmetry of presence and absence

Why prompting tricks only go so far

Our take

עוד ב־ בינה מלאכותית

The court reporter is not going extinct. They are becoming something else entirely.

The actuary's new assistant never sleeps. It also cannot explain why someone will die.

Menlo Ventures just raised $3 billion on the strength of one bet. The bet was Anthropic.

Large language models cannot count. This explains more than you think.