Ask any large language model to write exactly one hundred words on a topic, then count what it produces. The result will almost never be one hundred words. It might be ninety-three, or one hundred and twelve, or some other number that the model would confidently verify as correct if you asked it to check. This is not a bug that engineers forgot to fix. It is a window into what these systems actually are—and what they are not.
The inability to count tokens accurately stems from how language models process text. They do not see words the way humans do. Instead, they operate on tokens, fragments that might be whole words, parts of words, or punctuation marks. The word "understanding" might be one token; "misunderstanding" might be two. The model generates text by predicting the next token based on everything that came before, but it has no running tally, no internal counter clicking upward with each output. It is, in a meaningful sense, blind to the length of what it creates even as it creates it.
The illusion of competence
This limitation illuminates a broader truth that benchmarks obscure. Language models achieve remarkable results on standardised tests—passing bar exams, scoring well on medical licensing questions, solving competition mathematics problems—yet they fail at tasks a child could perform. They can discuss the philosophy of mathematics while being unable to reliably multiply large numbers. They can explain the rules of chess while making illegal moves when asked to play.
The pattern suggests that fluency and reasoning are far more separable than intuition would suggest. Humans who speak eloquently about a subject generally understand it. We have spent millennia using linguistic competence as a proxy for knowledge. Language models break this heuristic in ways that remain genuinely difficult to internalise, even for researchers who study them.
What counting reveals about architecture
The counting problem is not merely amusing; it is diagnostic. A system that could count its outputs would need some form of working memory that persists across generation steps, a scratch pad where intermediate states are tracked and updated. Current transformer architectures lack this. Each token prediction is, in a sense, a fresh inference conditioned on context but without persistent internal state beyond what fits in the context window.
Researchers have proposed various solutions—external tool use, chain-of-thought prompting that forces explicit counting, hybrid architectures with dedicated reasoning modules. Some of these improve performance. None eliminate the fundamental issue, which is that the core prediction mechanism was never designed to track cumulative properties of its own output.
Why this matters beyond curiosity
The counting limitation has practical consequences that extend far beyond parlour tricks. Any task requiring precise length constraints—legal documents with word limits, poetry with strict metre, code with character restrictions—becomes unreliable. More subtly, the same architectural features that prevent accurate counting also prevent reliable self-monitoring of other output properties: factual consistency across long documents, adherence to complex multi-part instructions, maintenance of narrative continuity.
When a language model hallucinates a false fact, it is exhibiting the same fundamental blindness. It cannot step back and verify its output against reality because it has no mechanism for that kind of reflection. It only predicts the next plausible token.
Our take
The gap between what language models can do and what they cannot do is not shrinking in predictable ways. Scaling has produced extraordinary gains in some capabilities while leaving others stubbornly resistant. The inability to count words is a small thing, trivially worked around in most applications. But it serves as a persistent reminder that these systems are not thinking in any sense we would recognise—they are performing an extraordinarily sophisticated form of pattern completion that happens to produce outputs we find useful. Mistaking fluency for understanding remains the central error in how we talk about artificial intelligence, and the counting problem is the simplest possible proof that the error persists.




