Large language models cannot count. This reveals more than you might think.

Ask a sophisticated AI to write a sonnet and it will produce something passable, perhaps even moving. Ask it to write a sonnet with exactly fourteen lines and you may get twelve, or sixteen, or fourteen achieved only by accident. This is not a bug being patched in the next update. It is a window into what these systems actually are.

The inability of large language models to reliably count — letters in a word, items in a list, syllables in a line — has become a running joke among researchers and a genuine frustration for users. But the joke contains a lesson. These systems do not process text the way humans do. They do not see words at all.

The tokenization problem

Before a language model reads your prompt, it breaks your text into tokens — chunks that might be whole words, parts of words, or single characters, depending on the vocabulary it learned during training. The word "strawberry" might become two or three tokens. The model never sees the individual letters s-t-r-a-w-b-e-r-r-y lined up for inspection. It sees numerical representations of chunks, stripped of their internal structure.

This is why asking how many r's appear in "strawberry" produces confident wrong answers. The model is not counting letters; it is pattern-matching against training data where similar questions appeared, hoping the statistical echo produces the right number. Sometimes it does. Often it does not.

The same architecture that enables these systems to synthesize vast knowledge and generate fluid prose also blinds them to the granular structure of their own output. They predict the next token based on probability distributions, not by examining what they have already written character by character.

Fluency without grounding

Humans who speak fluently also understand what they are saying — or at least can check their work by rereading it. Language models possess no such feedback loop. They generate text forward, token by token, without the ability to step back and verify that a list contains exactly five items or that a haiku has the right syllable count.

This asymmetry between generation and verification illuminates a broader truth. These systems have learned the statistical structure of human language with extraordinary fidelity. They have not learned to think about language the way humans do. The difference matters less when the task is open-ended composition and more when the task requires precision.

Programmers have developed workarounds — asking models to count step by step, or using external tools to verify outputs. But the underlying limitation persists. The architecture optimizes for predicting plausible continuations, not for maintaining accurate internal representations of what has been produced.

What the gap teaches us

The counting problem is a specific instance of a general phenomenon: language models excel at tasks where statistical patterns in training data provide reliable guidance and struggle where success requires something else — genuine reasoning, physical intuition, or simple arithmetic performed on novel inputs.

This does not diminish what these systems accomplish. Their ability to retrieve, synthesize, and articulate information across domains remains remarkable. But it should calibrate expectations. A system that cannot reliably count to fourteen is not secretly approaching general intelligence. It is a very sophisticated pattern-matching engine with capabilities and blind spots that follow directly from its design.

Our take

The counting problem is useful precisely because it is so mundane. It resists the tendency to anthropomorphize systems that speak in complete sentences and express uncertainty with appropriate hedging. A human who writes beautifully but cannot count to ten would be a medical curiosity. A language model with the same profile is simply revealing its architecture. The gap between what these systems appear to understand and what they actually process is not closing with scale — it is becoming better defined. That clarity, uncomfortable as it may be for the hype cycle, is progress of a different kind.

The Joni Times

Large language models cannot count. This reveals more than you might think.

The tokenization problem

Fluency without grounding

What the gap teaches us

Our take

المزيد في الذكاء الاصطناعي

The radiologist is not being replaced. The radiologist is being transformed.

Your AI cannot count. This is more revealing than it sounds.

The neural network doesn't know what a cat is. It just knows 0.73 looks like one.

The AI glossary is now a political document. Whoever defines the terms controls the debate.