Ask a frontier AI model to write a sonnet and it will produce something competent, occasionally beautiful. Ask it to count the letter 'r' in the word 'strawberry' and there is a meaningful chance it will get it wrong. This is not a bug awaiting a patch. It is a window into how these systems actually think—or rather, how they do something that resembles thinking while operating on fundamentally different principles than a human mind.
The counting problem has become a minor internet sport: users delight in catching sophisticated models fumbling elementary arithmetic, miscounting syllables, or losing track of items in a list. The failures seem absurd given that these same systems can explain quantum chromodynamics or generate working Python scripts. But the absurdity is instructive. It reveals that large language models are not general reasoners with occasional blind spots. They are pattern-completion engines of extraordinary power operating in a space where counting is genuinely hard and reasoning is surprisingly easy.
The architecture of approximation
Language models process text as tokens—chunks that might be a word, part of a word, or a punctuation mark. They never see individual letters the way you do when you scan a word character by character. When asked to count letters, the model must infer the answer from patterns in its training data, essentially guessing based on how similar questions were answered before. This is why 'strawberry' trips them up: the double-r is orthographically unusual, and the model's statistical intuitions lead it astray.
Arithmetic poses similar challenges. Models can perform calculations, but they do so by recognizing patterns in how numbers combine, not by executing the stepwise algorithms humans learn in school. Small numbers work fine because the training data contains abundant examples. Large or unusual numbers produce errors because the model is interpolating in regions where its pattern-matching grows unreliable. It is less a calculator than a savant who has memorized thousands of multiplication tables and guesses at the rest.
What comes easily, what comes hard
The inversion of difficulty is striking. Tasks that require years of human education—legal analysis, code generation, literary pastiche—emerge almost automatically from sufficient scale and training data. These are, in a sense, pattern-rich domains where statistical regularities run deep. Meanwhile, tasks that children master effortlessly—counting, tracking objects through time, maintaining consistent spatial reasoning—remain stubbornly difficult because they require something other than pattern completion.
This suggests that what we call intelligence may be less unified than we assume. Human cognition integrates symbolic manipulation, embodied intuition, and statistical learning into a seamless whole. Language models have achieved remarkable capability in one dimension while remaining curiously impoverished in others. They are not early-stage general intelligences. They are a new kind of cognitive tool, powerful in ways we are still learning to exploit and limited in ways we are still learning to map.
Our take
The counting failures are not embarrassing glitches to be patched away. They are honest signals about what these systems are and are not. A technology that can draft a persuasive essay but cannot reliably count syllables is not a flawed human substitute—it is something genuinely novel, with a capability profile unlike anything evolution or engineering has produced before. The sooner we stop measuring AI against human benchmarks and start understanding it on its own terms, the sooner we will learn to use it wisely. The strawberry test is not a gotcha. It is a koan.




