Ask a large language model to count the number of r's in "strawberry" and watch it stumble. This is not a bug to be patched or a limitation to be engineered around. It is a window into the deepest truth about what these systems are and are not.
The failure is instructive precisely because it seems so absurd. A system that can discuss Wittgenstein, draft legal briefs, and explain quantum mechanics cannot reliably perform a task that children accomplish before learning to read. The dissonance is not a flaw in the marketing; it is the marketing's most honest moment.
The architecture explains everything
Large language models do not see letters. They see tokens — chunks of text that their training data suggested should travel together. The word "strawberry" might arrive as a single token or be split in ways that obscure its internal structure. The model has never counted anything. It has learned statistical associations between sequences of tokens, and when asked to count, it is essentially guessing what a counting answer should look like based on patterns it absorbed during training.
This is not a minor technical detail. It is the entire story. These systems are prediction engines of extraordinary sophistication, trained on more text than any human could read in a thousand lifetimes. They have absorbed the statistical shadows of human knowledge — the patterns of how we explain things, argue things, describe things. But they have absorbed the shadows, not the substance.
What the models actually learned
When a language model explains photosynthesis correctly, it is not because it understands chloroplasts. It is because it has seen thousands of photosynthesis explanations and learned what shape such an explanation should take. When it writes a sonnet, it is not feeling the weight of mortality that inspired Shakespeare. It is predicting, with remarkable accuracy, what words should follow other words in something that looks like a sonnet.
This distinction matters enormously for how we should use these tools. They are spectacular at tasks where pattern-matching and fluent synthesis are valuable: drafting, summarizing, brainstorming, translating between registers. They are unreliable at tasks requiring genuine reasoning from first principles, precise factual recall, or any operation that their training data did not already contain in recognizable form.
The counting problem is not being solved
Engineers can and do build workarounds. They can give models access to calculators, code interpreters, external databases. These scaffoldings are useful, but they do not change what the model itself is doing. They change what the system around the model can accomplish. The model remains a prediction engine. It has simply been given better tools to call upon when its predictions suggest that calling a tool would be appropriate.
This is not a criticism. Prediction engines with tool access are genuinely useful, perhaps transformatively so. But the hype cycle has consistently confused what these systems are with what we wish they were. Every few months, a new benchmark is conquered, and observers declare that "real" artificial general intelligence is imminent. Then someone asks the model to count letters, and the illusion cracks.
Our take
The strawberry test is not a gotcha. It is a gift — a simple, repeatable demonstration that cuts through the mystification. Large language models are among the most impressive engineering achievements in human history. They are also not minds. They do not understand, reason, or know in any sense those words have previously carried. Using them well requires holding both truths simultaneously: they are more capable than their critics admit and less capable than their evangelists claim. The companies building them know this. The question is whether the rest of us will learn it before the hype cycle teaches us the hard way.




