When GPT-4 launched, users quickly discovered something odd: ask it to count the letters in "strawberry" and it confidently answers eight. The correct answer is ten. This wasn't a bug to be patched but a window into something fundamental about how these systems work — and don't.
The strawberry problem became a minor internet sensation, but its implications run deeper than viral amusement. Large language models don't count. They don't calculate. They don't reason in any way a mathematician would recognize. What they do, with extraordinary sophistication, is predict which word should come next. That this prediction engine produces coherent essays, functional code, and occasionally profound-seeming insights is remarkable. That it cannot reliably count to ten is not a failure but a feature of its architecture.
The prediction machine
At their core, large language models are probability engines trained on vast oceans of text. When you ask a question, the model isn't retrieving an answer from a database or working through logical steps. It's generating a sequence of tokens — roughly, word fragments — each chosen because it statistically fits what came before. The model that produces a sonnet and the model that miscounts letters is the same model doing the same thing: pattern completion at superhuman scale.
This explains the uncanny valley of AI competence. Ask for a recipe and you'll get something usable. Ask for legal analysis and you'll get something that sounds authoritative. Ask for the square root of 17 and you might get the right answer — not because the model computed it, but because "square root of 17" appeared near "4.123" often enough in training data. The system is interpolating from examples, not deriving from principles.
Where the seams show
The counting problem is merely the most legible example of a broader phenomenon. Language models struggle with spatial reasoning, temporal sequences, and anything requiring genuine logical deduction rather than pattern recognition. They hallucinate citations, invent historical events, and occasionally attribute quotes to people who never said them — all with the same confident tone they use when being accurate.
The models have improved dramatically. Newer versions handle arithmetic better, often by routing such queries to actual calculators bolted onto the system. But the core architecture remains unchanged: these are language models, not reasoning engines. The improvements come from better training data, more parameters, and clever engineering workarounds, not from solving the fundamental gap between prediction and understanding.
Our take
The strawberry test isn't a gotcha; it's a diagnostic. Every user of AI tools should understand that they're working with a sophisticated autocomplete system, not a thinking machine. This doesn't diminish the technology's utility — autocomplete at this level is genuinely transformative for writing, coding, and creative work. But it does mean treating AI outputs as first drafts requiring verification, not oracles delivering truth. The models are mirrors reflecting the patterns of human language back at us, sometimes brilliantly, sometimes absurdly. Knowing which is which remains, for now, a distinctly human responsibility.




