Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and watch it confidently answer two. The correct answer is three. This parlor trick has become the internet's favorite way to mock artificial intelligence, proof that these systems are somehow fundamentally broken. But the strawberry problem is not evidence of failure — it is a window into how these machines actually think, and why that thinking is both more alien and more interesting than most users realize.
The error persists across models and companies not because engineers cannot fix it, but because fixing it would require rebuilding the entire architecture from the ground up. Large language models do not see words the way humans do. They see tokens.
The tokenization bargain
Before a single word reaches the neural network, it passes through a tokenizer — a preprocessing step that chops text into digestible pieces. The word 'strawberry' might become 'straw' and 'berry,' or 'str,' 'aw,' and 'berry,' depending on the system. The model never encounters the individual letters s-t-r-a-w-b-e-r-r-y as discrete units. It processes compressed chunks optimized for statistical patterns across billions of documents.
This is not laziness. Tokenization is what makes these models economically viable. Processing text character by character would multiply computational costs by roughly four to six times. A query that costs a fraction of a cent would become prohibitively expensive. The tokens are a compression scheme, and like all compression, they trade fidelity for efficiency.
The bargain works remarkably well for most tasks. Tokens capture semantic meaning, grammatical structure, and contextual relationships with extraordinary sophistication. A model can analyze legal contracts, write poetry, and debug code precisely because it operates at this higher level of abstraction. But the abstraction has a cost: the raw orthographic structure of language — the actual letters — becomes invisible.
Why the fix is harder than it looks
Engineers have developed workarounds. Some systems now route counting questions to separate code-execution modules that can actually iterate through characters. Others fine-tune models specifically on letter-counting tasks. But these are patches on a fundamental architecture, not solutions to it.
The deeper issue is that language models are prediction engines trained on next-token probability. They learn that 'straw' frequently precedes 'berry' and that questions about letter frequency typically receive numerical answers. When asked about the r's in strawberry, the model is not counting — it is pattern-matching against similar questions it has seen, then generating a plausible-sounding response. The number it produces is a statistical echo, not a calculation.
This distinction matters far beyond party tricks. The same architectural choice that prevents accurate letter-counting also explains why models struggle with precise arithmetic, why they occasionally hallucinate citations, and why they can write convincing analysis of a book while misremembering specific plot points. The system is optimized for fluency and coherence at the token level, not for ground-truth accuracy at the character or fact level.
Our take
The strawberry test has become a lazy gotcha, deployed to prove that AI is overhyped and its creators are frauds. The reality is more interesting. Large language models represent a genuine breakthrough in machine cognition — and a genuinely different kind of cognition, one with capabilities and limitations that do not map neatly onto human intelligence. Understanding why the letter-counting fails is more valuable than mocking the failure itself. These systems are not broken; they are simply not what many users assume them to be. The sooner we grasp what they actually are, the better we will use them.




