Ask any major AI assistant how many times the letter 'r' appears in the word 'strawberry,' and there is a reasonable chance it will confidently answer two. The correct answer is three. This is not a bug that engineers forgot to fix; it is a window into the fundamental architecture of systems that otherwise compose poetry, debug code, and pass bar exams.
The failure is instructive precisely because it seems so trivial. A child who can barely read could outperform a trillion-parameter model at this task. The disconnect illuminates a truth that gets lost in breathless coverage of AI capabilities: these systems do not process language the way humans do, and the difference matters more than most users realize.
Tokens are not letters
Large language models do not see individual characters. They perceive language through tokens—chunks of text that might be whole words, word fragments, or punctuation marks. The word 'strawberry' arrives to the model as something like 'straw' and 'berry,' two separate units with no internal structure visible to the system. When asked to count letters, the model must essentially guess based on patterns it learned during training, not perform the mechanical operation a human would.
This tokenization scheme exists for good reason. Processing text character by character would be computationally ruinous at scale. Tokens compress language into manageable units and capture meaningful semantic relationships. The word 'unhappiness' might become three tokens that individually carry information about negation, emotion, and state. This is elegant engineering for generating coherent prose. It is terrible engineering for counting.
The pattern-matching mirage
What makes the strawberry problem so revealing is that models will often get similar questions right. Ask about the letters in 'banana' and the response may be accurate, not because the model counted anything, but because 'banana' appears frequently in training data alongside discussions of its repeated letters. The model learned an association, not a procedure.
This distinction—between learning patterns and learning processes—runs through every limitation of current AI systems. A model that can write a functional sorting algorithm in Python cannot reliably execute that algorithm mentally on a list of five numbers. It knows what sorting looks like without knowing what sorting is. The appearance of reasoning emerges from statistical relationships in training data, not from anything resembling cognition.
Where the seams show
The letter-counting failure belongs to a broader category of tasks where large language models struggle despite their apparent sophistication. Multi-step arithmetic, spatial reasoning, calendar calculations, and logical puzzles with novel structures all expose similar gaps. These are domains where correct answers require sequential, verifiable operations rather than plausible-sounding responses.
Engineers have developed workarounds. Chain-of-thought prompting encourages models to show their work, sometimes improving accuracy. Tool use allows models to call calculators or code interpreters for precise computation. Newer architectures experiment with dedicated reasoning modules. But these are patches on a fundamental design, not solutions to it. The underlying technology remains a sophisticated autocomplete engine, predicting what text should come next based on everything it has seen before.
Our take
None of this diminishes what large language models accomplish. They have democratized access to competent first drafts, technical explanations, and creative brainstorming in ways that seemed impossible a decade ago. But the strawberry test should be mandatory orientation for anyone integrating AI into consequential workflows. These systems are not thinking machines that occasionally make mistakes; they are pattern engines that occasionally produce outputs indistinguishable from thought. The difference is subtle until it is catastrophic. Knowing which tasks fall on which side of that line is the only AI literacy that actually matters.




