Ask any large language model how many times the letter 'r' appears in 'strawberry' and watch it stumble. This is not a failure of training or a gap in data. It is an architectural inevitability, baked into the very process that allows these systems to function at all.
The culprit is tokenization, the preprocessing step that converts human text into the numerical sequences a neural network can actually process. Before a single calculation occurs, your prompt is chopped into tokens — fragments that might be whole words, partial words, or individual characters, depending on statistical patterns in the training corpus. The model never sees 'strawberry' as a string of letters. It sees a token, or perhaps two tokens, representing a probabilistic unit of meaning.
The translation problem
Tokenization is essentially lossy compression. When a model encounters 'strawberry', it might process it as a single token that statistically correlates with concepts like fruit, red, summer, and jam. The internal representation contains no recoverable information about the letter sequence s-t-r-a-w-b-e-r-r-y. Asking the model to count letters is like asking someone to count the brushstrokes in a photograph of a painting — the information was discarded before they ever saw it.
This explains why models struggle with tasks that seem trivially easy to humans: reversing words, identifying rhymes, counting syllables, or detecting anagrams. These operations require character-level access that the tokenization process deliberately eliminates in favor of semantic compression. The tradeoff is intentional. Processing text token-by-token rather than character-by-character dramatically reduces computational costs and allows models to capture meaning across longer contexts.
What this reveals about intelligence
The tokenization limitation illuminates something deeper about how large language models work. They are not reasoning engines that happen to communicate in language. They are pattern-completion systems trained on token sequences, predicting what comes next based on statistical regularities in vast corpora. When a model appears to reason, it is retrieving and recombining patterns that correlate with reasoning-like outputs. When it fails at counting letters, the mask slips.
This distinction matters because it clarifies what these systems can and cannot do. Tasks that map well onto token-level pattern matching — summarization, translation, style transfer, code generation — play to the architecture's strengths. Tasks requiring precise symbolic manipulation, mathematical proof, or character-level text analysis fight against the grain. No amount of scaling will fix this. A larger model with the same tokenization scheme will make the same category of errors, just with more confident-sounding explanations.
Our take
The strawberry test has become a parlor trick, a gotcha that skeptics deploy to deflate AI hype. But its real value is pedagogical. Understanding why models fail at letter-counting teaches more about their nature than any benchmark score. These systems are genuinely remarkable at tasks involving semantic pattern-matching, and genuinely incapable of tasks requiring symbolic precision. The sooner we internalize this distinction, the sooner we can deploy them wisely — and stop expecting calculators to write poetry or poets to calculate.




