Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and watch it confidently answer two. The correct answer is three. This is not a bug awaiting a patch. It is a window into what large language models fundamentally are — and are not.
The failure is so consistent, so reproducible across models and prompts, that it has become a kind of parlor trick among AI researchers. But the strawberry problem, as it has come to be known, deserves more than amusement. It reveals that the systems we increasingly treat as oracles operate on principles radically different from human cognition.
The tokenization trap
Large language models do not see text the way humans do. Before any processing begins, input is broken into tokens — chunks that might be whole words, word fragments, or individual characters, depending on the tokenizer's training. The word 'strawberry' might become 'straw' and 'berry', or 'str', 'aw', and 'berry', or some other decomposition entirely. The model never encounters the raw sequence of letters.
This matters because the model's entire universe of meaning is built from statistical relationships between tokens, not characters. When asked to count letters, it must somehow reconstruct character-level information from token-level representations — a task for which it was never optimized. The model learned to predict the next plausible token in a sequence, not to perform discrete symbolic operations on sub-token elements.
The result is that counting, a task trivial for a pocket calculator from the 1970s, becomes genuinely difficult for systems capable of passing bar exams and writing competent poetry.
What fluency obscures
The counting failure exposes a deeper truth: fluency is not understanding. Language models are extraordinarily good at producing text that sounds like it was written by someone who understands the subject matter. They have absorbed the statistical patterns of human expertise across millions of documents. But pattern matching, however sophisticated, is not the same as reasoning.
This distinction matters enormously for how we deploy these systems. A model can generate a persuasive legal brief while having no actual grasp of legal reasoning. It can write code that compiles while having no model of what the code does. It can explain quantum mechanics in lucid prose while possessing no understanding of physics beyond the correlation of words.
The strawberry problem is merely the most visible symptom of this gap. The model fails at counting because counting requires tracking discrete states — a fundamentally different operation from predicting probable continuations.
The workaround economy
The AI industry has developed an entire infrastructure of workarounds for these limitations. Chain-of-thought prompting encourages models to show their work, which sometimes catches errors. Tool use allows models to call external calculators or code interpreters for tasks requiring precision. Retrieval-augmented generation lets models consult external databases rather than relying on their compressed, lossy training data.
These patches work, often impressively well. But they represent a tacit admission that the core architecture has fundamental blind spots. We are building elaborate scaffolding around systems whose basic operation involves a category error: treating all problems as next-token prediction problems, even when they manifestly are not.
Our take
The strawberry problem is not a failure of scale or training data. It is an architectural inevitability, and recognizing it should reshape how we think about AI capabilities. These systems are not nascent general intelligences temporarily bad at math. They are extraordinarily powerful pattern-completion engines that happen to produce outputs resembling thought. The distinction is not pedantic. It determines whether we use these tools wisely or stumble into failures we should have anticipated. The model cannot count to three. That fact should inform every decision about where we deploy it.




