Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and there is a reasonable chance it will confidently answer two. The correct answer is three. This is not a bug that will be patched in the next update. It is a window into something fundamental about how these systems process language — and why their impressive verbal fluency masks genuine cognitive blind spots.
The strawberry problem, as it has become known in AI circles, illustrates a counterintuitive truth: large language models do not see words the way humans do. They do not read letter by letter. They consume text in chunks called tokens, which are often fragments of words rather than complete units. The model that powers most commercial AI assistants processes 'strawberry' not as ten individual letters but as something closer to 'straw' and 'berry' — two semantic units that happen to share a boundary where an 'r' lives. When asked to count, the model is essentially guessing based on statistical patterns in its training data rather than performing the elementary operation a child would.
The illusion of understanding
This matters far beyond parlor tricks. The same architectural limitation that trips up letter-counting creates subtle failures in tasks that seem well within an AI's wheelhouse: verifying citations, checking arithmetic, confirming that a legal document contains specific required clauses. The model generates text that looks correct because it has learned what correct text looks like, not because it has verified the underlying facts. It is an extraordinarily sophisticated autocomplete, not a reasoning engine.
The gap becomes clearer when you consider what these models actually learn during training. They ingest billions of documents and develop statistical associations between sequences of tokens. They learn that 'the capital of France' is overwhelmingly followed by 'Paris' and that 'strawberry' appears in contexts involving red fruit and summer desserts. But they do not build an internal model of France as a country with geography and history, nor do they construct a representation of 'strawberry' as a string of characters that can be inspected and counted.
Why fluency deceives us
Humans are terrible at distinguishing genuine understanding from fluent performance. We evolved to infer intelligence from language — if someone speaks eloquently about quantum physics, we assume they understand quantum physics. Large language models exploit this heuristic ruthlessly. They produce grammatically perfect, contextually appropriate, stylistically sophisticated text, and our brains interpret this as comprehension.
The deception runs deeper because these models are genuinely useful. They can draft competent emails, summarize complex documents, and generate working code. The failure modes are not constant; they are intermittent and unpredictable. A model might correctly count letters in one word and fail on another. It might catch an arithmetic error in one context and propagate a worse one in the next. This inconsistency makes calibration nearly impossible. Users cannot develop reliable intuitions about when to trust the output.
Our take
The strawberry problem is not a gotcha for AI skeptics to deploy at dinner parties. It is a genuine insight into the nature of current systems — and a useful corrective to both hype and dismissal. These models are neither the general intelligences their boosters sometimes imply nor the parlor tricks their critics suggest. They are something genuinely new: systems that have mastered the surface structure of language without grasping its foundations. Understanding that distinction is the first step toward using them wisely.




