Ask ChatGPT how many times the letter 'r' appears in the word 'strawberry' and there is a reasonable chance it will confidently answer two. The correct answer is three. This is not a bug that OpenAI forgot to fix; it is a window into the alien cognition of large language models, and understanding it clarifies both their remarkable capabilities and their hard limits.

The strawberry problem became a minor internet sensation when users discovered that frontier AI systems—capable of passing bar exams and writing functional code—stumbled over a task a six-year-old handles easily. The failure seems absurd until you understand that these models do not see words the way humans do. They see tokens.

The tokenisation layer

Before any large language model processes your input, a tokeniser breaks your text into chunks. These chunks are not letters, not syllables, not even whole words—they are statistical fragments derived from training data. The word 'strawberry' might become something like 'straw' + 'berry' or 'str' + 'awberry' depending on the tokeniser's vocabulary. Once this split happens, the original character-level structure is lost. The model operates on these tokens the way you might operate on Lego bricks without remembering the plastic molecules inside them.

This design is not an oversight. Processing text character-by-character would be computationally ruinous at scale. Tokenisation compresses language into manageable units, enabling models to handle long documents and complex reasoning. The tradeoff is that certain low-level information—exact letter counts, precise character positions, visual spelling patterns—becomes opaque.

Why this matters beyond party tricks

The strawberry problem is funny, but the underlying architecture has serious implications. Models struggle with tasks that seem trivial to humans but require character-level awareness: generating text that fits exact character limits, reliably producing acronyms, counting syllables for poetry, or verifying that a password meets specific formatting rules. These failures are not random; they are systematic consequences of the tokenisation layer.

More subtly, the gap between human and model perception creates trust problems. When a system writes eloquent prose about quantum mechanics but cannot count to three in a word, users reasonably wonder what else it might be confidently wrong about. The failure mode is invisible—there is no error message, just a plausible-sounding incorrect answer delivered with the same fluency as correct ones.

The workarounds and their limits

AI labs have developed partial solutions. Some systems now use chain-of-thought prompting to spell out words letter-by-letter before counting. Others route certain queries to external code execution, letting Python handle the arithmetic. These patches work, but they are patches—the underlying architecture still does not natively perceive characters.

Researchers are exploring byte-level models that process text at a finer grain, but these face their own scaling challenges. For now, the tokenisation tradeoff remains baked into every major commercial system.

Our take

The strawberry problem is not evidence that large language models are stupid. It is evidence that they are strange—intelligent in ways that do not map onto human cognition, blind in ways that seem inexplicable until you understand the engineering. This strangeness is worth sitting with. We are building tools that can synthesise medical literature and draft legal contracts but cannot reliably tell you how many r's are in a fruit. The future of AI is not about eliminating these gaps but about knowing where they are, building guardrails around them, and resisting the temptation to treat fluent prose as proof of understanding.