Ask ChatGPT, Claude, or Gemini how many times the letter 'r' appears in the word 'strawberry,' and there is a reasonable chance it will confidently tell you two. The correct answer is three. This is not a bug to be patched; it is a window into the alien architecture of modern AI—and understanding it matters more than any benchmark score.

The strawberry problem became a minor internet meme, but the underlying phenomenon deserves serious attention. Large language models do not see words the way humans do. They see tokens—chunks of text that their training process deemed statistically useful. The word 'strawberry' might be split into 'straw' and 'berry,' or 'str,' 'aw,' and 'berry,' depending on the tokenizer. The model never encounters the individual letters as discrete objects to count. It is like asking someone to count the brushstrokes in a painting while only showing them a photograph.

The tokenization trap

Tokenization is the necessary compression that makes language models computationally tractable. Training on individual characters would require vastly more processing power and yield worse results for most tasks. The tradeoff is that certain operations humans find trivial—counting letters, reversing words, detecting anagrams—become genuinely difficult for systems that otherwise produce remarkably fluent prose.

This explains why the same model that can write a competent legal brief may struggle to tell you whether 'listen' and 'silent' share the same letters. The model has learned statistical relationships between tokens, not the compositional structure of orthography. It can often guess correctly by pattern-matching against similar examples in its training data, but it is not actually performing the operation.

What the failure illuminates

The strawberry test is useful precisely because it is so easy for humans and so revealing about machines. It demonstrates that fluency is not understanding, that confident delivery is not accuracy, and that impressive performance on complex tasks does not guarantee competence on simple ones. These are not intuitive truths. Humans naturally assume that anything smart enough to explain quantum entanglement can count to three.

The broader lesson applies to every domain where AI is being deployed. A model that drafts excellent marketing copy may hallucinate statistics. A coding assistant that writes elegant functions may miscount array indices. A medical AI that synthesizes research papers may stumble on basic arithmetic. The failure modes are not random; they follow from the architecture. Knowing where the architecture is weak is more valuable than knowing where it is strong.

Our take

The strawberry problem is not an embarrassment for AI companies to engineer around—though they are trying, with mixed results. It is a gift to users willing to receive it. Every technology has characteristic failure modes, and understanding them is the difference between using a tool and being used by one. The models are getting better at faking letter-counting through workarounds, but the underlying lesson remains: these systems process language in ways that are fundamentally unlike human cognition. That is not a flaw in the technology. It is the technology. The sooner we internalize this, the more intelligently we can deploy AI where it excels and protect ourselves where it does not.