Ask ChatGPT, Claude, or Gemini how many times the letter 'r' appears in 'strawberry' and watch a system capable of writing sonnets, debugging code, and summarizing dense legal documents stumble over a task any seven-year-old handles effortlessly. The answer is three. The models often say two. Sometimes they say one. Occasionally they get it right, then get it wrong on the next attempt. This is not a bug awaiting a patch. It is a window into how these systems fundamentally perceive language — and why that perception creates both their remarkable capabilities and their maddening limitations.

The tokenization trap

Large language models do not see text the way humans do. Before any processing occurs, text is broken into tokens — chunks that might be whole words, partial words, or individual characters depending on the tokenizer's training. The word 'strawberry' might become something like 'straw' + 'berry' or 'str' + 'aw' + 'berry', depending on the system. The model never actually encounters the raw sequence of letters s-t-r-a-w-b-e-r-r-y as discrete, countable units. It encounters abstract numerical representations of these chunks, optimized for predicting what comes next in a sequence, not for character-level analysis.

This is a deliberate engineering choice, not an oversight. Processing text character by character would be computationally expensive and would make the models worse at their primary task: understanding and generating coherent language at the level of meaning. Tokenization is why these systems can grasp that 'the cat sat on the mat' and 'a feline rested upon the rug' express similar ideas. It is also why they struggle with anagrams, letter counting, and precise spelling of unusual words.

What fluency obscures

The strawberry problem matters because it illustrates a broader truth about AI systems that their conversational polish tends to obscure: they are pattern-completion engines of extraordinary sophistication, not reasoning machines that happen to communicate in language. When a model explains quantum entanglement or writes a business plan, it is drawing on statistical patterns learned from billions of text examples, not working through problems from first principles.

This distinction becomes critical when these systems are deployed for tasks that require genuine symbolic manipulation — mathematics, formal logic, precise factual recall. The models can often produce correct-looking outputs because their training data contained many examples of correct mathematical notation or logical arguments. But they lack the underlying computational machinery to verify their work. They cannot count on their fingers.

Our take

The strawberry test should be required disclosure for anyone selling AI solutions. Not because letter-counting matters in itself, but because it punctures the illusion of general intelligence that fluent conversation creates. These systems are genuinely useful — transformatively so for certain applications. But their usefulness depends on understanding what they actually are: statistical mirrors of human language, brilliant at synthesis and generation, unreliable at anything requiring precise symbolic operations. The companies building them know this. Their customers often do not.