Ask a large language model to write a sonnet about climate change and it will produce something serviceable, perhaps even moving. Ask it how many times the letter R appears in the word 'strawberry' and there is a reasonable chance it will get it wrong. This is not a bug to be patched in the next release. It is a window into the fundamental nature of these systems — one that matters enormously as we hand them ever more consequential tasks.

The counting problem became something of an internet parlor trick as chatbots proliferated. Users discovered that systems capable of summarizing dense legal documents and debugging complex code would confidently declare that 'strawberry' contains two R's (it contains three). The failure seemed almost too absurd to be real. How could a machine that appears to reason about quantum physics stumble on a task a seven-year-old handles effortlessly?

The tokenization trap

The answer lies in how these systems perceive text. Large language models do not read words letter by letter the way humans do. They process language through tokens — chunks of text that might be whole words, parts of words, or individual characters, depending on how common the sequence is in their training data. The word 'strawberry' might be split into 'straw' and 'berry,' or broken differently still. The model never sees the individual letters laid out for counting; it sees abstract numerical representations of these chunks.

This is not a design flaw but a foundational choice that enables everything impressive about these systems. Tokenization allows models to process language efficiently, to recognize patterns across vast corpora, to generate fluent text in dozens of languages. But it means the model's relationship to the raw characters of a word is indirect, mediated, and often lossy. Asking it to count letters is like asking someone to count the bricks in a house by looking at a photograph — possible, but requiring inference rather than direct observation.

What prediction actually means

The deeper issue is that large language models are, at their core, prediction engines. They do not compute answers; they predict what tokens are likely to come next based on patterns absorbed from training data. When a model answers a math question correctly, it is not performing arithmetic in the way a calculator does. It is pattern-matching against countless similar problems it has seen, predicting that certain symbols should follow certain other symbols.

This works remarkably well for many tasks because human knowledge is, in large part, pattern-based. The model has seen enough medical case studies to predict plausible diagnoses, enough legal arguments to predict reasonable objections, enough code to predict functional syntax. But counting letters in an arbitrary word is not a pattern-recognition task. It requires procedural execution — step through each character, increment a counter, report the result. The model can only simulate this process by predicting what a human would write if asked to count, which introduces the possibility of the same errors humans make when counting quickly.

The implications for trust

None of this means large language models are useless or that their impressive capabilities are illusory. It means they are a specific kind of tool with a specific failure mode that users must understand. A model that hallucinates a legal citation and a model that miscounts letters are exhibiting the same underlying behavior: confident prediction in the absence of ground truth verification.

The organizations deploying these systems in high-stakes contexts — healthcare, law, finance — are increasingly building verification layers around them, using the model's fluency for drafting while checking its outputs against authoritative sources. This is the correct approach. The error is treating the model as an oracle rather than as an extremely capable but fundamentally alien intelligence that processes information in ways that do not map neatly onto human cognition.

Our take

The strawberry test is not a gotcha. It is a gift — a simple, memorable demonstration that these systems, for all their apparent sophistication, are not thinking the way we think. They are doing something else, something genuinely new and useful, but something that requires us to update our intuitions about what intelligence means and what verification requires. The companies building these tools have every incentive to paper over such limitations. Users have every reason to remember them.