Ask ChatGPT how many times the letter 'r' appears in the word 'strawberry' and watch it stumble. This is not a bug awaiting a patch. It is a window into the fundamental architecture of large language models — and the gap between what they do and what we imagine they do.
The failure is instructive. When you type 'strawberry' into a language model, the system does not see individual letters. It sees tokens — chunks of text that the model has learned to treat as atomic units during training. 'Strawberry' might be tokenized as 'straw' and 'berry,' or broken differently depending on the model. The letters themselves become invisible, buried inside these larger chunks like individual bricks hidden within prefabricated wall panels.
Why tokens exist
Tokenization is not a design flaw but a necessary compromise. Processing text character-by-character would be computationally prohibitive at scale. By chunking text into roughly 50,000 to 100,000 common subword units, models can process language efficiently while still handling novel words by breaking them into familiar pieces. The trade-off works brilliantly for most language tasks. It fails catastrophically for counting.
The deeper issue is that language models do not execute algorithms. They predict. Given a sequence of tokens, they calculate probability distributions over what token should come next, drawing on statistical patterns absorbed from vast training corpora. When asked to count letters, the model is not counting — it is pattern-matching against examples of counting it has seen before, then generating plausible-sounding output. Sometimes the pattern-match lands correctly. Often it does not.
The reasoning mirage
This reveals something uncomfortable about the 'reasoning' capabilities that AI companies trumpet. When a language model solves a logic puzzle or writes working code, it is not reasoning in any procedural sense. It is recognizing that the input resembles training examples where certain outputs followed, then generating statistically likely continuations. This can produce remarkably sophisticated results — until it encounters a problem that requires genuine step-by-step execution rather than pattern recognition.
The distinction matters enormously for understanding where AI will and will not prove reliable. Language models excel at tasks where good-enough pattern-matching suffices: drafting emails, summarizing documents, generating code that a human will review. They struggle with tasks requiring precise, verifiable computation — the kind of work where being 95 percent right is functionally equivalent to being wrong.
Our take
The counting problem is not a gotcha to embarrass AI companies. It is a useful diagnostic that reveals the true nature of these systems more clearly than any benchmark score. Large language models are extraordinarily powerful pattern-completion engines that have absorbed more human text than any person could read in a thousand lifetimes. They are not, however, minds — and the sooner we internalize that distinction, the better we will be at deploying them where they genuinely help rather than where they merely seem impressive.




