Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and watch it stumble. This is not a bug awaiting a patch. It is a window into the fundamental nature of systems that have otherwise dazzled us with their apparent intelligence.

The failure is instructive precisely because it seems so trivial. A system capable of synthesizing legal arguments, debugging code, and writing serviceable sonnets cannot reliably perform a task that requires nothing more than sequential attention. The disconnect illuminates what large language models actually do — and, more importantly, what they do not.

The tokenization trap

Language models do not see text the way humans do. Before any processing begins, input is sliced into tokens — chunks that might be whole words, word fragments, or individual characters depending on the model's vocabulary. The word 'strawberry' might become 'straw' and 'berry', or 'str', 'aw', 'ber', 'ry'. The model never encounters the raw string of letters that a human sees.

This preprocessing is not incidental. It is foundational. The entire architecture operates on these tokens, predicting which should follow which based on statistical patterns absorbed from training data. When asked to count letters, the model is essentially being asked to reason about entities it has never directly observed. It must infer the letter composition from token patterns — a task for which it has no reliable mechanism.

Prediction versus computation

The deeper issue is that language models are prediction engines, not computation engines. They excel at tasks that can be solved by recognizing and extending patterns: what word typically follows these words, what style matches this prompt, what argument structure fits this context. They struggle at tasks requiring discrete, sequential operations performed on specific inputs.

Counting is computation. It requires maintaining a running tally while iterating through a sequence, updating state at each step. Language models have no explicit state management. They produce output in one forward pass through the network, with no mechanism for loops or iterative refinement. When a model appears to count correctly, it is usually because it has memorized the answer or stumbled upon it through pattern matching — not because it has actually performed the operation.

The reasoning illusion

This limitation extends far beyond arithmetic. Language models regularly produce confident answers to questions requiring multi-step logical deduction, spatial reasoning, or temporal tracking — and regularly get them wrong in ways that reveal the absence of genuine reasoning. They are extraordinarily good at producing text that looks like reasoning. They are not actually reasoning.

The distinction matters enormously for how we deploy these systems. Tasks involving pattern recognition, style transfer, summarization, and creative generation play to their strengths. Tasks requiring precise calculation, formal verification, or guaranteed logical consistency expose their weaknesses. The most effective applications acknowledge this boundary rather than pretending it does not exist.

Our take

The counting failure is not an embarrassment to be patched over with tool use and calculator plugins. It is a feature to be understood. Language models are genuinely remarkable at what they do — they have compressed an astonishing amount of human knowledge and linguistic pattern into systems that can be queried in natural language. But they are not general intelligences, and treating them as such leads to both overreliance and unfair dismissal. The letter 'r' appears in 'strawberry' three times. That this remains a hard problem for systems that can explain quantum mechanics tells us something important about the difference between fluency and understanding.