Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and it may confidently answer two. The correct answer is three. This is not a bug that engineers forgot to fix; it is a window into the alien cognition at the heart of every large language model, and understanding it changes how you should think about AI entirely.
The failure seems absurd. A system that can write legal briefs and debug code cannot perform a task a seven-year-old handles trivially. But the absurdity dissolves once you understand that large language models do not see words the way humans do. They see tokens—chunks of text that might be whole words, syllables, or arbitrary fragments, depending on how frequently those character sequences appeared in training data. The word 'strawberry' might be split into 'straw' and 'berry,' or 'str,' 'aw,' and 'berry,' depending on the tokenizer. The model never processes individual letters at all. It is predicting the next token based on statistical patterns, not parsing characters.
The prediction machine
Large language models are, at their core, extraordinarily sophisticated autocomplete systems. They ingest a sequence of tokens and output a probability distribution over what token should come next. This process repeats until the response is complete. The magic emerges from scale: billions of parameters encoding patterns extracted from trillions of words of human text. The model learns that certain sequences of tokens tend to follow certain other sequences. It learns grammar, facts, reasoning patterns, and style—all as statistical regularities.
This architecture explains both the capabilities and the limitations. When you ask a model to write a sonnet, it draws on patterns from countless sonnets in its training data. When you ask it to solve a logic puzzle, it pattern-matches against similar puzzles it has seen. But when you ask it to count letters, you are asking it to perform a fundamentally different operation—one that requires treating text as a sequence of discrete characters rather than as tokens in a statistical model.
Why this matters for users
The letter-counting failure is a diagnostic tool. It reveals that large language models are not general-purpose reasoning engines; they are pattern-completion systems that happen to have absorbed enough patterns to simulate reasoning across many domains. They excel when the task resembles something well-represented in training data. They falter when the task requires operations their architecture was not designed to perform.
This has practical implications. Models are unreliable at precise arithmetic, exact character manipulation, and tasks requiring genuine step-by-step logical deduction rather than pattern recognition. They are superb at synthesis, summarization, translation, and creative generation—tasks where statistical fluency matters more than formal correctness. The wise user treats them as brilliant but unreliable collaborators, not as infallible oracles.
Our take
The letter-counting problem is not embarrassing; it is clarifying. It reminds us that these systems, however impressive, are not minds. They do not understand language in the way humans do. They predict plausible continuations of text, and they do it well enough to be transformatively useful. But the gap between predicting plausible text and genuinely reasoning about the world remains vast. Recognizing this gap is not pessimism about AI—it is the foundation for using it wisely.




