Ask ChatGPT how many letters are in the word "strawberry" and it will confidently tell you nine. There are ten. Ask it to count the r's in the same word and it might say two. There are three. These are not occasional glitches or edge cases — they are fundamental features of how large language models process text, and understanding why reveals the strange machinery humming beneath every AI conversation.
The error is not one of intelligence but of perception. When you type a word, you see letters. When a language model receives that same word, it sees something entirely different: tokens, the atomic units into which all text is shredded before processing. The word "strawberry" might become "straw" and "berry," or "str" and "awberry," or some other fragmentation depending on the model's training. The letters themselves — the r's you want counted — exist nowhere in the system's direct experience.
The tokenizer's invisible hand
Tokenization is the unsung preprocessing step that shapes everything a language model can and cannot do. Before any neural network sees your prompt, a tokenizer breaks your text into chunks optimized for statistical prediction, not human legibility. Common words often become single tokens; rare words get carved into subword pieces; numbers are notoriously fractured in ways that make arithmetic treacherous.
This design was never arbitrary. Training a model on individual characters would require processing sequences roughly four times longer, dramatically increasing computational costs. Tokens compress language into manageable pieces while preserving enough statistical signal for the model to learn patterns. The tradeoff worked brilliantly for generating fluent prose. It worked terribly for tasks requiring character-level awareness.
The model, in other words, has never seen the letter "r" in isolation. It has seen patterns of tokens that statistically tend to follow other patterns of tokens. When you ask it to count letters, you are asking it to reason about a level of granularity that exists below its perceptual floor — like asking someone to count the atoms in a photograph by looking at the image.
Fluency without understanding
This architectural quirk illuminates a deeper truth about language models: their competence is statistical, not conceptual. They do not understand language the way humans do, building meaning from sounds and symbols. They predict probable continuations based on vast pattern-matching across training data. When those patterns align with genuine reasoning — which they often do, because human text reflects human thought — the output appears intelligent. When the patterns diverge from the underlying reality, the seams show.
Counting letters requires genuine symbolic manipulation: isolating each character, maintaining a running tally, returning a precise figure. Language models can simulate this process by generating text that looks like counting, but they are not actually performing the operation. They are predicting what a counting answer should look like based on similar examples in their training data. Sometimes the prediction is correct. Often it is not.
This explains why the same model that struggles with "strawberry" can write a sonnet, summarize a legal brief, or explain quantum entanglement in accessible terms. Those tasks reward fluency and pattern completion. Counting rewards precision the model cannot reliably provide.
Our take
The letter-counting failure is not a bug to be patched but a window into architecture. Language models are magnificent prediction engines wrapped in a conversational interface that invites us to treat them as reasoning minds. They are not. Understanding this distinction — genuinely internalizing it — is the difference between using these tools wisely and being perpetually surprised by their limitations. The strawberry test is not a gotcha; it is a diagnostic. Any task requiring the model to perceive below the token level, manipulate symbols precisely, or maintain exact state across steps will fail in similar ways. Plan accordingly.




