Ask a large language model to write a sonnet about loneliness and it will produce something passable, perhaps even moving. Ask it how many times the letter 'r' appears in the word 'strawberry' and it will confidently tell you two. There are three. This is not a bug that engineers forgot to fix. It is a window into the alien architecture of minds built from statistics rather than symbols.
The discrepancy seems absurd. How can a system sophisticated enough to summarize legal contracts and debug code fail at counting letters? The answer lies in tokenization — the unglamorous preprocessing step that determines what these models actually perceive.
The world through tokens
Before any language model reads your prompt, a tokenizer chops it into pieces. These pieces are not characters, not words, but something in between: subword units optimized for compression. The word 'strawberry' might become 'straw' + 'berry' or 'str' + 'awberry' depending on the tokenizer's training. The model never sees individual letters at all. It sees a sequence of abstract tokens, each represented as a high-dimensional vector.
This design choice made large language models possible. Training on individual characters would require impossibly long sequences; training on whole words would create an unmanageable vocabulary. Subword tokenization is the elegant compromise that let researchers scale to hundreds of billions of parameters. But it means the model has no native concept of spelling. When you ask it to count letters, it must reconstruct the character sequence from tokens — a task it was never explicitly trained to perform.
What the models actually learn
During training, large language models develop one core competency: predicting the next token given all previous tokens. They become extraordinarily good at this. So good that emergent capabilities appear — reasoning, translation, code generation — that seem to exceed mere prediction. But these capabilities arise from pattern recognition at a scale humans struggle to comprehend, not from the symbolic manipulation that characterizes human cognition.
A child counting letters in 'strawberry' performs a discrete operation: isolate each character, check if it matches 'r', increment a counter. This is trivial for a computer running traditional code. But a language model approaches the question probabilistically, drawing on patterns in its training data where people discussed letter counts. If those discussions contained errors, or if the tokenization obscures the character sequence, the model's statistical guess will be wrong.
The implications are not trivial
This limitation matters beyond parlor tricks. Language models struggle with precise arithmetic, exact string matching, and any task requiring faithful manipulation of discrete symbols. They can approximate these operations — often well enough to be useful — but they cannot guarantee correctness. For applications demanding precision, from financial calculations to database queries, this probabilistic nature requires careful scaffolding.
The letter-counting failure also illustrates why language models hallucinate. They are not retrieving facts from a database; they are generating plausible continuations. When the training data is sparse or contradictory on a topic, plausibility and truth diverge. The model does not know it is wrong because it has no mechanism for knowing — only for predicting.
Our take
The inability to count letters is not a flaw to be patched but a feature to be understood. Large language models are prediction engines of unprecedented sophistication, and prediction is not the same as comprehension. Recognizing this distinction protects us from both over-reliance and premature dismissal. These systems are genuinely useful, occasionally brilliant, and fundamentally unlike the minds that built them. The strawberry test is a reminder that intelligence comes in stranger varieties than we assumed.




