Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and watch a system capable of writing sonnets, debugging code, and summarizing legal briefs stumble over a task that requires nothing more than pointing at letters and counting. The answer is three. The model will often say two. This is not a bug to be patched; it is a window into the architecture of artificial intelligence itself.
The failure illuminates a truth that gets lost in breathless coverage of AI capabilities: large language models do not process language the way humans do. They do not see letters. They see tokens—chunks of text that their training has carved the world into, optimized for prediction efficiency rather than semantic clarity. The word 'strawberry' might be split into 'straw' and 'berry,' or 'str' and 'awberry,' depending on the tokenizer. The model never encounters the individual letters as discrete, countable objects.
The prediction machine's blind spot
LLMs are, at their core, autocomplete engines of extraordinary sophistication. They predict the next token based on statistical patterns learned from billions of text examples. When you ask about the letter count in a word, you are asking the model to perform a fundamentally different operation—one that requires symbolic manipulation rather than pattern matching. It is like asking a chess grandmaster to win at checkers by playing chess moves. The skills do not transfer the way intuition suggests they should.
This is why models can write plausible-sounding explanations of quantum mechanics while failing to reliably add three-digit numbers. The training data contains countless examples of well-written physics explanations; the model has learned what such explanations look like. But arithmetic requires executing a procedure correctly every time, and procedures are not what these systems learn. They learn vibes.
What the gap tells us about intelligence
Human cognition integrates multiple systems seamlessly. We can read a sentence, count its words, notice a spelling error, and appreciate its rhythm—all simultaneously, using different cognitive faculties that evolution spent millions of years weaving together. LLMs possess one faculty, developed to superhuman levels: the ability to predict what text should come next given what came before. Everything they appear to do emerges from this single capability, which is both more impressive and more limited than it seems.
The models are getting better at arithmetic and counting through various engineering interventions—chain-of-thought prompting, tool use, retrieval augmentation. But these are workarounds, not solutions. They acknowledge the limitation rather than overcome it. The underlying architecture remains a prediction engine, not a reasoning engine, no matter how many scaffolds we build around it.
Our take
The counting problem is useful precisely because it is trivial. It strips away the mystique and forces a clear-eyed assessment of what these systems are. They are not thinking machines that occasionally make mistakes. They are statistical mirrors that reflect patterns in human text with uncanny fidelity—and sometimes that reflection produces something that looks exactly like reasoning. The difference matters enormously for how we deploy these tools, what we trust them with, and how we think about the intelligence they may or may not possess. A system that cannot count letters can still be transformatively useful. It just is not what the marketing implies.




