Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and it will confidently tell you two. There are three. This is not a glitch awaiting a patch; it is a window into how these systems fundamentally operate — and why the gap between linguistic fluency and basic reasoning may be harder to close than the industry admits.

The error illuminates a counterintuitive truth: large language models do not process text the way humans do. They never see individual letters at all. Instead, they work with 'tokens' — chunks of text that might be whole words, fragments of words, or common character sequences. The word 'strawberry' might arrive as two or three tokens depending on the model's training. The letters themselves exist only implicitly, buried in statistical patterns the model learned by predicting what word comes next across billions of sentences.

The tokenization trap

Tokenization was an engineering compromise, not a philosophical choice. Processing text character by character would be computationally expensive and would lose the semantic relationships that make language models useful. By chunking text into larger units, models can capture meaning more efficiently. A model that sees 'un-' and '-happy' as separate tokens learns that the prefix often negates what follows. This is genuinely useful.

But the tradeoff is severe. When you ask a model to count letters, you are asking it to perform a task its architecture was never designed for. It must infer letter frequencies from token-level patterns — essentially guessing based on how often it has seen similar questions answered correctly in training data. Sometimes it gets lucky. Often it does not. The model has no mechanism to actually iterate through characters the way a child learning to spell would.

Why this matters beyond party tricks

The letter-counting failure is a specific instance of a broader limitation: large language models are pattern-matching engines masquerading as reasoning systems. They excel at tasks where statistical regularities in language correlate with correct answers. They struggle when the answer requires stepping outside learned patterns to perform genuine computation.

This distinction matters for anyone deploying these systems in high-stakes contexts. A model that cannot reliably count letters also cannot reliably verify checksums, validate structured data, or catch subtle errors in code where a single character matters. The fluency that makes these tools so compelling can obscure their brittleness. A system that writes eloquent explanations of quantum mechanics may still hallucinate a citation or miscalculate a dosage.

Our take

The industry has a terminology problem. When we call these systems 'intelligent' or describe their outputs as 'reasoning,' we import expectations they cannot meet. A more honest framing would acknowledge that LLMs are extraordinarily sophisticated autocomplete — brilliant at interpolating between patterns in their training data, hopeless at tasks requiring even trivial computation their architecture cannot express. The letter-counting failure is not embarrassing; it is clarifying. It reminds us that fluency and understanding remain, for now, entirely different things.