Large language models cannot reliably count. This single flaw reveals everything about how they actually work.

Ask a large language model to count the letter 'r' in the word 'strawberry' and there is a reasonable chance it will get it wrong. This is not a bug that engineers forgot to fix. It is a window into the architecture of these systems — and a useful corrective to both the hype and the fear surrounding them.

The error seems absurd. A technology capable of passing bar exams, writing functional code, and explaining quantum mechanics in the voice of a Victorian gentleman cannot perform a task that a six-year-old manages without effort. But the absurdity dissolves once you understand what these systems actually do, which is not what most people assume.

Prediction machines, not thinking machines

Large language models do not read text the way humans do. They process language through tokens — fragments of words that the model has learned to associate with statistical patterns. The word 'strawberry' might be broken into 'straw' and 'berry', or into different chunks entirely depending on the tokenizer. The model never sees individual letters as discrete objects to be counted. It sees probability distributions over what token should come next.

This is why the same model that struggles with letter-counting can write a sonnet about strawberries. Poetry emerges from pattern completion across vast training data. Counting requires a fundamentally different operation: holding discrete symbols in memory and iterating through them systematically. The architecture was not designed for this. Asking it to count letters is like asking a jazz musician to perform surgery — the skills do not transfer, no matter how impressive each is in its own domain.

The training shortcut

When models do get counting questions right, they are often retrieving memorized answers rather than computing them. If a particular counting puzzle appeared frequently in training data, the model learns the association. This creates an illusion of capability that breaks down the moment you vary the question slightly. Change 'strawberry' to 'straaaawberry' and the model's confidence becomes meaningless.

This pattern — memorization masquerading as reasoning — extends far beyond arithmetic. Models can appear to solve logic puzzles they have seen before while failing on structurally identical problems with different surface features. The implications for anyone using these tools professionally are significant. Reliability requires understanding which tasks fall within the model's genuine competence and which are statistical parlor tricks.

What this means for the future

Researchers are actively working on hybrid systems that route certain tasks to traditional computational tools. A model that recognizes a counting question and hands it off to a simple algorithm can achieve perfect accuracy. This modular approach — statistical prediction for language, deterministic computation for math — may prove more robust than trying to train pure neural networks to do everything.

But the counting problem also suggests inherent limits. Some cognitive tasks may simply be incompatible with the transformer architecture that underlies current AI. Genuine reasoning, as philosophers and cognitive scientists define it, may require something these systems do not possess and cannot acquire through scale alone.

Our take

The letter-counting failure is a gift to anyone trying to think clearly about AI. It demonstrates that these systems are sophisticated pattern-matchers, not nascent minds. This makes them extraordinarily useful for tasks that benefit from pattern recognition — drafting, summarizing, translating, brainstorming — and unreliable for tasks that require systematic logical operations. The companies building these tools have every incentive to blur this distinction. Users have every reason to keep it sharp.