Large language models cannot count. This explains more than you think.

Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and watch it confidently answer two. The correct answer is three. This is not a bug awaiting a patch. It is a window into what large language models fundamentally are — and are not.

The failure is so consistent, so reproducible across models and prompts, that it has become a kind of parlor trick among AI researchers. But the strawberry problem, as it has come to be known, deserves more than amusement. It reveals that the systems we increasingly treat as oracles operate on principles radically different from human cognition.

The tokenization trap

Large language models do not see text the way humans do. Before any processing begins, input is broken into tokens — chunks that might be whole words, word fragments, or individual characters, depending on the tokenizer's training. The word 'strawberry' might become 'straw' and 'berry', or 'str', 'aw', and 'berry', or some other decomposition entirely. The model never encounters the raw sequence of letters.

This matters because the model's entire universe of meaning is built from statistical relationships between tokens, not characters. When asked to count letters, it must somehow reconstruct character-level information from token-level representations — a task for which it was never optimized. The model learned to predict the next plausible token in a sequence, not to perform discrete symbolic operations on sub-token elements.

The result is that counting, a task trivial for a pocket calculator from the 1970s, becomes genuinely difficult for systems capable of passing bar exams and writing competent poetry.

What fluency obscures

The counting failure exposes a deeper truth: fluency is not understanding. Language models are extraordinarily good at producing text that sounds like it was written by someone who understands the subject matter. They have absorbed the statistical patterns of human expertise across millions of documents. But pattern matching, however sophisticated, is not the same as reasoning.

This distinction matters enormously for how we deploy these systems. A model can generate a persuasive legal brief while having no actual grasp of legal reasoning. It can write code that compiles while having no model of what the code does. It can explain quantum mechanics in lucid prose while possessing no understanding of physics beyond the correlation of words.

The strawberry problem is merely the most visible symptom of this gap. The model fails at counting because counting requires tracking discrete states — a fundamentally different operation from predicting probable continuations.

The workaround economy

The AI industry has developed an entire infrastructure of workarounds for these limitations. Chain-of-thought prompting encourages models to show their work, which sometimes catches errors. Tool use allows models to call external calculators or code interpreters for tasks requiring precision. Retrieval-augmented generation lets models consult external databases rather than relying on their compressed, lossy training data.

These patches work, often impressively well. But they represent a tacit admission that the core architecture has fundamental blind spots. We are building elaborate scaffolding around systems whose basic operation involves a category error: treating all problems as next-token prediction problems, even when they manifestly are not.

Our take

The strawberry problem is not a failure of scale or training data. It is an architectural inevitability, and recognizing it should reshape how we think about AI capabilities. These systems are not nascent general intelligences temporarily bad at math. They are extraordinarily powerful pattern-completion engines that happen to produce outputs resembling thought. The distinction is not pedantic. It determines whether we use these tools wisely or stumble into failures we should have anticipated. The model cannot count to three. That fact should inform every decision about where we deploy it.

The Joni Times

Large language models cannot count. This explains more than you think.

The tokenization trap

What fluency obscures

The workaround economy

Our take

עוד ב־ בינה מלאכותית

Large language models cannot reliably count the letters in a word. This reveals everything about how they actually think.

Oracle's security flaw exposed more than 100 companies. It also exposed AI's infrastructure problem.

The Trump administration just killed Anthropic's most advanced AI models. The precedent is more dangerous than the models ever were.

The courtroom sketch artist may be AI's most poignant casualty. A profession born from legal necessity now faces extinction by algorithmic convenience.