Large language models cannot reliably count. This reveals more about intelligence than any benchmark.

Ask a large language model to count the number of times the letter 'r' appears in the word 'strawberry' and you will likely receive a confident, incorrect answer. This is not a bug awaiting a patch. It is a window into what these systems actually are — and what they are not.

The failure is instructive because counting feels elementary. Humans do it without conscious effort. Yet for an LLM, the task is surprisingly alien. These models do not see 'strawberry' as a sequence of eleven characters. They see it as a token, a chunk of text their training data taught them to treat as a unit. The internal representation has no direct access to the letter-by-letter structure we take for granted. Asking an LLM to count letters is like asking someone to describe a painting they have only heard described in poetry.

The token problem

Tokenization is the preprocessing step that converts raw text into the numerical inputs a neural network can process. Different models use different tokenizers, but all of them break language into subword units optimized for compression and statistical regularity, not for human intuition. The word 'strawberry' might become one token or two, depending on the system. Either way, the model never encounters the individual letters as discrete objects during inference. It predicts the next token based on patterns learned during training, not by inspecting the internal structure of the current one.

This design choice was pragmatic. Training on individual characters would be computationally expensive and would sacrifice the contextual richness that makes LLMs useful. But the tradeoff means that certain tasks humans find trivial — counting, precise arithmetic, character-level manipulation — require the model to rely on memorized patterns rather than genuine computation. When those patterns are incomplete or misleading, the model fails with serene confidence.

Why confidence without competence persists

LLMs do not know what they do not know. They generate text by predicting statistically plausible continuations, and nothing in that process flags uncertainty about letter counts. The model has seen countless examples of humans answering counting questions correctly, so it produces answers that look correct. The appearance of competence is itself a learned pattern.

This dynamic extends far beyond counting. LLMs struggle with multi-step reasoning, temporal ordering, and any task requiring them to maintain and update a precise internal state. They excel at interpolation — producing outputs that resemble their training data — but falter at extrapolation, especially when the task demands something their architecture was never designed to do. The gap between fluent prose and reliable reasoning is not a matter of scale. Larger models count letters no better than smaller ones.

What this means for deployment

The counting failure is a useful heuristic for evaluating AI claims. When a vendor promises that an LLM can handle tasks requiring precision, ask whether the task is fundamentally about pattern matching or about genuine computation. Summarization, translation, and stylistic rewriting are pattern-matching tasks where LLMs shine. Inventory management, financial reconciliation, and anything involving exact counts or arithmetic are not. The distinction matters for anyone integrating these tools into workflows where errors carry consequences.

Researchers are exploring workarounds — tool use, code execution, chain-of-thought prompting — that offload precise computation to systems designed for it. These hybrid approaches are promising, but they are patches on a foundation that was never built for the tasks they address.

Our take

The letter-counting problem is trivial in isolation and profound in implication. It reminds us that fluency is not understanding, that confidence is not accuracy, and that the most impressive language models remain, at their core, sophisticated autocomplete engines. This is not a criticism — autocomplete at scale turns out to be remarkably useful. But mistaking it for general intelligence leads to misallocated trust and, eventually, to expensive surprises. The models will keep improving. The question is whether our expectations will keep pace with what they actually are.

The Joni Times

Large language models cannot reliably count. This reveals more about intelligence than any benchmark.

The token problem

Why confidence without competence persists

What this means for deployment

Our take

المزيد في الذكاء الاصطناعي

The actuary's quiet obsolescence. How AI is rewriting the oldest profession in risk.

The lawyer's brief is now written by machine. The courtroom itself remains stubbornly human.

Your AI assistant has never tasted coffee. That's a bigger problem than you think.

The radiologist's new colleague never sleeps. It also never takes the blame.