Large language models are, at their core, prediction engines. They do not think, calculate, or reason in any way a human would recognize. They predict the next token — a chunk of text, typically a word or word-fragment — based on statistical patterns learned from billions of documents. This architecture produces results so impressive that it is easy to forget what it cannot do. And what it cannot do is surprisingly basic.
Ask GPT-4 or Claude to count the number of words in a paragraph, and it will often get it wrong. Ask it to multiply two five-digit numbers, and it may produce confident nonsense. These are not bugs to be patched. They are fundamental limitations baked into how these systems process information.
The tokenization trap
When you type a sentence, the model does not see letters or words. It sees tokens — arbitrary chunks determined by a compression algorithm optimized for efficiency, not meaning. The word "tokenization" might be split into "token" and "ization." The number 847,293 might become three or four separate tokens with no inherent mathematical relationship. The model has no internal number line, no accumulator, no concept of cardinality. It has seen many examples of arithmetic in its training data and learned to pattern-match toward plausible-looking answers. Sometimes this works. Often it does not.
This is why models struggle with tasks requiring precise sequential operations: counting characters, tracking nested parentheses, maintaining exact inventories across a long document. The architecture processes everything in parallel, attending to relationships between tokens, but it lacks the step-by-step scratch pad that even rudimentary computation requires.
Why this matters beyond math
The counting problem is a window into deeper limitations. Language models excel at interpolation — producing outputs that resemble patterns in their training data. They struggle at extrapolation — handling genuinely novel situations that require reasoning from first principles. They can write a convincing legal brief because they have seen thousands of legal briefs. They cannot reliably determine whether a contract clause actually contradicts another clause three pages earlier, because that requires precise logical tracking, not pattern completion.
This distinction explains the uneven reliability that frustrates users. The model that writes beautiful prose may hallucinate a citation. The model that summarizes a research paper may invent a statistic. These failures are not random glitches. They emerge from the same statistical machinery that enables the impressive outputs.
The tool-use workaround
The industry's response has been to give models access to external tools: calculators, code interpreters, search engines. When a model recognizes it needs to perform arithmetic, it can offload the task to a reliable system. This works, but it requires the model to correctly identify when it needs help — a metacognitive judgment it often fails to make. A model confident in a wrong answer will not reach for a calculator.
More sophisticated approaches involve training models to always use tools for certain task types, or building verification layers that catch obvious errors. These are engineering solutions to an architectural limitation, and they help. But they do not change the fundamental nature of what these systems are.
Our take
The gap between what language models appear to do and what they actually do is the central tension of the current AI moment. These systems are genuinely useful — often remarkably so — for tasks that reward fluent pattern-matching and tolerate occasional errors. They are genuinely dangerous when deployed in contexts requiring precision, consistency, or genuine reasoning. Understanding the counting problem is not pedantry. It is the beginning of AI literacy: knowing what the tool is, so you can know what it is for.




