Ask a large language model to write a sonnet about quantum physics and it will produce something passable, perhaps even elegant. Ask it how many r's appear in the word "strawberry" and there is a reasonable chance it will get it wrong. This is not a bug awaiting a patch. It is a fundamental feature of how these systems process information, and grasping why illuminates both the genuine power and the irreducible limits of the technology reshaping white-collar work.
The confusion stems from a category error that even sophisticated users make: assuming that because language models produce coherent text about mathematics, they must be doing mathematics. They are not. They are doing something else entirely — something impressive in its own right, but categorically different from computation.
The tokenization trap
Language models do not see words the way humans do. They see tokens — chunks of text that the system has learned to treat as atomic units. The word "strawberry" might be split into "straw" and "berry," or into different fragments depending on the tokenizer's training. When you ask the model to count letters, it is not examining the orthographic structure of a word. It is pattern-matching against similar requests it encountered during training, attempting to predict what a helpful response would look like.
This is why the same model that fails at letter-counting can correctly state that the square root of 144 is 12. It has seen that particular fact thousands of times in its training data. It is not computing the square root; it is retrieving a statistical association. The distinction matters enormously. One process scales to novel problems; the other does not.
What prediction cannot replace
The architecture underlying modern language models — the transformer — was designed to predict the next token in a sequence. It does this extraordinarily well. But prediction and calculation are different cognitive operations. When a human adds 47 and 38, they execute a procedure: carry the one, sum the tens column. When a language model produces "85," it is selecting the token most statistically likely to follow the prompt, based on patterns absorbed from billions of text examples.
For common arithmetic, this works tolerably well because the training data contains countless instances of simple sums. For unusual calculations — multiplying two seven-digit numbers, say — the model enters territory where pattern-matching fails and it has no procedural fallback. It will still produce an answer, delivered with the same confident tone. This is perhaps the most dangerous property of these systems: they do not know what they do not know.
Our take
The arithmetic weakness is not a temporary embarrassment that scaling will solve. It reflects something essential about what language models are: sophisticated prediction engines, not reasoning machines. This is not a dismissal of their utility — prediction turns out to be extraordinarily useful for summarization, translation, code generation, and a hundred other tasks. But the sooner users internalize that fluent prose about a subject is not the same as understanding that subject, the sooner we can deploy these tools where they genuinely excel and stop expecting them to be something they were never built to be.




