Ask a large language model to write a sonnet about quantum entanglement and it will produce something passable, perhaps even elegant. Ask it to multiply 47 by 89 and there is a reasonable chance it will confidently deliver the wrong answer. This asymmetry is not a bug to be patched in the next release. It is a window into the fundamental architecture of how these systems process information — and why the gap between linguistic virtuosity and mathematical competence is not merely wide but categorical.

The confusion arises because humans conflate two very different cognitive operations under the umbrella of "intelligence." When we multiply numbers, we execute a deterministic algorithm: carry the one, shift the column, sum the products. When we construct a sentence, we draw on pattern recognition, context, and probabilistic inference about what word should come next. Large language models do only the second thing. They are, at their core, extraordinarily sophisticated prediction engines trained on oceans of text to forecast the most plausible continuation of any given sequence.

The tokenization problem

The trouble begins before the model even starts thinking. Text enters these systems not as words but as tokens — fragments that might be whole words, syllables, or individual characters depending on frequency in the training data. The number 4,729 might be tokenized as "4," "729" or "47" "29" or some other arbitrary split. The model has no internal representation of four thousand seven hundred twenty-nine as a quantity. It sees only a sequence of symbols that, in its training data, appeared in certain contexts and were followed by certain other symbols.

This means that when a language model appears to perform arithmetic, it is not calculating. It is pattern-matching. It has seen enough examples of "2 + 2 = 4" that it can reliably reproduce the answer. It has seen enough examples of simple multiplication that it can often get those right too. But as numbers grow larger or problems grow more complex, the model is essentially guessing based on statistical regularities in how mathematical expressions appeared in its training corpus. There is no internal calculator, no symbolic manipulation, no understanding of place value.

Why language is different

The same architecture that fails at arithmetic succeeds brilliantly at language precisely because language is fundamentally probabilistic. There is no single correct next word in a sentence, only words that are more or less appropriate given context, tone, and meaning. A language model trained on billions of sentences develops an implicit grasp of grammar, idiom, and even rhetorical structure — not because it understands these things in any conscious sense, but because the statistical patterns of well-formed English are deeply embedded in the data.

This is why these systems can produce text that feels intelligent. They have absorbed the surface structure of human reasoning as expressed in language. They know that arguments typically have premises and conclusions, that essays have introductions and summaries, that formal writing differs from casual speech. What they lack is any grounding in the world those words describe. They are mirrors reflecting the shape of human thought without the substance.

Our take

The arithmetic problem is not a flaw in current implementations awaiting a clever fix. It is a fundamental consequence of building intelligence from prediction rather than logic. This does not diminish what language models can do — their fluency remains genuinely useful for drafting, summarizing, and exploring ideas. But it should permanently retire the notion that these systems are a few iterations away from general intelligence. They are something new: statistical engines of plausibility, powerful within their domain, brittle outside it. Understanding the difference is the beginning of using them wisely.