Ask a large language model to multiply 47 by 83 and it will often get it right. Ask it to multiply 4,729 by 8,317 and watch it confidently produce a number that is entirely wrong. This is not a bug to be patched; it is a window into the fundamental nature of these systems and why they will never be the general intelligences their most fervent boosters imagine.

The disconnect is jarring. These same systems can explain quantum mechanics, write functional Python scripts, and produce passable legal briefs. They appear, by any reasonable surface measure, to be intelligent. Yet they fail at tasks that require nothing more than following a deterministic algorithm — the kind of procedure a pocket calculator from 1975 executes flawlessly.

The pattern-matching illusion

Large language models do not calculate. They predict. When a model encounters "47 × 83 =", it is not performing multiplication; it is recognizing a pattern and predicting what tokens are statistically likely to follow based on the vast corpus of text it ingested during training. For common arithmetic problems that appeared frequently in training data, this pattern-matching often produces correct answers. For uncommon problems, the model is essentially guessing — educated guessing, but guessing nonetheless.

This distinction matters enormously. A calculator applies a fixed algorithm that will produce the correct answer regardless of whether anyone has ever computed that particular problem before. A language model has no algorithm for multiplication. It has statistical associations between sequences of characters. The appearance of calculation is a mirage.

Why this matters beyond arithmetic

The arithmetic failure is not merely a curiosity; it is diagnostic of a deeper architectural reality. Language models are compression engines for human-generated text. They have learned to approximate the outputs of human reasoning without implementing anything resembling the underlying process. When a model "explains" physics, it is reproducing patterns from physics explanations it has seen, not deriving conclusions from first principles.

This works remarkably well for many tasks precisely because human knowledge is highly redundant and interconnected. A model that has absorbed millions of physics discussions has, in effect, memorized the shape of physical reasoning. But it has not learned physics. The distinction becomes apparent at the boundaries — novel problems, edge cases, anything requiring genuine deduction rather than sophisticated interpolation.

The ceiling nobody wants to discuss

The AI industry has spent years implying that scale solves everything. More parameters, more training data, more compute — and eventually these systems will transcend their limitations. The arithmetic problem suggests otherwise. Models have gotten dramatically larger and more capable, yet their mathematical reasoning remains fundamentally brittle. They have not learned to calculate; they have merely memorized more examples.

This does not mean language models are useless. They are extraordinarily useful for tasks that align with their actual capabilities: synthesis, summarization, translation, code completion, creative generation. They are poor at tasks requiring reliable logical deduction, precise calculation, or reasoning about novel situations that differ structurally from training examples.

Our take

The inability of language models to perform reliable arithmetic is the most clarifying fact about them. It reveals that these systems, for all their apparent sophistication, are fundamentally doing something different from thinking. They are mirrors that reflect human knowledge back at us in recombined forms — useful mirrors, often brilliant mirrors, but mirrors nonetheless. The companies building these systems have every incentive to obscure this distinction. Users have every reason to understand it. A tool you comprehend is far more valuable than a magic box you do not.