Ask a large language model to multiply 47 by 89 and watch it hesitate. Not because the math is hard—a pocket calculator from 1975 handles it instantly—but because the model is doing something fundamentally different from calculation. It is predicting what the answer should look like, character by character, based on patterns absorbed from billions of text examples. This distinction between prediction and computation sits at the heart of what these systems can and cannot do.

The confusion is understandable. When an AI passes a medical licensing exam or writes competent legal briefs, the assumption follows that simpler tasks must be trivial. The opposite is true. Language models excel at tasks requiring pattern recognition, contextual inference, and the kind of fuzzy reasoning humans do naturally. They struggle with tasks requiring precise symbolic manipulation—the domain where traditional computers have always been unbeatable.

The tokenization trap

The root issue begins before a model sees a single word. Text enters these systems through tokenization, a process that breaks language into digestible chunks. The number 47,892 might become three or four separate tokens, each processed independently. The model never "sees" the number as a unified quantity with mathematical properties. It sees a sequence of symbols, much like how you might perceive an unfamiliar script—recognizable patterns without inherent meaning.

This design makes perfect sense for language. Breaking "uncomfortable" into "un" + "comfort" + "able" lets the model understand the word through its components, recognizing the negation prefix and the derivational suffix. The same logic applied to numbers creates absurdity. The digit 4 in 47 has a different positional value than the 4 in 4,700, but the model has no native mechanism to represent this. It must learn place value entirely from statistical patterns in training data.

What prediction actually means

Traditional computers execute algorithms. Given inputs, they follow deterministic steps to produce outputs. A calculator receiving 47 × 89 applies the multiplication algorithm directly to the binary representations of those numbers. The answer emerges from the operation itself.

Language models work through conditional probability. Given a prompt, they predict the most likely next token, then the next, then the next. When you ask for 47 × 89, the model is essentially asking: "In the training data, when text resembling this question appeared, what text typically followed?" If the training data contained many examples of similar multiplications with correct answers, the model can pattern-match its way to 4,183. But it is not multiplying. It is remembering—or more precisely, interpolating between memories.

This explains why larger models perform better at arithmetic without any architectural changes. More parameters mean more capacity to memorize patterns. A sufficiently large model has seen enough multiplication examples to generalize reasonably well, at least for numbers within common ranges. Ask it to multiply two seven-digit numbers and performance collapses, because such examples are sparse in training data.

The deeper lesson

This limitation is not a bug to be fixed but a window into what these systems fundamentally are: extraordinarily sophisticated pattern-completion engines. They have learned, from exposure to human text, to simulate many forms of reasoning. The simulation is often indistinguishable from the real thing—until you find the edge cases where the underlying mechanism shows through.

The practical implications matter. Language models augmented with calculators, code interpreters, and retrieval systems perform dramatically better on tasks requiring precision. The model handles what it does well—understanding intent, structuring problems, communicating results—while external tools handle symbolic computation. This hybrid approach acknowledges rather than obscures the technology's actual nature.

Our take

The arithmetic problem is oddly reassuring. It demonstrates that these systems, however impressive, remain comprehensible. They are not mysterious oracles but statistical engines with knowable properties and predictable failure modes. The companies building them understand this, which is why tool use and retrieval augmentation have become central to product development. The hype cycle benefits from mystification; actual utility requires clear-eyed assessment of what the technology can and cannot do. A system that writes better than most humans but counts worse than a forty-year-old calculator is not a contradiction. It is simply a different kind of machine than the ones we are accustomed to.