When a large language model confidently tells you that 7 × 8 equals 54, it is not making a careless error. It is doing exactly what it was designed to do: predicting the most statistically likely next token based on patterns in its training data. The fact that this prediction happens to be mathematically wrong illuminates something essential about the nature of these systems that their marketing materials tend to obscure.

The architecture underlying modern AI assistants — the transformer model that powers everything from ChatGPT to Claude to Gemini — processes language as sequences of tokens, which are essentially fragments of words or characters. When you ask such a system to multiply two numbers, it does not perform multiplication. It pattern-matches against similar-looking arithmetic problems it encountered during training and predicts what answer typically followed. For common calculations that appeared frequently in its training corpus, this works remarkably well. For anything slightly unusual — longer numbers, less common operations, or problems requiring multiple steps — the illusion shatters.

The tokenization trap

The problem runs deeper than mere memorization gaps. These models literally cannot see numbers the way humans do. The number 1,247 might be tokenized as "1," "247" or "12" "47" or some other arbitrary split, depending on the specific tokenizer. The model has no internal representation of quantity, no number line, no sense that 1,247 is larger than 1,246 by exactly one. It sees symbols that happened to co-occur with certain other symbols during training. Asking it to perform arithmetic is like asking someone to do mathematics using only the shapes of the numerals, with no understanding of what those shapes represent.

This is why models can write surprisingly sophisticated code — they have seen millions of examples of well-structured programs — while failing to execute that same code mentally. They can explain the concept of a prime number eloquently while being unable to reliably identify whether a given number is prime. The gap between linguistic competence and computational competence is not a temporary limitation awaiting the next model update. It is baked into the fundamental architecture.

What prediction cannot achieve

The implications extend far beyond arithmetic. Any task requiring strict logical consistency, precise multi-step reasoning, or verification against ground truth will eventually expose the same structural limitation. These systems generate plausible-sounding outputs by predicting what text would typically come next in a given context. Plausibility and truth are correlated often enough to be useful, but they are not the same thing.

This is why AI assistants can produce beautifully written legal briefs that cite nonexistent cases, or medical summaries that blend accurate information with fabricated statistics. The model has no internal fact-checker, no mechanism for distinguishing between what it has seen and what it has hallucinated. It is optimizing for fluency, not accuracy. The confident tone is itself a prediction about how authoritative text typically sounds, not a reflection of actual certainty.

Our take

None of this means large language models are useless — they are genuinely transformative tools for drafting, brainstorming, translation, and countless other applications where approximate correctness and human oversight combine productively. But the arithmetic failure is a gift to anyone trying to understand what these systems actually are. They are not nascent intelligences gradually acquiring human-like reasoning. They are extraordinarily sophisticated prediction engines that have learned to mimic the surface patterns of intelligence without possessing its underlying machinery. The sooner we internalize this distinction, the better we will be at using these tools for what they do well and recognizing when they are confidently, fluently, inevitably wrong.