Large language models are trained to predict the next word. This single fact, properly understood, explains both their astonishing capabilities and their maddening failures.
When you ask a model to write a melancholy poem about autumn, it draws on millions of examples of melancholy poems, autumn imagery, and the statistical relationships between words that evoke wistfulness. The result can be genuinely moving. When you ask it to multiply 47 by 83, it has no multiplication algorithm to execute. It has only seen many examples of multiplication problems and their answers, and it attempts to pattern-match its way to a plausible response. Sometimes it succeeds. Often it confidently produces nonsense.
The tokenization problem
The issue begins before the model even processes your question. Language models do not see numbers the way humans do. They see tokens—chunks of text that their training process deemed statistically useful. The number "1,247" might be split into "1," "24" and "7" or handled as a single token, depending on the model. The number "1,248" might be tokenized completely differently. This means the model has no inherent understanding that these two numbers are adjacent on a number line. They are simply different sequences of symbols with different statistical associations.
This is why models can correctly answer that Paris is the capital of France with near-perfect reliability but stumble when asked how many r's appear in "strawberry." The first is a pattern burned into the training data millions of times. The second requires character-level counting that the tokenization scheme actively obscures.
What prediction actually means
The phrase "next-token prediction" sounds simple, but the scale of the prediction task creates emergent behaviors that genuinely surprised the researchers who built these systems. A model trained on enough text begins to exhibit something that looks like reasoning, not because it was designed to reason, but because predicting text accurately requires modeling the processes that generated that text.
If you train on millions of legal documents, you develop something that resembles legal reasoning. Train on scientific papers, and you get something resembling scientific inference. But this is mimicry of reasoning's outputs, not reasoning itself. The model has learned that certain patterns of words follow other patterns of words. It has not learned the underlying logical structures that make those patterns valid.
The confidence illusion
Perhaps the most dangerous feature of language models is that they have no internal uncertainty signal that maps to human intuitions about confidence. A model produces tokens with associated probabilities, but these probabilities reflect how likely a token is given the training distribution, not how likely the resulting statement is to be true.
This is why models can state obvious facts and complete fabrications with identical apparent certainty. They are not lying—lying requires knowing the truth and choosing to contradict it. They are simply producing statistically plausible text, and plausible-sounding falsehoods are, by definition, statistically plausible.
Our take
None of this diminishes the genuine utility of language models. A tool that can draft emails, summarize documents, and brainstorm ideas at superhuman speed is valuable even if it cannot reliably count letters. But the current discourse oscillates between treating these systems as nascent gods and dismissing them as parlor tricks. The reality is more interesting and more specific: they are extraordinarily powerful pattern-completion engines operating on text, with all the capabilities and limitations that architecture implies. Understanding the architecture is not pessimism. It is the prerequisite for using these tools wisely.




