Ask a large language model to write a sonnet and it will produce something passable, perhaps even lovely. Ask it how many times the letter 'r' appears in the word 'strawberry' and there is a reasonable chance it will confidently tell you two. The correct answer is three. This is not a bug awaiting a patch; it is a window into the fundamental nature of what these systems are and, more importantly, what they are not.
The disconnect feels almost absurd. A technology capable of summarizing dense legal contracts, translating between dozens of languages, and generating plausible code cannot perform arithmetic that a five-year-old handles without effort. Yet the absurdity dissolves once you understand that language models do not process text the way humans do. They never see individual letters at all.
The tokenization problem
Before any text reaches the neural network, it passes through a tokenizer—a preprocessing step that breaks language into chunks called tokens. These chunks are not letters, not even always whole words. They are statistical conveniences, fragments that appeared frequently enough in the training data to earn their own numerical identity. The word 'strawberry' might become two or three tokens depending on the model's vocabulary. Once tokenized, the original spelling is gone, replaced by abstract identifiers that carry no inherent information about character composition.
This means that when you ask a model to count letters, you are asking it to reverse-engineer information that was discarded before it ever began thinking. It can guess, and it often guesses well enough to seem competent, but it is reconstructing from statistical memory rather than inspecting the actual string. The illusion of understanding is powerful precisely because language models are so good at pattern-matching that they can fake competence in domains where they have no direct access to the underlying data.
Fluency is not comprehension
The broader lesson extends well beyond counting. Language models are trained to predict the next token in a sequence, a task that rewards fluency, coherence, and stylistic mimicry. They learn that certain words tend to follow others, that paragraphs have structure, that confident assertions sound more authoritative than hedged ones. None of this requires understanding in any meaningful sense. A model can produce a grammatically perfect sentence about quantum entanglement without possessing any internal representation of physics.
This is why hallucinations—confident assertions of false information—are not occasional glitches but structural features. The system is optimized to sound right, not to be right. When it lacks relevant training data, it fills the gap with plausible-sounding fabrication, because plausibility is precisely what it was trained to maximize. The same mechanism that makes these tools so useful for drafting emails and brainstorming ideas makes them unreliable for tasks requiring factual precision.
Our take
None of this diminishes the genuine utility of large language models, but it should temper the grander claims made on their behalf. These are extraordinary pattern-matching engines, capable of accelerating certain kinds of intellectual labor in ways that would have seemed magical a decade ago. They are not, however, reasoning machines, and treating them as such invites disappointment at best and serious error at worst. The inability to count letters is not a trivial limitation to be engineered away in the next version; it is a symptom of an architecture that processes language as statistical texture rather than symbolic meaning. Understanding that distinction is the first step toward using these tools wisely—and the first defense against the hype that obscures what they actually do.




