Large language models process language the way a foreign tourist navigates Tokyo: by recognizing patterns and making educated guesses, not by understanding the underlying structure. This is why ChatGPT can write a sonnet about quantum mechanics but struggles to tell you how many r's appear in "strawberry."
The explanation lies in tokenization, the preprocessing step that converts text into digestible chunks before a model ever sees it. When you type a word, the system does not perceive individual letters. It perceives tokens—subword fragments that might split "strawberry" into "straw" and "berry," or "str," "aw," and "berry," depending on the tokenizer's training. The model literally cannot see the letters you are asking about.
The vocabulary problem
Modern language models work with vocabularies of roughly 50,000 to 100,000 tokens. These are not words in any human sense. They are statistical artifacts—chunks of text that appeared frequently enough in training data to earn their own entry. Common words like "the" get single tokens. Rare words get split. Numbers are particularly chaotic: "1000" might be one token, while "1001" becomes two.
This creates predictable blind spots. Ask a model to reverse a word and it must first infer the word's spelling from token boundaries it was never designed to expose. Ask it to count syllables and it must reconstruct phonetic structure from orthographic fragments. The model is not stupid; it is simply working with the wrong tools, like asking a chess grandmaster to win at checkers using chess rules.
Why this matters beyond party tricks
The counting problem is a window into something deeper about how these systems succeed and fail. Language models are pattern-completion engines trained on vast text corpora. They learn that certain token sequences predict other token sequences. This makes them superb at tasks that humans find difficult—synthesizing information across documents, maintaining consistent tone across thousands of words, recognizing subtle rhetorical patterns—while failing at tasks that humans find trivial.
The mismatch explains why AI-generated code often contains off-by-one errors, why models hallucinate plausible-sounding statistics, and why they struggle with precise date arithmetic. These tasks require exact symbol manipulation, but the model's training objective rewards probabilistic plausibility, not correctness. A response that sounds right is, from the model's perspective, indistinguishable from one that is right.
The engineering workarounds
The industry's response has been to bolt on external tools. Modern AI assistants route mathematical queries to calculators, code execution to interpreters, and factual lookups to search engines. This hybrid approach—neural intuition plus symbolic computation—works remarkably well in practice. But it also concedes that language models alone are not general reasoning engines. They are extraordinarily sophisticated autocomplete systems that happen to exhibit emergent capabilities no one fully predicted.
Understanding this distinction matters for anyone deploying these tools professionally. A language model is an excellent first-draft generator, brainstorming partner, and pattern recognizer. It is a poor calculator, fact-checker, or source of ground truth. The technology's genuine strengths are impressive enough without pretending it possesses capabilities it structurally cannot have.
Our take
The counting problem is not a bug to be fixed; it is a feature of the architecture, as fundamental as the fact that a hammer cannot turn screws. The companies building these systems know this, which is why they quietly route your arithmetic questions to Python interpreters running in the background. The real question is whether users understand they are interacting with a centaur—part neural network, part traditional software—rather than the unified intelligence the marketing implies. Clarity about what these tools actually are would serve everyone better than the current fog of anthropomorphic branding.




