Ask ChatGPT to write a sonnet about quantum mechanics and it will deliver something passable in seconds. Ask it how many letters are in the word "strawberry" and there is a reasonable chance it will confidently answer "eight." This is not a bug to be patched. It is a window into the fundamental architecture of the technology reshaping white-collar work.
The disconnect baffles users because it violates our intuitions about intelligence. Surely a system that can explain the Treaty of Westphalia or generate working code can count letters? The assumption reveals how deeply we anthropomorphize these tools. We imagine a mind behind the curtain, reasoning through problems the way we do. There is no mind. There is a prediction engine of staggering sophistication, trained to guess the next word in a sequence based on patterns absorbed from billions of text examples.
Tokens are not letters
The key to understanding the counting problem lies in tokenization, the process by which language models break text into digestible chunks. When you type "strawberry," the model does not see nine individual letters. It sees tokens — subword units that might split the word into "straw" and "berry," or "str," "aw," and "berry," depending on the tokenizer's vocabulary. The model has no direct access to the character-level structure of words. It is like asking someone to count the bricks in a house while showing them only a blurry photograph of the neighborhood.
This design choice was not arbitrary. Processing text character by character would be computationally expensive and would fail to capture the meaningful patterns that make language work. The word "un-" carries semantic weight as a prefix; treating it as three separate letters loses that signal. Tokenization was an engineering triumph that enabled models to understand context across long passages. The trade-off was a fundamental blindness to the granular structure of text.
The statistical mirage
When a language model answers a question, it is not retrieving facts from a database or executing logical operations. It is generating the sequence of tokens most statistically likely to follow the input, given everything it learned during training. If the training data contained many examples of people correctly stating that "seven" has five letters, the model might reproduce that pattern. If it contained errors, or if the question is phrased in an unusual way that does not match familiar patterns, the model may hallucinate with complete confidence.
This explains the unnerving quality of AI mistakes. A calculator that fails does so obviously — it crashes, or returns an error, or produces gibberish. A language model that fails does so fluently. It wraps its wrong answer in the same confident prose as its correct ones, because confidence and correctness are unrelated variables in its architecture. The model has no internal fact-checker, no mechanism for distinguishing what it knows from what it is guessing. It is always guessing. Sometimes the guesses are extraordinarily good.
Why this matters beyond trivia
The counting problem is trivial. The underlying issue is not. Every limitation that prevents accurate letter-counting also affects tasks with real consequences: verifying citations, performing multi-step reasoning, maintaining consistency across long documents, distinguishing established fact from plausible-sounding fabrication. Users who understand the architecture can work with it — breaking complex tasks into smaller steps, verifying outputs independently, treating the model as a drafting assistant rather than an oracle. Users who do not understand it are flying blind.
The companies building these systems know this. Recent models have added workarounds — external tools for calculation, retrieval systems for fact-checking, chain-of-thought prompting to simulate reasoning. These are scaffolding around a core architecture that remains fundamentally unchanged: a next-token predictor of extraordinary power and inherent limitation.
Our take
The strawberry test has become a minor internet meme, a gotcha for deflating AI hype. But the lesson it teaches is more interesting than the joke. These systems are genuinely remarkable — not because they think, but because they have revealed how much of what we call thinking is pattern completion. The question is not whether AI will eventually count letters correctly; that is an engineering problem with engineering solutions. The question is whether we can hold two truths simultaneously: that these tools are transformatively useful, and that they are not what they appear to be. The companies selling AI have obvious incentives to blur that distinction. The rest of us cannot afford to.




