Ask a large language model to write a sonnet and it will produce something passable, occasionally lovely. Ask it how many r's appear in the word "strawberry" and it will confidently tell you two. This is not a bug being fixed in the next update. It is a window into how these systems actually work—and why the gap between their apparent intelligence and their actual capabilities remains so persistently strange.
The explanation lies in tokenization, the preprocessing step that occurs before any language model sees a single word you type. Models do not read text the way humans do, letter by letter or even word by word. They consume language in chunks called tokens, which are determined by statistical patterns in their training data. The word "strawberry" might be split into "straw" and "berry," or "str" and "awberry," depending on the tokenizer. The model never sees the individual letters at all. It cannot count what it cannot perceive.
The compression trade-off
Tokenization exists because processing text character-by-character would be computationally ruinous. A typical English word averages four to five characters; processing at the token level rather than the character level dramatically reduces the sequence length a model must handle. This compression enables the transformer architecture to maintain attention across longer passages—to remember what was said three paragraphs ago when generating the current sentence. The trade-off is that the model's fundamental unit of perception is neither the letter nor the word but something in between, optimized for statistical efficiency rather than human intuition.
This explains a constellation of peculiar behaviors. Models struggle with anagrams, palindromes, and spelling tasks because they require letter-level manipulation. They falter at precise arithmetic because numbers are tokenized inconsistently—"1000" might be one token while "1,000" becomes three. They cannot reliably count syllables in their own output. The very architecture that enables fluid, contextual language generation makes these seemingly simple tasks genuinely difficult.
What fluency actually means
The tokenization problem reveals something deeper about what language models are doing when they appear to understand. They are not comprehending text in any human sense; they are predicting token sequences based on statistical relationships learned from vast corpora. When a model writes a coherent paragraph about quantum mechanics, it is not reasoning about physics—it is generating tokens that statistically follow other tokens in patterns consistent with its training data. This can produce remarkably useful output, but it is a fundamentally different process from human cognition.
This distinction matters because it shapes what these tools can and cannot reliably do. Tasks that align with token-level pattern matching—summarization, translation, style transfer, code generation within familiar patterns—play to the architecture's strengths. Tasks requiring precise symbolic manipulation, consistent logical chains, or operations on sub-token elements remain fragile regardless of scale. More parameters and more training data improve many capabilities but do not resolve limitations baked into the representational scheme itself.
Our take
The strawberry problem is not a trivial gotcha; it is a diagnostic. A system that cannot count letters but can write plausible legal briefs is telling us something important about the nature of its intelligence—and about the difference between fluency and understanding. The most sophisticated users of these tools have learned to work with this grain rather than against it, treating language models as powerful pattern engines rather than digital minds. The persistent confusion between these categories, in both directions, remains the central obstacle to using AI well.




