Large language models can draft legal contracts, explain quantum mechanics, and write passable poetry in the style of Yeats. Ask one to count the number of r's in "strawberry" and watch it stumble. This is not a bug awaiting a patch. It is a window into how these systems fundamentally work—and why the gap between their apparent intelligence and their actual capabilities remains so persistently strange.

The counting problem emerges from tokenization, the process by which language models break text into digestible pieces before processing. When you type a word, the model does not see individual letters. It sees tokens—chunks that might be whole words, word fragments, or in some cases single characters, depending on frequency patterns in training data. The word "strawberry" might arrive as two or three tokens, none of which cleanly map to the letter boundaries a human would recognize. The model literally cannot see what you are asking it to count.

The compression trade-off

Tokenization exists for good reason. Processing text character-by-character would be computationally ruinous. By chunking common patterns—"ing," "tion," "the"—models achieve massive efficiency gains. A typical model uses a vocabulary of tens of thousands of tokens, allowing it to represent most English text with far fewer computational steps than a character-level approach would require. This compression is what makes real-time conversation possible. But compression loses information. The internal representation that allows a model to understand that "running" relates to "run" is the same representation that obscures how many n's the word contains.

This trade-off ripples through surprising domains. Models struggle with anagrams, rhyme schemes, and precise syllable counts. They falter at tasks requiring exact character manipulation—generating text of precisely specified length, for instance, or reliably detecting typos. These are not failures of reasoning but failures of perception. The model is working with a lossy encoding of the text you provided.

Why this matters beyond party tricks

The counting limitation illustrates a broader principle: language models are not general-purpose reasoning engines operating on raw reality. They are pattern-completion systems operating on transformed representations of text. Their impressive capabilities emerge from those representations; their limitations are baked into the same foundation. When a model appears to "understand" a complex argument, it is recognizing and extending patterns in token sequences. When it fails at counting letters, it is revealing that those patterns do not preserve character-level information.

This has practical implications. Tasks requiring precise symbolic manipulation—certain mathematical operations, code debugging involving exact character positions, cryptographic applications—remain areas where language models require external tools or verification layers. The models can learn to call calculators or code interpreters, but they cannot internalize the counting itself. The architecture does not support it.

Our take

The strawberry test has become a minor internet meme, but it deserves more serious attention. It demonstrates that apparent intelligence and actual capability can diverge in counterintuitive ways. A system that can explain the French Revolution in nuanced detail cannot reliably tell you how many letters are in "revolution." This is not a contradiction—it is a feature of how statistical learning over compressed representations differs from the symbolic manipulation humans perform effortlessly. Understanding this distinction is essential for anyone hoping to deploy these systems wisely, and for anyone trying to predict which tasks will yield to AI pressure and which will resist it longer than expected.