Ask any large language model how many letters are in the word "strawberry" and watch it stumble. The answer is ten, but models routinely say nine, sometimes eight. This is not a bug in the traditional sense. It is a window into the alien way these systems perceive language — and understanding it changes how you should use them.
The confusion stems from tokenization, the preprocessing step that converts human text into the numerical sequences neural networks actually process. Before a model sees your prompt, a tokenizer chops it into chunks called tokens. These chunks are not letters, not words, not syllables — they are statistical artifacts derived from training data, optimized for compression efficiency rather than human intuition.
The invisible surgery
When you type "strawberry," the model might receive it as "straw" + "berry" or "str" + "aw" + "berry" depending on its tokenizer. The model never sees the individual letters s-t-r-a-w-b-e-r-r-y lined up for counting. It sees abstract numerical IDs representing these chunks, with no inherent awareness that "strawberry" contains three r's or that "berry" has five letters.
This explains a constellation of quirks. Models struggle with anagrams, palindrome detection, and rhyme identification. They cannot reliably tell you if two words have the same number of characters. They hallucinate when asked to reverse strings letter by letter. The failure mode is consistent: anything requiring character-level manipulation runs headlong into the token boundary problem.
Why this architecture exists
Tokenization is not laziness — it is necessity. Processing text character by character would make context windows impossibly short and training prohibitively expensive. A typical model might handle a context of 128,000 tokens; if each character were a token, that would shrink to perhaps 25,000 characters, roughly ten pages of text. By chunking common sequences together, tokenizers achieve compression ratios that make modern context lengths feasible.
The tradeoff is profound. Models gain the ability to reason across entire documents but lose fine-grained access to the orthographic structure of individual words. They understand language at the level of meaning-bearing chunks, not at the level of ink on paper.
Practical implications
This architecture shapes what tasks you should and should not delegate to language models. They excel at summarization, translation, code generation, and reasoning about concepts — all tasks where token-level understanding suffices. They fail at crossword puzzles, precise character manipulation, and any task requiring them to "see" text the way a human eye does.
The gap is not closing easily. Some researchers have experimented with character-level models, but the computational costs remain steep. Others have tried hybrid approaches, teaching models to call external tools for string manipulation. The fundamental tension between compression efficiency and orthographic awareness persists.
Our take
The strawberry test has become a parlor trick for exposing AI limitations, but its real value is pedagogical. It demonstrates that large language models are not general intelligences with occasional blind spots — they are specialized systems with a particular, inhuman relationship to text. Understanding tokenization does not make these tools less useful; it makes you better at using them. The machines read differently than we do. Knowing how differently is the first step toward productive collaboration.




