Ask any leading AI chatbot how many r's appear in the word "strawberry" and you will likely receive a confident, incorrect answer. This is not a bug that engineers forgot to fix. It is a window into the deepest architectural choice underlying every large language model: these systems do not see letters, words, or numbers the way humans do. They see tokens — arbitrary chunks of text that bear no necessary relationship to the units humans consider meaningful.

The strawberry problem has become a minor internet meme, but its implications are profound. When you type a word into an AI system, that word is first converted into tokens before any processing occurs. The word "strawberry" might become two or three tokens depending on the system's vocabulary. The model never sees the individual letters s-t-r-a-w-b-e-r-r-y as discrete objects to be counted. It sees something more like "straw" + "berry" — and from those chunks, it must infer what you are asking about.

The vocabulary that shapes thought

Tokenization emerged as a practical solution to a genuine problem. Human languages contain effectively infinite possible words when you account for conjugations, compounds, misspellings, and neologisms. Early natural language systems tried to maintain dictionaries of complete words, but this proved unwieldy. The breakthrough came from treating text as sequences of subword units — common fragments that could be combined to represent any string.

The dominant approach, byte-pair encoding and its variants, builds vocabularies by iteratively merging the most frequent character pairs in a training corpus. The result is a vocabulary where common words like "the" get single tokens while rare words are broken into pieces. "Tokenization" itself might become "token" + "ization." The number "1847" might become "18" + "47" or "1" + "847" depending on what patterns appeared most often during training.

This is elegant engineering, but it creates a fundamental mismatch between how humans conceptualize text and how models process it. When a human sees a number, they understand it as a quantity with mathematical properties. When a model sees a number, it sees a token or sequence of tokens that happened to co-occur with certain contexts during training. The model has no inherent understanding that "47" is larger than "18" — it must learn this statistically from patterns in text.

Why this matters beyond party tricks

The tokenization layer explains many of the behaviors that make AI systems feel simultaneously brilliant and obtuse. They can write sophisticated essays about quantum mechanics but stumble on basic arithmetic. They can analyze complex legal documents but fail to reliably count items in a list. They can generate poetry in the style of any author but cannot tell you whether "receive" or "recieve" is correctly spelled without essentially guessing from training statistics.

For professionals using these tools, understanding tokenization is not academic. It explains why AI systems perform inconsistently on tasks involving precise character manipulation, why they struggle with certain programming operations, why they sometimes produce mathematically nonsensical outputs with complete confidence. The model is not being lazy or careless. It literally cannot see the units you are asking about.

This also illuminates why multimodal models — systems that process images alongside text — sometimes perform better on visual counting tasks. When shown an image of letters, the visual processing pathway can potentially count discrete objects in a way the text pathway cannot.

Our take

The strawberry problem is often presented as an amusing failure, evidence that AI is not as smart as the hype suggests. The reality is more interesting. Tokenization is not a flaw to be patched but a foundational trade-off that enabled these systems to exist at all. By sacrificing character-level awareness, models gained the ability to process language at scale and develop the emergent capabilities that make them useful. The lesson is not that AI is stupid but that intelligence can be profoundly alien — capable of reasoning that humans find difficult while blind to things any child can see. Understanding these architectural realities is essential for anyone hoping to use these tools effectively rather than being perpetually surprised by their limitations.