Ask any large language model how many times the letter 'r' appears in 'strawberry' and watch it confidently announce two. The correct answer is three. This isn't a bug that engineers forgot to fix; it's a window into the fundamental architecture of systems that can write sonnets but cannot reliably count letters in a word.
The disconnect stems from tokenization, the process by which language models break text into digestible chunks before processing. These chunks—tokens—are neither words nor characters but something stranger: statistical fragments derived from patterns in training data. The word 'strawberry' might become 'straw' + 'berry' or 'str' + 'awberry' depending on the model's vocabulary. Once text is tokenized, the original letters become invisible to the system. It's like asking someone to count the bricks in a house while showing them only the floor plan.
The vocabulary that shapes thought
Every major language model maintains a fixed vocabulary of tokens, typically between 30,000 and 100,000 entries. Common words like 'the' and 'and' get their own tokens. Rarer words get split into pieces. The word 'tokenization' itself might become 'token' + 'ization' or 'tok' + 'en' + 'ization' depending on the system.
This creates fascinating asymmetries. Models handle common English with remarkable fluency because frequent words map cleanly to single tokens. But proper nouns, technical jargon, and non-English text often fragment into multiple pieces, forcing the model to work harder to maintain coherence. A Welsh place name might consume a dozen tokens while conveying less semantic information than the single token 'London.'
The token vocabulary also explains why models struggle with tasks that seem trivially easy to humans. Reversing a word requires knowing its constituent letters, but the model only sees tokens. Counting syllables demands phonetic awareness that tokenization obscures. Even basic arithmetic falters because numbers tokenize inconsistently—'100' might be one token while '1000' becomes '100' + '0.'
What tokens reveal about intelligence
The tokenization architecture reflects a deeper truth about how these systems achieve their capabilities. Language models don't understand text the way humans do; they predict statistical relationships between tokens based on patterns observed during training. When a model writes a coherent paragraph, it's not composing sentences so much as generating sequences of tokens that have high probability given the preceding context.
This explains the peculiar texture of AI-generated text: grammatically impeccable, semantically plausible, yet sometimes missing the thread of actual reasoning. The model isn't thinking through an argument; it's producing tokens that look like thinking. The distinction matters less for creative writing than for tasks requiring genuine logical inference, which is why the same system that drafts elegant prose can stumble over simple word puzzles.
Our take
The strawberry problem isn't a failure of artificial intelligence so much as a reminder that these systems are genuinely alien. They process language through a lens that makes certain tasks effortless and others inexplicably hard. Understanding tokenization won't make you trust AI less, but it should make you trust it differently—appreciating what it actually does rather than what it appears to do. The models are remarkable, just not in the ways their confident prose suggests.




