Ask any frontier AI model to write a persuasive essay on climate policy, and it will produce something publishable in minutes. Ask it to count the letters in 'strawberry', and there is a reasonable chance it will confidently answer eight. This is not a glitch awaiting a patch. It is a window into the architecture that makes these systems simultaneously brilliant and baffling.
The disconnect stems from a design choice made years ago that now shapes every interaction you have with ChatGPT, Claude, or Gemini: these models do not see letters. They see tokens.
The tokenisation layer
Before any large language model processes your prompt, a tokeniser breaks your text into chunks — not characters, not words, but something in between. The word 'strawberry' might become two tokens: 'straw' and 'berry'. The model never encounters the individual letters s-t-r-a-w-b-e-r-r-y as discrete units. It works with these pre-chunked pieces the way a chef works with pre-cut vegetables: efficiently, but without direct access to the raw ingredients.
This tokenisation is not arbitrary. It emerged from compression research and was refined through byte-pair encoding, a technique that identifies frequently co-occurring character sequences and treats them as single units. The result is a vocabulary of roughly 50,000 to 100,000 tokens that can represent any text more efficiently than character-by-character processing. Training becomes faster. Context windows stretch further. The economics work.
But the trade-off is that the model's fundamental unit of meaning is divorced from the visual and phonetic building blocks humans use when they learn language. A child sounds out 'cat' letter by letter. GPT-4 encounters it as a single indivisible token.
Why this matters beyond party tricks
The letter-counting problem is amusing, but the tokenisation layer creates subtler issues that affect real applications. Rhyme detection becomes unreliable because the model cannot consistently access terminal phonemes. Anagram solving is nearly impossible. Code generation occasionally produces syntax errors because the model cannot verify character-level constraints. Spelling in languages with complex morphology — Finnish, Turkish, Arabic — degrades unpredictably.
More consequentially, tokenisation varies between models and even between versions of the same model. A prompt optimised for one system may tokenise differently on another, producing inconsistent results. Developers building applications atop these models must treat the tokenisation layer as a hidden variable that shapes behaviour in ways the model itself cannot explain.
Our take
The letter-counting failure is not evidence that AI is 'dumb' or that the technology is overhyped. It is a reminder that these systems are not digital humans but a genuinely alien form of intelligence, one that processes language through a lens we designed for efficiency rather than fidelity to human cognition. Understanding the tokenisation layer will not make you better at prompting, but it will make you a more honest observer of what these tools can and cannot do — which, in an era of breathless AI discourse, is worth more than another benchmark score.




