Ask a large language model to count the number of times the letter 'r' appears in 'strawberry' and watch it confidently answer two. The correct answer is three. This is not a bug to be patched in the next release; it is a window into how these systems fundamentally process language — and why they will continue to surprise us in ways both impressive and absurd.

The counting failure reveals something essential: language models do not see text the way humans do. They see tokens.

The tokenisation trap

Before a single neural network weight is consulted, every piece of text fed into a language model is chopped into tokens — chunks that might be whole words, word fragments, or individual characters, depending on the tokeniser's vocabulary. The word 'strawberry' might become 'straw' + 'berry' or 'str' + 'aw' + 'berry' or something else entirely. The model never encounters the raw sequence of letters; it encounters a sequence of abstract symbols that bear only an indirect relationship to spelling.

This is not sloppiness. Tokenisation is what makes these systems tractable. A model that processed text character by character would need vastly more computation to capture the same linguistic patterns. By operating on meaningful chunks, the architecture can learn grammar, idiom, and even reasoning-like behaviour with remarkable efficiency. The trade-off is that the model has no native concept of individual characters within a token. When asked to count letters, it must infer the answer from statistical patterns in its training data — patterns that are surprisingly sparse for this particular task, because humans rarely write out letter-by-letter analyses of common words.

Why eloquence and innumeracy coexist

The same tokenisation that blinds models to letter-level detail is what enables their fluency. Because tokens correspond roughly to morphemes and common word pieces, the model's attention mechanism can focus on semantic relationships rather than orthographic ones. It learns that 'unhappiness' relates to 'happy' not by laboriously comparing letter sequences but by recognising token-level patterns that encode meaning.

This is why a language model can write a sonnet about quantum mechanics but stumble when asked how many words are in the sonnet it just wrote. The architecture is optimised for meaning, not measurement. Counting requires exact enumeration; meaning tolerates — even benefits from — fuzzy pattern matching.

The implications extend beyond parlour tricks. When a model summarises a document, it is not counting sentences or tracking word limits with precision; it is generating text that statistically resembles good summaries. When it performs arithmetic, it is pattern-matching against examples in its training data, not executing a calculator subroutine. This is why language models can solve many math problems correctly while occasionally insisting that 7 × 8 equals 54.

The engineering workarounds

AI developers are not unaware of these limitations. Modern systems increasingly route certain queries to external tools — calculators for arithmetic, code interpreters for counting, search engines for factual lookup. The language model becomes an orchestrator, deciding when to delegate rather than attempting everything itself. This hybrid approach papers over many weaknesses, but it also underscores that the core model remains a pattern-completion engine, not a general reasoning machine.

Some researchers are exploring architectures that might give models more direct access to character-level information, or that separate linguistic processing from symbolic computation more cleanly. Progress is real but incremental. The fundamental tension between efficiency and granularity is not easily resolved.

Our take

The strawberry test is useful precisely because it is trivial. Anyone who has spent five minutes with a child knows that counting letters is not hard. That a system capable of passing bar exams cannot reliably perform this task tells us something important: capability is not uniform, and impressive performance in one domain does not imply competence in another. The companies building these systems know this. The question is whether the rest of us — investors, regulators, enthusiastic adopters — will remember it when the next dazzling demo arrives.