Ask a large language model to write a sonnet, and it will produce something passable. Ask it to write a sonnet with exactly seven syllables per line, and watch it fail spectacularly. The machine that can discuss Wittgenstein and draft legal contracts cannot reliably perform a task that any eight-year-old manages without effort.

This is not a bug to be patched in the next release. It is a window into what these systems actually are — and what they are not.

The tokenization problem

Language models do not see words the way humans do. They perceive text through tokens, which are chunks of characters that their training process determined to be statistically useful. The word "strawberry" might be split into "straw" and "berry," or into "str," "aw," and "berry," depending on the model's vocabulary. When you ask how many R's appear in "strawberry," the model never actually sees the individual letters. It sees abstract numerical representations of token chunks and must somehow infer letter-level information from patterns learned during training.

This is roughly equivalent to asking someone to count the bricks in a house while only showing them a photograph taken from a moving car. They might guess correctly sometimes, especially for simple cases, but the fundamental information is not directly available to them.

Pattern-matching versus computation

The deeper issue is that language models are prediction engines, not calculators. They excel at determining what text should come next based on statistical patterns in their training data. When asked "What is 2 + 2?" they are not performing addition — they are recognizing that, in the vast corpus of text they consumed, the sequence "2 + 2 =" is overwhelmingly followed by "4."

This works remarkably well for common arithmetic. It fails catastrophically for unusual problems. Ask for the product of two six-digit numbers, and the model will confidently produce a wrong answer, because it has never seen that specific calculation and cannot actually multiply. It can only guess what a multiplication result should look like.

The implications extend far beyond arithmetic. Any task requiring precise, step-by-step logical operations — counting syllables, verifying legal citations, checking whether code actually compiles — runs into the same fundamental limitation. The model is always approximating, never computing.

Why this matters for deployment

The gap between perceived capability and actual capability creates genuine risks. A language model can produce text that reads like expert legal analysis while containing fabricated case citations. It can generate code that looks syntactically correct but fails to execute. It can write scientific explanations that sound authoritative but contain subtle factual errors.

The fluency is real. The reliability is not. And because these systems cannot distinguish between what they know and what they are guessing, they present both with identical confidence.

Our take

The counting problem is not evidence that AI is useless — these systems remain genuinely transformative for many applications. But it is a permanent reminder that we have built very sophisticated prediction machines, not thinking machines. The organizations that thrive with AI will be those that understand this distinction viscerally: that they are working with tools of extraordinary fluency and limited reliability, and that the human in the loop is not a temporary necessity but a permanent requirement. The hype cycle wants you to believe general intelligence is around the corner. The letter-counting problem suggests otherwise.