Ask ChatGPT how many times the letter 'r' appears in 'strawberry' and watch it confidently announce two. The correct answer is three. This is not a bug to be patched; it is a window into the fundamental architecture of large language models, and understanding it clarifies what these systems can and cannot do far better than any marketing material.
The error persists because LLMs do not see letters. They see tokens — chunks of text that their training process determined were statistically useful units. The word 'strawberry' might be tokenized as 'straw' and 'berry,' or as 'str,' 'aw,' and 'berry,' depending on the model. The system has no representation of individual characters to count. It is like asking someone to count the bricks in a photograph of a house: they can estimate based on what houses usually look like, but they cannot access the actual bricks.
Pattern matching, not reasoning
Large language models are, at their core, extraordinarily sophisticated autocomplete engines. They predict the next token in a sequence based on statistical patterns learned from training data. When GPT-4 writes a sonnet or explains quantum mechanics, it is not 'understanding' in any human sense — it is generating sequences that pattern-match against the vast corpus of human text it absorbed during training.
This explains both the magic and the failures. The models excel at tasks where the right answer looks like something that frequently appeared in training data: well-structured prose, common coding patterns, standard explanations of established concepts. They struggle when the task requires operations their architecture was never designed to perform: precise counting, multi-step logical reasoning, or anything requiring genuine novelty rather than sophisticated recombination.
The illusion of competence
The danger lies in the fluency. Because LLMs produce grammatically perfect, confidently stated outputs, humans naturally attribute understanding where none exists. A model that generates a plausible-sounding legal brief may have no grasp of whether its citations exist. A model that writes elegant code may not 'know' whether the code actually works.
This creates a peculiar failure mode: the systems are most dangerous precisely when they seem most competent. A hallucinated fact delivered in perfect prose is harder to catch than one delivered with obvious uncertainty. The architecture optimizes for plausibility, not truth.
What this means for users
None of this makes LLMs useless — far from it. They are remarkable tools for drafting, brainstorming, translation, and any task where human review catches errors. The key is understanding them as pattern-completion engines rather than reasoning systems. Use them for first drafts, not final answers. Trust them on common knowledge, verify them on specifics. Treat their confidence as a stylistic choice, not an indicator of accuracy.
Our take
The 'strawberry' problem is not a temporary limitation awaiting a software update. It reflects a fundamental architectural choice: these systems were built to predict text, not to reason about the world. That choice enabled their remarkable capabilities and ensures their characteristic failures. Anyone using these tools seriously needs to internalize this distinction. The models are not thinking; they are pattern-matching at superhuman scale. That is genuinely useful, but it is not intelligence, and conflating the two leads to predictable disasters.




