Large language models do not think. They predict. This distinction sounds philosophical until you ask one to count the letters in the word "strawberry" and watch it confidently announce there are two r's. The error is not a bug to be patched but a window into what these systems actually are—and what they can never become without fundamental redesign.
The confusion stems from a reasonable intuition: anything that can write poetry, summarize legal briefs, and explain quantum mechanics must surely be able to count to three. But LLMs process language as tokens, not characters or concepts. The word "strawberry" enters the system as a single token or a small cluster of them, depending on the model's vocabulary. The internal representation contains no explicit record of which letters appear or how many times. When asked to count, the model does not examine the word; it predicts what a plausible answer would look like based on patterns in its training data.
Prediction versus computation
Traditional software performs operations. A calculator receives two numbers and an operator, then executes a deterministic procedure that guarantees the correct result. An LLM receives a prompt and generates the most statistically likely continuation. For most arithmetic, these approaches happen to converge—the training data contains enough correct examples that the model learns to mimic the right answers for simple sums. But as numbers grow larger or problems grow more compositional, the statistical approach breaks down. The model has never seen the specific multiplication you are asking about, and its learned heuristics fail.
This is why chain-of-thought prompting helps: by forcing the model to generate intermediate steps, you give it more opportunities to pattern-match against training examples of correct reasoning. The model is not suddenly doing math; it is predicting what a math-doing human would write at each step. The improvement is real but fragile.
Why fluency misleads us
Humans conflate linguistic competence with general intelligence because, for us, they are deeply intertwined. A person who can articulate a nuanced argument about monetary policy almost certainly understands basic arithmetic. LLMs sever this connection. Their fluency emerges from exposure to trillions of words, while their reasoning emerges from—well, nothing structural. They have no working memory in the computational sense, no ability to loop through a sequence and increment a counter, no mechanism for verifying their own outputs against ground truth.
The result is a system that sounds authoritative regardless of whether it is correct. This is not deception; the model has no concept of correctness. It is simply generating plausible text, and plausible text about numbers often contains wrong numbers delivered with perfect confidence.
What this means for deployment
The counting problem is a microcosm of a larger challenge. Any task requiring precise, verifiable computation—financial calculations, code execution, logical proof—sits outside the native competence of transformer-based language models. The industry response has been to bolt on external tools: calculators, code interpreters, retrieval systems. This works, but it transforms the LLM from a reasoning engine into a sophisticated router that decides which tool to invoke. The intelligence, such as it is, lies in the orchestration rather than the core model.
Our take
The strawberry test is not a gotcha; it is a diagnostic. Anyone deploying LLMs in consequential domains should internalize what it reveals: these systems are unreliable narrators dressed in the costume of omniscient assistants. They will improve, and tool integration will paper over many gaps, but the fundamental architecture predicts rather than computes. Until that changes—if it ever does—the wise approach is to treat LLMs as brilliant but innumerate interns who require supervision on anything that matters.




