Ask ChatGPT how many times the letter 'r' appears in the word 'strawberry' and there is a reasonable chance it will get it wrong. This is not a cherry-picked gotcha but a symptom of something fundamental: large language models do not process language the way humans do, and the mismatch between their apparent fluency and their actual mechanics is the single most important thing to understand about the technology reshaping white-collar work.

The counting problem stems from tokenization, the process by which models break text into digestible chunks. When you type 'strawberry,' the model might see it as 'straw' and 'berry,' or some other subdivision that bears no relationship to individual letters. It never sees three discrete r's lined up for counting. It sees statistical patterns learned from billions of text fragments, and it predicts what a helpful response to your question would look like based on those patterns. Sometimes the prediction lands on the right answer. Often it does not.

The prediction machine

Every response from a large language model is, at its core, a sophisticated autocomplete. The system has ingested vast quantities of human writing and learned which words tend to follow which other words in which contexts. When you ask it to explain quantum entanglement, it is not retrieving information from a database or reasoning through physics principles. It is generating text that statistically resembles how humans have written about quantum entanglement in its training data.

This is why the same model can write a serviceable essay on Kantian ethics and then confidently assert that a salmon is a mammal. Both outputs are predictions about what text should come next. One happens to align with reality because humans have written accurately about Kant. The other fails because the statistical patterns around animal classification are apparently less robust than the model's confidence suggests.

The fluency trap

Human brains are wired to associate fluent language with understanding. When someone speaks articulately about a subject, we assume they comprehend it. Large language models exploit this heuristic ruthlessly, not through any intent but through their design. They are optimized to produce text that sounds right, which is a very different objective than producing text that is right.

This creates a peculiar danger: the better these systems become at mimicking human expression, the harder it becomes to identify their failures. A clumsy chatbot that mangles grammar is easy to distrust. A system that writes with the polish of a professional editor but occasionally fabricates citations requires a more vigilant form of skepticism.

What they actually do well

None of this means large language models are useless. They excel at tasks where pattern-matching and stylistic fluency matter more than factual precision: drafting emails, brainstorming ideas, summarizing documents, translating between languages, generating code scaffolding. They are genuinely useful tools for anyone who writes, programs, or processes large volumes of text.

The trouble arises when users treat them as oracles rather than assistants. A model that helps you draft a legal brief is valuable. A model that you trust to get the case citations right without checking is a liability. The technology works best when humans remain in the loop, applying the judgment and verification that the systems themselves cannot provide.

Our take

The counting problem is not a bug to be fixed in the next release. It is a window into the architecture of systems that are being deployed with remarkable speed and remarkable credulity. Large language models are extraordinary pattern-matching engines dressed in the clothing of conversational partners. Understanding what they actually are—rather than what they appear to be—is the only responsible way to use them. The letter r appears three times in strawberry. That this remains a difficult problem for billion-dollar AI systems should give everyone pause.