Ask a frontier AI model to count the windows on a moderately complex building facade, and it will almost certainly get it wrong. Not because the task is hard for humans—a child can do it—but because counting requires something these systems fundamentally lack: the ability to hold discrete objects in working memory and iterate through them systematically.

This is not a bug to be patched in the next release. It is a window into what large language models actually are, and more importantly, what they are not.

The prediction machine beneath the magic

Large language models work by predicting the next token in a sequence. A token is roughly a word or word-fragment, and the model's entire training consists of learning statistical patterns about which tokens tend to follow which other tokens, given vast amounts of text. When you ask a question, the model generates an answer one token at a time, each choice influenced by the probability distributions it learned during training.

This architecture produces something that feels like reasoning. The model can explain quantum mechanics, write poetry, and debug code. But it achieves these feats through pattern completion, not through the kind of step-by-step logical processing that humans use when counting objects or solving arithmetic problems.

The distinction matters enormously. A calculator follows explicit rules: take these digits, apply this operation, return the result. A language model instead asks: given everything I've seen about how humans discuss arithmetic, what answer would plausibly come next? For simple problems, the training data contains enough examples that the model's pattern-matching produces correct answers. For novel or complex calculations, it hallucinates—confidently producing text that looks like a correct answer but isn't.

Why the illusion is so convincing

The sophistication of modern language models creates a powerful cognitive trap. When a system can discuss Wittgenstein's philosophy of language, summarize legal precedents, and write functional software, the assumption that it must therefore be able to count windows feels reasonable. Humans who can do the former can certainly do the latter.

But language models are not general intelligences with uneven abilities. They are extremely sophisticated autocomplete systems that have absorbed enough human text to simulate many forms of expertise. The simulation breaks down precisely where pattern-matching diverges from actual computation: arithmetic, spatial reasoning, logical deduction over many steps, and anything requiring genuine working memory.

This explains why AI systems can pass bar exams but struggle with tasks that seem trivial. The bar exam tests pattern recognition over legal concepts—exactly what language models excel at. Counting windows requires holding a mental register and incrementing it systematically—exactly what they cannot do without external scaffolding.

The scaffolding workaround and its limits

The industry has developed clever patches. Modern AI systems can call calculators for arithmetic, execute code for complex logic, and use retrieval systems to ground their outputs in real data. These tool-using capabilities genuinely extend what AI can accomplish.

But the patches also reveal the core limitation. A system that needs to call a calculator to add large numbers is not a system that understands mathematics in any meaningful sense. It is a system that has learned when to admit it needs help and how to ask for it—useful, but categorically different from actual numerical cognition.

The implications ripple outward. Every application that requires reliable computation, precise reasoning, or guaranteed accuracy must build extensive verification infrastructure around the language model. The model becomes a sophisticated interface layer, not a reasoning engine.

Our take

None of this diminishes what language models genuinely achieve. They have made human knowledge more accessible, automated tedious communication tasks, and created entirely new forms of human-computer interaction. But the window-counting problem is a useful corrective to the breathless discourse that treats these systems as nascent general intelligences. They are something stranger and more specific: machines that have learned to speak human without learning to think human. Understanding that distinction is essential for anyone building with these tools, investing in these companies, or simply trying to make sense of a technology that will shape the coming decades. The magic is real. It just isn't the magic we think it is.