The most sophisticated artificial intelligence systems in the world can pass bar exams, generate working code, and compose sonnets in the style of Shakespeare. They cannot, however, reliably tell you whether a glass of water placed upside-down on a table will spill. This is not a bug awaiting a fix. It is a window into the fundamental architecture of what we have built and what we have not.
The disconnect between linguistic fluency and basic physical reasoning has become the central puzzle of contemporary AI research. Systems trained on billions of words have absorbed remarkable statistical patterns about how language works, but they have learned almost nothing about how the world works. They are, in a precise sense, all talk.
The training data problem
Large language models learn by predicting the next word in a sequence, over and over, across vast corpuses of text. This approach produces systems that are extraordinarily good at mimicking human communication patterns. But text is a lossy compression of reality. When you read "she picked up the heavy box," your brain automatically simulates weight, grip, balance, and effort. The model sees only tokens.
This explains a curious phenomenon: AI systems often fail at questions that seem trivially easy while succeeding at questions that seem impossibly hard. Solving a differential equation requires following explicit rules that appear frequently in training data. Knowing that you cannot fit a watermelon in a coffee cup requires intuitions about three-dimensional space that humans acquire through years of physical interaction with objects.
Researchers have documented this gap extensively. When prompted with scenarios involving spatial relationships, temporal sequences, or causal chains that deviate even slightly from common textual patterns, models frequently produce confident nonsense. They have memorised the surface structure of reasoning without acquiring the underlying machinery.
Why more data will not close the gap
The instinctive response to any AI limitation is to suggest more training data or larger models. This has worked remarkably well for many capabilities. It is unlikely to solve the common-sense problem.
The issue is not that language models lack information about physics or causality. The issue is that their architecture is optimised for a different task entirely. Predicting tokens rewards statistical co-occurrence, not causal understanding. A model that has seen millions of sentences about gravity still does not know what gravity feels like, and that experiential knowledge turns out to matter enormously for robust reasoning.
Some researchers are pursuing hybrid approaches: combining language models with physics simulators, robotic embodiment, or structured knowledge graphs. These efforts have produced incremental gains but nothing resembling the fluid, automatic common sense that humans deploy constantly without conscious effort. The gap between "can discuss physics" and "understands physics" remains vast.
What this means for AI deployment
The practical implications are significant. Current AI systems are most reliable in domains where success can be verified through text alone: coding, legal research, translation, summarisation. They become progressively less reliable as tasks require implicit physical reasoning, genuine causal inference, or predictions about novel situations.
This is why autonomous vehicles remain stubbornly difficult despite billions in investment. Driving requires exactly the kind of embodied, contextual, split-second reasoning that language models lack. It is also why AI systems that seem brilliant in controlled demonstrations sometimes fail catastrophically in deployment, when the real world presents situations that deviate from training distributions.
Our take
The AI industry has a communication problem. Marketing materials emphasise capabilities that genuinely impress while glossing over limitations that genuinely matter. The result is a public discourse oscillating between utopian hype and dystopian panic, neither of which captures the more interesting reality: we have built systems that are simultaneously more capable and more limited than most people understand. The honest assessment is that large language models represent a genuine breakthrough in one narrow dimension of intelligence while leaving most other dimensions essentially untouched. Recognising this is not pessimism. It is the prerequisite for using these tools wisely and for understanding what remains to be built.




