The smartest machines in history still cannot pack a suitcase. That tells us more about intelligence than any benchmark.

The most advanced artificial intelligence systems can pass bar exams, write competent poetry, and diagnose rare diseases from medical scans. They cannot reliably determine whether a glass of water will spill if you tilt it forty-five degrees. This is not a bug awaiting a fix; it is a window into what these systems fundamentally are and are not.

The gap between AI's spectacular successes and its mundane failures has become so familiar that we have stopped noticing how strange it is. A system that can explain quantum entanglement in iambic pentameter cannot understand that a sandwich left in a hot car will spoil. The asymmetry is not accidental. It reveals the architecture of these tools and, by contrast, something profound about the nature of human cognition.

The missing physics engine

Humans navigate the world with an intuitive physics that begins forming in infancy. We understand that unsupported objects fall, that liquids conform to containers, that pushing a heavy object requires more effort than pushing a light one. This knowledge is not learned from text; it emerges from embodied experience, from thousands of hours of dropping toys and spilling drinks and bumping into furniture.

Large language models, by contrast, learn about the world entirely through language. They have read millions of descriptions of gravity but have never felt weight. They have processed countless cooking instructions but have never experienced the difference between raw and cooked pasta. When they appear to reason about physical scenarios, they are pattern-matching against similar descriptions they have encountered, not simulating actual physics.

This explains the peculiar failure modes. Ask a frontier model to plan a route through a cluttered room, and it will struggle with spatial relationships that a toddler handles unconsciously. Ask it to predict what happens when you pour sand into a bucket of water, and it may produce a plausible-sounding answer that is subtly wrong in ways that reveal no genuine understanding of displacement or saturation.

The benchmark illusion

The AI industry measures progress through standardized tests: bar exams, medical licensing exams, coding challenges, mathematical olympiads. By these metrics, recent systems have achieved superhuman performance. But the metrics themselves encode a profound bias toward the kind of intelligence that can be captured in text.

Consider what it takes to pass the bar exam versus what it takes to be a competent lawyer. The exam tests recall of legal principles and application of logical rules. Actual legal practice requires reading a client's face to know when they are lying, understanding the social dynamics of a courtroom, knowing when a contract clause that is technically legal will nonetheless provoke a lawsuit. The exam captures perhaps ten percent of the job.

This is not to diminish the genuine utility of AI systems in legal research, document review, and contract analysis. These are valuable applications. But the benchmark scores create an illusion of general competence that the systems do not possess. A model that scores in the ninety-ninth percentile on the bar exam may still be unable to advise a client on whether to accept a settlement offer, because that decision requires integrating factors—risk tolerance, emotional state, long-term relationships—that exist outside the training distribution.

What the gap reveals

The persistent chasm between AI capability and human common sense is instructive. It suggests that intelligence is not a single dimension that can be scaled up through more parameters and more data. Human cognition appears to involve multiple distinct systems: pattern recognition, yes, but also embodied simulation, social modeling, temporal reasoning about cause and effect, and something like genuine curiosity that directs attention toward gaps in understanding.

Current AI architectures excel at the first of these and approximate the others through statistical correlation. When those correlations hold—when the test distribution matches the training distribution—the systems perform brilliantly. When the correlations break down, as they inevitably do in novel physical or social situations, the systems fail in ways that seem almost random to human observers.

This is why robotics remains so difficult despite advances in machine learning. A robot arm can be trained to pick up objects it has seen before, but generalizing to new objects with different weights, textures, and centers of gravity requires exactly the intuitive physics that language models lack. The embodiment problem is not separate from the intelligence problem; it may be central to it.

Our take

None of this means AI systems are not useful or that progress has stalled. The tools we have today are genuinely transformative for specific tasks. But the hype cycle has created expectations that outpace reality, and the gap between benchmark performance and real-world competence will continue to surprise people who mistake fluent text generation for understanding. The five-year-old packing a suitcase—choosing clothes for the weather, estimating what will fit, anticipating what she will want—is performing a feat of integrated intelligence that no current system can match. Recognizing this is not pessimism; it is the beginning of clarity about what these tools actually are.

The Joni Times

The smartest machines in history still cannot pack a suitcase. That tells us more about intelligence than any benchmark.

The missing physics engine

The benchmark illusion

What the gap reveals

Our take

עוד ב־ בינה מלאכותית

The radiologist is not being replaced. She is being rewritten.

Google slashes Gemini prices to ignite an AI subscription war. The race to the bottom has officially begun.

Artificial intelligence cannot smell the coffee. That limitation reveals more than any benchmark.

Large language models cannot count. The reason reveals everything about how they actually think.