Every user of a modern AI assistant has experienced the moment: you have been working through a complex document, asking questions, refining ideas, and suddenly the model responds as if the previous hour never happened. It contradicts itself, forgets the name you mentioned three times, or asks you to re-explain something you covered in detail. This is not a bug in the software. It is a fundamental constraint baked into the mathematics of how these systems work.
The constraint is called the context window, and it is the single most important limitation that separates current AI from the seamless digital assistant of science fiction. Understanding it does not require a computer science degree, but it does require abandoning the intuition that talking to an AI is like talking to a person with a memory.
The architecture of forgetting
A large language model processes text by converting words into numerical tokens — roughly three-quarters of a word each, on average. These tokens flow through the model's neural network together, and the model can only attend to tokens that are present in its current input. The context window is simply the maximum number of tokens the model can accept at once.
Early models like GPT-2 had context windows of about 1,000 tokens — perhaps 750 words. GPT-4 expanded this to 128,000 tokens in its largest configuration, enough for a short novel. Some newer models claim windows exceeding a million tokens. But bigger is not always better in practice. Research has consistently shown that models perform worse on information placed in the middle of long contexts, a phenomenon researchers call the "lost in the middle" problem. The model attends most reliably to the beginning and end of its input.
Crucially, the context window is not memory in any human sense. It is more like a whiteboard that gets erased and rewritten with each response. When you start a new conversation, the whiteboard is blank. When you continue an old one, the application pastes the prior exchanges back onto the whiteboard — until they no longer fit.
Why this matters for real work
The context window explains why AI assistants excel at self-contained tasks and struggle with extended projects. Ask a model to summarize a single article, and it performs admirably. Ask it to help you revise a manuscript over several sessions, and it will lose the thread. The model has no persistent understanding of your project; it only knows what fits on the whiteboard right now.
This limitation shapes how sophisticated users interact with AI. They learn to front-load critical information, to paste relevant context into each query, to treat the model as a brilliant but amnesiac collaborator. Some applications attempt to work around the constraint by automatically summarizing prior conversation and injecting that summary into new prompts, but summaries lose detail. Others use retrieval systems to pull relevant documents into the context on demand, a technique that helps but introduces its own errors.
The constraint also explains why AI cannot yet replace knowledge workers whose value lies in accumulated understanding of a client, a codebase, or a regulatory environment. The model knows nothing it was not told in the current session or trained on before deployment.
Our take
The context window is not a temporary limitation awaiting a software update. It reflects deep tradeoffs in computational cost, attention mechanisms, and the physics of silicon. Future models will expand it, and clever engineering will mitigate its effects, but the fundamental architecture means AI assistants will remain goldfish for the foreseeable future — brilliant within their bowl, oblivious to anything outside it. Users who internalize this constraint will get far more value from these tools than those who keep expecting them to remember.




