Last week a friend messaged me: "I've uploaded a document to ChatGPT and asked it to do an analysis. It did good. Then I uploaded the next one and suddenly it forgot half of my instructions and spouted total nonsense". She'd hit 80% of the context window.
I've had dozens of similar conversations over the last few months. A person starts using LLMs and at first everything is great: the code does what it's supposed to, the summaries are on point. But as days pass the ugly truth shows its face. The code works as long as the task is straightforward and concise. Otherwise, it's a minefield. Summaries that were great for a single document fail on two and lose important details.
If you're using LLMs or agents, you experience this daily. The root cause is something well known but commonly misunderstood: the context window. On the surface it sounds like a size limit, the number of tokens a model can hold at once. But the context window isn't a container that holds your data intact until it overflows. It's a compression stage. Lossy from the first token.
That makes every LLM a probabilistic semantic compressor, at least it's not "just a monoid in the category of endofunctors", thank god for that.
It's probabilistic because every token choice depends on random and semi-random conditions, semantic because it operates on meaning, and a compressor because it always loses information.
LLM Basics (click here if you don't know how LLMs work)
Large Language Models are trained on massive amounts of text. During training they learn statistical patterns: what words tend to follow what other words, in what contexts, across billions of documents. The result is a compressed representation of language patterns.
When you send a prompt, the model breaks your text into tokens. Tokens are not words. They're chunks of words. "serendipity" becomes [ser][end][ip][ity]. Each token gets converted into a vector, a list of numbers that represents its meaning in context.
Generation works one token at a time. The model looks at all tokens so far (your prompt plus anything it already generated) and predicts the most likely next token. Then it appends that token and repeats. This is called autoregressive generation.
The key word is "predicts." The model doesn't "know" the answer. It calculates a probability distribution over all possible next tokens and samples from it. Sometimes the most probable token is correct. Sometimes it isn't. This is why the same prompt can produce different outputs, and why LLMs can be confidently wrong.
Everything else (chat, code generation, summarization, agents) is built on top of this token-by-token prediction loop.
Attention is all you need and you can't get enough of it
The paper "Attention is all you need" gave rise to modern LLMs. The idea is concrete: LLMs generate tokens one by one, and for every token they ask a question: what other tokens are related to the current one?. In the simplest case it helps distinguish "a bank" as a financial institution from "a river bank". In more complex cases it connects semantically similar entities across the entire prompt.
Attention is a memory hog. For a frontier model, each token can consume over a megabyte of memory in attention state. A 128K context window means north of 100GB just for the prompt. This is what limits the context window: the number of tokens an LLM can hold at once.
People are working on making this more efficient. Sometimes those optimizations trade off quality, but that's a different story.
What matters for users: the attention is finite. A model can't match every token to every other token in practice. It's a computational nightmare.
The consequence: every single token in a prompt consumes attention. You asked for 2+2, but also included a bunch of random numbers? The model will likely fail, distracted by irrelevant content.
The bigger the model, the less it fails on simple cases. But the failures become more subtle, harder to spot. Every irrelevant token affects overall performance.
Compression starts at token one
If distractors in the prompt were the only problem we might have an easier life. But no. Even if your prompt is 100% relevant, every token still consumes attention. As the prompt grows, a model attends less and less, and information gets lost: compressed or forgotten.
Contrary to common belief, the loss is immediate. With the first tokens. It only grows with prompt size. Not after 50% of context, not after 20%. From the first tokens. By the time context reaches 70% the model is severely degraded. Popular evals like needle-in-the-haystack test idealized conditions of a single needle, not the dozens or hundreds that real prompts contain.
The compressor's two tradeoffs
Two tensions you can't escape.
More detail vs more noise
The model needs more details to answer your question better. At the same time, every word you add competes for attention and steers the model off the target. Every subtask of your main task is a distractor. The more the compressor has to "keep in mind" the worse it represents any single part.
All the data vs fidelity
You have a document where all the details of a system are described with great precision, but there's a tiny little catch: the document is enormous by LLM standards, bigger than the context window.
You need all the details to build the system, but the model can't do the work because performance degrades long before the document is even processed. I would love to tell you there is good news, but alas, no. This is a fundamental problem at the intersection of LLM design, resource requirements, and information theory.
The problem is not lack of intelligence. Even with the best intelligence you have to compress the document to feed the LLM. But you can't know what to compress until you've understood the whole thing.
Your coding agent is a compressor in a trench coat
Here's how Claude Code, Codex and OpenCode work:
- Start a main agent
- Preload skill descriptions, general context, MCP servers
- The main agent does all the work, with an option of spawning subagents
- You're encouraged to treat the main agent as a capable entity and ask it to do complex tasks directly
- You're encouraged to save workflows as text in Skills
Now consider a typical coding session. The system prompt eats 4-8K tokens. Skill descriptions and MCP definitions add another 2-5K. Project context (your CLAUDE.md, file trees, conventions): 3-8K more. That's 10-20K tokens consumed before you type your first message. On a 200K context model, you've already spent 10% on overhead. And that is an extremely conservative figure.
It's not even about percentages, seemingly innocent phrase from one of the Skill files can alter the trajectory of a session in a meaningful way. Sometimes it only takes a word to anchor a model to a specific mode.
You write a message, the agent reads some files, generates code, reads test output. By your third interaction you're at 40-60%. By your fifth, the compressor is operating fully in the degraded zone, carrying the full weight of every file it read, every failed test, every correction you made. And the system prompt, the skills, the MCPs are still there, still consuming attention, now completely irrelevant to the function you're debugging.
The fact that you call a "probabilistic semantic compressor" an "agent" doesn't change its nature.
Every single thing in this design promotes errors and performance degradation. The context is always full of things irrelevant to the immediate task. Instead of code-defined steps you have an LLM keeping the workflow in the context window.
The frontier models are improving, the error percentage dropped. Agents produce seemingly OK results for simple tasks. For non-trivial tasks reliability hasn't moved. It's just hidden behind tons and tons of slop, creating an illusion of robustness that disappears when you need precision.
Building a proper coding agent is hard and much more expensive. The architecture has to be right, and you need to spawn a fresh agent for every subtask. That means you can't cache the prompt. Currently, agentic coders rely heavily on caches: they keep your conversation preprocessed for the duration of your session, saving a lot of money on inference. Without caches the economics become impossible to sustain. I believe this is one of the reasons why Anthropic fought against nonstandard harnesses recently.
Working with a compressor
Once you see the compressor, you can work with it instead of against it.
Treat every token as a cost. Each token in your context doesn't just take up space. It actively degrades the representation of everything else. That AGENTS.md that your harness generated with /init is competing with an actual task at hand. Cut what isn't pulling its weight.
Know your compression ratio. A 200K context window doesn't give you 200K tokens of usable capacity. Depending on your task you may start seeing quality drops as early as 10%. It falls off a cliff past 70%. Your real token budget is much smaller.
Reset the compressor. The worst thing you can do is continue a long conversation hoping the model "remembers." It doesn't remember. It drags compressed residue of every previous exchange through attention. Start fresh.
Let code do bookkeeping. Every time you ask a model to "keep track" of something across steps, you're asking to preserve a specific piece of information while processing everything else around it. That's exactly what lossy compression drops first. Use code for tracking state, for iteration, for validation. Write intermediate results to files. The model should think, not manage.
Don't ask a model to do what code can solve. File operations, string manipulation, data transformation: use deterministic solutions. Running them through an LLM introduces error probability for zero benefit. A script that parses JSON gets it right every time. Running the same operation through an LLM gets it right most of the time, and "most of the time" compounds across steps.
Build pipelines, not generic executors. A pipeline knows its steps: read spec, generate code, run tests, check output. Each step has a small targeted input and specific output. A generic executor keeps everything in one ballooning context window and hopes for the best.
What can be deterministic must be deterministic. That's my favorite approach. Just don't put an LLM in unless you really need reasoning. For any work that demands reliability there is no other way.
The hype and the reality
I don't want to discourage anyone from using these tools. But in the hype economy we should understand the difference between a story and harsh reality.
My friend's ChatGPT didn't forget her instructions because it was stupid. It forgot because every token from the first document was still sitting in the attention window, competing for the model's focus. The instructions lost the competition. The compressor compressed them away.
That's the mechanism, whether you're uploading documents to ChatGPT or running a coding agent on a 50-file codebase.
Seeing the compressor won't fix the fundamental tradeoffs. But it changes the questions you ask. When the model produces garbage, you can trace what it was forced to compress. That's a question with an answer, and the answer tells you what to fix.
P.S. Attention eval
Want to see the degradation with your own eyes? I built attention-eval project. It runs a battery of tests against any LLM and shows you where attention falls off. Fair warning: it does a lot of requests and will cost you money. But seeing the curve for your specific model is worth it.