SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Agentic AI

Context Engineering vs Prompt Engineering

2025-10-21 8 min read

"Prompt engineering" is the wrong mental model for where agents are going.

It made sense when the product was a chat box.

It breaks the moment you ask an agent to persist, to improve over time, and to operate across channels, projects, and weeks.

The field already sensed this and quietly changed the word.

Prompt engineering is syntax. Context engineering is governance.

The center of gravity moved

This is not one company's idea. In June 2025, Shopify's CEO Tobi Lutke publicly argued that "context engineering" was a better description of the core skill than "prompt engineering," and Andrej Karpathy amplified the point: in any serious LLM application, the real work is filling the context window with the right information for the next step, not wording a clever prompt. Within the same two weeks, LangChain framed it as "building dynamic systems to provide the right information and tools in the right format," and Cognition, the team behind Devin, called it "the next level of prompt engineering." By September 2025, Anthropic had brought the framing into its own engineering guidance, reframing the question from wording to "what configuration of context is most likely to generate our model's desired behavior."

That is a real shift in the center of gravity. The serious agent builders adopted a better name for the skill.

But much of the popular framing starts with assembly: fill the window with the right information for the next step. The production problem goes further, into ownership, policy, and lifecycle. Governing what persists across many steps, sessions, and weeks is a different problem. The first is about assembly. The second is about state.

This post is about the second.

Problem: durable agents fail for boring reasons

Long-lived agents do not fail because the model cannot reason.

They fail because state is unmanaged.

Conversations grow until context windows blow up.
Old instructions linger and conflict with new instructions.
Preferences drift, or get overwritten by noise.
Retrieval becomes a slot machine.
Tool outputs bloat the prompt and dominate the budget.
Nobody can answer: "Why did the agent believe that?"

When the system fails, people reach for a better prompt.

That is like debugging a distributed system by rewriting the README.

You might get a short-term improvement.

You will not get reliability.

Blind spot: most "memory" features are not memory

Start with the thing everyone already calls memory.

Most platforms treat memory like a feature:

a /remember command
a vector store
"top-k memories" injected into the prompt

That is not learning.

That is a write API plus retrieval.

Learning requires a lifecycle:

deciding what counts as state
consolidating experience into durable representations
governing what is allowed to persist
assembling the right context for the next action under a budget

In other words, learning requires context engineering in the second, harder sense.

Architectural truth: context is an operating system

If you want durable agents, treat context like an operating system.

Not a longer prompt.

An OS has:

state
policies
permissions
scheduling
isolation boundaries
observability
and a way to recover when things crash

Agent context needs the same.

The useful artifact is not a prompt. It is a contract: a declaration of what state means, rules for how state is modified, and constraints on what is allowed.

Concretely, each class of memory in that contract has a shape. A user preference, for instance:

Type: durable preference
Owner: the profile subsystem, not the main loop
Write policy: updated only on explicit, repeated signal, never from a single offhand remark
Evidence: the interaction that justified the change
Retention: persists until contradicted, then superseded
Rollback: revertible to the prior value, with the change logged

Write that for every class of state you let an agent keep, and you have a context contract. Skip it, and you have a text file that grows.

That is a control plane move.

This maps directly to two principles I keep coming back to.

The first: context is an operating system. Identity, continuity, and policy live in context. Treat it as governed state, not a longer prompt.

The second: the harness is the product. Models commoditize. The moat shifts to the runtime: policy, approvals, audit trails, artifacts, sandboxes, memory lifecycle.

A contract without the harness is philosophy. A harness without the contract is chaos.

Durable agents require both.

What context ownership actually implies

Context ownership has concrete implications. None of them are exotic. They are the same controls any stateful system needs.

1) Versioned state, not invisible state

If memory is valuable, it must be inspectable.

A minimum bar:

you can view it
you can diff it
you can revert it
you can attribute where it came from

If you cannot do those things, the agent is not "learning."

It is accumulating a pile of ungoverned text.

2) Scheduled consolidation, not ad hoc notes

Durable context cannot be written only at interaction time.

The main agent is busy:

responding
calling tools
making decisions under latency and budget constraints

The consolidation step belongs in the background.

Call it background reflection, sleep-time compute, or maintenance jobs.

Either way, the system needs a periodic process that:

reviews recent activity
promotes the signal
compresses the noise
and rewrites memory artifacts so the next session starts clean

3) Budgets and drop policies, not "stuff it all in there"

Context windows are getting bigger, but that does not remove the need for budgets.

This is measured, not theoretical.

Chroma's "context rot" study found that the 18 models it tested grew less reliable as input length grew, even on simple retrieval, and that distractors resembling the answer amplify the decline. Treat that vendor framing as directional.

It lines up with peer-reviewed work. NoLiMa, accepted at ICML 2025, reported many models falling below half their short-context accuracy by 32K tokens, with GPT-4o dropping from 99.3 percent to 69.7 percent.

The "lost in the middle" effect (Liu et al., 2023) is the named version of the same problem: models use the start and end of a long context far better than the middle.

Bigger context simply postpones the crash, while making it more expensive.

So context engineering means:

hard token budgets per memory class
recency and scope rules
intent-scoped retrieval
and explicit drop policies

4) Governance at the point of execution

Every memory write is an irreversible action unless you make it reversible.

So treat memory writes like any other sensitive tool:

Intent: what are we changing, and why?
Authorization: is this class of memory allowed, or does it require approval?
Evidence: what experience supports the update?
Reversibility: can we revert cleanly?

This is not bureaucracy.

This is how you keep a long-lived system from drifting into contradiction.

Why labs optimizing coding benchmarks is a real problem

There is a structural reason memory lags.

Coding benchmarks reward short-lived excellence. A SWE-bench-style run scores whether a model resolves an issue inside a single session, under a fixed context.

They measure:

local problem solving
within one session
under a fixed context

They do not measure whether an agent can:

remain coherent over months
improve from experience
keep a stable identity across model swaps
or avoid accumulating false beliefs in memory

A capability the scoreboard ignores tends to lag, and memory formation is that capability.

The market corrects this only when people run agents continuously and discover the failure modes are state failures.

That is when "prompt engineering," in either sense, stops being the lever.

The takeaway

The field found a better name in 2025. The work now is to mean the harder thing by it.

Context engineering is not only assembling the right window for the next step. It is owning context as durable, governed state: principles and tooling for what persists, how it changes, and who is allowed to change it.

That is why the next generation of moats will look less like model choice and more like:

memory lifecycle
governance primitives
and control planes for state

State is the foundation, not an afterthought. It is the first of the four questions I ask about every agent, and the reason is simple.

Reasoning depth cannot exceed state stability.

That's architecture.

Sources

Tobi Lutke, on "context engineering" (X, June 2025): https://x.com/tobi/status/1935533422589399127
Andrej Karpathy, on "context engineering" (X, June 2025): https://x.com/karpathy/status/1937902205765607626
Harrison Chase / LangChain, "The rise of context engineering" (June 23, 2025): https://blog.langchain.com/the-rise-of-context-engineering/
Walden Yan / Cognition, "Don't Build Multi-Agents" (June 2025): https://cognition.com/blog/dont-build-multi-agents
Anthropic, "Effective context engineering for AI agents" (September 2025): https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Chroma, "Context Rot: How Increasing Input Tokens Impacts LLM Performance" (July 2025): https://research.trychroma.com/context-rot
Modarressi et al., "NoLiMa: Long-Context Evaluation Beyond Literal Matching" (ICML 2025): https://arxiv.org/abs/2502.05167
Liu et al., "Lost in the Middle: How Language Models Use Long Contexts" (TACL 2024): https://arxiv.org/abs/2307.03172
Jimenez et al., "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" (arXiv 2310.06770): https://arxiv.org/abs/2310.06770