SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Model Escalation Is a Tool Call

June 11, 2026 #Agentic AI 11 min read

On June 10, 2026, OpenRouter announced Advisor, a server-side tool that lets one model consult another model mid-generation.

The obvious reading is cost optimization.

Run the cheap model most of the time. Let it call a more expensive model when it gets stuck. OpenRouter's example points at the spread directly: a low-cost executor can handle routine work, while a frontier advisor is reserved for the moments that need stronger reasoning.

That reading is true.

But it is not the interesting part.

The interesting part is that escalation is becoming part of the agent's internal control flow.

The expensive model should not always be the worker. It should often be the escalation path.

That is a different architecture than routing per request.

The old routing model is too coarse

Routing between a cheap model and an expensive one is not new, and it is worth being precise about what already exists.

There are two established families. The first is the reactive cascade: call a cheap model, score its answer, and escalate to a stronger model only if the score is low. FrugalGPT introduced this for LLMs in 2023 and reported matching a top model's quality at up to 98 percent lower cost on its own benchmark. AutoMix added a self-verification step before deferring upward. The second family is the predictive router: decide which model should handle the query before generating anything. RouteLLM trained that decision on preference data, and a market of routers followed, from Martian to Not Diamond to OpenRouter's own auto-router.

Both families are mature. And both operate on the same unit: the query. A reactive cascade may call the cheap model first and escalate after scoring it; a predictive router decides up front. But in both cases the decision is made between whole model calls, about a single query, not inside one running generation of a long agent task.

That assumption holds for short, single-turn work. If the user asks for a trivial rewrite, do not invoke the most expensive model in the stack. If the user asks for a subtle architecture decision, use a stronger model.

But agentic workloads do not reveal all of their difficulty up front.

An agent may begin with a normal request and discover the hard part halfway through:

a code edit touches a billing path
a migration has no rollback plan
a dependency changes the security boundary
a product decision depends on fresh market data
a customer-support answer requires policy interpretation
a proposed action is irreversible

At the request boundary, the task looked cheap.

Inside the run, it became expensive.

That is where request-level routing breaks down. It forces the system to either overpay for the whole run or under-provision the moments that actually need judgment.

Advisor-style escalation changes the unit of routing.

The unit is no longer the request.

The unit is the decision point.

Advisor is a runtime primitive, not a feature

OpenRouter's Advisor design has several details that matter architecturally.

The executor model calls Advisor like a tool. It passes a prompt describing what it needs help with. The advisor model returns guidance as the tool result. The executor keeps ownership of the final answer.

That split matters.

The advisor is not a ghostwriter. It is a consultant inside the loop.

The idea of escalating to a stronger model is not new. The cascade literature has done cheap-first-then-escalate since 2023, and vendors already frame model choice as a capability and cost tradeoff. What Advisor changes is the control point. The escalation is no longer a wrapper grading separate calls, or a route chosen before generation, or a usage pattern a developer wires by hand. It is a tool the executing model invokes itself, mid-generation, while it keeps writing.

OpenRouter also makes the advisor configurable. It can be any model in the catalog. It can have its own instructions. It can have its own tools, such as web search. A workflow can expose multiple named advisors: a security reviewer, an architect, a cost reviewer, a policy reviewer.

The system also has constraints. Recursion is blocked. Consultations are capped per request. Advisor tokens bill at advisor-model rates, separate from the executor. Advisor memory is reconstructed when prior advisor tool calls and results are replayed, and that memory is scoped per advisor.

Those details are not implementation trivia.

They are the shape of a production agent runtime.

Once a model can call a stronger model, the escalation path itself becomes something you have to govern.

Who can the executor ask? When is it allowed to ask? What context can it forward? How many times can it ask? Can the advisor use tools? Can the advisor see the full transcript, or only a compact packet? What does the run log: prompt content, cost metadata, advisor identity, final advice, or only routing telemetry?

Those are control-plane questions, the same ones I work through in The Harness Is the Control Plane.

Cost control becomes exception handling

The mature pattern is not "always use the cheap model."

The mature pattern is "use the cheap model until uncertainty crosses a threshold."

That is how engineering systems usually mature. We do not make every request take the slow path. We create fast paths, slow paths, retries, circuit breakers, fallbacks, approval gates, and escalation rules.

Agent runtimes need the same idea.

Advisor turns frontier reasoning into an exception path.

Routine work stays on the executor. Uncertain work gets escalated. Dangerous work may require a human. Repeated failure should stop the run.

That sounds simple until you build it. The hard part is not adding a tool definition. The hard part is deciding what earns the escalation.

In a coding agent, the triggers might be schema changes, missing tests, security-sensitive files, billing paths, rollback gaps, unfamiliar ownership, or a large diff.

In a payments workflow, the triggers might be payment authorization, refunds, subscription changes, or any irreversible charge.

In an enterprise workflow agent, the triggers might be customer data access, policy interpretation, permission changes, regulatory claims, or irreversible operations.

The cheap model can still do the work.

But the runtime should know when the cheap model needs a second opinion.

The escalation packet matters

OpenRouter's cookbook makes a practical point that is easy to miss: the advisor should not automatically receive the full transcript.

That is the right default.

If every advisor call forwards the entire conversation, raw logs, full diffs, private customer data, and irrelevant history, the pattern stops being cost control. It becomes another way to move too much context into an expensive model.

The useful unit is a compact escalation packet.

For a code-review advisor, that packet might contain the decision being reviewed, the changed files or affected modules, the uncertainty signals, the shortest useful diff summary, and the specific question the executor cannot resolve.

For a policy advisor, it might contain the policy clause, the proposed action, the user's authorization state, and the reason the action is ambiguous.

This is where agent architecture gets real.

Good escalation is not just "ask a better model."

Good escalation is context assembly under a budget.

The advisor should get enough state to improve the decision, but not so much state that it becomes an uncontrolled transcript sink.

Specialist advisors are interfaces

Named advisors are also more important than they look.

A security reviewer is not the same interface as an architect. A cost reviewer is not the same interface as a legal-risk reviewer.

Each advisor should have a narrow contract: what it reviews, what evidence it needs, what it is allowed to inspect, what kind of output it should return, and what decisions it cannot make.

This is where multi-agent systems often go wrong. Teams create a vague panel of agents and hope the discussion produces quality. That is usually theater.

A better design is boring:

one executor owns the task
advisors are bounded tools
each advisor has a defined role
each advisor receives a compact packet
each advisor returns a narrow answer
the executor remains accountable for the final output

That is less glamorous than a swarm.

It is also much closer to production.

The analogy is not a committee. It is an escalation tree.

Recursion limits are circuit breakers

The recursion protection matters too.

If an advisor can call itself, or call another advisor that calls it back, cost control can turn into a cost incident. A system designed to improve quality becomes a loop generator.

This is not hypothetical paranoia.

Agents already fail through loops: retry loops, tool loops, search loops, clarification loops, self-critique loops, and repair loops. When each loop can invoke a frontier model, the blast radius becomes financial as well as operational.

So the advisor path needs circuit breakers: max consultations per request, max advisor tool calls, max advisor tokens, max wall-clock time, explicit failure behavior, and routing telemetry for every escalation.

The point is not to make escalation rare for its own sake.

The point is to make escalation bounded.

A second opinion is useful when it changes the decision. It is waste when it becomes ritual.

Billing follows control flow

Separate billing is the final clue.

If the executor and advisor bill independently, then cost is no longer just a property of the outer model. Cost follows the internal path the run took.

Two requests can start with the same executor and have different economics: one never escalates, one calls a security advisor once, one calls an architect and a cost reviewer, one hits the consultation cap and fails closed.

From the outside, all four may look like the same agent. Operationally, they are different executions.

That means agent observability has to include routing and escalation telemetry. Not raw private content by default, but the operational facts: executor model, advisor name, advisor model, whether the advisor was offered, whether it was invoked, why, token usage, returned cost, latency impact, and the final decision path.

Without that, teams will not know whether cost increases came from harder tasks, worse routing, larger prompts, unnecessary advisor use, provider pricing changes, or model behavior drift.

This is why agent FinOps cannot live only in a dashboard after the invoice arrives.

It has to live in the runtime.

The real shift

OpenRouter Advisor is still a beta server tool. The exact API will change. The product may evolve.

The architectural pattern is the part worth paying attention to.

The cascade concept is old. What is new is the control point. We are moving from routing the request to escalating the decision.

Routing asks: which model should handle this request?

Escalation asks: which decision inside this run deserves a stronger model, with what evidence, under what budget, and with what audit trail?

That is the more useful question for production agents.

Because long-running agents are not uniformly hard. They are mostly routine with occasional moments of risk, ambiguity, or irreversible consequence.

The right architecture should reflect that shape. Cheap execution by default. Frontier advice when earned. Human approval when authority is required. Circuit breakers when uncertainty compounds. Telemetry everywhere cost or authority changes.

That is how model routing becomes an operating system concern, the runtime layer I described in The Agent Runtime Era.

Not a dropdown. Not a pricing trick. A control-flow primitive.

Sources

OpenRouter, "Advisor: Give Any Model a Lifeline to a Smarter One" (June 10, 2026): https://openrouter.ai/blog/announcements/advisor-server-tool/
OpenRouter documentation, "Advisor Server Tool" (accessed June 2026): https://openrouter.ai/docs/guides/features/server-tools/advisor
OpenRouter cookbook, "Build a Token-Efficient Review Agent": https://openrouter.ai/docs/cookbook/building-agents/advisor-server-tool
Chen, Zaharia, Zou, "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance" (arXiv 2305.05176, 2023): https://arxiv.org/abs/2305.05176
Ong et al., "RouteLLM: Learning to Route LLMs with Preference Data" (arXiv 2406.18665, 2024; ICLR 2025): https://arxiv.org/abs/2406.18665
Aggarwal, Madaan et al., "AutoMix: Automatically Mixing Language Models" (arXiv 2310.12963, 2023): https://arxiv.org/abs/2310.12963
Martian (model router): https://withmartian.com/
Not Diamond (model routing): https://www.notdiamond.ai/
OpenRouter Auto Router documentation: https://openrouter.ai/docs/guides/routing/routers/auto-router
Anthropic, "Choosing a model" (frames model choice as a capability and cost tradeoff): https://platform.claude.com/docs/en/about-claude/models/choosing-a-model

The figures here are reported by the vendors and papers cited, not independently verified. Read them as directional.