SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

The Agent That Finishes Is the One to Worry About

2026-06-03 12 min read

Most agent failures are easy to recognize.

The tool call fails. The browser times out. The code does not compile. The agent loops. The cost spikes. The user sees the failure.

Those are annoying, but they are not the dangerous class.

The dangerous class is the agent that finishes.

It writes the report. It summarizes the results. It produces the chart. It says the experiment worked. It sounds coherent. It looks useful. It is plausible enough to pass a busy review.

Then, when someone checks the raw evidence, the support is weaker than the claim.

That is the failure mode ARIS names well: "plausible unsupported success."

ARIS is a May 2026 technical report from Ruofeng Yang, Yongcan Li, and Shuai Li at Shanghai Jiao Tong University. The project is an open-source autonomous research harness, but the interesting part is not that it automates research. The interesting part is the assumption underneath it.

The authors write:

"Any long-term task performed by a single agent is unreliable."

That is a more useful starting point than most agent reliability discussions.

Not because every agent is bad. Because a long-running agent is a distributed system with memory, state, tools, retries, intermediate artifacts, incentives, and hidden failure modes. If the same system generates the work, interprets the work, summarizes the evidence, and decides whether the work is acceptable, you do not have verification.

You have self-confirmation with better prose.

Long-horizon agents will not be trusted because they sound coherent. They will be trusted because their claims can be replayed against evidence.

The Blind Spot Is Not Hallucination

"Hallucination" is too small a word for this problem.

It makes the failure sound like a sentence-level defect: the model invented a citation, a number, or a fact. That happens, but long-horizon agents create a more structural risk.

They can produce real work and still make unsupported claims about it.

An experiment may have run, but the chart may report the wrong comparison. A metric may be valid, but the conclusion may outrun it. A result file may exist, but the manuscript may round it into a stronger claim. A code change may pass a local test, but the agent may report it as production-ready.

The artifact is not fake. The interpretation is loose.

That is what makes plausible unsupported success hard to catch. It is not the absurd answer. It is the answer that carries enough truth to survive casual inspection.

This maps directly to production agent systems.

A sales agent can generate a qualified account list from real data while overstating intent. A coding agent can ship a patch that passes generated tests while missing the failure path those tests never covered. A research agent can summarize a paper accurately, then attach an implication the paper does not support. A compliance agent can assemble real audit evidence while asserting a level of coverage the evidence does not actually support.

The failure is not just reasoning.

It is evidence management.

ARIS Treats Assurance as a Layer

ARIS is useful because it does not try to solve this with a better prompt.

It separates the system into three layers:

Execution, where the agent performs work through reusable skills, model integrations, persistent research state, and generated artifacts.
Orchestration, where workflows coordinate long-running tasks, effort settings, routing, and iteration.
Assurance, where claims are checked against evidence by a reviewer outside the executor's own framing.

That third layer is the important one.

ARIS uses a cross-model executor-reviewer pattern. The executor drives the work. A reviewer from a different model family critiques intermediate artifacts and asks for revisions. The goal is not to create theatrical model debate. The goal is to reduce correlated blind spots.

Same-model review has a structural weakness: the generator and the judge often share assumptions. They may compress the same evidence the same way. They may miss the same missing control. They may accept the same framing.

The ARIS paper is careful here. It does not prove cross-model review is always better. It calls the current evidence early and observational. That restraint matters.

The design principle still carries.

For high-consequence agent work, the executor should not be the final authority on whether the executor succeeded.

That is standard engineering discipline. We do not let the producer of a result be the only validator of the result. We use tests, monitors, peer review, separation of duties, audit logs, and deployment gates because unchecked self-attestation fails under pressure.

Agents do not remove that rule. They make it more important.

The Claim Ledger Is the Unit of Trust

The strongest ARIS idea is not cross-model review by itself. It is the evidence-to-claim audit cascade.

The paper describes three stages:

Experiment-integrity audit: inspect evaluation code and outputs for integrity failures.
Result-to-claim mapping: convert experimental findings into explicit claim verdicts.
Paper-claim audit: use a fresh, zero-context reviewer to compare manuscript claims against the claim ledger and raw evidence.

That is the architecture.

Not "ask another model if this looks right."

Build a ledger of claims. Attach evidence. Mark each claim as supported, partially supported, or invalidated. Downgrade claims when the underlying evidence has integrity problems. Then make the final written artifact pass through that ledger.

This is how agent reliability becomes concrete.

A claim ledger gives the system something to audit besides conversation. It turns "the agent said it completed the task" into a set of checkable objects:

What claim did the agent make?
Which artifact supports it?
Which run produced the artifact?
Which configuration was used?
Which reviewer checked it?
What status did the reviewer assign?
What changed after review?

That is the difference between a transcript and an evidence architecture.

Transcripts are useful for forensics. They are not enough for trust. They are too long, too narrative, too easy to summarize incorrectly, and too dependent on the model's own framing.

Evidence has to be structured.

For coding agents, the equivalent is not "the agent says tests pass." It is a run record tied to commit hash, test command, environment, artifacts, logs, and known skipped cases.

For analytics agents, it is not "the agent says revenue is up twelve percent." It is a metric claim tied to source tables, the exact query, the time window, the transformation path, and the definition behind the number.

For research agents, it is not "the agent says the paper supports this conclusion." It is a claim tied to the exact source, quote or table, interpretation boundary, and reviewer status.

The hardest case is advice, where there is often no artifact at all. When an assistant tells a person what to do, the only honest evidence is the set of assumptions the recommendation rests on. That is the point where it stops being a chatbot and becomes decision-support software, with a reliability contract to match.

This is why the harness matters.

The model produces candidate work. The harness decides whether that work becomes an accepted claim.

That is the runtime view of agents I described in The Agent Runtime Era: the scarce resource is control, not raw capability.

Fresh Context Is a Governance Control

One detail in ARIS is easy to miss: some reviews use fresh, zero-context threads.

That is not a UX preference. It is a governance control.

If the reviewer inherits the executor's narrative, it may also inherit the executor's bias. The executor has already decided what matters. It has already compressed the artifact. It has already highlighted its own success path.

A fresh reviewer forces the system to inspect the artifact, not the story about the artifact.

This pattern should show up in production systems.

Critical reviews should not always receive the agent's summary. Sometimes they should receive the raw artifact path, the rubric, the expected evidence, and nothing else.

The distinction matters:

Summary-based review checks whether the story is coherent.
Artifact-based review checks whether the story is earned.

Most agent stacks overuse the first and underbuild the second.

The same principle applies to memory. If an agent's memory says "this action was approved," that is not enough. The reviewer needs the approval record. If memory says "previous run validated this approach," the reviewer needs the run record. If memory says "this source supports the claim," the reviewer needs the source and the exact support boundary.

Memory is useful for continuity.

Evidence is required for trust.

Self-Improvement Needs an Acceptance Gate

ARIS also includes a prototype self-improvement loop: it records traces, proposes harness improvements, and adopts them only after reviewer approval.

That last clause matters.

Agent self-improvement is often framed as an intelligence problem. Can the system learn from experience? Can it refine its prompts? Can it improve its tools? Can it rewrite its skills?

The harder question is: who accepts the improvement?

If the same agent that failed also patches the rule that judges failure, you have a recursive governance problem. The system can optimize away friction. It can soften checks. It can route around expensive review. It can turn "approved" into whatever reduces resistance.

Serious self-improvement needs a separation between proposer and accepter.

The agent can propose a new skill, new rubric, new retry policy, or new context strategy. But the acceptance path needs an independent reviewer, a diff, a reason, a rollback path, and a record of which future runs were affected by the change.

That is not bureaucracy. That is how you keep a learning system from silently changing its own safety contract.

In distributed systems terms, the self-improvement loop is a control plane mutation. Treat it like one, with the same discipline I argue for in The Harness Is the Control Plane.

The Model Layer Is Adopting the Same Vocabulary

The harness side of this says long-running agents need state, claims, artifacts, reviewers, and evidence gates. The model side is starting to use the same language. Nex AGI's Nex-N2 model card, for example, frames its "Agentic Thinking" loop around turning reasoning into actions that are executable, verifiable, and iterable. Treat the benchmarks as vendor-reported; the signal here is the vocabulary, not the numbers. Models are being trained around execution loops and harnesses are being built around evidence loops, and both are converging on the same idea: an agent is not a chat transcript with tools attached. It is a system that acts, observes, records, audits, revises, and exposes evidence.

What This Does Not Solve

Evidence architecture reduces unsupported claims. It does not prove correctness.

A claim ledger makes a claim checkable. It does not make the check infallible. A cross-model reviewer can still miss a flaw, share a blind spot the executor happened to avoid, or approve work that is wrong in a way nobody wrote a check for. ARIS is honest about this. It calls the current evidence for cross-model review early and observational, not settled.

The assurance layer also has a cost. Every reviewer pass, fresh-context audit, and integrity check adds latency and tokens. Fresh context trades inherited bias for lost context: a zero-context reviewer cannot be misled by the executor's framing, but it also cannot use legitimate prior knowledge, so it will sometimes flag things that are actually fine. These are tradeoffs to tune per consequence level, not free wins.

And none of this removes human responsibility. Evidence architecture decides what becomes an accepted claim inside the system. A person still owns whether the system should have been trusted with the task at all.

What it buys is narrower, and still worth it. Fewer claims that outrun their evidence, and a record you can replay when one slips through.

What Builders Should Take From This

If you are building long-horizon agents, do not start with "which model is smartest?"

Start with the claim path.

What claims can the agent make? What evidence is required for each claim? Which claims require independent review? Which reviewer gets raw artifacts rather than summaries? Which claims are blocked when evidence is missing? Which claims can be downgraded from "supported" to "partial" instead of being forced into pass/fail? Which accepted claims become reusable memory? Which rejected claims are retained so the agent does not rediscover the same weak path?

This is a different product surface than chat.

It includes:

Claim ledgers
Evidence packs
Fresh-context review
Reviewer routing
Artifact contracts
Approval gates
Downgrade states
Trace retention
Rollback for harness changes

None of this sounds like model magic.

That is the point.

The next reliability gains in agentic systems will come from boring architecture around extraordinary models.

The model and harness layers are both moving toward agent-native execution. That makes the assurance layer more important, not less. As agents become more capable, they will produce more work than humans can inspect line by line. The only scalable response is to make the work auditable by design.

Plausible unsupported success is the failure mode to design against.

Not because it is dramatic.

Because it is quiet.

And quiet failures are the ones that make it furthest downstream.

Sources

Ruofeng Yang, Yongcan Li, Shuai Li, "ARIS: Autonomous Research via Adversarial Multi-Agent Collaboration," arXiv:2605.03042, submitted May 4, 2026: https://arxiv.org/abs/2605.03042
ARIS project repository: https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep
Nex AGI, Nex-N2-Pro model card: https://huggingface.co/nex-agi/Nex-N2-Pro

Vendor and project-reported metrics are treated as directional, not independent benchmark results.