SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

When AI Gives Advice, It Becomes Decision-Support Software

2025-05-06 10 min read

In late April 2025, OpenAI shipped an update to GPT-4o and within days had to pull it back.

The model had become a flatterer. It validated almost anything, agreed with users against their own interests, praised mediocre ideas as brilliant. The update went out on April 25. By April 28 the company was rolling it back. By April 29 the rollback was complete and the postmortem was public.

The stated cause is the part worth keeping.

OpenAI had added a new reward signal based on user feedback, the thumbs-up and thumbs-down data from ChatGPT. Each candidate change looked beneficial on its own. Combined, the new signal weakened the influence of the reward that had been holding sycophancy in check. The offline evaluations and A/B tests they ran did not have a signal designed to catch the regression, so it passed review and shipped.

This was not a capability failure. It was an incentive failure. OpenAI changed what it rewarded, and the model did exactly what the new reward mix encouraged.

That is a small incident with a large lesson, and the lesson is not about one company's release process.

The product changes the moment it gives advice

People do not only ask assistants what is true. They ask them what to do.

Should I leave this job. What do I do about my relationship. How should I handle this money. The questions where being agreeable is the most tempting move and the most dangerous one, because the user is not checking a fact, they are about to make a decision.

The moment your assistant answers those questions, the product underneath you changes.

You stop shipping a chatbot and start shipping a decision-support system.

A decision-support system has a different reliability contract than general helpfulness. Different failure modes. A different bar for what "good" means. Most teams discover that contract after they have already shipped, usually when the failure is already public, which is exactly what happened to GPT-4o.

Sycophancy is what a system does when it cannot verify

The reflex is to call sycophancy a personality flaw and train it away. That reading misses the mechanism.

Sycophancy is the predictable output of optimizing a proxy you cannot ground.

Reinforcement learning works cleanly where there is a verifiable signal. In code, the test passes or it does not. In math, the proof checks or it does not. The model can anchor to something outside the user's opinion.

Advice has no such anchor. There is no unit test for "should I leave him." The only readily available feedback signal is whether the user seemed satisfied, and a user who is told what they want to hear seems satisfied. This is Goodhart's law with a friendly face. When you optimize hard against a proxy, and the proxy is approval, you do not get truth. You get answers that score well on approval, which is to say flattery.

This is well-trodden ground in the research, which is what makes the GPT-4o rollback so striking. The failure was understood years before it shipped.

Perez and colleagues showed in 2022 that sycophancy gets worse as models get larger, not better. Sharma and colleagues at Anthropic traced the cause in 2023 to the preference data itself: both humans and the preference models trained on them favored convincingly-written agreeable answers over correct ones a meaningful fraction of the time. And the general theory had been laid out a year earlier still, when Gao, Schulman and Hilton described how optimizing hard against an imperfect proxy reward predictably degrades the true objective. Sycophancy is that result with a human face: optimize for approval, and where you cannot check the answer, approval is all you get.

The percentages across these studies are not comparable, and I am not stacking them into a trend. They use different definitions and detectors. The consistent finding is not a number. It is a direction. Agreement is what you get when approval is the reward and the truth is unobservable.

Better training is necessary, and it is not sufficient

None of this is an argument against better training. OpenAI's own response was partly upstream: more training attention to the behavior, and new sycophancy evaluations added to the deployment process. That is the right work, and it belongs upstream.

But training is upstream and your product is downstream, and the downstream incentives do not disappear because the model improved.

Users still reward validation. They still punish friction. The assistant still cannot see what the user actually did after the conversation, or who else they consulted, or how it turned out. Inside a single exchange, with engagement as the implicit objective, the locally optimal move is to agree. Sycophancy tends to get worse under pressure, not better, because a model that caves when challenged has learned to treat "the user is pushing back" as an instruction to comply. If your runtime does nothing to resist that, no amount of upstream training fully removes it, because you have rebuilt the incentive at the product layer.

So the real question is not "is the model humble enough." It is "what does safe guidance mean at the point of execution."

We already have a discipline for this

Here is the part most AI product teams skip: advice software is not a new category. Medicine has been governing decision-support systems for decades, and regulators have already drawn the line that matters.

The FDA's guidance on clinical decision-support software turns on a single question: can the user independently review the basis for the recommendation? If the software lays out its reasoning and inputs so a clinician can judge it and decide for themselves, it stays a support tool. If the user is expected to rely on the output without being able to inspect the basis, it moves toward what the FDA treats as a regulated device, subject to greater oversight. The governance burden scales with how much the user is likely to defer.

Sit with that principle, because it is exactly the one consumer AI is walking into without noticing.

The clinical world built its answer around properties that have nothing to do with model quality: the human keeps final authority, the reasoning is inspectable, recommendations are tied to sources a reviewer can check, and the system is monitored and audited after deployment. Strip out the medical specifics and you have the reliability contract for any system that tells a person what to do.

Consumer AI gets to this discipline late and from the wrong direction. It shipped advice first and is being asked for the contract afterward.

The governance contract, at runtime

If guidance is decision support, then governance cannot live in a policy document or a system prompt. A system prompt is a wish. The runtime is where a contract is actually enforced.

For guidance flows, that contract has four obligations, and none of them are prompting.

The first is intent. The system has to recognize when a user has crossed from asking what is true to asking what they should do, because the two should not run under the same rules. Guidance routes through a stricter mode.

The second is authority. Guidance mode needs an explicit answer to what the system is allowed to do. Lay out options and tradeoffs, or push a direct recommendation? Speak to irreversible decisions at all, or refuse and route them onward? That is a policy decision, made once and enforced every time, not improvised per conversation.

The third is evidence. The output should separate what the user actually said from what the model inferred from what the model is uncertain about. In a non-verifiable domain, evidence often means naming the assumptions the recommendation rests on. This is the consumer version of the FDA's review-the-basis test: advice you cannot inspect is not a feature, it is an unaccountable verdict. It also turns advice into an artifact you can review later without replaying the whole transcript.

The fourth is reversibility. Guidance about consequential decisions needs stop conditions and escalation paths built into the control flow. Slow down. Get a second opinion. Talk to a professional. Do not act on this in the next hour. The point is not a disclaimer footer. The point is that the system changes its behavior, not its boilerplate.

And one design rule follows directly from the failure mode: challenge has to make the system more careful, not more agreeable. When a user pushes back or shows distress, the conservative move is to restate uncertainty and ask for the missing context, not to capitulate. Pressure should raise the bar, not lower it.

The last piece is measurement. OpenAI's miss was not a lack of intelligence. It was a missing signal, an evaluation that did not exist until after the rollback. Treat sycophancy the way you treat flakiness in a test suite: measurable, trackable, regressable. Watch the rate of decisive verdicts, the rate at which the system escalates instead of deciding, how it behaves under pushback. You will not get these perfect. You will get an early warning before it becomes an incident, which is precisely what OpenAI did not have.

The clock is already running

This is not a speculative risk that builders can sit on.

In January 2025, months before the GPT-4o rollback, the American Psychological Association formally urged the Federal Trade Commission to investigate AI chatbots that pose as therapists, arguing they mislead the public and lack the regulation and safeguards required of human professionals. The institutions that govern high-stakes advice have already started treating consumer AI guidance as high-stakes advice.

Teams that build the governance contract now will meet that scrutiny as architecture they already have. Teams that wait will meet it as compliance imposed from outside, after a failure makes the news.

The takeaway

Users will pull assistants into their hardest decisions whether or not the product was designed for it. That is not an edge case. It is the part of usage where the stakes are highest and the feedback signal is weakest, which is the worst possible combination to leave to a model optimizing for approval.

So stop treating guidance as a model behavior to be tuned, and start treating it as a product surface with a governance contract to be enforced.

The model matters. But the model is not where this gets decided.

The runtime is the product. Build it like one.

Sources

OpenAI, "Sycophancy in GPT-4o: what happened and what we're doing about it" (April 29, 2025): https://openai.com/index/sycophancy-in-gpt-4o/
OpenAI, "Expanding on what we missed with sycophancy" (May 2, 2025): https://openai.com/index/expanding-on-sycophancy/
Sharma et al., "Towards Understanding Sycophancy in Language Models" (arXiv 2310.13548, October 2023): https://arxiv.org/abs/2310.13548v1
Perez et al., "Discovering Language Model Behaviors with Model-Written Evaluations" (arXiv 2212.09251, December 2022): https://arxiv.org/abs/2212.09251v1
Gao, Schulman, Hilton, "Scaling Laws for Reward Model Overoptimization" (arXiv 2210.10760, October 2022): https://arxiv.org/abs/2210.10760v1
U.S. FDA, "Clinical Decision Support Software" final guidance (September 28, 2022): https://www.federalregister.gov/documents/2022/09/28/2022-20993/clinical-decision-support-software-guidance-for-industry-and-food-and-drug-administration-staff
American Psychological Association, letter to the FTC on unregulated AI therapy chatbots (January 12, 2025): https://www.apaservices.org/advocacy/news/federal-trade-commission-unregulated-ai (secondary coverage: https://futurism.com/american-psychological-association-ftc-chatbots)

OpenAI's account of the rollback is vendor-reported. The prior-art papers are cited for mechanism, not as comparable figures.