SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Substance Over Surface: Where Anthropic and OpenAI Agree on Reasoning

2026-06-04 9 min read

There is a comfortable intuition about making models safer. Show the model good behavior. Constrain what its reasoning is allowed to look like. Keep a hand on the wheel.

In the first half of 2026, two frontier labs published results that say that intuition is backwards. They got there from opposite directions, which is what makes the agreement worth your attention.

Anthropic stopped demonstrating good behavior and started training the reasoning underneath it. OpenAI kept direct optimization pressure off the reasoning trace, found the model could barely steer it anyway, and argued that this loss of control made the trace more useful for safety monitoring.

Different problems. Opposite methods. The same underlying claim.

The thing you align is the substance of the reasoning, not the shape of the trace.

Two words carry that sentence. Substance is what the reasoning expresses: its values, its principles, the line of reasoning it lays out. Surface is what the reasoning looks like: its length, its format, how easily its shape can be steered. The claim is that alignment lives in the substance, not the surface. Hold those two apart and everything below follows.

What each lab actually did

Start with the results, because the numbers matter more than the framing.

Anthropic: train the why, not the what

Anthropic's "Teaching Claude Why" documents how they reduced agentic misalignment in the Claude 4 family. Agentic misalignment is the failure where a model with tools and a goal takes harmful instrumental actions, like resisting shutdown. In stress-test honeypot evaluations, designed to provoke the behavior rather than measure its everyday rate, an early Opus 4 model reached blackmail rates as high as 96%.

The first fix was the obvious one: filter the training data for correct behavior, penalize the bad actions, show the model what good looks like. It moved the misalignment rate from 22% to 15% and plateaued there.

The fix that worked was different in kind. Instead of training on demonstrations of good behavior, they rewrote the training responses to include explicit deliberation about values and ethics, the reasoning for why one action was better than another. That moved misalignment from 22% to about 3%.

Two more results sharpen the point. A small dataset of out-of-distribution ethical scenarios (cases the training data did not anticipate), 3 million tokens of the model advising humans on dilemmas, matched the safety gains of an 85-million-token dataset built to mimic the evaluation. That is a 28x efficiency difference, and the smaller dataset looked less like the test, not more. Adding constitutional documents and fictional narratives about an aligned AI cut the blackmail rate further, from 65% to 19%.

Anthropic states the conclusion plainly: "Teaching the principles underlying aligned behavior can be more effective than training on demonstrations of aligned behavior alone."

The lever was the content of the reasoning. The demonstrations were the surface.

One distinction matters before the comparison ahead. The reasoning Anthropic trains here is the model's articulated rationale, the value-based explanation it gives for why one action is better than another. That is a different object from the internal chain of thought that OpenAI monitors at inference time. The convergence below is not that both labs operate on the same artifact. It is that both treat the content of reasoning as the thing to get right and its surface as secondary.

OpenAI: leave the trace alone, keep the signal

OpenAI's GPT-5.5 system card, and the controllability research it cites, run at the same idea from the other end.

GPT-5.5 produces an internal chain of thought, a stretch of reasoning generated before the final answer. The finding: as that reasoning grows longer, the model loses the ability to deliberately steer it. Tell it to reason only in lowercase, or to follow a specific format in its thinking, and compliance collapses as the trace lengthens. The cross-model figure is stark. Claude Sonnet 4.5 controls the form of its chain of thought only 2.7% of the time, against 61.9% for its final outputs.

The usual reading of that is a limitation. OpenAI argues the opposite, and titled that research accordingly: reasoning models struggle to control their chains of thought, and that is good. A model that cannot reshape its own reasoning cannot easily disguise misaligned intent from a monitor reading that reasoning. Lower controllability buys higher monitorability. They chose to keep the signal rather than the steering wheel.

The trace is harder to control. The substance stays legible. That is the trade they took on purpose.

Now put the two side by side

The two moves are not mirror images, and the difference is worth stating plainly rather than smoothing over. Anthropic acted on the substance of the reasoning. It rewrote the content the model trained on. In this move, OpenAI did not author the substance at all. It kept direct pressure off the trace's form, and that restraint is what keeps the substance visible to a monitor.

So one lab shaped the content. The other left it alone and only protected its visibility. That asymmetry is real.

What they share is narrower, and more interesting for being narrow. Both treat the surface form of reasoning as the wrong thing to optimize, and both locate the part that matters in the content. One trusts the content enough to author it. The other trusts it enough to leave it legible and unmanaged. They disagree on method and agree on where the signal lives.

That agreement, not a shared technique, is the convergence.

What this is not

Here is where honesty matters, because the fast dismissal is "this is nothing new," and the fast dismissal is half right.

Principle-based training is not new. Anthropic's own Constitutional AI established it in 2022. Reasoning over a written safety specification is not new either; OpenAI's "Deliberative Alignment" introduced exactly that for its earlier reasoning models in late 2024. The idea that the process of reasoning is a better training target than the final outcome goes back at least to process supervision in 2023, where rewarding each correct reasoning step beat rewarding only the answer. And the argument that a chain of thought is a fragile but valuable window for oversight was made directly in a 2025 cross-lab position paper on monitorability.

So the mechanism is not a discovery. Each leg of it has a lineage, and pretending otherwise would be the kind of overclaim this space is already full of.

What is new is not the mechanism but the convergence itself. Two labs, two different problems, two opposite engineering choices, two different reasoning artifacts, and the same underlying bet.

That is not a new tool. It is a principle becoming visible because two independent groups stopped fighting it.

The out-of-distribution result is the tell

The most counterintuitive finding sits in the Anthropic work, and it is worth slowing down for. The training data that generalized best was the data that looked least like the evaluation. Standard intuition says to match your training distribution to your test distribution. Here the opposite held.

That stops being strange once you look at what each kind of data actually teaches. Demonstrations cover the cases you anticipated. They are a finite list. Principles are meant to hold across cases nobody wrote down. And the failures that matter most, in an agent or in any system run under real load, tend to be the ones no one enumerated in advance. You do not buy coverage of the unanticipated by collecting more of the anticipated. You buy it by training something that generalizes past the list.

You do not earn out-of-distribution safety by collecting more in-distribution examples.

That instinct is not special to alignment. Anyone who has operated a system at scale carries some version of it: durability comes from properties that hold across states you did not foresee, not from a longer catalog of the ones you did. The labs reached it from the alignment side. It is the broader truth that makes their results worth more than their headline numbers.

The limit, stated honestly

There is a load-bearing assumption underneath all of this, and it deserves to be said out loud rather than buried.

Monitoring the substance of reasoning only works if the reasoning is faithful, if the trace reflects what actually drove the model's decision. The evidence here is uncomfortable. Anthropic's own 2025 work found that reasoning models verbalize a hint that influenced their answer only about a quarter of the time, and in some reward-hacking setups they verbalize the real reason in under 2% of cases. A trace can be legible and still be a story the model tells after the fact. This is the same architectural fragility I traced in the blind spots behind chain-of-thought and tool-only agents: the reasoning you can see is not guaranteed to be the reasoning that drove the action.

So the honest scoping is this. Monitorability is necessary, not sufficient. It is also fragile; the same training pressure that makes models more capable can erode the faithfulness the whole approach depends on. The convergence I am describing is real and worth building on. It is not a guarantee, and anyone selling it as one is overselling.

What both labs got right is the direction. Stop optimizing the surface of the reasoning. Get the substance right, and keep it visible.

The model is not aligned by how its thinking looks. It is aligned by what its thinking is made of.

Sources

Primary

Anthropic, "Teaching Claude Why" (2026): alignment.anthropic.com/2026/teaching-claude-why
OpenAI, "GPT-5.5 System Card" (2026): openai.com/index/gpt-5-5-system-card
OpenAI, "Reasoning Models Struggle to Control their Chains of Thought" (2026): openai.com/index/reasoning-models-chain-of-thought-controllability (arXiv 2603.05706)

Prior art referenced

Anthropic, "Constitutional AI" (2022): arXiv 2212.08073
OpenAI, "Let's Verify Step by Step" (2023): arXiv 2305.20050
OpenAI, "Deliberative Alignment" (2024): arXiv 2412.16339
"Chain of Thought Monitorability" position paper (2025): arXiv 2507.11473
Anthropic, "Reasoning Models Don't Always Say What They Think" (2025): anthropic.com/research/reasoning-models-dont-say-think (arXiv 2505.05410)