SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Declared State Is Not Observed State

September 10, 2024 #Distributed Systems 7 min read

A green status light lit over a visibly seized machine, a system's self-report diverging from its observed state.

The dashboard is all green. Every health check is passing, every instance reports ready, the deploy went clean. And the system is down. Orders are not going through, and the first customer email has already landed.

If you have run anything at scale, you have lived some version of this morning. The instinct that got you there is a reasonable one. We build a health endpoint precisely so we do not have to guess whether a service is alive, and for years that green dot has been a fair proxy for "this thing is doing its job." Trusting it is not naive. It is what the mechanism is for.

The trouble is what the green dot actually measures.

A typical health check confirms that a process is running and can answer an HTTP request. It does not confirm that the service can reach its database, that its connection pool is not exhausted, that the downstream dependency it needs is responding, or that the work it is supposed to be doing is getting done. The service reports "healthy" because, from where it sits, it looks healthy. The report is a statement the component makes about itself.

And a statement a component makes about itself is a claim, not a measurement.

The reasonable-looking trap

Most observability starts by asking components to describe their own condition. Health endpoints. Self-reported metrics. A worker that emits "job complete." A node that announces "I am the leader." It is the natural first move, because it is cheap and it scales: every component already knows the most about itself, so let it tell you.

The catch is structural. The component doing the reporting and the component being reported on are the same thing. That arrangement works fine for the failures a component can see in itself. It fails for exactly the failures that matter most, the ones the component cannot detect from the inside.

Consider a node that has been network-partitioned from its peers but is otherwise running perfectly. Ask it how it is doing and it will tell you, with complete confidence, that it is healthy and still the leader. It is not lying. It genuinely cannot tell that the rest of the cluster has moved on without it. The most dangerous state in the system is the one the actor is most confident about, because its confidence and its blindness come from the same place: it only has its own point of view.

Self-reporting surfaces the failures you were already going to catch. It goes quiet on the ones that take you down.

Observe the effect, not the assertion

The way out is old, and the distributed-systems literature has said it in several voices.

The end-to-end argument, from Saltzer, Reed, and Clark, says you cannot fully verify a function at an intermediate layer. You confirm it at the endpoints, where the actual outcome is visible. A relay reporting "message sent" is not the same as the destination confirming "message received," and only the second one is worth building on.

The Byzantine tradition, from Lamport and others, goes further. If a node can be arbitrarily wrong, you cannot take its word for anything that matters. Agreement protocols exist for one reason: to establish a fact about the system without trusting any single participant's self-description. The entire apparatus of consensus is a formal admission that a component's account of itself is not sufficient evidence.

Put plainly: the trustworthy signal is not what the component says about its state. It is the effect the component has on the world. Did the row actually commit? Did the consumer's offset actually advance? Did a real request travel the real path and come back correct? Those are observations. They are produced by something other than the actor, and they describe what happened rather than what the actor believes.

A component's report of its own state is a claim. The effect it leaves in the world is a measurement. Build on the measurement.

What this looks like in practice

The principle turns into a handful of concrete habits, all of which move the source of truth away from the actor.

Probe the real path, not the process. A health check that returns 200 because the process is up tells you almost nothing. A synthetic transaction that exercises the actual work, writes a canary record, reads it back, hits the real dependency, tells you whether the service can do its job. The first is white-box and self-reported. The second is black-box and observed. Prefer the observed one for anything you page on.

Measure progress, not heartbeats. A heartbeat says "I am still here." It does not say "I am still making progress." A worker can heartbeat happily while wedged on a task it will never finish. Liveness that matters is defined by observed forward motion, work completed, queue drained, offset advanced, not by a periodic "still alive" ping the actor sends about itself.

Reconcile declared against actual. The most reliable systems continuously compare what a component claims is true against a separate view of what is true, and alert on the drift. The ledger says the balance is X; a separate tally of the transactions says Y; the gap is the signal. Reconciliation is just institutionalized distrust of self-report, run on a schedule.

Keep the log of what happened, not what is. A component's current-state field is something it overwrote, on its own authority, and you have to take on faith. An append-only record of the events that actually occurred is an observed history you can replay and check for yourself. This is one of the quieter virtues of event sourcing: you store what happened, not what some component asserts is true now.

The through-line in all four is the same. Do not ask the actor how it is doing. Observe what it did.

The deeper version of the problem

Push this far enough and it stops being a monitoring detail and becomes a property of the whole design.

Any system in which the only account of what a component did is the account that component wrote about itself is, structurally, unauditable. There is no separate record to check the claim against, so the claim cannot be wrong in a way anyone can detect. This is why auditors are not the same people as the accountants, why the process under test does not grade its own test, why separation of duties is a principle and not a nicety. It is not about assuming bad faith. A perfectly honest component that is broken in a way it cannot see will report success in perfectly good faith. That separation is what lets the truth surface anyway.

You cannot audit a system whose only record of what it did is the record it wrote about itself.

Where self-reports still earn their place

None of this makes self-reporting useless, and it would be an overcorrection to rip out every health check tomorrow. A self-report is a cheap, fast first signal, and for the class of failures a component can see in itself, it is often enough. The failure is not having them. The failure is treating a claim as proof, and letting the actor's own account be the only thing standing between you and a bad morning.

The right shape is defense in depth. Let components report themselves, because it is cheap. Then put a separate, external check behind the report for anything you actually care about, because the report will be silent exactly when it matters. Trust, but verify, and be honest about which half of that phrase your architecture is actually doing.

Declared state is a promise. Observed state is a fact. When the two disagree, and eventually they will, only one of them was ever measuring anything.

Observe the effect.

Sources

Saltzer, J. H., Reed, D. P., & Clark, D. D. (1984). End-to-End Arguments in System Design. ACM TOCS. https://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf
Lamport, L., Shostak, R., & Pease, M. (1982). The Byzantine Generals Problem. ACM TOPLAS. https://lamport.azurewebsites.net/pubs/byz.pdf
Kleppmann, M. (2017). Designing Data-Intensive Applications, ch. 8 (unreliable clocks, knowledge, truth, and lies). O'Reilly.
Nygard, M. (2018). Release It! (2nd ed.), on health checks and failure modes. Pragmatic Bookshelf.
Fowler, M. (2005). Event Sourcing. https://martinfowler.com/eaaDev/EventSourcing.html