SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Tools Don’t Extend an Agent. They Define It.

2025-12-27 6 min read

A 6.7-billion-parameter model, given a handful of tools, beat a model 26 times its size on a range of zero-shot tasks, from factual lookups to arithmetic to questions about dates. The smaller model won those, and not because it was smarter. It won because it could route each problem to the right outside capability: a question-answering system for missing facts, a calculator for arithmetic, Wikipedia search where retrieval helped, and a calendar for date questions. The larger model could not.

That result is from Toolformer, a 2023 paper from Meta. Most people filed it under "tool use is useful." That reading misses what it actually shows.

It shows that the tools, not the model, were the agent.

An agent is not a model with tools attached. It is the set of things it can reach.

Problem

We talk about agents as if the model is the agent and the tools are accessories. The question everyone asks first is which model is smartest. Tools come up later, as an integration detail, the plumbing you bolt onto the clever brain once you have picked it.

The framing feels obvious. The model is the expensive part, the part with the billions of parameters and the famous name. The tools are just functions. So the model must be the agent, and the tools must be features.

The framing is backwards. I have argued before that most agent failures are architectural rather than failures of model intelligence (the blind spots behind CoT and tool-only models). This is the same mistake seen from the other side: when you treat the model as the agent, you stop looking at the thing that actually determines what the agent can do.

Blind Spot

Toolformer is the cleanest place to see why, because it isolates the variable. The model learns to use tools from its own feedback. It proposes a tool call, checks whether inserting the tool's result makes the next words easier to predict, and keeps the call only when it does. No human labels which calls are useful. The model's own uncertainty is the judge. Perplexity, roughly, is how surprised the model is by what comes next; a useful tool call lowers that surprise.

Three details from that setup break the model-is-the-agent intuition.

First, on those tasks, the small model with tools beat the large model without them. Capability did not track parameter count. It tracked reach.

Second, because a call survives training only when it makes the continuation easier to predict, tools get attached precisely where the bare model was weakest. Capability flows to the gaps. The tools are not decoration. They are where the model goes to become more than it is.

Third, and most telling, the ability to benefit from tools largely appears at around 775 million parameters. Below that threshold, tool access brought little average gain, with Wikipedia search the exception. Tool use is not a switch you flip. It is a capability the architecture has to be large enough to host. The agent's relationship to its tools is structural, not cosmetic.

Put those together and the tools stop looking like extensions of a fixed capability. They look like the thing that decides the capability in the first place.

Architectural Truth

Here is where a distributed systems background makes the point feel obvious, because the field settled this question a long time ago for services.

A service is not defined by its internal logic alone. It is defined by its interfaces and its dependencies, by what it can call and what can call it. A service that can reach the payments ledger is a different thing, with a different role and a different blast radius, than one that cannot, even when the code inside is identical. We do not describe a service by how clever its internal functions are. We describe it by what it touches.

An agent is the same. Two agents running on the same model, with different tools, are different agents. One with a code interpreter and a repository is a developer. One with a search index and a document store is a researcher. They share a brain and share nothing else, because what each can reach is not the same, and what an agent can reach is what it is.

Tools do not extend an agent's identity. They constitute it.

This changes what tool selection is. In the SRAL framework I use to evaluate agents, this is the Act dimension, the agent's surface against the world (the four questions I ask about every agent). Choosing an agent's tools is not a configuration step you do after picking the model. It is the act of deciding what the agent is and what it is allowed to become. When you grant a tool, you are not adding a feature. You are admitting a capability into the agent's definition, with everything that follows: new things it can do, new things it can break, a new role it now holds in your system.

The design implication is direct. Stop starting with the model and treating tools as plumbing. Start with the reach the agent should have, the set of actions that define its job, and choose the tool set as deliberately as you would choose a service's dependencies. The model is the substrate. The tools are the constitution.

What this is not

This is not a claim that learned tool use was invented in one paper, or that capability-through-affordance is a new idea. Toolformer built on earlier self-supervised tool work like TALM, and on tool-use efforts like WebGPT before it. The broader notion that what a system can reach shapes what it is has a long lineage in how we reason about services, and about organisms.

Toolformer is not the proof of a law. It is the cleanest single demonstration that the tools carry the capability, and the cleanest place to watch the model-is-the-agent intuition fail. The narrow results are Toolformer's. The architectural reading, that tools are constitutive rather than additive, is the lens. The lens earns its keep because it changes the first question you ask when you design an agent.

You stop asking which model is smartest.

You start asking what this agent should be able to touch.

Reach is identity.

Sources

Schick et al., "Toolformer: Language Models Can Teach Themselves to Use Tools" (2023): https://arxiv.org/abs/2302.04761
Parisi et al., "TALM: Tool Augmented Language Models" (2022): https://arxiv.org/abs/2205.12255
Nakano et al., "WebGPT: Browser-assisted question-answering with human feedback" (2021): https://arxiv.org/abs/2112.09332