The bill arrived, and it changed the conversation about agents.
EY illustrates the shift with a single line: a chat that cost about four cents in 2023 becomes a $1.20 orchestration once you add tool retrieval, planning, and subagents. Roughly thirty times more, for the same intent. ICONIQ's 2026 survey puts numbers on the consequence: model inference runs around 23 percent of cost at scaling-stage AI companies, and their gross margins sit near 52 percent, well below what mature software is built on. Gartner expects more than 40 percent of agentic AI projects to be cancelled by the end of 2027, and lists escalating costs as the first reason.
The industry's response has been to reach for the finance team. Token dashboards. Cost-per-task metrics. Per-tenant budgets reviewed monthly. A new FinOps vocabulary for agents.
That work is necessary. It is also aimed at the wrong layer.
A dashboard tells you what you spent. It cannot tell you what to refuse.
Cost like this is not a billing problem. It is a scheduling problem, and distributed systems solved the general version of it a long time ago.
Two conversations that are not talking to each other
There are two discourses running in parallel about agent cost, and they rarely meet.
The louder one is economics. CFOs, analysts, and FinOps teams treat token spend as a bill to forecast, attribute, and optimize after the fact. The instruments are spreadsheets and dashboards. The mental model is the monthly invoice.
The quieter one is systems. A smaller thread of engineers has started treating agent load the way an operating system treats contended resources: admission control, backpressure, priority queues, budgets that behave like an out-of-memory killer when they are breached. HiveMind's OS-style scheduler for concurrent agents is a clear recent example. But that work is mostly framed around latency and reliability, with cost as a side effect.
The connection worth making explicit is the one between them. The economics camp measures cost without controlling it. The systems camp controls load without naming cost as the point. Put them together and the conclusion is direct: the unit economics of an agent are decided at request time, under contention, by code, not at month end by a review.
That makes cost a control-plane concern. It belongs in the runtime, next to the other things the harness already governs, not in a report you read after the money is gone.
Routing is not scheduling
The obvious objection is that we already optimize cost. We route.
We do, and routing is good. FrugalGPT showed in 2023 that a cascade of cheaper models could match a frontier model on many tasks at a fraction of the price. RouteLLM learned the routing decision from preference data. I have argued that the next step is escalation as a tool call, where a cheap model consults a stronger one mid-task instead of routing the whole request up front.
All of that answers one question: which model should handle this request?
It does not answer the question that actually governs your margin: given a fixed budget and contended capacity right now, which requests do we admit, which do we queue, which do we serve at reduced quality, and which do we refuse outright?
Routing is selection. Scheduling is allocation under scarcity. Routing picks the lane. Scheduling decides whether the car gets on the road at all. You can have perfect per-request routing and still blow your margin, because nothing in a router says no when a thousand low-value requests arrive at once and each one politely chooses the cheapest adequate model.
Routing picks the model. Scheduling decides whether the request runs at all.
The primitives already exist
The useful part of treating this as a scheduling problem is that the primitives are not hypothetical. They are decades old, well understood, and sitting unused one layer below most agent stacks.
Admission control is the first gate. Not every request deserves the full agent loop, the frontier model, and an unbounded tool budget. A background reconciliation job and a paying customer's live question should not draw from the same pool on a first-come basis. Admission control decides, at intake, what a request is allowed to consume before it has consumed anything.
Backpressure is the second. When token spend starts to outrun the budget, the system has to push back on intake rather than quietly overspend and reconcile later. This is what a token-bucket rate limiter does, and yes, the collision of names is unfortunate, the token bucket is a traffic-shaping algorithm, not a count of language-model tokens. The idea transfers exactly. You refill capacity at a sustainable rate, and when the bucket is empty, work waits or sheds. A system without backpressure does not have a budget. It has a hope.
Quality of service is the third. Under contention, not all work is equal. A revenue-bearing interaction should preempt a speculative batch job for the scarce frontier budget. Without explicit priority, every request competes equally, which means your most expensive capacity is allocated by accident.
Load shedding is the fourth, and the one teams resist most. When the budget is under pressure, the correct behavior is to degrade on purpose: a cheaper model, a shorter context, a deferred task, a smaller subagent fan-out. Graceful degradation is not failure. It is the difference between a system that protects its margin under load and one that serves every request beautifully until the unit economics collapse.
None of these are AI ideas. They are how every network, every database, and every operating system survives more demand than it can profitably serve.
Cheaper tokens will not save you
The standard rebuttal is to wait. Goldman Sachs notes that hardware is driving the cost per token down 60 to 70 percent a year. Surely the problem solves itself.
It will not, because that curve is an industry average you do not control, and your bill is set by your own architecture. The same Goldman analysis projects token consumption rising roughly twenty-four times by 2030, and agents are why: a single agentic task expands into planning, retrieval, tool calls, and subagents, each consuming tokens the old single-shot workflow never touched. Whatever headroom the falling cost gives back, an unscheduled agent loop will spend.
Falling unit costs lower the floor. Agent depth raises the ceiling, and it does so per request, in your system, whatever the industry average is doing. The gap between the two is exactly where a scheduler has to live.
This is the same shape as every efficiency curve in computing history. Compute got cheaper and we did not do less. We did more, and the systems that won were the ones that scheduled the abundance instead of drowning in it.
Reasoning depth has a budget
There is a deeper principle hiding in the FinOps panic.
An agent's reasoning depth, how many steps it plans, how many tools it calls, how many subagents it spawns, how much context it carries, is not free intelligence. Every increment is purchased with tokens and latency. Which means depth is bounded by what the budget can sustain under load, whether or not you ever decided that on purpose.
Reasoning depth cannot exceed what the cost and latency budget can schedule.
Most teams discover that boundary the way you discover any unmanaged limit: at the worst possible time, in production, when traffic spikes and the agent that reasoned brilliantly in the demo starts either timing out or quietly bankrupting the feature. The boundary was always there. The only choice is whether the scheduler enforces it deliberately or the invoice enforces it after the fact.
What this changes
If cost is a scheduling problem, the work moves out of the dashboard and into the runtime.
It means a request carries a budget the way it carries an auth token, and the runtime spends against it the way a long-running service spends against a rate limit. It means admission, priority, and degradation policy are written, tested, and versioned like any other control-plane logic, not improvised per incident. It means cost telemetry feeds a controller that can act in the moment, not just a report that explains the loss next month.
FinOps still matters. You need the bill, the attribution, the trend. But the bill is the scoreboard, not the game. The game is played at request time, under contention, by whatever decides what to admit and what to refuse.
Cost is not a number you read after the run.
It is a decision you make during it. Build the scheduler.
Sources
- EY, "Agentic AI: the enterprise token cost" (June 1, 2026): https://www.ey.com/en_us/insights/ai/agentic-ai-token-costs
- ICONIQ, "2026 State of AI: Bi-Annual Snapshot" (January 2026; model inference ~23% of cost at scale, ~52% gross margins): https://www.iconiq.com/growth/reports/2026-state-of-ai-bi-annual-snapshot
- Gartner, "Over 40% of Agentic AI Projects Will Be Canceled by End of 2027" (June 25, 2025): https://www.gartner.com/en/newsroom/press-releases/2025-06-25-gartner-predicts-over-40-percent-of-agentic-ai-projects-will-be-canceled-by-end-of-2027
- Goldman Sachs Research, "AI agents forecast to boost tech cash flow as usage soars" (2026; inference cost/token down 60-70%/yr, consumption ~24x by 2030): https://www.goldmansachs.com/insights/articles/ai-agents-forecast-to-boost-tech-cash-flow-as-usage-soars
- Chen, Zaharia, Zou, "FrugalGPT" (arXiv 2305.05176, 2023): https://arxiv.org/abs/2305.05176
- Ong et al., "RouteLLM" (arXiv 2406.18665, 2024): https://arxiv.org/abs/2406.18665
- HiveMind: OS-inspired scheduling for concurrent LLM agents (arXiv 2604.17111, 2026): https://arxiv.org/abs/2604.17111
- "Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale" (arXiv 2604.06970, 2026; admit/defer/reject on a cost ladder): https://arxiv.org/abs/2604.06970
The figures here are reported by the vendors and analysts cited, not independently verified. Read them as directional.