SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Backpressure: The Honest Answer to a Fast Producer and a Slow Consumer

2023-06-18 10 min read

A valve metering flow between a fast wide pipe and a slow narrow one, showing backpressure governing producer and consumer.

Every streaming system has a fast end and a slow end.

A producer reads from a socket, a database, a sensor, another service. A consumer writes to disk, calls an API, runs a model, renders a frame. The two are almost never matched in speed, and they have no reason to be. The producer does not know how busy the consumer is. The consumer does not get to choose how fast the world sends it work.

So one of them is always faster than the other. The only question is what happens to the difference.

Backpressure is the answer to that question, and most systems answer it by accident.

The accidental answer is a buffer. You put a queue between the fast end and the slow end, and for a while it works. The producer fills the queue, the consumer drains it, and the mismatch is hidden inside the slack. The trouble is that the queue is hiding the problem, not solving it, and the day the producer stays faster than the consumer for long enough, the slack runs out and the system finds out what it actually decided.

What an unbounded queue actually costs

The seductive fix, when a buffer fills, is to make the buffer bigger. It is the wrong instinct, and it is worth being precise about why.

Start with latency. A queue does not make a slow consumer faster. It makes the slow consumer's backlog longer. Little's Law is the unavoidable statement of this: the average number of items in a system equals the arrival rate times the average time each item spends inside. Hold the arrival rate above the service rate and the queue grows without bound, and so does the time each item waits. A bigger buffer does not reduce latency under sustained overload. It guarantees more of it. You are not absorbing a spike anymore. You are storing a permanent backlog and charging every item for the privilege of sitting in it.

Then there is memory. An unbounded queue is an unbounded memory commitment handed to whoever is fastest, which is exactly the party you least want holding that power. Under sustained mismatch the queue does not plateau. It climbs until the process runs out of heap and dies. An out-of-memory crash is not a graceful degradation. It takes the in-flight work with it, including the items that were already safely buffered.

And then the failure spreads. A queue that grows until it kills its process does not fail in isolation. The upstream that was feeding it now blocks or errors or starts queuing on its own. The pattern repeats one hop back, and one more, and a localized slowdown becomes a correlated outage across services that never touched the slow consumer directly. The bigger the buffers, the longer this stays invisible, and the more violent it is when it finally gives.

This is the real cost of "just add a bigger buffer." It converts a visible, recoverable slowdown into a delayed, catastrophic failure. The buffer did not remove the mismatch. It only postponed the reckoning and made it worse.

Push, pull, and who is in control

To see the alternative clearly, look at who decides the rate.

In a push model, the producer is in control. It emits as fast as it can and the consumer copes. This is efficient when the consumer is reliably faster, and it is exactly the model that fails under mismatch, because nothing in the protocol lets the consumer say "slow down." The producer's speed is the system's speed, whether or not the consumer can keep up.

In a pull model, the consumer is in control. It asks for the next item only when it is ready, and the producer waits to be asked. This is naturally safe, because the consumer never receives more than it requested. The classic objection is overhead: a strict request-one, get-one loop spends a round trip per item and starves the pipeline. A consumer that can only ask for one thing at a time leaves the producer idle while the request travels.

The resolution is not to pick a side. It is to let the consumer pull in bulk while the producer pushes within that allowance. The consumer signals how many items it is prepared to handle. The producer is then free to send up to that number, as fast as it likes, without asking again. The consumer is in control of the rate, the producer is in control of the timing, and the channel stays full without ever overflowing.

That is the whole idea of demand-driven flow control, and it is older than any of the libraries that implement it. TCP has worked this way for decades. The receiver advertises a window, the size of the buffer it is willing to accept. The sender may transmit up to that window and no further until the receiver advertises room for more. The fast end is bounded by the slow end's stated capacity, continuously, as a property of the protocol rather than as something the application layer remembers to do. Backpressure is not a new idea. It is the oldest correct idea in flow control, rediscovered every time someone builds a pipeline without it.

The Reactive Streams contract

Reactive Streams took that principle and wrote it down as a contract between a producer and a consumer, independent of any one library. The standard reached a stable 1.0 in 2015, and its center is a single inversion: demand flows upstream before data flows downstream.

A subscriber does not receive elements because a publisher decided to send them. It receives elements because it requested them. It calls request(n), and that n is a hard ceiling. The publisher may emit at most n elements and then must stop and wait for the next request. Demand is the currency. No demand, no data. The fast end physically cannot outrun the slow end, because the slow end is the one issuing the permits.

This is the part that matters architecturally, and it is worth stating plainly. Backpressure in this model is not a feature you switch on when a queue fills. It is the default state of the channel. The consumer's capacity is expressed continuously, as demand, and the producer is bounded by it at every step. There is no buffer quietly absorbing the mismatch and no moment where the slack runs out, because there is no slack pretending the mismatch is not there.

In Akka Streams, every stage of a pipeline obeys this contract end to end. A Source does not get to flood a slow Sink. Demand propagates backward through every operator between them, so the slowest stage sets the pace for the whole graph automatically, the way the slowest section of a pipe sets the flow of water through all of it. When a fast source meets a slow sink, you do not get a memory leak. You get a source that runs no faster than bounded demand and configured buffers allow. The mismatch is still there. It is simply being governed instead of buffered.

The same shape shows up far outside the reactive libraries. A Kafka consumer is in control of how fast it reads, because it drives consumption through its poll loop, calling poll only when it is ready for more and shaping how much each fetch returns. The broker does not push records at a pace the consumer cannot sustain. The poll loop, fetch sizing, and pause and resume are the flow-control levers, while the committed offset simply tracks position rather than throttling the read. Lag is then the visible, measurable backlog you would otherwise have hidden inside an unbounded in-memory queue. Different mechanism, same architectural truth: the consumer owns the rate, and the backlog is made observable rather than fatal.

Bounded buffering is the discipline, not the absence of buffering

None of this means buffers are forbidden. A buffer between two stages is genuinely useful. It absorbs short bursts, it decouples timing so a brief hiccup on one side does not immediately stall the other, and it lets both ends run smoothly across small variations in speed.

The discipline is that the buffer must be bounded, and you must decide in advance what happens when it is full. That decision is the actual architecture. Block the producer until there is room, which propagates backpressure upstream and keeps the system honest. Drop the oldest, or the newest, or the least important, when the data is more useful fresh than complete. Fail fast and shed load, when stale work is worth nothing and you would rather reject than rot. Each of these is a legitimate answer, and which one is correct depends entirely on what the data is for.

What is never a legitimate answer is "the buffer is unbounded, so the question never comes up." That is not a decision. It is the deferral of a decision, paid back later with interest, in production, at the worst possible time. An unbounded buffer is a promise that the future will be kind. It rarely is.

The shift in thinking is from buffering as a hiding place to buffering as a controlled, finite shock absorber with an explicit overflow policy. The bound is not a limitation you tolerate. It is the thing that makes the system's behavior under stress something you chose rather than something you discover.

The takeaway

Backpressure is not an optimization you add when a queue overflows. It is the acknowledgment that a fast producer and a slow consumer are the normal case, and that the difference between them has to go somewhere visible and bounded, not somewhere hidden and infinite.

The system that pretends the mismatch away with a bigger buffer is not more capable. It is more confident, right up until the heap fills. The system that lets the consumer signal its capacity, and bounds the producer to it, is slower to impress and far harder to kill.

A buffer hides the mismatch. Backpressure governs it. Only one of them is still standing when the load does not relent.

Sources

The Reactive Manifesto, v2.0 (reactivemanifesto.org, 2014): https://www.reactivemanifesto.org/
Reactive Streams specification, 1.0.0 for the JVM (reactive-streams.org, 2015): https://www.reactive-streams.org/
Akka Streams documentation, "Buffers and working with rate" and the back-pressure model (Akka core): https://doc.akka.io/libraries/akka-core/current/stream/stream-flows-and-basics.html
J. D. C. Little, "A Proof for the Queuing Formula: L = λW" (Operations Research, 1961).
Jon Postel (ed.), "Transmission Control Protocol," RFC 793 (1981), on the receive window and flow control: https://www.rfc-editor.org/rfc/rfc793
Jay Kreps, "The Log: What every software engineer should know about real-time data's unifying abstraction" (LinkedIn Engineering, 2013): https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying
Apache Kafka, KafkaConsumer Javadoc, on the poll loop, position, and pause/resume flow control: https://kafka.apache.org/34/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html