SRAL: A Framework for Evaluating Agentic AI Architectures

Sharan, Aakash

Distributed Vector Search: How Real Vector Databases Scale Beyond One Machine

November 25, 2025 #AI Infrastructure 7 min read

Why dense search becomes a routing, sharding, and distributed-systems problem

Vector search looks simple when everything fits on one machine.

It becomes a different discipline entirely when you need to serve:

millions to billions of vectors,
across multiple nodes,
with predictable latency,
and high recall,
while RAG or agent pipelines depend on you staying under 100ms end-to-end.

Unlike relational data, vectors can’t be sharded by key.

Unlike document search, they don’t map cleanly to inverted indexes.

And unlike hash tables, you can’t just spread them uniformly and hope locality survives.

Nearest neighbors live in geometry, not in keys — and geometry doesn’t respect your shard boundaries.

This post explains how distributed vector systems actually scale.

Once you understand this, you understand the architecture behind Pinecone, Weaviate, Milvus, Qdrant, Vespa, pgvector clusters — and every serious internal enterprise vector engine.

1. Why Distributed Vector Search Is Harder Than It Looks

Every distributed vector database must answer three questions:

1. How do we split vectors across machines? (Sharding)

2. How do we send each query to the right machines? (Routing)

3. How do we merge partial results correctly? (Aggregation)

The Annoying Trade-Off: Achieving the highest Recall usually requires a wider search (more shards/nodes), which directly conflicts with the need for low Latency and low Cost (less fan-out). Distributed vector search is primarily an optimization problem trying to balance this triangle.

If any part breaks:

recall silently drops,
latency becomes unpredictable,
RAG pipelines hallucinate,
agent behavior drifts,
and costs explode.

Everything starts with sharding.

2. Sharding: How to Divide Vectors Without Destroying Locality

In distributed systems you normally choose a key, hash it, and assign it to a shard.

Vector search makes that impossible.

Why?

Because neighbors aren’t defined by a key — they’re defined by position in high-dimensional space.

Random hashing destroys that.

So modern vector DBs rely on three sharding strategies.

2.1 Centroid-Based Sharding (IVF): The Scalable Default

This is the dominant approach in the industry.

The idea:

Cluster the entire dataset into K coarse clusters.
Assign each cluster to a node or shard.
Put each vector in the shard corresponding to its nearest centroid.

This preserves locality almost perfectly:

vectors that live near each other in space end up on the same node.

Why it works so well:

Locality is preserved → fewer shards per query.
Routing is simple → “send the query to the clusters closest to the query centroid.”
Scaling is linear → want more throughput? increase the number of clusters.

This is why IVF underpins Pinecone pods, Milvus distributed mode, Weaviate’s sharding, and several proprietary internal search engines.

It’s simple, elegant, and incredibly effective.

2.2 Graph-Aware Sharding (HNSW): The Hard Path

HNSW is a global graph — a network of vectors connected by distance-based edges.

Graphs don’t shard cleanly.

If you cut the graph across machines:

long-range links break
neighbor connectivity weakens
navigation steps increase
recall collapses

To avoid that, HNSW systems typically:

use fewer but larger machines,
rely heavily on replication rather than partitioning,
reindex periodically in large batches,
avoid high-rate streaming writes.

That’s the tradeoff: you get microsecond-level search but pay for it in distribution complexity.

Emerging Approaches: Some advanced systems use Hierarchical Sharding, employing IVF for the coarse, top-level routing, and then running HNSW within each resulting shard. This combines the sharding efficiency of IVF with the high-speed local search of HNSW.

2.3 Hybrid Sharding (Metadata + Geometry)

Real products have constraints that pure geometry can’t solve:

tenant separation
compliance + PII boundaries
billing isolation
organizational ownership
domain-based access controls

So enterprises combine keys:

(cluster_id, tenant_id)
(centroid_id, time_bucket)
(domain, cluster_range)

This balances locality and organizational constraints.

Sharding is no longer an IR decision.

It becomes a business + cost + compliance architecture decision.

3. Routing: How Queries Find the Right Machines

Once the data is split, the next problem appears:

Given a query embedding, which shards are worth searching?

You cannot search all shards.

Broadcasting destroys latency and cost.

Three routing strategies dominate.

3.1 Centroid-Based Routing (Most Common and Most Efficient)

The workflow:

Embed the query.
Compare it to the centroid set.
Pick the closest clusters (nprobe).
Route to only those shards.

Benefits:

tiny and predictable fan-out
stable latency
near-linear scalability
excellent recall when tuned correctly

The tunable nprobe value is the primary control knob for balancing latency and recall in IVF-based systems.

3.2 Full Broadcast (The Anti-Pattern)

When systems lack routing metadata — or when developers don’t know any better — they do:

send query → hit all shards → merge everything

This is only tolerable for:

tiny datasets
hobby work
toy deployments

For real workloads, broadcast guarantees:

inflated CPU usage
runaway latency
unpredictable tail behavior

Avoid unless you enjoy on-call rotations.

3.3 Learned Routing (Emerging Trend)

This is the frontier:

ML models that predict the best shard
neural networks that learn routing boundaries
secondary vector indexes over centroids
adaptive fan-out controllers

Early research shows:

lower fan-out
more stable latency
better load distribution

Expect this to reach production over the next 1–2 years, especially in multi-tenant workloads.

4. Replication: Keeping the Cluster Alive Under Failure

Vector databases replicate for:

high availability
fault tolerance
higher read throughput
isolation between workloads

But the replication cost is very different depending on the index.

4.1 HNSW Replication: Heavy and RAM-Hungry

Replicating a graph means copying:

vectors
edges
multi-layer structures

Snapshotting and restoring HNSW can be slow.

Keeping replicas consistent under writes is harder.

Most systems compromise:

active–passive pairs, or
active–active with diluted consistency.

4.2 IVF Replication: Smooth and Cheap

IVF shards are just lists of cluster partitions.

Replication is simply:

copy partition → refresh metadata → done.

Fast and reliable.

This is why IVF is a favorite for distributed, HA deployments.

4.3 PQ Replication: Storage-Heavy, IO-Sensitive

PQ compresses vectors into blocks stored on SSD.

Replication requires copying:

compressed codes
codebooks
SSD block layouts

Less RAM cost, more IO burden.

PQ is not a graph challenge; it’s a storage + wear-leveling challenge.

5. The Write Path: Sharding Vectors as They Arrive

The challenge of distributed search isn't just serving queries; it's efficiently inserting new data. The write path introduces a critical bottleneck:

Global Lookup Requirement: For Centroid-Based (IVF) systems, a newly ingested vector must first be compared against the global centroid index (the small, centralized structure) to determine its correct shard. This global lookup structure must be extremely fast, highly available, and consistent, as it acts as the central routing layer for all incoming writes.
Consistency and Latency: Once the shard is identified, the write must be propagated to all its replicas. For graph-based indices like HNSW, maintaining read/write consistency across replicas during high-volume streaming writes is a significant source of operational complexity and write-side latency.

6. Multi-Tier Storage: RAM → SSD → Object Store

Scaling vector DBs requires understanding that not all vectors need to live in RAM.

Modern systems use tiered storage:

RAM (hot path)

centroids
routing tables
top-level HNSW layers
PQ codebooks
hot vectors

SSD (warm path)

IVF partitions
compressed PQ blocks
overflow segments

Object Storage (cold path)

historical embeddings
archive partitions
old index versions
rebuild artifacts

The key idea:

RAM is for navigation. SSD is for capacity. Object storage is for history.

7. Latency Budgets: The 100ms Constraint

In RAG and agent pipelines, vector search is just one stage.

The entire system must fit inside a 100–150ms window.

Typical budget:

embedding model → 10–50ms
vector search → 5–30ms
reranking → 5–15ms
LLM → 30–80ms

If vector search burns 40ms, the pipeline breaks.

This is why distributed search must:

keep fan-out tiny
minimize inter-node chatter
ensure shard latency consistency
avoid broadcast
use caching and prefetch for PQ

Latency is not an optimization problem — it’s a survival constraint.

Final Takeaway

Distributed vector search is not “just ANN at scale.”

It is:

geometry
sharding strategy
routing correctness
replication mechanics
tiered storage
workload isolation
latency budgeting
operational discipline

The deeper truth: Dense search is a distributed systems problem disguised as a retrieval problem. And ultimately, sharding and routing are your primary defense against spiraling infrastructure costs.

Once you internalize that, index tuning stops being the bottleneck — and architecture takes center stage.