Concepts

Query Fan-Out

Send one user query to many shards or services, merge the partial results, and return a ranked response within a latency budget.

intermediate3 min readUpdated unknownCapacityReliabilityOperationsTradeoffs
ShardingTail LatencyPartial ResultsRankingBackpressure

Concepts Covered

  • Query coordinators
  • Shard requests
  • Partial result merging
  • Tail latency
  • Timeouts
  • Degraded responses
  • Scatter-gather tradeoffs

Definition

Query fan-out is the pattern where one user request is sent to multiple shards or backend services, and the results are merged into one response.

Search systems use query fan-out when the full index is too large for one machine. Each shard owns part of the searchable corpus. A query coordinator sends the query to relevant shards, collects partial matches, merges them, ranks them, and returns the final result set.

The Pain That Forces Query Fan-Out

A single search node can only hold so much index data and serve so many queries. Eventually the corpus, traffic, or freshness requirements exceed one machine.

The system shards the index:

shard A -> documents 0-99M
shard B -> documents 100M-199M
shard C -> documents 200M-299M

Now one query may need results from all three shards. If the coordinator asks only one shard, it misses documents. If it asks every shard without limits, every query becomes expensive.

Query fan-out exists because distributed search must balance completeness, latency, and cost.

Mental Model

Fan-out is scatter-gather:

user query
  -> coordinator
    -> shard 1 returns top 100
    -> shard 2 returns top 100
    -> shard 3 returns top 100
  -> merge and rank
  -> top 20 results to user

Each shard returns its best local candidates. The coordinator combines those candidates into global results.

Tail Latency

Fan-out increases sensitivity to slow shards.

If a query fans out to 50 shards, the user waits for many backend calls. Even if most shards are fast, one slow shard can delay the response.

This is tail latency pressure:

overall latency ~= slowest required shard + merge time

Production systems use timeouts, hedged requests, replicas, partial results, and careful shard selection to keep the user experience inside a latency budget.

Partial Results

Sometimes the system must choose between a complete slow answer and a partial fast answer.

For a social search product, it may be acceptable to return high-quality results from most shards if one shard misses a tight deadline. For legal, financial, or compliance search, that may not be acceptable.

The product decides the tolerance. The infrastructure must make the choice explicit.

Operational Reality

Important signals:

  • shard latency distribution
  • number of shards touched per query
  • timeout rate
  • partial result rate
  • coordinator CPU
  • merge latency
  • hot shard frequency
  • cache hit rate for common queries

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Used In Systems

System studies where this idea appears in context.

Related Concepts

Core ideas that connect to this topic.