Concepts

Sharding

Split data across multiple partitions or databases so a system can scale beyond one machine or one storage node.

intermediate4 min readUpdated unknownCapacityDataOperationsTradeoffs
Partition KeysHot PartitionsHorizontal ScalingReshardingQuery Routing

Concepts Covered

  • Shards
  • Partition keys
  • Horizontal scaling
  • Query routing
  • Hot partitions
  • Cross-shard queries
  • Resharding
  • Operational complexity

Definition

Sharding means splitting data across multiple storage partitions so each partition owns only part of the dataset.

Instead of one database storing every row, the system divides ownership:

shard 1: users 0 - 9,999,999
shard 2: users 10,000,000 - 19,999,999
shard 3: users 20,000,000 - 29,999,999

Real systems often use hashing, ranges, tenant IDs, geography, or custom routing rules. The goal is the same: no single database node has to store or serve everything.

The Pain That Forces Sharding

Before sharding, teams usually try simpler tools:

  • better indexes
  • query tuning
  • caching
  • read replicas
  • bigger machines
  • moving analytics away from the primary database

Sharding appears when one storage node is no longer enough.

The pressure can come from several directions:

table too large
write throughput too high
indexes too big
single-node storage exhausted
connection limits reached
maintenance windows too risky
blast radius too large

At that point, the system has a physical problem: one machine cannot keep carrying the dataset and traffic shape.

Sharding turns one large storage problem into many smaller storage problems.

Mental Model

A shard is an ownership boundary.

The system needs a rule that answers:

Which shard owns this piece of data?

That rule is usually based on a partition key.

For a URL shortener, short_code can be a natural partition key because redirect requests look up by short code.

For a messaging system, possible partition keys include:

  • conversation_id
  • user_id
  • tenant_id
  • region

The right key depends on the access pattern. A shard key is not just a storage decision. It determines which queries are easy, which queries are expensive, and where hot spots appear.

Partition Key Example

Suppose chat messages are sharded by conversation_id.

hash(conversation_id) -> shard

This is good for loading one conversation because all messages for that conversation can live together.

But if one group chat becomes extremely active, that one conversation can overload one shard. The total system may have spare capacity, but the hot conversation still maps to one place.

Now compare sharding by message_id.

That spreads writes better, but reading one conversation may require querying many shards.

This is the core sharding tradeoff:

spread load evenly
versus
keep related data together

Query Routing

Once data is sharded, the application must route requests to the right shard.

request for user_42
  -> compute shard from user_id
  -> connect to shard_7
  -> run query

This routing may live in application code, a database proxy, a routing service, or the database system itself.

Routing bugs are serious. A request sent to the wrong shard may look like missing data. A write sent to the wrong shard can create corrupted ownership.

Cross-Shard Operations

Sharding makes single-shard operations faster and more scalable, but cross-shard operations become harder.

Examples:

  • joining data across shards
  • counting all rows globally
  • moving users between shards
  • enforcing uniqueness across the whole system
  • running transactions that touch multiple shards

A query that used to be one database operation may become a fan-out query:

query shard_1
query shard_2
query shard_3
merge results

This adds latency, partial failure, and more application logic.

Resharding

Resharding means changing how data is distributed.

This becomes necessary when:

  • shards grow unevenly
  • traffic distribution changes
  • a partition key was chosen poorly
  • the system needs more shards
  • customers or regions need to move

Resharding is painful because live data must move while the product continues serving reads and writes. The system may need dual writes, backfills, routing cutovers, consistency checks, and rollback plans.

This is why teams delay sharding until they need it. It solves real scale problems, but it raises the operational floor permanently.

Operational Reality

Important signals:

  • per-shard storage size
  • per-shard CPU and memory
  • per-shard p95 and p99 latency
  • hot partitions
  • cross-shard query volume
  • failed routing attempts
  • data movement progress
  • replication lag per shard

Sharding gives horizontal scale, but the cost is complexity. The database is no longer one place. It becomes a fleet of ownership boundaries that the application must understand.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Used In Systems

System studies where this idea appears in context.

Related Concepts

Core ideas that connect to this topic.