Concepts
Sharding
Split data across multiple partitions or databases so a system can scale beyond one machine or one storage node.
Concepts Covered
- Shards
- Partition keys
- Horizontal scaling
- Query routing
- Hot partitions
- Cross-shard queries
- Resharding
- Operational complexity
Definition
Sharding means splitting data across multiple storage partitions so each partition owns only part of the dataset.
Instead of one database storing every row, the system divides ownership:
shard 1: users 0 - 9,999,999
shard 2: users 10,000,000 - 19,999,999
shard 3: users 20,000,000 - 29,999,999
Real systems often use hashing, ranges, tenant IDs, geography, or custom routing rules. The goal is the same: no single database node has to store or serve everything.
The Pain That Forces Sharding
Before sharding, teams usually try simpler tools:
- better indexes
- query tuning
- caching
- read replicas
- bigger machines
- moving analytics away from the primary database
Sharding appears when one storage node is no longer enough.
The pressure can come from several directions:
table too large
write throughput too high
indexes too big
single-node storage exhausted
connection limits reached
maintenance windows too risky
blast radius too large
At that point, the system has a physical problem: one machine cannot keep carrying the dataset and traffic shape.
Sharding turns one large storage problem into many smaller storage problems.
Mental Model
A shard is an ownership boundary.
The system needs a rule that answers:
Which shard owns this piece of data?
That rule is usually based on a partition key.
For a URL shortener, short_code can be a natural partition key because redirect requests look up by short code.
For a messaging system, possible partition keys include:
conversation_iduser_idtenant_idregion
The right key depends on the access pattern. A shard key is not just a storage decision. It determines which queries are easy, which queries are expensive, and where hot spots appear.
Partition Key Example
Suppose chat messages are sharded by conversation_id.
hash(conversation_id) -> shard
This is good for loading one conversation because all messages for that conversation can live together.
But if one group chat becomes extremely active, that one conversation can overload one shard. The total system may have spare capacity, but the hot conversation still maps to one place.
Now compare sharding by message_id.
That spreads writes better, but reading one conversation may require querying many shards.
This is the core sharding tradeoff:
spread load evenly
versus
keep related data together
Query Routing
Once data is sharded, the application must route requests to the right shard.
request for user_42
-> compute shard from user_id
-> connect to shard_7
-> run query
This routing may live in application code, a database proxy, a routing service, or the database system itself.
Routing bugs are serious. A request sent to the wrong shard may look like missing data. A write sent to the wrong shard can create corrupted ownership.
Cross-Shard Operations
Sharding makes single-shard operations faster and more scalable, but cross-shard operations become harder.
Examples:
- joining data across shards
- counting all rows globally
- moving users between shards
- enforcing uniqueness across the whole system
- running transactions that touch multiple shards
A query that used to be one database operation may become a fan-out query:
query shard_1
query shard_2
query shard_3
merge results
This adds latency, partial failure, and more application logic.
Resharding
Resharding means changing how data is distributed.
This becomes necessary when:
- shards grow unevenly
- traffic distribution changes
- a partition key was chosen poorly
- the system needs more shards
- customers or regions need to move
Resharding is painful because live data must move while the product continues serving reads and writes. The system may need dual writes, backfills, routing cutovers, consistency checks, and rollback plans.
This is why teams delay sharding until they need it. It solves real scale problems, but it raises the operational floor permanently.
Operational Reality
Important signals:
- per-shard storage size
- per-shard CPU and memory
- per-shard p95 and p99 latency
- hot partitions
- cross-shard query volume
- failed routing attempts
- data movement progress
- replication lag per shard
Sharding gives horizontal scale, but the cost is complexity. The database is no longer one place. It becomes a fleet of ownership boundaries that the application must understand.
Related Topics
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Prerequisites
Read these first if this topic feels unfamiliar.
Used In Systems
System studies where this idea appears in context.
Related Concepts
Core ideas that connect to this topic.