Concepts
Hot Key Mitigation
Techniques for handling keys or records that receive disproportionate traffic and threaten to overload a partition, cache node, or database row.
Concepts Covered
- Hot keys
- Hot partitions
- Celebrity problem
- Load imbalance
- Sharded counters
- Cache pressure
- Request coalescing
- Adaptive mitigation
Definition
A hot key is a key that receives far more traffic than most other keys.
Examples:
- a viral short URL
- a celebrity post receiving likes
- a product page during a launch
- a globally shared configuration key
- a cache key used by almost every request
Hot key mitigation means changing the system so one extremely popular key does not overload one cache node, database row, shard, queue partition, or worker group.
The Pain That Forces Hot Key Mitigation
Distributed systems often scale by spreading work across many keys.
That assumption breaks when real product traffic becomes uneven.
Imagine an Instagram-like system where every like on a viral post updates one counter:
post_id = 42
like_count = like_count + 1
At small scale, this is fine. At viral scale, thousands of writes per second may target the same post_id.
The system may have hundreds of database nodes, but the hot post still maps to one row, one partition, or one lock. Average traffic looks healthy, yet one part of the system is overloaded.
This is the hot key problem: the cluster has capacity, but the key placement concentrates pressure.
Mental Model
Horizontal scaling helps when load is spread.
Hot keys are dangerous because they collapse horizontal scale back into a single point of pressure.
normal traffic:
many keys -> many partitions
hot key traffic:
one key -> one partition -> overload
A hot key is not always a bug. It can be a product success signal: a celebrity post, a viral link, a launch event, or breaking news. The system must be designed for the fact that attention is not evenly distributed.
Example: Hot Counter
Naive counter update:
likes:post_42 -> 9,001,212
Every like updates the same counter. This can create:
- row lock contention
- write conflicts
- replication lag
- cache invalidation storms
- queue partition imbalance
A common mitigation is a sharded counter:
likes:post_42:shard_0 -> 120,001
likes:post_42:shard_1 -> 119,992
likes:post_42:shard_2 -> 120,087
...
Writes are spread across counter shards. Reads sum the partial counters or read a periodically refreshed projection.
This improves write scalability but makes reads and freshness more complex.
Common Mitigations
| Technique | What it does | Tradeoff |
|---|---|---|
| Replication | Serve hot reads from many copies | Harder invalidation |
| Sharded counters | Split one logical counter into many physical counters | Reads become more complex |
| Request coalescing | Combine duplicate work for the same key | Adds coordination |
| Cache prewarming | Load hot data before demand arrives | Requires prediction |
| Adaptive sharding | Dynamically split hot keys | Operationally harder |
| Rate limiting | Slow abusive or extreme traffic | Product impact |
| Async aggregation | Buffer writes and update derived state later | Eventual consistency |
The right answer depends on whether the hot pressure is read-heavy, write-heavy, or both.
Hot Reads Versus Hot Writes
Hot reads are often easier to mitigate. The same value can be replicated into multiple caches or CDN locations.
Hot writes are harder because every write changes state.
A viral short URL is mostly a hot read:
short_code -> destination_url
The destination usually does not change, so caching works well.
A viral post like count is a hot write:
post_id -> like_count
Each like changes the aggregate. The system may need sharded counters, event streams, async aggregation, or approximate counts.
Operational Reality
You cannot manage hot keys if you only watch service averages.
Important signals:
- top keys by request rate
- per-partition CPU and latency
- cache node imbalance
- row lock contention
- queue partition lag
- retry spikes grouped by key
- p99 latency for hot-key requests
- eviction churn for hot cache entries
The fix should match the pressure. If the problem is cache miss storms, use coalescing or stale-while-revalidate. If the problem is one counter row, shard the counter. If the problem is abusive traffic, rate limit. If the problem is one queue partition, revisit partitioning.
Hot key mitigation is not one technique. It is a way of designing for uneven reality.
Related Topics
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Prerequisites
Read these first if this topic feels unfamiliar.
Used In Systems
System studies where this idea appears in context.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.