Concepts

Hot Key Mitigation

Techniques for handling keys or records that receive disproportionate traffic and threaten to overload a partition, cache node, or database row.

intermediate4 min readUpdated unknownCapacityReliabilityOperationsTradeoffs
Hot KeysHot PartitionsLoad DistributionAdaptive ShardingCelebrity Problem

Concepts Covered

  • Hot keys
  • Hot partitions
  • Celebrity problem
  • Load imbalance
  • Sharded counters
  • Cache pressure
  • Request coalescing
  • Adaptive mitigation

Definition

A hot key is a key that receives far more traffic than most other keys.

Examples:

  • a viral short URL
  • a celebrity post receiving likes
  • a product page during a launch
  • a globally shared configuration key
  • a cache key used by almost every request

Hot key mitigation means changing the system so one extremely popular key does not overload one cache node, database row, shard, queue partition, or worker group.

The Pain That Forces Hot Key Mitigation

Distributed systems often scale by spreading work across many keys.

That assumption breaks when real product traffic becomes uneven.

Imagine an Instagram-like system where every like on a viral post updates one counter:

post_id = 42
like_count = like_count + 1

At small scale, this is fine. At viral scale, thousands of writes per second may target the same post_id.

The system may have hundreds of database nodes, but the hot post still maps to one row, one partition, or one lock. Average traffic looks healthy, yet one part of the system is overloaded.

This is the hot key problem: the cluster has capacity, but the key placement concentrates pressure.

Mental Model

Horizontal scaling helps when load is spread.

Hot keys are dangerous because they collapse horizontal scale back into a single point of pressure.

normal traffic:
many keys -> many partitions

hot key traffic:
one key -> one partition -> overload

A hot key is not always a bug. It can be a product success signal: a celebrity post, a viral link, a launch event, or breaking news. The system must be designed for the fact that attention is not evenly distributed.

Example: Hot Counter

Naive counter update:

likes:post_42 -> 9,001,212

Every like updates the same counter. This can create:

  • row lock contention
  • write conflicts
  • replication lag
  • cache invalidation storms
  • queue partition imbalance

A common mitigation is a sharded counter:

likes:post_42:shard_0 -> 120,001
likes:post_42:shard_1 -> 119,992
likes:post_42:shard_2 -> 120,087
...

Writes are spread across counter shards. Reads sum the partial counters or read a periodically refreshed projection.

This improves write scalability but makes reads and freshness more complex.

Common Mitigations

TechniqueWhat it doesTradeoff
ReplicationServe hot reads from many copiesHarder invalidation
Sharded countersSplit one logical counter into many physical countersReads become more complex
Request coalescingCombine duplicate work for the same keyAdds coordination
Cache prewarmingLoad hot data before demand arrivesRequires prediction
Adaptive shardingDynamically split hot keysOperationally harder
Rate limitingSlow abusive or extreme trafficProduct impact
Async aggregationBuffer writes and update derived state laterEventual consistency

The right answer depends on whether the hot pressure is read-heavy, write-heavy, or both.

Hot Reads Versus Hot Writes

Hot reads are often easier to mitigate. The same value can be replicated into multiple caches or CDN locations.

Hot writes are harder because every write changes state.

A viral short URL is mostly a hot read:

short_code -> destination_url

The destination usually does not change, so caching works well.

A viral post like count is a hot write:

post_id -> like_count

Each like changes the aggregate. The system may need sharded counters, event streams, async aggregation, or approximate counts.

Operational Reality

You cannot manage hot keys if you only watch service averages.

Important signals:

  • top keys by request rate
  • per-partition CPU and latency
  • cache node imbalance
  • row lock contention
  • queue partition lag
  • retry spikes grouped by key
  • p99 latency for hot-key requests
  • eviction churn for hot cache entries

The fix should match the pressure. If the problem is cache miss storms, use coalescing or stale-while-revalidate. If the problem is one counter row, shard the counter. If the problem is abusive traffic, rate limit. If the problem is one queue partition, revisit partitioning.

Hot key mitigation is not one technique. It is a way of designing for uneven reality.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Used In Systems

System studies where this idea appears in context.

Related Concepts

Core ideas that connect to this topic.

Related Patterns

Reusable architecture moves built from these ideas.