System Design

Instagram Likes System

Design a production-grade like system that handles retry-safe mutations, hot posts, distributed counters, fan-out, derived projections, and eventually consistent read models.

intermediate19 min readUpdated unknownModelingCapacityDataReliabilityOperationsTradeoffs
IdempotencyDistributed CountersEventual ConsistencyFan-OutHot Key MitigationTransactional OutboxDerived ProjectionsBackpressureEvent StreamsIdempotent ConsumersProjection Drift

Study path

Read these in order

Start with the mechanics, then move into the patterns that explain why the system is shaped this way.

  1. 1IdempotencyConcept
  2. 2Distributed CountersConcept
  3. 3Eventual ConsistencyConcept
  4. 4Event StreamsConcept
  5. 5Transactional OutboxPattern
  6. 6Fan-OutConcept
  7. 7Hot Key MitigationConcept
  8. 8Sharded CounterPattern
  9. 9Idempotent ConsumersConcept
  10. 10Projection DriftConcept
  11. 11Reconciliation JobPattern

Concepts Covered

1. Introduction

An Instagram-style like system lets a user express lightweight engagement with a post. The visible product behavior is simple: the user taps a heart, the button changes state, and a like count eventually updates.

The backend problem is more subtle. A like is not just count = count + 1. A like has at least two meanings:

  • a durable relationship: user U likes post P
  • a derived aggregate: post P has N likes

Those two meanings should not be collapsed into one piece of state too early. The relationship is the source of truth. The aggregate count is a derived projection that can be rebuilt, cached, sharded, delayed, or corrected.

At small scale, a single table and a counter column can work. At large scale, likes create retry problems, duplicate write problems, hot-counter problems, fan-out problems, and read-model consistency problems. Popular posts concentrate write traffic. Mobile clients retry during unreliable network conditions. Feed reads need the current viewer's like state and an aggregate count. Notifications, ranking systems, activity feeds, and analytics may all want to react to the same like event.

This module uses "Instagram Likes" as a familiar product shape, not as a claim about Instagram's private implementation.

2. Product Requirements

Functional Requirements

  • A user can like a post.
  • A user can unlike a post.
  • A user cannot create multiple active likes on the same post.
  • The UI can show whether the current viewer has liked a post.
  • The UI can show an aggregate like count.
  • The system can emit like/unlike events for downstream features.
  • Notifications, ranking, analytics, and feed systems can consume those events.
  • Operators can repair or rebuild derived counts if they drift from the source of truth.

Non-Functional Requirements

  • Like and unlike actions should feel fast to the user.
  • The current viewer's own like state should be strongly consistent after mutation.
  • Aggregate like counts can be eventually consistent if lag is bounded and understood.
  • Duplicate client retries should not create duplicate likes.
  • Popular posts should not overload a single database row, cache key, or partition.
  • Downstream consumer failures should not block the core like/unlike mutation.
  • Event publication should be reliable enough that derived projections do not silently miss committed changes.
  • The system should expose enough observability to detect lag, drift, hot keys, and consumer failures.

3. Core Engineering Challenges

The hard part is not storing a boolean. The hard part is preserving the right product guarantees while the feature becomes a high-volume distributed workflow.

ChallengeWhy it matters
Duplicate requestsMobile clients retry after timeouts. Without idempotency, retries can create duplicate side effects.
Hot postsViral posts concentrate writes on the same aggregate count.
Viewer state vs aggregate countThe button state should be accurate for the current user, while the total count can lag.
Fan-outOne like may affect notifications, activity feeds, ranking, recommendations, analytics, and abuse signals.
Event reliabilityIf a like commits but the event is lost, derived systems drift.
OrderingLike and unlike events for the same user/post pair can arrive out of order downstream.
CostA tiny user action can create many derived writes if the architecture is careless.
RepairabilityDerived counts will eventually drift; the system needs a way to detect and fix them.

The naive version fails when it treats the counter as the only truth. If the system simply increments and decrements a like_count column, it becomes hard to answer whether a specific user liked a post, hard to deduplicate retries, hard to rebuild corrupted counts, and easy to overload hot rows.

4. High-Level Architecture

flowchart LR
  Client[Client] --> LikeAPI[Like API]
  LikeAPI --> EdgeStore[(Like edge store)]
  LikeAPI --> Outbox[(Transactional outbox)]
  LikeAPI --> ViewerCache[Viewer state cache]

  Outbox --> Publisher[Outbox publisher]
  Publisher --> Stream[Like event stream]

  Stream --> CounterWorkers[Counter workers]
  Stream --> NotificationWorkers[Notification workers]
  Stream --> RankingWorkers[Ranking workers]
  Stream --> AnalyticsWorkers[Analytics workers]

  CounterWorkers --> CountShards[(Counter shards)]
  CounterWorkers --> CountProjection[(Compacted count projection)]
  NotificationWorkers --> NotificationStore[(Notifications)]
  RankingWorkers --> RankingSignals[(Ranking signals)]
  AnalyticsWorkers --> AnalyticsStore[(Analytics store)]

  FeedRead[Feed or post read] --> CountProjection
  FeedRead --> ViewerCache
  FeedRead --> EdgeStore

The core idea is separation of concerns:

  • The Like API owns the user-facing mutation.
  • The edge store owns the durable relationship.
  • The outbox protects event publication.
  • The event stream decouples downstream work.
  • Counter workers maintain derived count projections.
  • Read paths combine aggregate count and viewer-specific state.

The API should not synchronously update every downstream feature. That would make the like button depend on notification systems, ranking systems, analytics systems, and queue health. Instead, the API commits the relationship and publishes an event reliably.

5. Core Components

Like API

The Like API owns the command path for like and unlike actions. It authenticates the user, validates the post, applies rate or abuse controls if needed, performs an idempotent state transition, writes an outbox event when the state changes, and returns the current viewer state.

The most important boundary is that the API should commit the source-of-truth relationship, not all derived product effects. It should not wait for counters, notifications, ranking features, and analytics to finish before responding. Those systems can lag; the user's mutation should not depend on them being healthy.

This API needs to distinguish "request received twice" from "user intentionally changed state twice." A stable idempotency key, a unique (post_id, user_id) constraint, or a state-machine transition can make duplicate requests safe. For example, if the current state is already liked, another like request should return success without incrementing the count again.

Operationally, teams would watch mutation latency, duplicate request rate, state-transition conflicts, database write errors, outbox write failures, and the ratio of real state changes to no-op retries.

Like Edge Store

The edge store is the source of truth for the relationship between users and posts. A row answers the question: "Does this user currently like this post?"

A simplified model is:

CREATE TABLE post_likes (
  post_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL,
  state VARCHAR(16) NOT NULL,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP NOT NULL,
  version BIGINT NOT NULL,
  PRIMARY KEY (post_id, user_id)
);

The primary key prevents duplicate active relationships. Depending on product needs, unlikes can either delete the row or transition state to unliked. Keeping a state transition can make auditing, idempotency, and replay easier. Deleting rows can save storage and make "users who liked this post" queries cleaner.

The edge store should be designed around the main access patterns:

  • check whether viewer U liked post P
  • list users who liked post P
  • list posts liked by user U
  • apply a like/unlike transition safely

Those access patterns may require different indexes or even different denormalized stores at very large scale. The key point is that the edge store is truth; counters are summaries.

Transactional Outbox

The transactional outbox solves a classic distributed systems problem: how do we update the database and publish an event without losing one of those actions?

If the API writes the like edge and then publishes to a broker, it can crash after the database commit but before the broker publish. The user sees success, but downstream counters and notifications never hear about the like. The outbox avoids this by writing the event into an outbox table in the same transaction as the edge update.

transaction:
  update post_likes
  insert like_outbox event
commit

A separate publisher reads unpublished outbox rows and sends them to the event stream. This means event publication can be retried even if the API process dies.

The outbox does not guarantee no duplicates. It primarily protects against lost events. Consumers still need idempotency because the publisher may send the same event more than once.

Like Event Stream

The event stream carries state changes such as LikeCreated and LikeRemoved to downstream systems. It decouples the write path from slower or less reliable consumers.

The stream should preserve enough information for consumers to make safe decisions:

event_id
post_id
user_id
action
occurred_at
edge_version
idempotency_key

edge_version or another monotonic sequence is useful because like/unlike can happen quickly. If consumers receive stale events out of order, they need a way to avoid applying an old state after a newer one.

Consumer lag is one of the most important operational signals. If counter workers lag, counts become stale. If notification workers lag, notifications arrive late. If ranking workers lag, engagement signals are delayed. The write path may still be healthy, but the product experience is drifting.

Counter Projection

The counter projection answers "how many likes does this post have?" quickly. It is derived from edge changes or events.

The simplest counter is a single row:

post_id -> like_count

That is easy to read but dangerous for hot posts because every like and unlike updates the same row. A viral post can turn that row into a bottleneck.

A more scalable pattern is a sharded counter:

post_id, shard_id -> partial_count

Each like updates one shard, often selected by hashing user_id, event_id, or another stable value. Reads either sum the shards or read from a periodically compacted total.

This projection can be eventually consistent because the edge store remains the truth. If the counter becomes corrupted, repair jobs can recompute it from edges or events.

Viewer State Cache

Feed and post-detail pages need to answer a viewer-specific question: "Did I like this post?"

This is not the same as the aggregate count. A feed showing 20 posts may need 20 viewer-state checks for the current user. At high traffic, constantly hitting the edge store for those checks can be expensive.

A viewer state cache can store recent (user_id, post_id) -> liked/unliked answers, often with a cache-aside access pattern. It improves read latency but introduces invalidation questions. When a user likes or unlikes a post, the cache entry should be updated or invalidated immediately enough that the UI does not contradict the user's action.

The cache should never be the only truth. If it is missing or stale, the system can fall back to the edge store. Operators should watch cache hit ratio, stale-state reports, fallback latency, and invalidation failures.

Notification Workers

Notification workers consume like events and decide whether another user should be notified. This sounds simple, but product rules matter. The post owner may not need a notification for every like if a post is receiving thousands of likes. Notifications may be batched, rate-limited, muted, filtered for privacy, or suppressed for abuse.

This work is an example of fan-out, and it should be asynchronous. A like should not fail just because the notification system is temporarily slow.

The notification worker should be an idempotent consumer and can use the Idempotent Consumer pattern. If it receives the same LikeCreated event twice, it should not create duplicate notifications. A unique key such as (event_id) or (recipient_id, actor_id, post_id, notification_type) can help.

Ranking and Feed Signal Workers

Likes are often inputs into ranking systems. A new like may increase a post's engagement score, affect recommendation features, or update activity signals used by feeds.

These systems usually do not need strict per-event synchronous freshness. They need reliable event streams, bounded lag, and aggregation semantics. For example, ranking might care about likes per minute, likes from close connections, or decayed engagement over time.

The risk is write amplification. If every like triggers many ranking writes immediately, the system can become expensive and fragile. Stream processing, batching, and windowed aggregation help keep this work manageable.

Analytics Workers

Analytics pipeline workers turn like events into product and business metrics: likes per post, likes per author, engagement rate, time-series dashboards, experiments, and anomaly detection.

Analytics should be separated from the transactional like system. It is usually append-heavy, aggregation-heavy, and tolerant of delay. Mixing analytics writes into the core mutation path would make the like button depend on reporting infrastructure.

The operational question is freshness. Product dashboards might tolerate minutes of lag; abuse detection might require faster signals. That freshness requirement determines queue priority, worker scaling, and storage choices.

6. Data Modeling

A practical model separates durable edges, event publication, counter projections, and analytics.

Source Of Truth: Like Edge

CREATE TABLE post_likes (
  post_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL,
  state VARCHAR(16) NOT NULL,
  created_at TIMESTAMP NOT NULL,
  updated_at TIMESTAMP NOT NULL,
  version BIGINT NOT NULL,
  PRIMARY KEY (post_id, user_id)
);

Possible secondary indexes:

IndexPurpose
(user_id, updated_at)List recent posts liked by a user.
(post_id, updated_at)List recent users who liked a post.
(post_id, state)Support active-like scans or repair jobs.

The primary key is central to idempotency. It makes duplicate active likes structurally impossible.

Outbox

CREATE TABLE like_outbox (
  event_id UUID PRIMARY KEY,
  post_id BIGINT NOT NULL,
  user_id BIGINT NOT NULL,
  action VARCHAR(32) NOT NULL,
  edge_version BIGINT NOT NULL,
  created_at TIMESTAMP NOT NULL,
  published_at TIMESTAMP
);

The outbox should be indexed by unpublished rows:

CREATE INDEX like_outbox_unpublished_idx
ON like_outbox(published_at, created_at);

In practice, publishers need a safe way to claim rows without multiple workers fighting over the same records.

Counter Shards

CREATE TABLE post_like_count_shards (
  post_id BIGINT NOT NULL,
  shard_id INT NOT NULL,
  count BIGINT NOT NULL,
  updated_at TIMESTAMP NOT NULL,
  PRIMARY KEY (post_id, shard_id)
);

Shards distribute writes. A compacted projection can make reads cheaper:

CREATE TABLE post_like_counts (
  post_id BIGINT PRIMARY KEY,
  count BIGINT NOT NULL,
  updated_at TIMESTAMP NOT NULL
);

The compacted count can lag slightly. The tradeoff is lower read cost in exchange for eventual consistency.

7. Request Lifecycle

Like Flow

  1. The client sends POST /posts/{postId}/like with an idempotency key or stable request identity.
  2. The Like API authenticates the user and validates that the post can be liked.
  3. The API opens a transaction.
  4. The API upserts the (post_id, user_id) edge into the liked state.
  5. If the previous state was already liked, the operation is a no-op success.
  6. If the state changed, the API inserts a LikeCreated event into the outbox.
  7. The transaction commits.
  8. The API updates or invalidates viewer-state cache.
  9. The API returns the viewer's current liked state.
  10. The outbox publisher eventually publishes the event to the stream.
  11. Consumers update counters, notifications, ranking, and analytics.
sequenceDiagram
  participant Client
  participant API as Like API
  participant DB as Edge DB
  participant Outbox
  participant Stream
  participant Counter as Counter Worker

  Client->>API: Like post
  API->>DB: Upsert edge to liked
  API->>Outbox: Insert LikeCreated if state changed
  API-->>Client: Return liked=true
  Outbox->>Stream: Publish event
  Stream->>Counter: Consume LikeCreated
  Counter->>Counter: Update counter projection

Unlike Flow

Unlike is not simply "decrement the counter." It should transition the source-of-truth relationship first.

If the current state is already unliked or no active edge exists, the operation should usually be a no-op success. If the state changes from liked to unliked, the system emits LikeRemoved.

The counter worker should decrement only when it processes a valid state-change event, not every duplicate unlike request.

Read Flow

A feed read usually composes multiple pieces of data:

  • post content
  • aggregate like count
  • current viewer's liked state
  • author metadata
  • ranking or feed position

The like system usually owns only the count and viewer-state pieces. The read path can fetch aggregate counts from a compacted projection and viewer state from cache or the edge store.

If the aggregate count says 1050 but the viewer just liked the post and still sees 1050 for a moment, that may be acceptable. If the viewer taps like and the button flips back incorrectly, that is much worse. The system should prioritize correctness of the viewer's own state over immediate freshness of every derived projection.

8. Scaling Problems

Hot Posts

A viral post can receive many likes in a short period. If each like updates one post_like_counts row, that row becomes a write hotspot. Sharded counters spread writes across multiple rows or partitions.

The number of shards is a tradeoff. More shards reduce write pressure but make reads and compaction more expensive. For ordinary posts, a single counter may be fine. For hot posts, adaptive sharding may be useful.

Celebrity Accounts

Large accounts can repeatedly create hot posts. The system may need to identify posts likely to become hot and use more aggressive counter sharding, batching, or delayed compaction.

This is a product-created distribution problem. Traffic is not uniform. A small number of posts can dominate engagement.

Duplicate and Out-Of-Order Events

Event systems commonly provide at-least-once delivery. That means consumers may see duplicates. Like/unlike events can also arrive out of order.

Consumers should use event IDs, edge versions, or idempotent state transitions. A counter worker that blindly increments for every LikeCreated message can overcount if duplicate events arrive.

Read Amplification

A feed with 30 posts may need 30 counts and 30 viewer-state checks. If each check becomes an independent database query, read amplification becomes severe.

Batch APIs, caches, compacted projections, and denormalized feed payloads can reduce this pressure. The right answer depends on freshness requirements and feed architecture.

Write Amplification

One like may produce many downstream writes:

  • edge update
  • outbox row
  • stream event
  • counter update
  • notification candidate
  • ranking feature
  • analytics event
  • cache invalidation

That amplification is not automatically bad, but it must be controlled. The system should distinguish critical writes from optional derived writes.

Projection Drift

Derived counters can drift because of duplicate events, missed events, bugs, manual operations, or partial outages. A serious system needs reconciliation jobs that compare edge truth against projections and correct differences.

9. Distributed Systems Concepts

Idempotency

Idempotency means repeated requests produce the same intended result. For likes, sending the same like request twice should leave the post liked once, not liked twice.

The unique (post_id, user_id) key is one form of idempotency at the data model level. API idempotency keys can add another layer when clients retry the exact same mutation.

Eventual Consistency

The aggregate like count can be eventually consistent. This is a product decision, not a technical excuse. The system should define acceptable lag and monitor it.

The viewer's own like state should usually be more consistent than the aggregate count. Users notice when their own action appears to be lost.

Derived Projections

Counters, ranking signals, notifications, and analytics are derived projections. They are useful because they make reads and product experiences faster, but they should be repairable from source events or edge truth.

Backpressure

If consumers cannot keep up, queues grow. Backpressure is the system's way of handling that pressure without collapsing. Possible responses include scaling workers, batching, prioritizing critical consumers, shedding optional analytics pipeline work, or temporarily increasing projection lag.

Ordering

Like and unlike operations for the same pair are order-sensitive. If a user likes, unlikes, and likes again quickly, consumers need enough metadata to avoid applying an older unlike after a newer like. This is the same pressure that makes event IDs, versions, and idempotent consumers necessary.

10. Reliability & Failure Handling

FailureImpactMitigation
Client retries after timeoutDuplicate requestsIdempotency key and unique edge constraint
API crashes after edge writeEvent might be lost without protectionTransactional outbox
Outbox publisher duplicates eventsCounters or notifications can duplicateIdempotent consumers and event IDs
Counter worker lagsCounts become staleLag monitoring, worker scaling, replayable stream
Counter projection corruptsVisible counts become wrongReconciliation jobs from edge store or event log
Viewer cache staleUI shows wrong liked stateCache invalidation on mutation and fallback to edge store
Notification worker failsNotifications delayed or missingRetry queues and dead-letter queues
Database hotspotMutations slow or failPartitioning, sharded counters, adaptive hot-key handling

Reliability is not just uptime. For this system, reliability also means not silently corrupting counts, not losing events, and not showing the user contradictory state after their own action.

Monitoring should include:

  • like/unlike mutation latency
  • mutation error rate
  • no-op retry rate
  • outbox unpublished row age
  • event publish failure rate
  • consumer lag by consumer group
  • counter drift detected by reconciliation
  • hot post write concentration
  • viewer-state cache hit ratio

11. Real-World Company Approaches

For a company at Instagram-like scale, it is reasonable to expect separation between the user-facing mutation path and downstream derived systems. That is a general production architecture pattern, not a claim about Instagram's private implementation.

At large social-network scale, a company might:

  • keep the like relationship in a durable store
  • maintain separate read-optimized count projections
  • use asynchronous event streams for counters, notifications, ranking, and analytics
  • apply special handling for viral posts or high-follower accounts
  • use repair jobs to reconcile derived counts
  • prioritize the viewer's own action state over immediate global count accuracy

Different companies may choose different storage engines, queue systems, cache layers, and consistency contracts. The reusable lesson is the separation of truth, projection, and product experience.

12. Tradeoffs & Alternatives

Counter As Source Of Truth

This is simple but fragile. It cannot answer who liked the post, cannot deduplicate retries well, and is hard to repair when wrong.

Edge Store As Source Of Truth

This is more robust. It supports uniqueness, viewer state, audits, and repair. The cost is more storage and a more complex read path.

Single Counter Row

Good for small scale and simple reads. Bad for viral posts because every update hits the same row.

Sharded Counters

Better for write distribution. More complex for reads, compaction, and repair.

Synchronous Downstream Updates

Keeps projections fresher, but couples the like button to downstream systems and increases latency.

Asynchronous Downstream Updates

Improves availability and latency, but introduces eventual consistency, queue lag, duplicate delivery, and operational complexity.

Delete On Unlike

Saves storage and makes active-like scans clean. It may lose useful history unless events or audit logs preserve it.

State Transition On Unlike

Preserves history and simplifies idempotency in some designs. It uses more storage and requires queries to filter active state.

13. Evolution Path

  1. Store likes in post_likes with a unique (post_id, user_id) key.
  2. Add a simple counter column or counter table for fast reads.
  3. Add viewer-state cache for feed and post-detail reads.
  4. Move counter updates out of the API and into asynchronous workers.
  5. Add a transactional outbox so committed likes reliably produce events.
  6. Add sharded counters for hot posts.
  7. Add event-stream consumers for notifications, ranking, and analytics.
  8. Add reconciliation jobs to detect and repair projection drift.
  9. Add adaptive hot-post handling and consumer prioritization.
  10. Add deeper observability around lag, drift, and write concentration.

This evolution matters because the final architecture is not the right starting point for every product. Complexity should be introduced when it solves a real bottleneck or reliability risk.

14. Key Engineering Lessons

  • A like is a relationship before it is a counter.
  • The user-to-post edge should be the source of truth.
  • Aggregate counts are derived projections and should be repairable.
  • Idempotency is required because clients retry.
  • Hot posts create concentrated write pressure.
  • Event-driven architecture decouples downstream systems but requires idempotent consumers and lag monitoring.
  • The viewer's own like state deserves stronger consistency than the global aggregate count.
  • Production systems need repair paths, not just happy paths.
  • The best architecture separates what must happen synchronously from what can happen asynchronously.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.