Concepts

Event Streams

Durable ordered logs of events that let systems publish state changes and let independent consumers process them asynchronously.

intermediate4 min readUpdated unknownReliabilityOperationsTradeoffs

ProducersConsumersConsumer GroupsReplayDelivery GuaranteesConsumer Lag

Concepts Covered

Events
Producers
Consumers
Consumer groups
Durable logs
Replay
Ordering boundaries
Consumer lag
Delivery guarantees

Definition

An event stream is a durable ordered sequence of records that describe things that happened in a system.

Example:

LikeCreated(post_id=42, user_id=7, event_id=evt_1)
LikeRemoved(post_id=42, user_id=7, event_id=evt_2)

Services publish events to the stream. Independent consumers read those events and build downstream behavior such as counters, notifications, analytics, ranking features, inboxes, search indexes, and projections.

The Pain That Forces Event Streams

At small scale, one service can do everything directly:

Like API:
  insert like
  update counter
  send notification
  update ranking
  write analytics
  return response

This looks simple, but it makes the user request depend on every downstream feature.

If analytics is slow, the like is slow. If notifications are down, the like might fail. If ranking logic grows expensive, the core user action becomes fragile.

The product problem is that one committed state change often needs to trigger many pieces of work, but the user-facing write should not wait for all of them.

The Naive Failure

Imagine a like request that synchronously calls five downstream systems.

1. Insert like succeeds.
2. Counter update succeeds.
3. Notification call times out.
4. Analytics call succeeds.
5. API returns failure because one downstream call failed.

What should the client do? Retry the like? If it retries, the counter or analytics may duplicate. If it does not retry, the user sees failure even though the like exists.

The naive design mixes the source-of-truth command with secondary workflows.

Mental Model

An event stream lets the source system say:

This happened. Whoever cares can process it independently.

The Like API should not need to know every current and future feature that reacts to a like. It should commit the source state and publish an event. Consumers can then process the event at their own pace.

Event streams turn direct coupling into asynchronous coordination.

How Event Streams Work

Core pieces:

Producer: writes events.
Broker or stream: stores ordered records.
Consumer: reads events and performs work.
Consumer group: allows multiple workers to share processing.
Offset or cursor: remembers how far a consumer has read.

Example flow:

Like API -> LikeCreated event -> event stream
                                -> counter consumer
                                -> notification consumer
                                -> analytics consumer
                                -> ranking consumer

Each consumer can fail, retry, lag, or deploy independently without blocking the original Like API request.

Ordering And Partitioning

Streams are usually ordered inside a partition, not globally across the whole platform.

If all events for post_42 go to the same partition, consumers can process that post's events in order:

partition_key = post_id

This helps correctness, but it can create hot partitions when one key receives huge traffic. Partitioning is always a tradeoff between ordering and load distribution.

Replay

Replay is one of the most powerful properties of event streams.

If a counter projection is wrong, the system can reset the counter consumer and process old events again.

If a new analytics system launches, it can read historical events instead of waiting for new ones.

Replay is only safe when consumers are designed carefully. Replaying events can duplicate side effects if consumers are not idempotent.

What Event Streams Guarantee

Event streams can provide:

durable storage of events
ordered processing within a partition
independent consumers
replay from retained history
buffering during downstream outages

They do not automatically provide:

exactly-once business effects
correct producer behavior
infinite retention
no lag
no duplicate delivery
schema compatibility

Operational Reality

Operators must watch:

producer publish failures
consumer lag
oldest unprocessed event age
partition skew
retry rate
dead-lettered events
schema compatibility failures
event retention windows
replay duration

Failure modes:

Consumers fall behind and projections become stale.
Duplicate delivery causes duplicate side effects.
Bad partition keys create hot partitions.
Producers publish events that do not match committed database state.
Event schemas change in ways old consumers cannot parse.
Retention expires before a consumer can replay.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Idempotency

Used In Systems

System studies where this idea appears in context.

Instagram Likes System WhatsApp-Style Messaging System

Related Concepts

Core ideas that connect to this topic.

Fan-Out Backpressure Idempotent Consumers Analytics Pipelines

Related Patterns

Reusable architecture moves built from these ideas.

Transactional Outbox