System Design
WhatsApp-Style Messaging System
Design a WhatsApp-style chat system that supports real-time messaging, delivery receipts, offline users, media sharing, group chats, retries, and multi-device sync.
Study path
Read these in order
Start with the mechanics, then move into the patterns that explain why the system is shaped this way.
- 1Realtime GatewaysConcept
- 2Delivery GuaranteesConcept
- 3Message OrderingConcept
- 4Offline DeliveryConcept
- 5PresenceConcept
- 6Push Notification HandoffConcept
- 7Message ReceiptsConcept
- 8Multi-Device SyncConcept
- 9Group Message Fan-OutConcept
- 10Media Message PipelineConcept
- 11Connection RegistryPattern
- 12Cursor-Based SyncPattern
- 13Large Group Fan-Out IsolationPattern
- 14Upload-Then-Reference MediaPattern
Concepts Covered
- Realtime gateways and long-lived client connections
- One-to-one chats, group chats, and multi-device sync
- Message IDs, idempotency keys, retries, and duplicate prevention
- Message ordering inside a conversation
- Delivery guarantees, acknowledgements, and retry boundaries
- Online delivery, offline delivery, and push notification handoff
- Presence as an ephemeral routing and UX signal
- Message receipts, read cursors, and derived unread counters
- Event streams for delivery pipelines
- Group message fan-out for groups and device-level delivery
- Media message pipelines for uploads, thumbnails, and object storage
- Eventual consistency for receipts, unread counts, and projections
- Derived projections for inboxes, unread counts, and conversation lists
- Backpressure when queues, devices, or workers fall behind
- Rate limiting for spam, abuse, and reconnect storms
- Sharding by user, conversation, or region
- Hot key mitigation for large groups and viral message bursts
- Outbox pattern, idempotent consumers, and dead-letter queues
- Chat-specific patterns such as connection registries, cursor-based sync, read cursor receipts, large group fan-out isolation, and upload-then-reference media
1. Introduction
A WhatsApp-style messaging product lets people send private messages, create groups, share media, see delivery states, receive push notifications, and continue conversations across devices. The product feels simple to the user: type a message, press send, and expect it to arrive quickly.
The system behind that experience is not simple. Chat combines long-lived connections, durable storage, asynchronous delivery, offline users, retry storms, ordering rules, mobile network failures, push notification integrations, media storage, privacy constraints, and heavy fan-out for group conversations.
This is not a reverse-engineering of WhatsApp's private internals. We are designing a production-grade system for a WhatsApp-style messaging product. The important goal is to understand the forces that make chat systems become distributed: connection scale, delivery guarantees, message ordering, offline behavior, and operating under unreliable mobile networks.
The most useful mental model is to separate three concerns:
- The message must be accepted durably.
- The message must be delivered to the right recipients and devices.
- The product must show useful state, such as sent, delivered, read, unread count, and conversation order.
If those concerns are mixed into one synchronous request path, the system becomes fragile. A slow push provider, offline recipient, lagging group fan-out worker, or overloaded receipt pipeline should not make the sender lose the message.
2. Product Requirements
Functional Requirements
- Users can send and receive one-to-one text messages.
- Users can create group conversations and send messages to the group.
- Users can use multiple devices for the same account.
- Users can see message states such as sent, delivered, and read.
- Users can receive messages after being offline.
- Users can receive push notifications when not actively connected.
- Users can share media attachments such as images, videos, and documents.
- Users can see recent conversations ordered by latest activity.
- Users can see unread counts per conversation.
- Users can reconnect after mobile network changes without losing accepted messages.
- Operators can rate limit abusive senders and suspicious automation.
Non-Functional Requirements
- Message send should feel fast to the sender.
- Accepted messages should not be lost.
- Duplicate sends caused by retries should not create duplicate messages.
- Conversation ordering should be understandable and stable.
- Online delivery should be low latency.
- Offline delivery should be reliable.
- The system should tolerate reconnect storms.
- Group fan-out should not overload the rest of the platform.
- Push notification failures should not block message durability.
- The design should support horizontal scaling by user, conversation, or region.
Out Of Scope For This Study
- Exact cryptographic protocol design.
- Voice and video calls.
- Payments, business messaging, and ads.
- Full contact discovery and phone-number verification flows.
- Exact WhatsApp implementation details.
End-to-end encryption is discussed as an architectural constraint, but not as a cryptography tutorial. In a real product, cryptography must be designed and reviewed by specialists.
3. Core Engineering Challenges
| Challenge | Why it matters |
|---|---|
| Long-lived connections | Millions of mobile clients may keep WebSocket-like connections open. |
| Mobile unreliability | Phones switch networks, sleep, reconnect, duplicate requests, and lose connectivity. |
| Durable acceptance | Once the server acknowledges a send, the message should not disappear. |
| Duplicate prevention | Clients retry aggressively, so send requests must be idempotent. |
| Message ordering | Users expect a conversation to have a coherent order even when requests race. |
| Offline delivery | Recipients may be disconnected for minutes, days, or longer. |
| Multi-device sync | One account may have several devices with different connection states. |
| Group fan-out | One message can become thousands or millions of recipient-device deliveries. |
| Receipts | Delivered/read states are derived from device acknowledgements and can lag. |
| Push notifications | Push providers are external dependencies and should not block core messaging. |
| Abuse prevention | Public messaging systems attract spam, scraping, and automated harassment. |
A naive design might keep every conversation in memory on one server and broadcast messages directly over open sockets. That works for a toy app, but it fails when users reconnect to different servers, when recipients are offline, when messages need durability, when group chats are large, and when delivery state must be rebuilt after a failure.
The production design needs a durable message log, connection routing, asynchronous delivery workers, derived read models, and explicit failure policies.
4. High-Level Architecture
flowchart LR Sender[Sender App] --> Gateway[Realtime Gateway] Gateway --> MessageAPI[Message API] MessageAPI --> Idempotency[Idempotency Check] Idempotency --> MessageStore[(Message Store)] MessageStore --> Outbox[(Message Outbox)] Outbox --> Stream[Message Event Stream] Stream --> DeliveryWorkers[Delivery Workers] DeliveryWorkers --> ConnectionRouter[Connection Router] ConnectionRouter --> RecipientGateway[Recipient Gateway] RecipientGateway --> Recipient[Recipient App] DeliveryWorkers --> OfflineQueue[(Offline Delivery Queue)] DeliveryWorkers --> PushService[Push Notification Service] Recipient --> ReceiptAPI[Receipt API] ReceiptAPI --> ReceiptStream[Receipt Stream] ReceiptStream --> ProjectionWorkers[Projection Workers] ProjectionWorkers --> InboxStore[(Inbox And Unread Projections)] Sender --> MediaService[Media Upload Service] MediaService --> ObjectStore[(Object Storage)]
The most important architecture decision is that message acceptance and message delivery are separate.
The sender talks to a gateway. The gateway calls the message write path. The write path validates the request, deduplicates it, persists the message, and publishes a delivery event. Delivery workers then decide which recipient devices need the message, where those devices are connected, whether push notification is needed, and which derived projections must be updated.
This separation lets the system handle slow recipients, offline users, push provider problems, and group fan-out without blocking the sender's durable write.
5. Core Components
Client Application
The client application owns the user's local messaging experience. It generates a client message ID before sending, stores pending messages locally, retries failed sends, reconnects after network changes, and reconciles server acknowledgements with local state.
The client should not assume one request equals one message. Mobile networks are unreliable. A send request may time out even if the server accepted the message. The client may retry the same logical send. That is why every send should carry a stable client-generated idempotency key, often scoped to the sending user or device.
The client also sends delivery and read acknowledgements. Those acknowledgements can be delayed, batched, retried, or dropped under network pressure. The server should treat them as events that improve derived state, not as perfectly synchronized truth.
Realtime Gateway
The Realtime Gateway manages long-lived client connections. In many chat systems this is a WebSocket-like service, although the exact transport can vary. Its job is to authenticate the connection, track the connected user and device, receive inbound messages, push outbound messages, and send lightweight heartbeats.
The gateway should not be the durable source of truth for messages. Gateways are connection-oriented and can restart, drain, or lose clients during deployments. If a gateway dies after receiving a send but before persisting it, the client must retry. If a gateway dies after the message is persisted, delivery should continue through another worker or another gateway.
At scale, gateways are horizontally replicated. A connection registry records which user device is connected to which gateway instance. The registry can live in a fast store with TTLs, because presence and connection location are naturally ephemeral.
Important gateway metrics include active connections, reconnect rate, authentication failures, heartbeat failures, send latency, outbound queue depth, and dropped connection count.
Message API
The Message API owns the durable send path. It validates the sender, verifies conversation membership, checks rate limits, validates payload size, applies idempotency, persists the message, and writes a delivery event.
This component decides whether a message has been accepted. Once accepted, the system should be able to deliver it eventually or at least expose a clear failure state. That means the write path should persist the message before acknowledging success.
A practical API response might include:
{
"server_message_id": "m_93f2",
"conversation_id": "c_100",
"server_sequence": 84211,
"state": "accepted"
}
The sender can then replace the local pending message ID with the server message ID while preserving the user's local UI state.
Idempotency Store
The Idempotency Store prevents duplicate messages when the client retries. A send request includes a stable key such as:
sender_id + device_id + client_message_id
If the server has already processed that key, it returns the original server message ID instead of creating a second message.
This is not optional for mobile chat. Timeouts, retries, network switches, app restarts, and background execution can all cause duplicate sends. Without idempotency, a user may see the same message sent multiple times even though they only intended one send.
The idempotency record does not need to live forever. It only needs to cover the retry window. The message itself remains durable in the message store.
Message Store
The Message Store is the durable source of truth for messages. It stores the message ID, conversation ID, sender ID, server sequence, payload metadata, creation time, and optional media references.
For privacy-sensitive systems, the server may store encrypted payloads and only understand metadata needed for routing and abuse controls. Even then, the system still needs durable metadata to deliver, order, and sync messages.
The store is usually partitioned. Common partitioning choices include:
- Partition by conversation ID to keep conversation ordering local.
- Partition by user ID to optimize inbox sync.
- Partition by region to reduce latency and data movement.
Partitioning by conversation ID is attractive for ordering because one conversation can receive a monotonically increasing sequence. But very large groups can become hot partitions. Partitioning by user ID can make user sync easier, but group fan-out writes may touch many partitions.
Message Outbox
The Message Outbox bridges message persistence and asynchronous delivery. The key reliability problem is this: writing the message and publishing the delivery event must not fall out of sync.
If the API writes the message but fails to publish the event, the message exists but delivery may never start. If it publishes the event but fails to write the message, workers may try to deliver a message that does not exist.
The outbox pattern handles this by storing a delivery event in the same durable transaction or write unit as the message. A relay then publishes outbox rows to the message event stream. If the relay crashes, it resumes from the outbox.
Message Event Stream
The Message Event Stream carries accepted message events to delivery workers. It decouples the write path from delivery work.
The stream lets the system absorb bursts, retry failed delivery attempts, and scale consumers independently. It also provides an ordering boundary. For example, if events for a conversation are routed to the same partition, delivery workers can process that conversation in order. If events are partitioned by user, per-conversation ordering must be enforced elsewhere.
Streams introduce lag. Lag is not automatically bad, but it must be visible. If stream lag grows, users may see messages accepted but not delivered quickly. That is a backpressure signal.
Delivery Workers
Delivery Workers consume message events and decide who should receive the message.
For a one-to-one chat, the worker identifies the recipient's active devices, sends to connected devices through the connection router, and queues offline delivery for disconnected devices. For a group chat, the worker performs group message fan-out: it expands the group membership into recipients and then devices.
This component must be idempotent. A message event may be delivered to the worker more than once. The worker should not create duplicate per-device delivery records or send duplicate pushes without a guard. Idempotent consumer logic is especially important during retries, worker crashes, and stream rebalances.
Delivery workers also need backpressure behavior. If a group with many members generates too much fan-out, workers should not starve all other conversations. Bulkheads, quotas, and separate worker pools can isolate large group delivery from normal one-to-one delivery.
Connection Router
The Connection Router knows where connected devices are currently attached. If device d_42 is connected to gateway gw_7, the router sends the outbound message to gw_7, which pushes it to the device.
The connection router depends on ephemeral presence data. If the registry says a device is connected but the gateway has already lost it, delivery fails and the system falls back to offline queue or retry. That is normal. Presence is a useful hint, not a perfect truth.
Offline Delivery Queue
The Offline Delivery Queue tracks messages that could not be delivered to a device in real time. When a device reconnects, it can ask for messages after its last acknowledged server sequence or device checkpoint.
There are two common approaches:
- Store per-device pending delivery records.
- Store the canonical message log and let the device sync from a checkpoint.
Per-device queues can make delivery state explicit, but they grow with the number of devices and recipients. Checkpoint-based sync can be cheaper, but it requires efficient queries over the message log and membership history.
Most serious designs use a combination: durable message history plus derived per-device or per-user delivery state where needed.
Receipt Service
The Receipt Service processes message receipts, including delivery receipts and read receipts.
A delivered receipt usually means a recipient device received the message. A read receipt usually means the user opened the conversation or the client marked messages as read. These are not the same thing.
Receipts should be treated as events:
message_delivered(user_id, device_id, message_id, timestamp)
message_read(user_id, conversation_id, up_to_sequence, timestamp)
The service writes receipt events to a stream. Projection workers update read models such as message state, unread counters, and conversation summaries. These projections can be eventually consistent. It is acceptable if a read receipt appears a moment after the user opens the chat, as long as the system converges.
Inbox And Unread Projection Store
The Inbox Store powers fast UI reads: conversation list, latest message preview, unread counts, pinned conversations, archived state, and mute state.
This store is a derived projection. It should be rebuildable from durable source events: messages, receipts, membership changes, and user settings. Treating the inbox as derived keeps the write model simpler and makes repair possible when counters drift.
Unread counts are a classic place where projection drift happens. A retry, delayed read receipt, or missed event can make a count wrong. The system should support reconciliation jobs that recompute counts from source data for affected users or conversations.
Push Notification Service
Push notification handoff is for users who are not actively connected or whose device OS requires notification delivery through a platform provider.
Push should not be part of the critical durable send transaction. External push providers can be slow, rate limited, unavailable, or return transient errors. The system should enqueue push tasks after message acceptance and retry them separately.
Push payloads may need to be privacy-preserving. Depending on encryption and product policy, the push notification might contain only "New message" rather than plaintext message content.
Media Service
Media message pipelines keep large binary files out of the core message pipeline. A common flow is:
- The sender requests upload authorization.
- The client uploads media to object storage.
- The media service scans or processes the object if product policy requires it.
- The sender sends a message referencing the uploaded media object.
- Recipients download media through authorized URLs or media proxies.
This keeps chat message delivery small and metadata-oriented. It also allows media processing, thumbnail generation, virus scanning, retention policies, and download authorization to evolve separately from core text messaging.
6. Data Modeling
Core Entities
| Entity | Purpose |
|---|---|
| User | Account identity and profile metadata. |
| Device | A registered device for a user. |
| Conversation | One-to-one or group conversation. |
| ConversationMember | User membership, role, join time, and leave time. |
| Message | Durable message record. |
| MessageDelivery | Per-user or per-device delivery state. |
| ReceiptEvent | Delivered/read acknowledgement event. |
| InboxProjection | Fast conversation list and unread counters. |
| MediaObject | Metadata for uploaded media. |
Message Record
messages
- message_id
- conversation_id
- server_sequence
- sender_user_id
- sender_device_id
- client_message_id
- message_type
- encrypted_payload
- media_object_id
- created_at
- edit_state
- delete_state
The server_sequence is important. It gives each conversation a stable message ordering reference assigned by the server. Clients can still show local pending messages instantly, but server sequence becomes the shared ordering reference once the message is accepted.
Idempotency Record
message_idempotency
- sender_user_id
- sender_device_id
- client_message_id
- server_message_id
- created_at
- expires_at
The unique key is usually (sender_user_id, sender_device_id, client_message_id). If the same request arrives again, the API returns the existing server_message_id.
Conversation Membership
conversation_members
- conversation_id
- user_id
- role
- joined_at_sequence
- left_at_sequence
- muted_until
- archived_at
Membership history matters. If a user joins a group today, they may not be allowed to see old messages. If a user leaves, they should not receive future messages. The delivery worker needs membership state at the time of the message.
Inbox Projection
user_conversation_inbox
- user_id
- conversation_id
- latest_message_id
- latest_server_sequence
- latest_activity_at
- unread_count
- read_up_to_sequence
- delivery_cursor
- pinned_rank
- archived
This table exists because rendering the conversation list from raw messages every time would be expensive. It is derived state optimized for the UI.
7. Request Lifecycle
Sending A One-To-One Message
- The client creates
client_message_id. - The client sends the message to a realtime gateway.
- The gateway authenticates the user and forwards the request to the Message API.
- The Message API verifies conversation membership and rate limits.
- The API checks idempotency.
- The API assigns
server_message_idandserver_sequence. - The API writes the message and outbox event.
- The API returns
acceptedto the sender. - The outbox relay publishes a message event.
- Delivery workers identify recipient devices.
- Connected devices receive the message through their gateways.
- Offline devices sync later or receive push notifications.
- Recipient devices send delivery/read receipts.
- Projection workers update message state, inbox rows, and unread counts.
The sender should not wait for all recipient devices to acknowledge before seeing "sent." This is the product-facing version of delivery guarantees: each label should mean a different backend boundary. The user-visible states are staged:
- Pending: local client has not received server acceptance.
- Sent: server accepted the message durably.
- Delivered: at least one recipient device acknowledged receipt, depending on product rules.
- Read: recipient user read the message, depending on privacy settings.
Sending A Group Message
Group messaging adds fan-out. The message is still written once to the conversation log, but delivery work expands to all eligible members and devices.
For small groups, the worker can fan out immediately. For large groups, the system may batch recipients, use separate worker pools, or create per-recipient delivery tasks. The goal is to prevent one large group from delaying normal one-to-one delivery.
Offline Recipient Sync
When a device reconnects, it uses cursor-based sync and sends its last known cursor:
conversation_id -> last_received_sequence
The server returns messages after that cursor, subject to membership and retention rules. If the device has been offline for a long time, the server may paginate the sync or ask the client to refresh conversation summaries first.
Media Message Flow
Media should use upload-then-reference media: upload through a media message pipeline before the message is sent or as part of a resumable upload flow. The message itself should contain metadata and a media reference, not the raw file.
If upload succeeds but message send fails, the media object may become orphaned and can be garbage collected. If message send succeeds but media processing is delayed, recipients may see a placeholder until the media becomes available.
8. Scaling Problems
Connection Scale
Realtime gateways handle long-lived connections. Ten million connected users does not mean ten million active sends per second, but it does mean the system must manage many sockets, heartbeats, connection metadata, and reconnects.
Gateway scaling is usually horizontal. Clients connect to any healthy gateway, and the connection registry records where each device is attached. The system needs careful load balancing so reconnect storms do not overload a few gateway instances.
Message Write Scale
Message writes are partitioned by conversation, user, or region. The partition key affects ordering and hot spots.
Conversation-based partitioning simplifies per-conversation order, but large groups can become hot. User-based partitioning helps inbox reads, but group writes scatter across many users. Region-based partitioning reduces latency, but cross-region conversations require routing and consistency decisions.
Group Fan-Out
Group message fan-out is where chat systems become expensive. A single message to a group of 100,000 members can create 100,000 recipient-level delivery tasks, plus device-level tasks, push tasks, and unread projection updates.
Common mitigation strategies:
- Use separate worker pools for large groups.
- Batch fan-out tasks.
- Store one canonical group message and derive per-user visibility lazily.
- Apply rate limits to very large groups.
- Use backpressure to slow non-critical projections before core message acceptance.
Fan-out is a tradeoff between write cost and read cost. Fan-out-on-write makes recipient reads fast but creates heavy write bursts. Fan-out-on-read makes writes cheaper but makes each recipient's inbox computation more complex.
For very large groups, large group fan-out isolation keeps this workload from starving normal one-to-one delivery.
Receipt Scale
Message receipts can outnumber messages. A group message to many users may generate many delivery acknowledgements and read acknowledgements.
The system should batch receipts where possible. For read receipts, the read cursor receipts pattern is usually better than sending a separate read event for every message.
Hot Conversations
A large active group can become a hot key. The message sequence allocator, message partition, delivery workers, and projection store may all concentrate load around one conversation ID.
Mitigations include sharding large group delivery tasks, separating storage of canonical messages from per-recipient projections, and isolating large groups into dedicated worker pools.
9. Distributed Systems Concepts
Idempotency
Idempotency prevents duplicate messages when clients retry. The server must treat repeated sends with the same client message ID as the same logical operation.
Without idempotency, timeouts become user-visible duplicates. With idempotency, the client can retry safely until it receives a server response.
Ordering
Message ordering is usually scoped to a conversation. The system does not need one global order for all messages across all users. A server sequence per conversation is enough for most chat UI.
Ordering becomes harder when multiple senders send concurrently, when messages cross regions, or when group membership changes. The product should define the rule: messages appear in server acceptance order within a conversation.
Eventual Consistency
Delivered states, read states, unread counts, and conversation previews are often eventually consistent. The source message is durable first; projections catch up.
This is acceptable if the UI is designed for it. Users tolerate a receipt arriving slightly late. They do not tolerate losing the message itself.
Backpressure
Backpressure appears when delivery workers, push queues, receipt streams, or projection stores fall behind.
The system should degrade in the right order. Core message acceptance is more important than immediate unread count accuracy. Push notifications are useful, but they should not block durable sends. Non-critical analytics can lag before delivery does.
Derived Projections
Conversation lists and unread counts are derived from messages and receipts. Treating them as rebuildable projections makes the system easier to repair.
If a projection drifts, reconciliation jobs can recompute it from durable events.
10. Reliability & Failure Handling
Message API Fails Before Persisting
If the API fails before persisting the message, the client does not receive a durable acknowledgement. The client should retry with the same idempotency key.
Message Persists But Response Is Lost
If the server persists the message but the response times out, the client retries. The idempotency store returns the already-created message ID. This is one of the most important chat reliability paths.
Outbox Relay Fails
If the outbox relay fails, messages remain in the outbox. When the relay recovers, it publishes the pending events. Delivery is delayed, but messages are not lost.
Delivery Worker Crashes
If a worker crashes while processing a message event, the stream should redeliver or another worker should resume. Delivery workers must be idempotent so reprocessing does not duplicate delivery records.
Gateway Fails
If a gateway fails, clients reconnect to another gateway. Connected presence records should expire through TTLs. Undelivered messages are recovered through sync from the message log or offline queue.
Push Provider Is Unavailable
Push failures should be retried with backoff. They should not block message acceptance. If push is down, connected users still receive realtime messages, and offline users receive messages when they reconnect.
Projection Drift
Unread counters and conversation previews can drift because of missed events, retries, or delayed receipts. Reconciliation jobs should compare projections against source events and repair affected rows.
Regional Failure
For a global messaging product, regional failure strategy is a major design decision. Some products prefer a user's home region for data locality. Others replicate conversations across regions. Cross-region messaging adds latency and conflict handling.
At the foundation level, keep the design honest: start with regional routing and clear ownership, then add replication once the product needs it.
11. Real-World Company Approaches
Real messaging companies usually optimize for three things: reliability, mobile behavior, and operational cost.
Public engineering discussions across the industry commonly show the following patterns:
- Long-lived connection gateways are separated from durable storage.
- Message acceptance is separated from asynchronous delivery.
- Delivery pipelines use queues or streams.
- Mobile clients use retries and local pending state.
- Read models such as inboxes and unread counters are derived projections.
- Large fan-out workloads are isolated from normal traffic.
- Push notification providers are treated as external dependencies, not as the source of truth.
For a WhatsApp-style product specifically, we should avoid claiming exact internal architecture. The safe and useful lesson is that the user-facing simplicity of messaging requires a careful split between durable message storage, realtime delivery, offline sync, receipts, and device state.
12. Tradeoffs & Alternatives
WebSocket Push vs Polling
WebSocket-like realtime connections reduce latency and server polling overhead, but they require connection management, heartbeats, load balancing, and reconnect handling. Polling is simpler but wasteful and slower for real-time messaging.
Fan-Out-On-Write vs Fan-Out-On-Read
Fan-out-on-write creates recipient delivery records when a message is sent. This makes reads and sync easier, but large groups create heavy write bursts.
Fan-out-on-read stores the message once and computes recipient visibility when users read or sync. This lowers write amplification but increases read complexity.
Most systems mix both approaches. Small groups can fan out eagerly. Very large groups may use more lazy or batched delivery.
Per-Device Delivery vs Per-User Delivery
Per-device delivery is precise for multi-device sync, but it creates more state. Per-user delivery is cheaper, but it can miss details like one device receiving a message while another is offline.
If multi-device correctness matters, the system usually needs device-level cursors or acknowledgements.
Strong Consistency vs Eventual Consistency
Strong consistency for every receipt, unread count, and conversation preview would be expensive and fragile. Eventual consistency is usually the right tradeoff for secondary state.
The message itself needs stronger durability than the projections around it.
Store Messages Once vs Copy To Every Inbox
Storing messages once is storage-efficient and makes edits/deletes easier. Copying to every inbox makes user reads fast but increases write amplification and repair complexity.
The right choice depends on group size, read patterns, retention rules, and privacy constraints.
13. Evolution Path
Phase 1: Simple Durable Chat
Start with one-to-one messages, a durable message table, server message IDs, basic conversation membership, and a realtime gateway. Use client retries with idempotency from the beginning.
Phase 2: Offline Sync And Push
Add per-device cursors, offline sync, and push notification tasks. Separate push retries from message acceptance.
Phase 3: Receipts And Inbox Projections
Add delivered/read receipts, unread counters, and conversation list projections. Treat these as eventually consistent derived state.
Phase 4: Group Messaging
Add group membership, group delivery workers, batching, and large-group safeguards. Separate large-group fan-out from normal one-to-one traffic.
Phase 5: Media And Multi-Device Polish
Add media upload, thumbnails, resumable uploads, device-specific sync state, and better reconciliation.
Phase 6: Regional Scaling
Add regional routing, cross-region replication, disaster recovery, and careful ownership rules for conversations and users.
14. Key Engineering Lessons
- Chat is a durable event system with realtime delivery, not just a socket server.
- The sender acknowledgement should mean the message was accepted durably.
- Idempotency is mandatory because mobile clients retry.
- Message ordering should be scoped to the conversation, not the whole world.
- Delivery, read receipts, inboxes, and unread counts are derived state.
- Offline users are normal, not an edge case.
- Push notifications are a secondary delivery path, not the source of truth.
- Group fan-out is the main scaling pressure.
- Large groups and reconnect storms need isolation and backpressure.
- The system should prefer delayed secondary state over lost messages.
15. Related Topics
- Realtime Gateways
- Message Ordering
- Delivery Guarantees
- Offline Delivery
- Presence
- Push Notification Handoff
- Message Receipts
- Multi-Device Sync
- Group Message Fan-Out
- Media Message Pipeline
- Idempotency
- Event Streams
- Fan-Out
- Eventual Consistency
- Derived Projections
- Backpressure
- Rate Limiting
- Sharding
- Hot Key Mitigation
- Outbox Pattern
- Idempotent Consumer
- Retry With Backoff And Jitter
- Dead-Letter Queue
- Bulkhead Isolation
- Connection Registry
- Cursor-Based Sync
- Read Cursor Receipts
- Large Group Fan-Out Isolation
- Upload-Then-Reference Media
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Related Concepts
Core ideas that connect to this topic.
Related Patterns
Reusable architecture moves built from these ideas.