Concepts

Realtime Gateways

Connection-facing services that keep clients reachable in real time while leaving durable message truth to backend systems.

intermediate4 min readUpdated unknownReliabilityOperationsTradeoffs
Long-Lived ConnectionsConnection RoutingHeartbeatsGateway StateReconnect Storms

Concepts Covered

  • Long-lived client connections
  • WebSocket-style gateway services
  • Connection registries
  • Heartbeats and liveness
  • Gateway-local state
  • Slow clients
  • Reconnect storms
  • Realtime delivery vs durable storage

Definition

A realtime gateway is a service that manages long-lived client connections and gives the backend a way to push events to connected users quickly.

In a chat system, the gateway is the component that keeps a phone, browser, or desktop app reachable while the user is online.

It accepts an authenticated connection, records which user and device are attached to that gateway instance, forwards inbound client commands, and pushes outbound events such as messages, receipts, typing indicators, and presence changes.

The gateway is not the durable source of truth. That boundary matters.

Gateways are optimized for connection management, not permanent storage. They can restart during deployments, lose sockets when mobile networks change, and shed load during incidents. If a gateway disappears, accepted messages should still be recoverable from the message store or event log.

The Pain That Forces Realtime Gateways

Without realtime gateways, clients usually poll the server:

Do I have new messages?
Do I have new messages?
Do I have new messages?

Polling is simple, but it wastes resources and increases latency. A messaging app feels instant because the server can push events to the client as soon as they are available.

At small scale, one server can keep open sockets and store everything in memory.

At large scale, users connect to many gateway instances. A delivery worker trying to deliver a message to device d_42 needs to know where that device is currently connected.

That is why production systems usually need a connection registry:

device_id -> gateway_instance_id
d_42      -> gw_17

This mapping is naturally temporary. It should have a TTL and be refreshed by heartbeats, because a phone can vanish without sending a clean disconnect.

Mental Model

The gateway is the live edge of the system.

It knows:

who is connected right now
which socket belongs to which device
which events can be pushed immediately
which clients are too slow

It should not be the only place that knows:

which messages exist
which messages were accepted
what the final conversation order is
what the durable unread count is

The gateway makes the product feel live. The backend makes the product recoverable.

What Belongs In A Gateway

Good gateway responsibilities:

  • authenticate the connection
  • track user ID, device ID, connection ID, and gateway instance
  • maintain heartbeats or ping/pong checks
  • forward inbound commands to durable backend services
  • push outbound events to connected clients
  • apply local connection limits and basic abuse controls
  • drain connections safely during deployment

Responsibilities that should usually not live only in the gateway:

  • canonical message history
  • permanent delivery state
  • conversation membership truth
  • final message ordering
  • long-term unread counts

The gateway can cache small pieces of state for speed, but the system must survive losing that cache.

How Delivery Uses Gateways

A common delivery flow looks like this:

1. Message API durably accepts message.
2. Message event reaches delivery workers.
3. Workers identify recipient devices.
4. Workers query connection registry.
5. Online device is routed to its gateway.
6. Offline or failed route falls back to sync.

This is why realtime delivery and offline delivery must work together. The gateway path is the fast path. The durable sync path is the safety net.

Operational Reality

Important signals:

  • active connections per gateway
  • new connections per second
  • reconnect rate
  • authentication failures
  • heartbeat timeout rate
  • outbound queue depth
  • message push latency
  • slow-client disconnects
  • gateway CPU, memory, and file descriptor usage
  • failed delivery attempts caused by stale connection routing

Failure modes:

  • Gateway crashes and connected clients must reconnect.
  • Stale registry entries route work to the wrong gateway.
  • Slow clients cause outbound buffers to grow.
  • Reconnect storms overload authentication and sync.
  • Deployments drop many connections at once.
  • Gateway-local state is mistaken for durable truth.

Reconnect storms are especially important. Mobile operating systems, regional network issues, app releases, or gateway deployments can cause huge numbers of clients to reconnect together. If every reconnect immediately performs expensive sync, the recovery path can become the incident.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Used In Systems

System studies where this idea appears in context.

Related Concepts

Core ideas that connect to this topic.

Related Patterns

Reusable architecture moves built from these ideas.