Patterns

Large Group Fan-Out Isolation

Keep massive group delivery workloads from starving normal messaging by separating queues, worker pools, limits, and backpressure policies.

advanced4 min readUpdated unknownCapacityReliabilityOperationsTradeoffs
Group Message Fan-OutBackpressureBulkheadsHot Key MitigationWorker Isolation

Concepts Covered

  • Large group fan-out
  • Worker pool isolation
  • Queue priorities
  • Per-conversation limits
  • Backpressure
  • Hot group mitigation
  • Bulkhead boundaries
  • Delivery lag by workload class

1. Intent

Large Group Fan-Out Isolation prevents huge group conversations from consuming all delivery capacity.

In chat systems, one group message can expand into thousands or millions of delivery tasks. If that work shares the same queues and workers as normal one-to-one messages, one active group can delay the whole product.

This pattern treats large group delivery as a special workload with separate limits and operational controls.

2. The Problem Without This Pattern

Imagine a group with 200,000 members. One message might create:

  • 200,000 recipient delivery tasks
  • more device-level tasks
  • push notification tasks
  • unread projection updates
  • receipt events

If the group becomes active, workers may spend all capacity expanding and delivering group messages.

A normal one-to-one message between two users should not wait behind a massive public group backlog.

Without isolation, fan-out becomes a platform-wide reliability risk.

3. How The Pattern Works

The system classifies delivery work:

one_to_one_delivery_queue
small_group_delivery_queue
large_group_delivery_queue
push_queue
receipt_queue

Large groups can have:

  • dedicated worker pools
  • per-group delivery rate limits
  • batched recipient expansion
  • lower priority for non-critical projections
  • lazy inbox updates
  • separate dashboards and alerts
  • separate retry budgets

The goal is not to make large group delivery instant at any cost. The goal is to keep the rest of the platform healthy while large group work progresses predictably.

4. When To Use It

Use this pattern when:

  • group size can become very large
  • one message creates many delivery tasks
  • normal messages must remain low latency
  • group fan-out lag is acceptable within limits
  • worker pools can be separated by workload
  • hot conversations are possible
  • push providers or projection stores can be overwhelmed by group traffic

It applies to chat groups, broadcast channels, large notification audiences, social fan-out, and activity feed generation.

5. When Not To Use It

It may be premature when:

  • groups are small
  • fan-out volume is low
  • the product has no large audience messaging
  • one worker pool has plenty of headroom
  • operational complexity is a bigger risk than fan-out load

Start simple, but design the boundaries so large groups can be isolated later.

6. Data And Operational Model

Useful data:

conversation_profile
- conversation_id
- member_count
- fanout_class
- delivery_priority

fanout_task
- message_id
- conversation_id
- recipient_range
- attempt_count
- status

Operators should watch:

  • fan-out lag by conversation class
  • age of oldest large-group task
  • one-to-one delivery latency
  • worker utilization by pool
  • retry rate
  • per-group queue depth
  • projection lag caused by group traffic
  • push provider throttling by group workload

Large group fan-out needs its own SLO. It may be acceptable for a public channel to take longer to fan out than a one-to-one chat, but that delay should be deliberate and visible.

7. Failure Modes

  • Large groups share workers with one-to-one delivery and starve it.
  • Recipient expansion creates a hot key.
  • Fan-out tasks retry without deduplication.
  • Per-group limits are too strict and delivery never catches up.
  • Operators monitor global queue depth but miss one massive group backlog.
  • Large-group push tasks exceed provider limits.
  • Lazy fan-out makes unread projections confusing.
  • Membership snapshots are wrong and deliver to users who should not receive the message.

8. Tradeoffs

BenefitCost
Protects normal messaging latencyAdds queues and worker pools
Makes large group lag visibleMore operational tuning
Reduces blast radius of hot groupsDelivery may be less immediate
Enables workload-specific limitsRequires classification logic
Supports backpressure by workloadMore complex observability

Large group isolation is a product reliability decision: normal conversations should stay healthy even when one group creates enormous work.

Knowledge links

Use these links to understand what to know first, where this idea appears, and what to study next.

Prerequisites

Read these first if this topic feels unfamiliar.

Used In Systems

System studies where this idea appears in context.

Related Concepts

Core ideas that connect to this topic.