Patterns
Large Group Fan-Out Isolation
Keep massive group delivery workloads from starving normal messaging by separating queues, worker pools, limits, and backpressure policies.
Concepts Covered
- Large group fan-out
- Worker pool isolation
- Queue priorities
- Per-conversation limits
- Backpressure
- Hot group mitigation
- Bulkhead boundaries
- Delivery lag by workload class
1. Intent
Large Group Fan-Out Isolation prevents huge group conversations from consuming all delivery capacity.
In chat systems, one group message can expand into thousands or millions of delivery tasks. If that work shares the same queues and workers as normal one-to-one messages, one active group can delay the whole product.
This pattern treats large group delivery as a special workload with separate limits and operational controls.
2. The Problem Without This Pattern
Imagine a group with 200,000 members. One message might create:
- 200,000 recipient delivery tasks
- more device-level tasks
- push notification tasks
- unread projection updates
- receipt events
If the group becomes active, workers may spend all capacity expanding and delivering group messages.
A normal one-to-one message between two users should not wait behind a massive public group backlog.
Without isolation, fan-out becomes a platform-wide reliability risk.
3. How The Pattern Works
The system classifies delivery work:
one_to_one_delivery_queue
small_group_delivery_queue
large_group_delivery_queue
push_queue
receipt_queue
Large groups can have:
- dedicated worker pools
- per-group delivery rate limits
- batched recipient expansion
- lower priority for non-critical projections
- lazy inbox updates
- separate dashboards and alerts
- separate retry budgets
The goal is not to make large group delivery instant at any cost. The goal is to keep the rest of the platform healthy while large group work progresses predictably.
4. When To Use It
Use this pattern when:
- group size can become very large
- one message creates many delivery tasks
- normal messages must remain low latency
- group fan-out lag is acceptable within limits
- worker pools can be separated by workload
- hot conversations are possible
- push providers or projection stores can be overwhelmed by group traffic
It applies to chat groups, broadcast channels, large notification audiences, social fan-out, and activity feed generation.
5. When Not To Use It
It may be premature when:
- groups are small
- fan-out volume is low
- the product has no large audience messaging
- one worker pool has plenty of headroom
- operational complexity is a bigger risk than fan-out load
Start simple, but design the boundaries so large groups can be isolated later.
6. Data And Operational Model
Useful data:
conversation_profile
- conversation_id
- member_count
- fanout_class
- delivery_priority
fanout_task
- message_id
- conversation_id
- recipient_range
- attempt_count
- status
Operators should watch:
- fan-out lag by conversation class
- age of oldest large-group task
- one-to-one delivery latency
- worker utilization by pool
- retry rate
- per-group queue depth
- projection lag caused by group traffic
- push provider throttling by group workload
Large group fan-out needs its own SLO. It may be acceptable for a public channel to take longer to fan out than a one-to-one chat, but that delay should be deliberate and visible.
7. Failure Modes
- Large groups share workers with one-to-one delivery and starve it.
- Recipient expansion creates a hot key.
- Fan-out tasks retry without deduplication.
- Per-group limits are too strict and delivery never catches up.
- Operators monitor global queue depth but miss one massive group backlog.
- Large-group push tasks exceed provider limits.
- Lazy fan-out makes unread projections confusing.
- Membership snapshots are wrong and deliver to users who should not receive the message.
8. Tradeoffs
| Benefit | Cost |
|---|---|
| Protects normal messaging latency | Adds queues and worker pools |
| Makes large group lag visible | More operational tuning |
| Reduces blast radius of hot groups | Delivery may be less immediate |
| Enables workload-specific limits | Requires classification logic |
| Supports backpressure by workload | More complex observability |
Large group isolation is a product reliability decision: normal conversations should stay healthy even when one group creates enormous work.
9. Related Systems And Concepts
Knowledge links
Use these links to understand what to know first, where this idea appears, and what to study next.
Prerequisites
Read these first if this topic feels unfamiliar.
Used In Systems
System studies where this idea appears in context.
Related Concepts
Core ideas that connect to this topic.