System Design: Notification Platform, 100M Notifications/Second
Goal: Build a notification platform that pushes 100 million notifications per second across web (WebSocket/SSE), Android (FCM), and iOS (APNs). Support scheduling, broadcast, per-user preferences, and survive failures with at-least-once delivery.
Here's how to architect it, from ingestion to delivery, with scale math and failure scenarios.
1. Problem Statement
The goal: a notification platform that pushes 100 million notifications per second across web, Android, and iOS.
Design a notification system that can send 100 million notifications per second. Support web push (real-time via SSE/WebSocket), Android (FCM), iOS (APNs), scheduled notifications, and handle failures gracefully.
A few clarifications before diving in:
Scale clarification: 100M/sec is massive, on par with the biggest platforms in production today. I'll design for this as a sustained peak, not a burst. Every layer must be horizontally partitioned.
Assumptions I'm making:
- This is a multi-tenant platform (like OneSignal, Pusher, or Firebase itself, serving many apps).
- "Send" means accepted + dispatched. Actual delivery to the user's eyeball depends on device state, OS, and network.
- At-least-once delivery semantics. Exactly-once is impractical at this scale, so deduplicate where possible.
- Users can be on multiple devices simultaneously (phone + laptop + tablet).
- The system is a backend platform. It doesn't build client apps, it provides SDKs and APIs.
What NOT to do:
- Poll clients asking "any new notifications?" (this is push, not pull)
- Store notifications in a single PostgreSQL table and SELECT by user_id (dies at 1M/sec)
- Use Redis as the sole data store (volatile memory, no durability)
- Send FCM/APNs synchronously in the request path (one slow Apple response blocks everything)
- Broadcast every notification to every WebSocket pod (20B internal messages/sec of waste)
- Build one monolith that does ingestion + routing + delivery + scheduling
The trickiest problem in this design? A notification can arrive at any server, but the user's WebSocket lives on one specific pod out of hundreds. Section 9.3 covers how to solve this.
2. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Send real-time push notifications to web browsers via persistent connection (WebSocket/SSE) | P0 |
| FR-02 | Send push notifications to Android devices via FCM | P0 |
| FR-03 | Send push notifications to iOS devices via APNs | P0 |
| FR-04 | Support scheduled notifications (deliver at a future timestamp, recurring via cron) | P0 |
| FR-05 | Support broadcast notifications (send to all users, segments, topics) | P0 |
| FR-06 | Support targeted notifications (single user, device, or channel) | P0 |
| FR-07 | Track delivery status: sent, delivered, read, failed, retried | P1 |
| FR-08 | Support notification templates with variable interpolation | P1 |
| FR-09 | User-level preference management (opt-in/out per channel, quiet hours) | P1 |
| FR-10 | Multi-tenant isolation | P0 |
| FR-11 | Idempotent delivery (deduplication) | P0 |
| FR-12 | Notification cancellation before delivery | P1 |
| FR-13 | Priority levels (critical, high, normal, low) | P1 |
| FR-14 | Rate limiting per tenant, per user, per channel | P0 |
| FR-15 | Dead letter queue for permanently failed notifications | P1 |
3. Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR-01 | Throughput | 100M notifications/sec sustained |
| NFR-02 | Latency (p50 / p99) | < 100ms / < 500ms for real-time channels |
| NFR-03 | Availability | 99.99% (52 min downtime/year) |
| NFR-04 | Durability | No notification loss once accepted (at-least-once) |
| NFR-05 | Horizontal scalability | Linear scale-out, no single bottleneck |
| NFR-06 | Data retention | Hot: 7 days, Warm: 90 days, Cold: 1 year |
| NFR-07 | Concurrent WebSocket/SSE connections | 100M peak (designed for 5x growth to 500M) |
| NFR-08 | Scheduled notification precision | ≤ 1 second drift |
| NFR-09 | Recovery Time Objective (RTO) | < 30 seconds |
| NFR-10 | Recovery Point Objective (RPO) | 0 (no data loss) |
| NFR-11 | Geographic distribution | Multi-region active-active |
| NFR-12 | Tenant isolation | Noisy neighbor protection at every layer |
4. High-Level Approach & Technology Selection
4.1 What Kind of System Is This?
100M notifications per second across three channels: WebSocket for real-time web push, FCM for Android, and APNs for iOS. This is a push pipeline, not a request/response system. A client sends "notify user X" and walks away -- the platform takes responsibility for delivery across every channel, every device, potentially minutes or hours later for scheduled notifications.
At this scale, nothing can be synchronous. The ingestion path must accept and persist the notification immediately (202 Accepted), and the delivery path runs asynchronously -- routing to the right devices, checking preferences, dispatching through channel-specific protocols, tracking status, and retrying failures.
4.2 What Components Are Needed?
The system does four things: accepts notification requests, routes them to users and devices, dispatches through channel-specific protocols, and tracks delivery status. Each step has different scaling requirements:
- Durable message bus -- Decouple ingestion from delivery. If a delivery channel falls behind (FCM rate-limited, APNs down), events must buffer and replay. The bus also enables independent scaling per channel via separate topics.
- Per-channel delivery dispatchers -- Each target (WebSocket, FCM, APNs) speaks a different protocol with different constraints. FCM limits ~1,000 msg/sec per HTTP/2 connection. APNs allows ~4,000 with multiplexing. WebSocket requires finding the exact gateway pod holding the user's connection.
- Connection registry -- With 100M concurrent WebSocket connections spread across 200 gateway pods, the system needs a fast lookup: "which pod holds user X's connection?" This must handle 100K+ connects/disconnects per second (100M connections with even 0.1% churn per second). (Deep dive: Section 9.3)
- High-throughput hot store -- Device tokens, user preferences, notification status -- all at 100M+ writes/sec. The store must handle wide-column access patterns (user → all their devices) with tunable consistency.
- Scheduler -- Future and recurring notifications need a time-indexed store that fires at the right second, backed by a durable store for crash recovery.
4.3 Store Selection
| Store | Technology | Rationale |
|---|---|---|
| Device tokens, preferences, status | ScyllaDB | 100M+ writes/sec, wide-column model for user→device mappings, tunable consistency, linear scale-out, TTL support |
| Connection registry, rate limits, dedup | Redis Cluster | Sub-ms lookups, Pub/Sub for cross-server message relay, ephemeral data |
| Scheduled notifications index | Redis sorted sets | O(log N) range queries by fire time |
| Scheduled notifications store | ScyllaDB | Durability (source of truth for scheduled payloads) |
| Analytics | ClickHouse | Columnar, trillion-row aggregations, SQL |
| Cold archive | S3 + Parquet | Cost-effective, queryable via Athena/Spark |
| Templates | PostgreSQL | Low volume, transactional, relational |
4.4 Why ScyllaDB Over Alternatives
At 100M+ writes/sec with wide-column access patterns (user → devices, user → notification status), the system needs a database built for this workload. Here's how the candidates stack up:
| Dimension | ScyllaDB | Cassandra | DynamoDB | MongoDB |
|---|---|---|---|---|
| Throughput per node | ~1M ops/sec (shard-per-core, C++) | ~100K ops/sec (JVM, shared threads) | Provisioned or on-demand (pay per RCU/WCU) | ~50K ops/sec (WiredTiger) |
| p99 latency under load | < 5ms (no GC pauses) | 10-50ms (GC spikes under pressure) | < 10ms (single-digit) | 10-100ms (variable) |
| Data model fit | Wide-column, perfect for user→device mappings | Same CQL model | Key-value / document, no native wide-column clustering | Document, schema-flexible but poor for wide-column |
| TTL support | Native per-row TTL, zero overhead | Native TTL but tombstone storms cause latency spikes | Native TTL on items | TTL index, separate background deletion |
| Multi-DC replication | Built-in, tunable per-keyspace | Built-in, same model | Global Tables (managed) | Atlas Global Clusters (managed) |
| Cost at scale (PB-range) | ~10x fewer nodes than Cassandra | ~2,000 nodes (JVM overhead) | $$$$ at this write volume | Not viable at this scale |
| Operational complexity | Medium (fewer nodes, less tuning) | High (JVM tuning, compaction tuning, GC tuning) | Low (fully managed) | High at this scale |
Why ScyllaDB wins here:
ScyllaDB's shard-per-core architecture assigns each CPU core its own partition range, memory, and I/O queue. No locks, no GC, no thread contention. That's how a single ScyllaDB node matches 10 Cassandra nodes in throughput. Since it speaks CQL (Cassandra Query Language), the data model, queries, and driver ecosystem are identical. Same proven data model, without Cassandra's operational pain.
DynamoDB would work functionally, but at 100M writes/sec the cost is prohibitive. At $1.25 per million WCUs, that's $125/sec or ~$10.8M/day. ScyllaDB on bare metal costs a fraction of that.
MongoDB can't sustain this write volume with wide-column access patterns. WiredTiger's document-level locking becomes a bottleneck well before 1M ops/sec per node.
5. High-Level Architecture
Here's the architecture, layer by layer.
Bird's-Eye View
Layer Responsibilities
Layer 1, Ingestion: Stateless. Accepts requests, validates, deduplicates, writes to Kafka. Returns 202 only after Kafka confirms persistence. Every pod is identical, so the LB can route to any.
Layer 2, Kafka: The central nervous system. Decouples everything. Provides durability (RF=3), ordering within partitions, replay capability. Separate topics per delivery channel enable independent scaling.
Layer 3A, Routing: The brain. Resolves "send to user X" into "send to device D1 via FCM, device D2 via APNs, and WebSocket connection C1 on server WS-042." Checks preferences, quiet hours, renders templates, fans out to channel topics.
Layer 3B, Scheduler: Leader-elected service. Fires notifications at their scheduled time. Uses Redis sorted sets backed by ScyllaDB for durability.
Layer 4, Delivery: Channel-specific. Each dispatcher speaks the native protocol of its delivery target. WS Gateway holds persistent connections. FCM/APNs dispatchers maintain HTTP/2 connection pools. The hardest part is finding which of 200 pods holds a user's WebSocket connection -- see Section 9.3 for the full deep dive.
Layer 5, Data: ScyllaDB for high-throughput writes. Redis for ephemeral state. ClickHouse for analytics. S3 for cold storage.
6. Back-of-the-Envelope Estimation
Here's the math that shapes the architecture.
6.1 Throughput Breakdown
Target: 100,000,000 notifications/sec
Channel split (assumed):
Web push (real-time): 30% → 30M/sec
Android (FCM): 40% → 40M/sec
iOS (APNs): 30% → 30M/sec
6.2 Connection Math
Registered users: 500M
Peak concurrency (20%): 100M simultaneous WebSocket/SSE connections
Memory per connection: ~20 KB (buffers, session state, TLS)
Total connection memory: 100M × 20 KB = 2 TB
Connections per server: 500K (aggressive but achievable with epoll/io_uring,
ulimit -n 1M+, net.core.somaxconn tuned)
WS Gateway servers needed: 100M / 500K = 200 servers
6.3 Bandwidth
Average notification payload: 1 KB (JSON: title, body, metadata, routing)
Ingestion: 100M/sec × 1 KB = 100 GB/sec inbound
Delivery: ~120M/sec × 1 KB = 120 GB/sec outbound (1.2x fan-out for multi-device)
Aggregate: ~1 Tbps across the cluster
6.4 Storage
Daily volume (at 20% average of peak): 1.73 trillion notifications/day
Hot metadata (200 bytes/notification, 7 days): ~2.4 PB
Warm (90 days, compressed columnar): ~6 PB
Cold (1 year, Parquet on S3): ~25 PB
Practical: Store only metadata + status in hot tier.
Purge full payloads 24h after delivery.
6.5 Kafka Sizing
Ingestion rate: 100 GB/sec
Per-broker write rate: ~200 MB/sec (NVMe + 10GbE)
Brokers for ingestion: 100,000 / 200 = 500 brokers
Replication factor 3: ~1,500 broker-equivalents of storage
Partitions: 100,000+ (for consumer parallelism)
→ Multi-cluster Kafka deployment, partitioned by tenant hash
6.6 FCM/APNs Dispatcher Sizing
FCM:
Practical limit: ~1,000 msg/sec per HTTP/2 connection
40M/sec → 40,000 concurrent connections to Google
At 100 connections/pod → 400 FCM dispatcher pods
APNs:
Practical limit: ~4,000 msg/sec per HTTP/2 connection (multiplexed)
30M/sec → 7,500 concurrent connections to Apple
At 100 connections/pod → 75 APNs dispatcher pods
6.7 Routing & API Gateway Sizing
API Gateway: stateless request handlers.
100M notifications/sec
Each pod handles ~50K requests/sec (gRPC, validation, Kafka produce)
100M / 50K = 2,000 pods minimum
Routing workers: resolve users → devices, check preferences, fan out.
100M notifications/sec, each requires 2-3 ScyllaDB reads + preference check
Each pod handles ~50K routing decisions/sec
100M / 50K = 2,000 pods minimum
Both are stateless and horizontally scalable. HPA scales them based on CPU and Kafka consumer lag.
6.8 Summary
| Resource | Estimate |
|---|---|
| WS Gateway servers | 200 (500K connections each) |
| Kafka brokers | 500+ (multi-cluster) |
| FCM dispatchers | 400 pods |
| APNs dispatchers | 75 pods |
| API Gateway pods | 2,000+ |
| Routing workers | 2,000+ pods |
| Hot storage (ScyllaDB) | 2.4 PB |
| Connection RAM | 2 TB |
| Network | ~1 Tbps |
Bottom line: every layer scales independently. Shared-nothing architecture. Partition everything.
7. Data Model
7.1 ScyllaDB Schema
Device Tokens, partitioned by user_id (1-10 devices per user = small partitions):
Notification Status, partitioned by user_id, clustered newest-first, 7-day TTL:
Scheduled Notifications, partitioned by minute-level time buckets:
User Preferences:
7.2 Redis Data Structures
# ── Connection Registry ──
# "User X is connected on device D, held by server S"
HSET conn:registry:{user_id} {device_id} "{server_id}:{conn_id}"
# "Server S holds these connections" (for drain/cleanup)
SADD conn:server:{server_id} "{user_id}:{device_id}"
# ── Rate Limiting (sliding window) ──
ZADD ratelimit:tenant:{tenant_id} {timestamp_ms} {notification_id}
# ── Per-User Rate Limiting (sliding window) ──
ZADD ratelimit:user:{user_id} {timestamp_ms} {notification_id}
# ── Per-Channel Rate Limiting (hourly counter) ──
INCR ratelimit:user:{user_id}:{channel}
EXPIRE ratelimit:user:{user_id}:{channel} 3600
# ── Deduplication ──
BF.ADD dedup:{date} {idempotency_key}
# ── Scheduled Index ──
ZADD scheduled:fire_index:{minute_bucket} {fire_epoch} {notification_id}
# ── Topic Subscriptions ──
SADD channel:subscribers:{topic} {user_id_1} {user_id_2} ...
One cluster or many? One Redis Cluster with 50+ shards handles all workloads: connection registry, rate limits, dedup sets, scheduled notification indexes, and Pub/Sub channels. Redis Cluster hashes keys across shards automatically. Different key prefixes (conn:registry:*, ratelimit:*, scheduled:*) naturally spread across shards. Splitting into separate clusters adds operational overhead without meaningful isolation benefit. If a connection registry shard fails, it doesn't affect rate limit shards because they're already on different shards within the same cluster.
7.3 ClickHouse (Analytics)
8. API Design
8.1 Send Notification
POST /v1/notifications
X-Tenant-Id: tenant_abc
X-Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Authorization: Bearer <jwt>
{
"notification_id": "550e8400-e29b-41d4-a716-446655440000",
"type": "TARGETED", // TARGETED | BROADCAST | TOPIC
"priority": "HIGH", // CRITICAL | HIGH | NORMAL | LOW
"channels": ["WEB_PUSH", "FCM", "APNS"], // or ["ALL"]
"recipients": {
"user_ids": ["user_123", "user_456"],
"topic": null,
"segment_id": null
},
"content": {
"title": "Order Shipped",
"body": "Your order #{{order_id}} has been shipped.",
"image_url": "https://cdn.example.com/img/shipped.png",
"deep_link": "app://orders/{{order_id}}",
"template_id": "tmpl_order_shipped",
"template_vars": { "order_id": "ORD-98765" },
"data": { "order_id": "ORD-98765", "carrier": "FedEx" }
},
"schedule": {
"deliver_at": "2026-02-05T10:00:00Z",
"timezone": "America/New_York",
"recurrence": null // or cron: "0 9 * * 1"
},
"options": {
"ttl_seconds": 86400,
"collapse_key": "order_update",
"sound": "default",
"badge_count": 3,
"android": { "channel_id": "orders", "icon": "ic_shipping" },
"ios": {
"category": "ORDER_UPDATE",
"thread_id": "order-98765",
"interruption_level": "active"
},
"web": {
"require_interaction": false,
"actions": [
{"action": "track", "title": "Track Order"},
{"action": "dismiss", "title": "Dismiss"}
]
}
}
}
Response (202 Accepted):
8.2 Other Endpoints
GET /v1/notifications/{id}/status → delivery status per channel
DELETE /v1/notifications/{id} → cancel (if not yet delivered)
POST /v1/notifications/batch → up to 1000 per batch
POST /v1/devices → register device token
PUT /v1/users/{id}/preferences → channel opt-in/out, quiet hours
8.3 Internal gRPC (Service-to-Service)
8.4 WebSocket Wire Protocol
Client → Server: CONNECT {"user_id":"user_123","token":"jwt...","channels":["orders"]}
Server → Client: CONNECTED {"connection_id":"conn_abc","server":"ws-pod-042"}
Server → Client: NOTIFICATION {"id":"n_789","title":"...","body":"...","data":{...}}
Client → Server: ACK {"id":"n_789"}
Server → Client: PING
Client → Server: PONG
9. Deep Dives
9.1 Web Push: SSE vs WebSocket
This choice matters. Here's the comparison:
| Dimension | WebSocket | SSE |
|---|---|---|
| Direction | Full-duplex (bidirectional) | Server → Client only |
| Client ACK | Native (client sends frames) | Requires separate HTTP POST |
| Protocol | ws:// (upgrade from HTTP) | Standard HTTP |
| Auto-reconnect | Manual (must implement) | Built into EventSource API |
| HTTP/2 multiplexing | No (dedicated TCP per connection) | Yes |
| Binary support | Yes | Text only |
| Proxy traversal | Often blocked by corporate firewalls | Passes through any HTTP proxy |
| Memory per connection | ~20 KB | ~15 KB |
| Load balancer compat | Needs L4 or WS-aware L7 | Any HTTP LB works |
My Recommendation: WebSocket Primary, SSE Fallback
Why WebSocket as primary: At 100M/sec, inline client ACK is essential. With SSE, every delivery confirmation needs a separate HTTP round-trip. That's potentially 30M extra HTTP requests/sec just for acknowledgments. With WebSocket, the ACK is a tiny frame on the existing connection.
Why SSE as fallback: Corporate firewalls and restrictive proxies block WebSocket upgrades. SSE works over plain HTTP and degrades gracefully. The client SDK auto-detects: try WebSocket first → fall back to SSE if upgrade fails.
Connection Lifecycle (WebSocket)
9.2 Connection Management & Message Routing
The Connection Registry
This is the lookup table that answers: "User X has a WebSocket open. Which of the 200 servers is holding it?"
Targeted Notification Routing
Why not go through the load balancer? The external NLB is only for initial client connections. It routes to a random pod, which is useless when a specific one is needed. Once connected, the pod registers itself in Redis. From that point, the routing layer needs a way to reach that exact pod. Three options:
- Redis Pub/Sub (chosen approach): Each WebSocket pod subscribes to a channel named after itself (e.g.,
ws-pod-042:deliver). The routing worker publishes to that channel. Simple, no direct network calls between services. - Direct pod IP via Kubernetes headless service: A headless Service resolves to individual pod IPs instead of a single VIP. The routing worker looks up the pod IP from Redis and calls it directly over gRPC or HTTP. Faster (no pub/sub hop), but couples the routing layer to pod networking and breaks if pods restart and get new IPs before Redis is updated.
- Dedicated internal load balancer with sticky routing: A separate internal LB with consistent hashing on
user_id. Routes to the same pod every time. No Redis lookup needed for routing, but flexibility is lost. Adding or removing pods reshuffles connections.
Option 1 wins. Redis Pub/Sub decouples the routing worker from pod networking entirely. The routing worker doesn't need to know IPs, open connections, or handle retries. It publishes a message and moves on.
Broadcast / Topic Routing (Staged Fan-out)
For a topic with millions of subscribers, direct fan-out from one worker would OOM. Staged fan-out fixes this:
Total broadcast latency for 10M users: < 5 seconds
9.3 How the Right WebSocket Node Is Selected
The hard part: an event can fire from any server, but the user's WebSocket lives on one specific pod. How does it get routed?
The Problem Visualized
Four approaches. Most don't work at this scale.
Approach 1: Sticky Hashing (Route User to Deterministic Pod)
Idea: hash(user_id) % num_pods → always routes to the same pod
user_123 → hash("user_123") % 200 → pod 42 (always)
Event producer knows: "user_123 is on pod 42"
→ Send notification directly to pod 42
Why this DOESN'T work at this scale:
- Pod count changes (scaling, crashes). Rehashing redistributes users → mass disconnects.
- Consistent hashing helps but still moves ~1/N users per pod change.
- Hot users (celebrities) create hotspot pods.
- No multi-device support: user on phone AND laptop can't be on the same pod if they connected at different times through different LB decisions.
Verdict: Rejected. Too fragile, too many edge cases.
Approach 2: Broadcast to All Pods
Idea: Send every notification to ALL 200 pods.
Each pod checks: "Do I have this user?" If yes, deliver. If no, discard.
Event → broadcast to 200 pods → 199 discard, 1 delivers
Why this DOESN'T work at this scale:
- 100M notifications/sec x 200 pods = 20 BILLION message deliveries/sec internally.
- 99.5% of messages are wasted (delivered to pods that don't have the user).
- Network bandwidth explodes: 100 GB/sec x 200 = 20 TB/sec internal traffic.
Verdict: Rejected. Catastrophically wasteful.
Approach 3: Centralized Registry Lookup (CHOSEN)
Idea: Maintain a fast lookup table in Redis.
When event fires → lookup user → find exact pod → route directly.
Here is the complete end-to-end flow showing how an event on any server reaches the right WebSocket node:
PHASE 1: CONNECTION TIME (user connects, happens ONCE)
PHASE 2: EVENT TIME (notification fires, can happen from ANYWHERE)
PHASE 3: WEB DELIVERY (finding the RIGHT node)
PHASE 4: USER OFFLINE (no active connection)
If HGETALL conn:registry:user_123 returns EMPTY:
- Store in ScyllaDB: status = 'QUEUED'
- When user reconnects to ANY pod (e.g., ws-pod-099):
- ws-pod-099 fetches pending:
SELECT * FROM notification_status WHERE user_id = 'user_123' AND status = 'QUEUED'
- Delivers all pending, updates status to DELIVERED
- ws-pod-099 fetches pending:
Total hops: Producer → Kafka → Router → Redis Lookup → Pub/Sub → Pod Total time: ~50-150ms end-to-end (p99)
Why this works at scale:
- Redis HGETALL on a user key is O(N) where N = devices per user (1-5). Sub-millisecond.
- Redis Pub/Sub delivers to the subscribed pod in ~0.1ms.
- Each WS pod subscribes to exactly ONE channel (its own name). No topic explosion.
- 100M lookups/sec across a Redis Cluster with 50+ shards (each shard handles ~2M ops/sec, so 100M / 2M = 50 shards) is well within capacity.
- Registration and deregistration are both O(1).
Approach 4: Gossip Protocol (Pod-to-Pod)
Idea: Each pod gossips its connection list to neighbors.
Eventually every pod knows the full mapping.
Why this doesn't scale:
- 100M connections x ~100 bytes = 10 GB of state to replicate to every pod.
- Convergence delay: O(log N) gossip rounds = ~8 rounds for 200 pods.
- Connection churn (100K+ connects/disconnects per second) overwhelms gossip bandwidth.
- This would be reinventing a worse Redis.
Verdict: Rejected.
Handling Stale Registry Entries
A connection might drop without clean deregistration (network partition, pod crash, OOM kill). Now Redis says "user_123 is on ws-pod-042" but the connection is dead.
Three recovery mechanisms run in parallel:
MECHANISM 1: HEARTBEAT (proactive, every 30 seconds)
Each WS pod sends PING to all its local connections. No PONG within 10s → connection declared dead:
HDEL conn:registry:user_123 device_web_1
SREM conn:server:ws-pod-042 user_123:device_web_1
Catches: silent TCP drops, client-side crashes, network glitches Detection time: 30-40 seconds
MECHANISM 2: POD HEALTH REAPER (reactive, every 30 seconds)
Background service checks health of all pods in conn:server:* keys.
For crashed pods → bulk deregister all connections:
SMEMBERS conn:server:ws-pod-042
→ for each: HDEL conn:registry:{user_id} {device_id}
DEL conn:server:ws-pod-042
Catches: pod OOM kills, hardware failures, AZ outages Detection time: 10-30 seconds
MECHANISM 3: TTL SAFETY NET (passive)
All Redis keys have 24h TTL. Connections refresh TTL on each PONG. Even if mechanisms 1 and 2 miss it, stale entries expire.
Catches: everything else Detection time: up to 24 hours (worst case safety net)
Recovery after stale entry is cleared:
- Next delivery attempt sees empty registry
- Notification stored in ScyllaDB as QUEUED
- When user reconnects → pending notifications delivered
9.4 Load Balancer & Persistent Connections
The Core Problem
Standard HTTP LBs distribute per-request. WebSocket connections last minutes to hours. Three tensions:
- Connection affinity: Once upgraded, all frames must go to the same backend. LB can't re-route mid-connection.
- Rolling deploys: New pods replace old ones. Old pods must drain 500K connections gracefully.
- Uneven distribution: Long-lived connections cause accumulation. A pod started 2 hours ago has way more connections than one started 2 minutes ago.
Architecture: L4 NLB + Envoy Sidecar
Graceful Drain on Deploy / Scale-Down
Timeline:
T+0s Kubernetes sends SIGTERM (preStop hook fires)
│
├── Mark self "draining" in Redis:
│ SET ws-pod-042:status "draining" EX 600
│
├── Deregister from NLB target group
│ (NLB stops sending NEW TCP connections)
│
├── Send WebSocket CLOSE frame (code 1001: Going Away)
│ to ALL 500K connected clients:
│ "Server shutting down, please reconnect"
│
T+0-5s Clients receive CLOSE frame
│
├── Well-behaved clients ACK and start reconnecting
│ to NLB → routed to different healthy pods
│
T+30s Send PING to remaining connections
│
├── No PONG → force-close
│
T+300s terminationGracePeriodSeconds expires
│
├── Kubernetes sends SIGKILL
│
├── Cleanup: bulk deregister from Redis
│ SMEMBERS conn:server:ws-pod-042 → HDEL all
│ DEL conn:server:ws-pod-042
Recovery: Users who reconnected to new pods already have
their registrations updated. Pending undelivered notifications
are fetched from ScyllaDB on reconnect.
Client Reconnection: Avoiding Thundering Herd (Mass Simultaneous Reconnections)
When 500K clients disconnect simultaneously, they must NOT all reconnect at the same instant.
With 500K clients and 0-2s jitter, reconnections spread to ~250K/sec, well within NLB capacity.
9.5 Mobile Push: FCM & APNs
FCM (Android)
Error handling:
200 OK → SENT, commit offset
UNREGISTERED → Deactivate token in ScyllaDB
INVALID_ARGUMENT → DLQ (bad payload)
429 → Backoff, pause partition consumer
500/503 → Retry 3x with exponential backoff, then DLQ
APNs (iOS)
Error handling:
200 OK → SENT, commit offset
BadDeviceToken → Deactivate token
Unregistered → Deactivate (check timestamp > last registration)
ExpiredProvider → Regenerate JWT, retry
429 → Backoff
500/503 → Retry 3x, then DLQ
Key APNs detail:
410 Unregistered includes a timestamp. Only deactivate if
Apple's timestamp is AFTER the last token registration.
Otherwise the user may have re-registered on a new device.
9.6 Deduplication
Ingestion-Layer Deduplication (FR-11)
Every notification carries an X-Idempotency-Key header. The API gateway checks this key against a Bloom filter in Redis before writing to Kafka.
API Gateway receives notification:
1. key = X-Idempotency-Key (or notification_id if no key provided)
2. BF.EXISTS dedup:{today} key
→ EXISTS (probably duplicate): reject with 409 Conflict
→ NOT EXISTS: BF.ADD dedup:{today} key, proceed to Kafka
3. Bloom filter rotates daily (dedup:{yesterday} deleted at midnight)
Why a Bloom filter and not a Redis SET? At 100M notifications/sec, storing every idempotency key as a SET member would consume ~100 GB/day (100 bytes per key x 8.64B daily notifications). A Bloom filter with 0.1% false positive rate uses ~1.2 GB for the same volume. The tradeoff: 0.1% of unique notifications are falsely rejected as duplicates. At this scale, that's acceptable. The alternative (100 GB of Redis memory per day) is not.
False positives mean a tiny fraction of legitimate notifications get a 409. The client retries with a new idempotency key. False negatives (letting a duplicate through) are impossible with Bloom filters. That's the right direction for at-least-once semantics: better to occasionally reject a valid notification than to deliver a duplicate.
Delivery-Layer Deduplication
The ingestion Bloom filter catches most duplicates, but Kafka producer retries can still produce the same message twice (the Bloom filter saw it once, marked it, but Kafka didn't ACK in time, so the API gateway retried). Delivery workers guard against this: before dispatching, they check notification_status in ScyllaDB. If the (user_id, notification_id, channel) combination already has status SENT or DELIVERED, the worker skips it. This is a cheap read (partition key lookup) and prevents the user from seeing the same notification twice.
9.7 Multi-Channel Delivery Decision
9.8 Scheduled Notifications
Architecture
Handling 1 Billion Scheduled Notifications
One giant Redis sorted set won't fit in memory (1B x 100 bytes = 100 GB). Solution: partition by minute.
scheduled:fire_index:2026-02-05T10:00 ← only notifications for this minute
scheduled:fire_index:2026-02-05T10:01
...
Average per bucket: 1B / 525,600 min/year ≈ 1,900 entries
Peak (100x): ~190,000 entries per bucket ← trivially fits in memory
Scheduler only reads the CURRENT minute's bucket.
Background job pre-populates Redis buckets 10 minutes ahead from ScyllaDB.
Cancellation
1. API: DELETE /v1/notifications/{id}
2. ScyllaDB: UPDATE status = 'CANCELLED'
3. Redis: ZREM scheduled:fire_index:{bucket} {id}
(only if fire_at is within 10 min and already pre-loaded into Redis.
For far-future cancellations, step 2 is sufficient — the pre-loader
skips CANCELLED entries.)
4. If already in Kafka but not delivered:
→ Delivery workers check: SISMEMBER cancelled:notifications {id}
→ If found, skip delivery
Scheduling Lifecycle (End-to-End)
Here's what happens from the moment a client sends a scheduled notification to the moment it fires:
If fire_at is within 10 minutes, the API writes to both ScyllaDB and Redis. The notification is ready to fire almost immediately. For far-future notifications (hours, days, months), the API writes to ScyllaDB alone. A background pre-loader continuously scans ScyllaDB and populates Redis buckets 10 minutes before fire time. This keeps Redis lean. A notification scheduled 30 days out doesn't sit in Redis for a month. If Redis loses an entry, the consistency sweep catches it. If ScyllaDB is slow, the API retries. The notification is only acknowledged to the client after the ScyllaDB write succeeds.
Time Wheel Deep Dive
The scheduler leader runs a tight loop that ticks every second:
EVERY 1 SECOND:
1. current_bucket = format(now(), "YYYY-MM-DDTHH:mm")
2. entries = ZRANGEBYSCORE scheduled:fire_index:{current_bucket} -inf {now_epoch} LIMIT 1000
3. For each entry:
a. Fetch full payload from ScyllaDB
b. Produce to Kafka (notifications.ingest topic)
c. ZREM from Redis
d. UPDATE status = 'FIRED' in ScyllaDB
4. If entries.length == 1000, immediately loop again (drain the bucket)
The LIMIT 1000 per tick prevents the scheduler from blocking on a single massive batch. If a minute bucket has 50K entries (flash sale, Black Friday campaign), the scheduler drains it in ~50 ticks (50 seconds). That's fine because the precision target is ≤ 1 second drift for normal load, and a few seconds for extreme spikes.
Why a leader-elected single writer? Multiple schedulers reading the same bucket would fire duplicate notifications. Leader election via Redis lock (SETNX with TTL) is simple and the failover time (< 5 seconds) is acceptable. The hot standby watches the lock and takes over immediately on expiry.
Recurring Notifications (Cron)
For recurring schedules like "every Monday at 9 AM," the recurrence field holds a cron expression:
recurrence: "0 9 * * 1" → every Monday at 09:00 UTC
recurrence: "0 */6 * * *" → every 6 hours
recurrence: null → one-shot, no recurrence
After the scheduler fires a recurring notification:
1. Fire the current occurrence (produce to Kafka)
2. Compute next fire time from cron expression
3. INSERT new row in ScyllaDB with next fire_at, status = PENDING
4. ZADD the new fire time into the appropriate minute bucket in Redis
5. UPDATE current row status = 'FIRED'
Each occurrence is a separate ScyllaDB row. This gives a complete audit trail (every past firing is queryable) and avoids mutation of a single row that multiple processes might read.
To stop a recurring notification, the client sends DELETE /v1/notifications/{id}. The scheduler checks status before firing. If CANCELLED, it skips and does NOT compute the next occurrence.
Timezone Handling
Scheduled times are stored in UTC internally. The API converts on ingestion:
Client sends: deliver_at: "2026-03-10T09:00:00", timezone: "America/New_York"
API converts: fire_at: "2026-03-10T14:00:00Z" (EST is UTC-5)
Stored in: ScyllaDB and Redis as UTC epoch
DST transitions matter. "Every day at 9 AM Eastern" shifts between UTC-5 (EST) and UTC-4 (EDT). For recurring notifications, the scheduler recomputes the UTC fire time for each occurrence using the stored timezone. It doesn't just add 24 hours blindly.
March 8, 2026 (before spring forward): 9 AM ET = 14:00 UTC
March 9, 2026 (after spring forward): 9 AM ET = 13:00 UTC ← 1 hour earlier in UTC
The cron library handles this by resolving the cron expression in the user's timezone first, then converting to UTC.
Missed Fire Recovery
Three things can cause a missed fire: scheduler leader crash, Redis data loss, or a bug that skips an entry.
The consistency sweep catches all of them:
EVERY 5 MINUTES (runs on leader):
1. Compute buckets to scan:
past_buckets = last 10 minute buckets (now - 10min to now - 1min)
2. For each bucket:
SELECT * FROM scheduled_notifications
WHERE fire_time_bucket = '{bucket}'
AND fire_at < now() - 60s
AND status = 'PENDING'
LIMIT 10000
3. For each missed notification:
a. Re-index in Redis: ZADD scheduled:fire_index:{bucket} {fire_epoch} {id}
b. The normal time wheel picks it up on the next tick
4. Log metric: scheduled.missed.recovered.count
The query scans by fire_time_bucket (partition key) so it's efficient. Without the partition key, ScyllaDB would do a full table scan across all nodes, which is catastrophic at this data volume.
The 60-second grace period avoids racing with the time wheel (which might be about to fire the notification). Worst case, a missed notification fires within 5 minutes plus the sweep interval. For a leader crash with < 5 second failover, most scheduled notifications fire within 10 seconds of their target time.
9.9 Notification Templates
Templates live in PostgreSQL. Low volume, transactional, relational queries like "all templates for tenant X." Out of 100M notifications/sec, maybe 5% use templates. That's 5M template renders per second, but spread across only a few hundred distinct templates per tenant.
Where Rendering Happens
Template rendering runs in the routing layer, not at ingestion. The routing worker already fetches user preferences and device tokens from ScyllaDB for every notification. Adding a template lookup is one more read, and it's almost always a cache hit.
Routing worker receives notification with template_id + template_vars
→ Cache lookup: template:{tenant_id}:{template_id}:{version}
→ Cache hit (99%+): render immediately
→ Cache miss: fetch from PostgreSQL, populate cache (TTL 5 min)
→ Mustache-style interpolation: "Your order #{{order_id}}" → "Your order #ORD-98765"
→ Result becomes the notification payload for downstream delivery
Why not render at ingestion time? Because the routing worker already has the user context (locale, device type, preferences). A template might produce different output per device: shorter title for APNs (limited to 178 bytes), longer body for web push. Rendering at routing time allows tailoring the output per channel.
Template Schema
Versioned. When a tenant updates a template, a new version row is inserted rather than mutating the old one. Notifications already in flight keep using the version they were created with. No race between "template updated" and "notification being rendered right now."
Caching Strategy
Templates are a classic read-heavy, write-rare workload. A few hundred templates per tenant, updated maybe a few times a day, read millions of times per second.
Each routing worker keeps a local in-memory LRU cache (bounded at 10K entries, roughly 50 MB). Cache key: {tenant_id}:{template_id}:{version}. TTL: 5 minutes. On a miss, the worker fetches from PostgreSQL and populates the cache.
Why not Redis for template caching? PostgreSQL handles the read volume fine here. Templates are small, the indexes are tight, and the total dataset fits comfortably in PostgreSQL's buffer cache. Adding Redis as an intermediate layer for a low-volume dataset adds a network hop and operational complexity for no meaningful latency win. The local LRU is faster (zero network) and sufficient given how few distinct templates exist per tenant.
Error Handling
What happens when a template fails to render? Missing variable, bad syntax, PostgreSQL unreachable.
Template render fails:
1. Missing variable in template_vars:
→ Fall back to default_vars from template definition
→ If still missing: send raw payload without template, log warning
2. template_id doesn't exist:
→ DLQ immediately (bad request, won't fix itself on retry)
3. PostgreSQL unreachable:
→ Circuit breaker opens (see Section 11.12)
→ Serve from local LRU cache (stale template > no notification)
→ If cache also empty: send raw payload from notification content
A template problem should never cause a dropped notification. Worst case, the user gets an unformatted notification with the raw content fields. That's better than silence.
9.10 Priority Queues
The API accepts four priority levels: CRITICAL, HIGH, NORMAL, LOW. But accepting them in a JSON field and actually enforcing them through the pipeline are different things. Without separate processing paths, a flood of LOW priority marketing blasts would delay CRITICAL alerts like fraud detection, OTP codes, or security warnings.
Separate Kafka Topics Per Priority
notifications.routed.web.critical → dedicated consumer group (always has capacity)
notifications.routed.web.high → shared consumer group
notifications.routed.web.normal → shared consumer group
notifications.routed.web.low → shared consumer group (first to shed)
Same pattern for .fcm.* and .apns.*
The routing worker reads the notification's priority field and produces to the matching topic. CRITICAL gets its own topic with dedicated consumers that never share resources with lower priorities.
Why not one topic with consumer-side priority? Kafka doesn't support per-message priority within a partition. Messages process in offset order. A consumer can't skip past 10K LOW messages to reach a CRITICAL message stuck behind them. Separate topics give real priority isolation at the Kafka level.
Consumer Allocation
Priority Consumer Pods Kafka Partitions Notes
───────── ───────────── ──────────────── ─────
CRITICAL 200 (dedicated) 200 Never shared, never scaled down
HIGH 600 (shared) 1000 Scales with load
NORMAL 800 (shared) 2000 Bulk of traffic
LOW 400 (shared) 1000 Shed first under pressure
CRITICAL consumers are over-provisioned on purpose. They sit mostly idle. That's the point. When a bank sends a fraud alert, it goes through a pipeline with plenty of headroom, not one competing with a retailer's daily deal blast.
Backpressure and Load Shedding
When the system is under pressure (Kafka consumer lag growing, delivery channels rate-limited), LOW priority notifications slow down first.
Backpressure response:
1. Consumer lag > 100K on LOW topic for 2+ minutes:
→ Pause 50% of LOW consumers, redistribute pods to HIGH
2. Consumer lag > 500K on NORMAL topic:
→ Pause all LOW consumers
→ Alert: "LOW priority delivery paused, backpressure active"
3. Consumer lag > 100K on CRITICAL topic:
→ Page on-call (this should never happen)
→ Auto-scale CRITICAL consumers (already over-provisioned, so investigate root cause)
LOW priority notifications are not dropped during backpressure. They stay in Kafka (24-hour retention) and get processed once the pressure eases. The real-world effect: marketing emails arrive a few minutes late during a traffic spike. Nobody notices.
9.11 Rate Limiting
Three levels, each checked at a different point in the pipeline.
Where Each Check Happens
API Gateway (ingestion): tenant-level rate limit
Routing Worker: per-user rate limit + per-channel rate limit
Tenant limits are enforced early, at the API gateway, to reject excessive traffic before it touches Kafka. Per-user and per-channel limits run later, at routing, because the routing worker is the one that knows the target user and their channels.
Redis Data Structures
# Tenant-level: sliding window (already defined in Section 7.2)
ZADD ratelimit:tenant:{tenant_id} {timestamp_ms} {notification_id}
# Per-user: sliding window
ZADD ratelimit:user:{user_id} {timestamp_ms} {notification_id}
# Per-channel per-user: simple counter with TTL
INCR ratelimit:user:{user_id}:fcm
EXPIRE ratelimit:user:{user_id}:fcm 3600 # resets every hour
Default Limits
| Level | Default Limit | Window | Configurable? |
|---|---|---|---|
| Tenant | 10,000/sec | Sliding 1s | Yes, per contract |
| Per-user | 100/hour | Sliding 1h | Yes, per tenant config |
| Per-channel (FCM) | 50/hour | Fixed 1h | Yes |
| Per-channel (APNs) | 50/hour | Fixed 1h | Yes |
| Per-channel (Web) | 200/hour | Fixed 1h | Yes |
Web gets a higher default because web push is lightweight. No third-party API call, just a WebSocket frame on an existing connection. Users on the web also tend to tolerate higher frequency since they can dismiss notifications instantly.
Sliding Window Implementation
Tenant-level rate limiting uses a sliding window log in Redis. Four commands, pipelined into one round trip:
function checkTenantRateLimit(tenantId, notificationId, limit, windowMs):
key = "ratelimit:tenant:{tenantId}"
now = currentTimeMs()
windowStart = now - windowMs
pipeline:
ZREMRANGEBYSCORE key -inf windowStart // prune expired entries
ZADD key now notificationId // add current request
ZCARD key // count entries in window
EXPIRE key (windowMs / 1000 + 60) // TTL safety net
count = pipeline.execute()[2] // ZCARD result
if count > limit:
ZREM key notificationId // remove the one just added
return RATE_LIMITED
return OK
At 100M notifications/sec across 50+ Redis shards, each shard handles roughly 2M rate limit ops/sec. Comfortable.
What Happens When Rate Limited
Tenant rate limited at API Gateway:
→ 429 Too Many Requests
→ Response includes Retry-After header
→ Logged for billing and abuse monitoring
User rate limited at Routing Worker:
→ Notification silently dropped (not queued, not retried)
→ Status set in ScyllaDB: RATE_LIMITED
→ Visible in tenant's analytics dashboard
→ Does NOT count against the tenant's quota (already passed that check)
User-level rate limiting protects the end user from notification spam, not the system from overload. If a buggy integration fires 1,000 notifications at the same person in an hour, only the first 100 get through. The rest are dropped with status RATE_LIMITED so the tenant can see what happened in their dashboard and fix their integration.
Priority Overrides
CRITICAL notifications bypass per-user and per-channel rate limits. A fraud alert or OTP code should never be blocked because the user already got too many marketing notifications that hour. Tenant-level limits still apply though. Even CRITICAL traffic can't exceed the tenant's contracted throughput.
Routing Worker rate limit check:
if notification.priority == CRITICAL:
skip per-user rate limit
skip per-channel rate limit
else:
check per-user rate limit → RATE_LIMITED? drop + log
check per-channel rate limit → RATE_LIMITED? drop + log
9.12 Zero-Downtime Deploys for WebSocket Gateways
Every team deploying WebSocket servers eventually asks this question: can new code ship without disconnecting users? With 200 gateway pods holding 100M connections, the stakes are high. A bad rollout could cascade through the entire fleet.
Can Zero Downtime Be Achieved?
Let's define it precisely. Zero downtime means: no connected user ever observes a disconnection during a deploy. Not a brief spinner, not a reconnect, nothing.
The answer is no. A WebSocket connection is a stateful TCP session pinned to a specific OS process on a specific pod. Replacing the process (which is what a deploy does) kills the TCP session. There is no HTTP-level redirect or retry built into the WebSocket protocol. Unlike stateless HTTP requests that the load balancer can transparently reroute to a new backend, a WebSocket connection is a long-lived socket held by one process. Kill the process, kill the connection.
Three "clever" approaches come up in every design review. None of them work at this scale:
Kernel fd passing (SCM_RIGHTS). A file descriptor can be transferred between processes on the same host. But gateway pods are separate containers, often on different nodes. fd passing doesn't work across nodes. Even on the same node, the new process would need to reconstruct all application-level state (user_id, subscriptions, auth context, in-flight message buffers) for each of 500K connections. Nobody does this in production at this scale.
Proxy-level connection holding. Envoy can hold the client-side TCP connection open while swapping the backend for HTTP/2 (the GOAWAY frame tells clients to stop sending new streams). But WebSocket traffic is bidirectional frames on a single upgraded connection. There is no "stop using this connection for new work" concept. The proxy would need to buffer frames in both directions indefinitely, which defeats the purpose of a real-time system.
Blue-green with connection serialization. Serialize the entire connection state (TLS session, WebSocket frame state, application state), transfer it to a new pod, and resume. Theoretically possible. In practice, that means building a distributed state migration system that handles 500K connections per pod, all to avoid a 0-2 second reconnect delay. The complexity far outweighs the benefit.
What IS achievable: near-zero-downtime. Each user sees at most a 0-2 second reconnect blip. Zero notifications are lost (ScyllaDB queues anything that arrives during the gap). At any given moment during the rollout, less than 0.5% of connections are cycling. This is what Slack, Discord, and Pusher actually do in production.
Rolling Deploy Strategy for 200 Pods
The key insight: drain one pod at a time, and make sure the replacement pod is fully ready BEFORE the old pod starts draining.
Kubernetes Deployment config:
maxSurge: 1 is what makes this work. Kubernetes starts the new pod first, waits for it to pass readiness checks for 30 seconds (minReadySeconds), and only then sends SIGTERM to the old pod. At no point does capacity drop below 199 pods.
The pacing math:
Each pod takes roughly 35 seconds to cycle: 5 seconds for clients to receive the CLOSE frame + 30 seconds of minReadySeconds on the replacement. With 200 pods going one at a time, the full rollout takes about 117 minutes. At any given moment, only 500K out of 100M connections are in the reconnect window. That's 0.5% of users.
How fast can the rollout go?
| Parallelism | Users impacted at once | Rollout time | Redis write spike | Risk |
|---|---|---|---|---|
| 1 pod | 500K (0.5%) | ~2 hours | Low | Minimal |
| 5 pods | 2.5M (2.5%) | ~25 min | Moderate | Low |
| 10 pods | 5M (5%) | ~12 min | High | Medium |
| 20 pods | 10M (10%) | ~6 min | Very high | Thundering herd risk |
At 10 pods in parallel, Redis has to handle 5M HSET + SADD commands in a few seconds from reconnections. A 50-shard Redis Cluster can handle that, but it eats into the headroom that protects against other spikes. 5 pods in parallel is the sweet spot for most rollouts: 25 minutes total, 2.5% user impact, comfortable Redis load.
The Full Deploy Lifecycle
Here's the complete timeline for one pod cycling through a deploy. The critical difference from a crash (Section 11.1): the new pod is already running and accepting connections BEFORE the old pod starts draining.
T-30s Kubernetes creates ws-pod-042-v2 (new version)
Pod starts, subscribes to Redis Pub/Sub, joins NLB
New connections from fresh clients start landing here
T+0s minReadySeconds (30s) passes. New pod confirmed stable.
Kubernetes sends SIGTERM to ws-pod-042-v1
Graceful drain begins (see Section 9.4 for full drain lifecycle)
T+0-2s Clients reconnect with jitter to healthy pods (including v2)
T+5s ~95% reconnected
T+300s SIGKILL. Bulk Redis cleanup. Kubernetes moves to next pod.
The 5-minute grace period (T+0 to T+300s) is generous. Most connections drain in under 5 seconds. The long tail exists for clients on flaky mobile networks that take 30+ seconds to notice the CLOSE frame. The SIGKILL at T+300s is a safety net, not the normal path.
Notification Delivery During the Deploy Window
During the drain, a user can be in one of three states. Each one has a clear delivery path.
State 1: Already reconnected to a new pod. The Redis registry has been updated with the new pod mapping. Notifications route normally through Pub/Sub to the new pod. No issue.
State 2: Mid-reconnect (0-2 second gap). The user disconnected from the old pod but hasn't connected to a new one yet. The Redis registry still points to the old pod. What happens:
Routing Worker → Redis lookup → "user_123 is on ws-pod-042-v1"
→ PUBLISH ws-pod-042-v1:deliver
→ Old pod receives Pub/Sub message
→ Tries to send to WebSocket session → session is closed
→ Writes notification to ScyllaDB: status = QUEUED
→ User reconnects to ws-pod-099-v2
→ ws-pod-099-v2 fetches pending:
SELECT * FROM notification_status
WHERE user_id = 'user_123' AND status = 'QUEUED'
→ Delivers all queued notifications
→ Updates status to DELIVERED
This is the same queuing mechanism described in Section 9.3 (Phase 4). No new infrastructure needed.
State 3: Still connected to the old pod. The draining pod hasn't sent the CLOSE frame yet, or the client hasn't processed it. The WebSocket session is still alive. Notifications deliver over the existing connection as normal. The drain process doesn't terminate connections instantly. It sends CLOSE and waits for the client to acknowledge.
Worst-case delivery delay: 2-3 seconds. User disconnects at T+0, reconnects at T+2s (jitter), queued notifications drain from ScyllaDB in the next second. For CRITICAL notifications like OTP codes or fraud alerts, a 2-3 second delay during a deploy is acceptable. The alternative (never deploying) is worse.
| User state | % of pod's users | Notification path | Delay |
|---|---|---|---|
| Already reconnected | ~90% (by T+2s) | Normal Pub/Sub | None |
| Mid-reconnect | ~8% (0-2s window) | ScyllaDB queue | 2-3s |
| Still on old pod | ~2% (slow clients) | Existing WebSocket | None |
Canary Deploys for WebSocket Gateways
Canary deploys work for WebSocket gateways, but they work differently than for stateless services.
Blue-green is impractical. It requires 200 extra pods (the "green" fleet) running alongside the existing 200 (the "blue" fleet). Each pod allocates ~10 GB for 500K connections. That's 2 TB of extra memory just for the deploy window. And 100M connections still need to migrate. Expensive, and it doesn't avoid reconnections.
Canary with new-connection routing is practical:
- Deploy the new version to 10 pods (out of 200)
- These canary pods join the NLB target group alongside the 190 stable pods
- Only new connections (from reconnects or fresh users) land on canary pods. Existing connections on the 190 stable pods are untouched
- Monitor canary pods for 10-15 minutes: error rates, connection stability, memory usage, WebSocket frame errors
- If healthy: proceed with rolling deploy for the remaining 190 pods
- If unhealthy: remove canary pods from NLB. Their users (~5M connections) reconnect to the 190 stable pods. Rollback complete in under 30 seconds
This pairs well with ArgoCD Rollouts (already in the tech stack, Section 12.2). The setWeight step controls what percentage of the NLB target group is canary pods. The analysis step monitors metrics from Prometheus before promoting.
The canary phase doesn't test the drain path. It only validates that the new code handles WebSocket connections correctly (no memory leaks, no frame errors, no auth failures). The rolling deploy that follows tests the drain mechanics, but by that point the new code is already validated.
Deploy Frequency at This Scale
Deploy frequency is limited by rollout time, not risk. At 1 pod at a time, a full rollout takes about 2 hours. A new rollout can't start while the previous one is still running (Kubernetes blocks it by default). So there's a practical ceiling.
Separate cadences for different services:
| Service | Deploy frequency | Reason |
|---|---|---|
| Stateless (routing, FCM, API gateway) | 10+/day | No connection state. Instant replacement. |
| WebSocket gateways | 1-2/day | 2-hour rollout. Batch changes. |
| Infrastructure (Kafka, Redis, ScyllaDB) | Weekly | Planned maintenance windows. |
For WebSocket gateways, deploying every commit is impractical. Batch changes and ship once or twice a day. For faster iteration, use feature flags. Deploy the code with the flag off, validate with a canary, then flip the flag to activate the feature. This decouples code deployment from feature activation. The safety of a slow rollout with the agility of instant feature toggling.
One exception: security patches or critical bug fixes. For those, accept the 2-hour rollout or increase parallelism to 5-10 pods. A 25-minute rollout with 2.5% user impact is worth it for a security fix.
10. Identify Bottlenecks
10.1 WebSocket Node Selection Problem
Broadcasting every notification to all gateway pods would amplify 100M/sec into 20B internal messages (Section 9.3, Approach 2). The fix is the centralized Redis registry with Pub/Sub relay described in Section 9.3, Approach 3.
10.2 Broadcast Fan-Out Explosion
When a single notification must reach 10M+ subscribers (flash sale, trending topic), direct fan-out from one routing worker causes OOM. One worker can't hold 10M user IDs in memory and produce 10M Kafka messages. Staged fan-out fixes this: scan subscriber sets with SSCAN in micro-batches of 10K users, push those to Kafka, and let hundreds of consumers process them in parallel. One O(N) operation becomes N/10K parallel O(1) operations.
10.3 Connection Registry Scalability
100M WebSocket mappings in Redis means ~10 GB of state (100 bytes per connection). The bigger challenge is the write rate: 100K+ connects/disconnects per second, each requiring an HSET and SADD. A Redis Cluster with 50+ shards handles this by hashing on user_id, keeping each shard under 2M ops/sec. Stale entries from crashed connections get cleaned up three ways: heartbeat checks (30s), a pod health reaper (10-30s), and TTL expiry as a safety net (24h worst case).
10.4 Scheduled Notification Hot Partitions
A single Redis sorted set for 1 billion scheduled notifications needs ~100 GB. That won't work. Partition by minute instead: scheduled:fire_index:2026-02-05T10:00. With 525,600 minutes per year, each bucket averages ~1,900 entries. Even at 100x peaks, that's ~190,000 per bucket. Trivial. The scheduler only loads the current minute, and a background job pre-populates buckets 10 minutes ahead from ScyllaDB.
10.5 Cache Stampede on Redis Failure
When a Redis shard fails, all routing workers that depended on it for connection lookups fail simultaneously. That's potentially hundreds of workers hitting the same error at once.
Partial failure (one shard out of 50):
Each routing worker maintains a local LRU cache of recent Redis lookup results (60-second TTL). When a shard goes down, the cache absorbs hits for recently active users on that shard. Only cache misses result in notifications written to ScyllaDB as QUEUED. With a 50-shard cluster, one shard failure affects ~2% of lookups. Most of those are cache hits. Real impact is minimal.
Full Redis outage:
Both lookup and Pub/Sub delivery are gone. Every notification falls back to ScyllaDB as QUEUED. Real-time delivery stops. Notifications are delayed until either Redis recovers (replica promotion takes 10-15 seconds) or users reconnect.
This is an accepted tradeoff. Redis failover is fast (10-15 seconds), and the 60-second gateway re-registration sweep rebuilds the registry quickly after. Maintaining a durable fallback registry (dual-writing every connection to ScyllaDB) would cost ~110K extra writes/sec permanently to avoid a 10-15 second delay during a rare event. Not worth it. The simpler design: accept the brief delay, queue in ScyllaDB, deliver on recovery.
10.6 Thundering Herd on Gateway Pod Restart
When a gateway pod dies, 500K clients disconnect at once. The server sends CLOSE frame (code 1001) before shutdown. Clients reconnect with jittered backoff (Section 9.4), spreading 500K reconnections to ~250K/sec.
10.7 Cross-Region Notification Latency
With ingestion-layer routing (Section 12.1), a notification for an EU user submitted to a US API Gateway must produce to the EU Kafka cluster across the Atlantic. That adds ~60-120ms of network round-trip. Combined with the normal pipeline latency (~50-150ms), cross-region notifications land at ~110-270ms end-to-end. Still well within the p99 target of 500ms. For the 80-90% of notifications that are local (US user → US delivery), there's zero cross-region overhead.
11. Failure Scenarios
11.1 WebSocket Gateway Pod Crash
SCENARIO: A gateway pod serving 500K concurrent WebSocket connections dies
This is the most interesting failure in the entire system. One pod crash instantly disconnects 500,000 users. If all 500K clients reconnect at the same instant to remaining pods, the result is a thundering herd that cascades through the whole gateway fleet.
Timeline:
| Time | Event |
|---|---|
| T=0 | Gateway pod-17 crashes (OOM, hardware fault, or deployment) |
| T=0 | 500K WebSocket connections drop. Clients get TCP RST or timeout |
| T=+3s | LB health check fails, stops routing to pod-17 |
| T=+10s | Pod health reaper detects pod-17 is gone, begins cleaning Redis |
| T=+10-30s | Reaper removes ~500K stale entries from connection registry |
| T=+5-30s | Clients begin reconnecting with exponential backoff + jitter |
| T=+30s | 90% of clients reconnected to other pods, new registry entries in Redis |
| T=+30s | Queued notifications in ScyllaDB delivered on reconnect |
The hidden problem: Thundering herd on reconnect
500K clients all lost their connection at the same instant. If they all reconnect immediately, the remaining 199 gateway pods each receive ~2,500 new connections simultaneously on top of their existing 500K. That's manageable per-pod, but the Redis registry writes (500K HSET commands in 5 seconds) can spike Redis CPU.
The client SDK prevents this with jittered exponential backoff (see Section 9.4 for the full reconnection implementation). With jitter, 500K reconnections spread over a 30-second window instead of a 1-second spike.
What happens to notifications during the 30-second gap?
Two outcomes, both safe:
-
Registry already cleaned: Routing layer can't find the user, queues the notification in ScyllaDB (Section 9.3, Phase 4). Delivered on reconnect.
-
Stale entry still in Redis: Routing layer finds the stale entry, publishes to
ws-pod-017:deliver. RedisPUBLISHreturns 0 (no subscriber — pod is dead). The routing layer checks this return value: if 0, it queues the notification in ScyllaDB as QUEUED. Delivered on reconnect.
Either way, no notification is lost.
| Metric | Value |
|---|---|
| Data loss | ZERO -- notifications queued in ScyllaDB |
| User impact | 5-30s reconnect delay, then queued notifications delivered |
| Capacity impact | <0.5% (1 of 200 pods) |
11.2 Redis Cluster Failure
SCENARIO: Redis cluster node fails, partial connection registry lost
Redis holds two critical pieces of state: the connection registry (user_id → gateway pod mapping) and the rate limiter counters. When a Redis node fails, a fraction of both are temporarily unavailable.
Impact analysis:
With a 50+ shard Redis Cluster (50 masters, each with a replica), one master failure affects ~2% of hash slots. The replica promotes to master in 10-15 seconds (Redis Cluster failover). During those 10-15 seconds, reads/writes to affected slots fail.
What breaks:
-
Connection registry lookups fail for ~2% of users: The routing layer can't find which gateway pod the user is on. It falls back to writing the notification to ScyllaDB as QUEUED. The user's gateway pod will deliver it on the next heartbeat check (30-40 seconds).
-
Rate limiter unavailable for ~2% of tenants: Circuit breaker kicks in. Fall back to in-memory approximate rate limiter. Slightly over-permissive, but delivering a few extra notifications is better than dropping valid ones.
The hidden problem: Registry rebuild after failover
When the replica promotes, it has all the data (assuming no replication lag). But if the master died before replicating recent writes, some connection entries from the last few seconds are lost. These users are technically connected but invisible to the routing layer.
Recovery: Every gateway pod re-registers all its connections every 60 seconds as a background sweep. Within 60 seconds of failover, the registry is fully accurate again.
| Metric | Value |
|---|---|
| Data loss | ZERO -- notifications queued in ScyllaDB |
| User impact | 10-60s delayed delivery for ~2% of users |
| Full recovery | <60s after replica promotion |
What if 10-15 seconds of delayed delivery is unacceptable?
The alternative: dual-write the connection registry to both Redis and ScyllaDB on every connect/disconnect. With 100M concurrent connections and ~30 minute average session length, that's ~55K connects/sec + ~55K disconnects/sec = ~110K extra ScyllaDB writes/sec. When Redis fails, the routing layer falls back to ScyllaDB for lookups and delivers via direct gRPC to pods (bypassing Redis Pub/Sub). Real-time delivery continues at ~10-20ms instead of ~2-5ms.
The tradeoff: 110K extra writes/sec running permanently to avoid a 10-15 second delay that happens a few times a year. For most systems, not worth it. For systems where every second of real-time delivery matters (trading alerts, emergency notifications), it might be.
11.3 ScyllaDB Node Failure
SCENARIO: One ScyllaDB node in a 9-node cluster goes down (simplified for illustration; production sizing per Section 6 requires hundreds of nodes, but the failure mechanics are identical)
ScyllaDB is the persistent backbone -- it stores device tokens, user preferences, notification status, and the scheduled notification queue. Losing a node sounds scary but is the most boring failure in this system.
Why it's boring:
With RF=3 and consistency level QUORUM (read and write), every piece of data exists on 3 nodes. A QUORUM operation needs 2 of 3 replicas. One node dying means 2 are still available. Reads and writes continue without any application-level change.
Timeline:
| Time | Event |
|---|---|
| T=0 | Node-5 goes down |
| T=0 | All queries to data owned by node-5 still succeed (2 of 3 replicas alive) |
| T=+10s | Gossip protocol detects node-5 as DOWN |
| T=+10s | Coordinator nodes stop including node-5 in query plans |
| T=hours | Replacement node joins, streams data from surviving replicas |
What ScyllaDB does better than Cassandra here: No GC pauses during the recovery streaming. Cassandra nodes sometimes pause for 5-10 seconds during large compactions or GC, causing cascading timeouts. ScyllaDB's shard-per-core design means streaming happens on dedicated cores without affecting live queries.
The only real risk: If a second node fails before the first recovers AND both nodes hold replicas of the same partition, QUORUM is lost for that partition (1 of 3 replicas alive). With 9 nodes and RF=3, this probability is low (~6.25% per pair), and even then only affects the data on those specific partitions.
| Metric | Value |
|---|---|
| Data loss | ZERO |
| User impact | NONE -- completely transparent |
| Recovery | Hours (streaming), but service never degraded |
11.4 Kafka Broker Failure
SCENARIO: Kafka broker-7 fails (hardware fault, disk corruption)
Kafka is the durable backbone. Every notification passes through it after the API returns 202 Accepted. Losing a broker is a non-event for data safety, but causes a brief processing blip.
Timeline:
| Time | Event |
|---|---|
| T=0 | Broker-7 dies |
| T=0 | Partitions where broker-7 was leader become temporarily leaderless |
| T=+5-15s | Controller elects new leaders from ISR (in-sync replicas) |
| T=+5-15s | Producers get NOT_LEADER errors, retry with backoff, succeed after election |
| T=+5-15s | Consumers pause on affected partitions, resume after new leaders |
| T=+15s | Fully operational, no data lost |
Why no data is lost:
- RF=3: Every message exists on 3 brokers
min.insync.replicas=2: A write is only ACKed after 2 replicas confirmacks=all: Producer waits for all in-sync replicas before confirming- Broker-7 dying means 2 replicas survive. New leader elected from those 2
Per-channel topic design helps here: Each delivery channel (web-push, fcm, apns) has its own Kafka topic. If broker-7 hosted leaders for some FCM partitions, only FCM delivery blips. Web push and APNs continue unaffected. Independent scaling, independent failure domains.
Producer behavior during election:
Producer config:
acks=all
retries=5
retry.backoff.ms=100
buffer.memory=64MB // ~5 seconds of buffering at peak
// During 5-15s leader election:
// - Producer buffers messages in 64MB memory
// - Retries on NOT_LEADER_FOR_PARTITION
// - Succeeds after new leader elected
// - If buffer fills: block for max.block.ms (60s), then throw
At 100M notifications/sec with average 500 bytes each, a 15-second broker outage means ~7.5 GB of inflight data across all producers. With 2000+ API gateway pods, each buffers ~3.75 MB. Well within the 64 MB buffer.
| Metric | Value |
|---|---|
| Data loss | ZERO |
| User impact | 5-15 second delivery delay |
| Recovery | Automatic, no manual intervention |
11.5 FCM/APNs Full Service Outage
SCENARIO: Google FCM goes completely down for 2 hours
This actually happens. FCM has had multiple multi-hour outages. When it does, Android delivery stops entirely. The question isn't "how to prevent it" -- that's impossible. The question is "how to not lose a single notification and drain the backlog cleanly when it comes back."
What happens step by step:
- FCM dispatcher workers start getting 500/503 errors on every request
- Error rate crosses 50% threshold within 60 seconds
- Circuit breaker opens. FCM workers stop calling FCM entirely
- Kafka consumer for the
notifications.fcmtopic pauses its partitions - Messages accumulate in Kafka. At 30M FCM notifications/sec, that's 108B messages over 2 hours
- Kafka retention is 24 hours. 2-hour outage is well within buffer
The critical detail: Controlled drain on recovery
When FCM comes back, blasting 108B buffered notifications at once would be a mistake. FCM has its own rate limits, and sending a 2-hour backlog at full speed would just trigger 429s.
The recovery drain works like this:
Circuit breaker: OPEN → (probe every 10s) → HALF_OPEN → (5 successes) → CLOSED
After CLOSED:
1. Resume Kafka consumer with rate limiting
2. Start at 10% of normal throughput
3. Ramp up 10% every 30s if error rate < 5%
4. Full throughput restored in ~5 minutes
5. Backlog drained at full speed (30M/sec) once stable
6. Total drain time for 2h outage: ~1 hour at full speed
What about notifications with TTL? Some notifications have a time-to-live (e.g., "Your food is arriving in 5 minutes" is useless 2 hours later). During drain, the FCM worker checks TTL before dispatching. Expired notifications are marked EXPIRED in ScyllaDB and skipped. This actually speeds up drain by reducing the effective backlog.
| Metric | Value |
|---|---|
| Data loss | ZERO -- Kafka buffers everything |
| User impact | No Android notifications for duration of FCM outage |
| Recovery time | ~1 hour to drain 2-hour backlog |
| Expired notifications | Skipped during drain (TTL check) |
The exact same pattern applies to APNs outages. Separate Kafka topic, separate circuit breaker, separate drain -- one channel going down doesn't affect the others.
11.6 Scheduler Leader Crash
SCENARIO: The scheduler leader crashes mid-tick while firing scheduled notifications
The scheduler service uses leader election via Redis lock. One leader ticks every second, scanning Redis sorted sets for notifications whose fire time has passed. If it crashes, scheduled notifications stop firing until a new leader takes over.
Timeline:
| Time | Event |
|---|---|
| T=0 | Leader crashes mid-tick (was processing minute-bucket 14:35) |
| T=0 | Redis lock TTL starts counting down (TTL=10s) |
| T=+5s | Hot standby notices leader heartbeat missing, starts attempting lock acquisition |
| T=+10s | Redis lock expires. Standby acquires lock, becomes new leader |
| T=+10s | New leader starts ticking from current time |
| T=+10-15s | Notifications scheduled between T=0 and T=+10s are unfired |
| T=+5min | Consistency sweep runs, catches everything the crash missed |
The gap: What about T=0 to T=+10s?
During the 10-second leadership gap, no scheduled notifications fire. The consistency sweep (runs every 5 minutes) scans ScyllaDB for any PENDING notification with fire_at < now() - 60s. This catches the gap. Worst case: a scheduled notification fires 5 minutes late.
What if the leader crashed partway through a batch? Some notifications from the 14:35 bucket were already produced to Kafka, others weren't. The new leader doesn't know which ones were sent. Solution: idempotency keys. Every scheduled notification has a deterministic idempotency key (notification_id + fire_time). If Kafka already has it, the duplicate is dropped by the dedup layer.
| Metric | Value |
|---|---|
| Data loss | ZERO -- ScyllaDB is source of truth |
| User impact | Up to 5-minute delay for notifications in the gap window |
| Duplicate risk | None -- idempotency keys on every scheduled notification |
11.7 Full Region Failure
SCENARIO: US-EAST region goes completely offline (AWS outage, network partition)
This is the big one. An entire region dying means: all US-EAST Kafka brokers, all Redis nodes, all gateway pods, all ScyllaDB nodes in that DC. Millions of active WebSocket connections drop.
What survives:
- ScyllaDB data: Multi-DC replication means EU-WEST has a full copy of all device tokens, user preferences, notification status. Zero data loss.
- Kafka messages: NOT replicated cross-region (by design). In-flight messages in US-EAST Kafka are unavailable until the region recovers. But: the ingestion layer already ACKed these to the client as 202 Accepted. They're durable in US-EAST Kafka (on disk, RF=3 within the region). When the region comes back, they'll be processed.
What breaks and how it recovers:
| Component | Impact | Recovery |
|---|---|---|
| WebSocket connections | All US-EAST WS connections drop | Clients reconnect to EU-WEST via DNS failover (30-60s) |
| Kafka | US-EAST topics unavailable | New notifications for US users produced to EU-WEST Kafka |
| Redis | US-EAST registry lost | Rebuilt as clients reconnect to EU-WEST gateways |
| Scheduled notifications | US-EAST scheduler stops | EU-WEST scheduler picks up (ScyllaDB has the schedule) |
| In-flight notifications | Stuck in US-EAST Kafka | Delivered when region recovers, or re-fired by sweep |
DNS failover timeline:
| Time | Event |
|---|---|
| T=0 | US-EAST goes dark |
| T=+30s | Health checks fail, DNS TTL starts expiring |
| T=+30-60s | DNS resolves to EU-WEST for US users |
| T=+60-90s | Clients reconnect to EU-WEST gateway pods |
| T=+90s | Service restored with higher latency (cross-Atlantic) |
| T=+5min | EU-WEST auto-scales to handle doubled load |
The cross-Atlantic latency penalty: US users connecting to EU-WEST gateways add ~80-100ms round trip. For real-time notifications this is noticeable but acceptable. WebSocket PING/PONG keep the connection alive despite higher latency. When US-EAST recovers, DNS shifts back and clients reconnect to their local region.
| Metric | Value |
|---|---|
| Data loss | ZERO -- ScyllaDB multi-DC has everything |
| In-flight notifications | Delivered when US-EAST recovers |
| User impact | 30-90s outage during failover, +80ms latency after |
| Full recovery | When US-EAST comes back online |
11.8 Compound Failure: Redis + WebSocket Gateway
SCENARIO: Network partition isolates the Redis cluster from gateway pods, but both are still running
This is worse than either failure alone. The gateway pods are alive and holding WebSocket connections, but can't read or write to the connection registry. The routing layer can't find any user's gateway pod.
What happens:
- Gateway pods detect Redis is unreachable. They keep existing WebSocket connections alive (connections don't depend on Redis).
- New connections can't be registered in Redis. Users who connect during the partition are invisible to routing.
- Routing layer: every Redis lookup fails. Every notification goes to ScyllaDB as QUEUED.
- Real-time delivery stops entirely. All notifications are delayed until either Redis recovers or users reconnect.
Recovery depends on which heals first:
If Redis recovers: Gateway pods re-register all connections in the next 60-second sweep. Routing layer starts finding users again. ScyllaDB queue drains as users are discovered online.
If the partition persists > 5 minutes: Gateway pods start proactively draining ScyllaDB queues for their connected users. Each pod knows which users it's serving (local connection map), so it can pull queued notifications from ScyllaDB and deliver them directly, bypassing the routing layer entirely.
| Metric | Value |
|---|---|
| Data loss | ZERO -- ScyllaDB catches everything |
| User impact | Delayed delivery (seconds to minutes depending on partition duration) |
| Recovery | Automatic once Redis is reachable or via direct ScyllaDB drain |
11.9 When Is Data ACTUALLY Lost?
When is a notification ACTUALLY lost?
Only if Kafka loses committed data. Once the API returns 202 Accepted, the notification is persisted in Kafka with RF=3 and acks=all. From that point, it WILL be delivered eventually, no matter what else fails.
For Kafka to lose a committed message, it takes:
- 3 brokers holding replicas of the same partition all fail simultaneously WITH unrecoverable disk loss
- OR:
acks=allis misconfigured (not the case here) - OR:
min.insync.replicasis set to 1 (set to 2 in this design)
Probability: astronomically low with proper operations.
Mitigations:
- RF=3 across different racks/AZs
min.insync.replicas=2- Rack-aware replica placement
- Regular disk health monitoring
- Kafka tiered storage to S3 (independent durability layer)
Everything downstream of Kafka is derived:
- ScyllaDB notification status: Rebuilt from Kafka replay
- Redis connection registry: Rebuilt from gateway pod registration sweeps
- Redis rate limiter counters: Rebuilt from incoming traffic patterns
- Scheduled notification state: Source of truth is ScyllaDB, backed by Kafka events
The system's durability guarantee comes down to one thing: Kafka is the write-ahead log for every notification. As long as Kafka is intact, every notification is recoverable.
11.10 Failure Scenario Summary
Every scenario above comes back to the same architecture principle: Kafka is the durable backbone, ScyllaDB is the persistent queue, Redis is ephemeral and rebuildable.
- Gateway pod crash? Clients reconnect, ScyllaDB holds queued notifications
- Redis down? Routing falls back to ScyllaDB queuing, registry rebuilds in 60s
- ScyllaDB node down? RF=3 means transparent failover
- Kafka broker down? Leader election in 15s, no data loss
- FCM/APNs outage? Kafka buffers for 24h, controlled drain on recovery
- Scheduler crash? Standby takes over in 10s, sweep catches gaps in 5min
- Full region down? DNS failover to other region, ScyllaDB has all data
At-least-once delivery is guaranteed as long as Kafka is intact. Every delivery path has a fallback. No single component failure causes notification loss.
11.11 Retry Strategy Matrix
| Error Type | First Retry | Max Retries | Backoff | Final Action |
|---|---|---|---|---|
| Kafka produce | 100ms | 5 | Exponential | 503 to client |
| ScyllaDB timeout | 200ms | 3 | Exponential | Serve from cache |
| Redis timeout | 50ms | 3 | Linear | In-memory fallback |
| FCM 429 | 1s | 10 | Exp + jitter | DLQ |
| FCM 500/503 | 500ms | 3 | Exponential | DLQ |
| APNs 429 | 1s | 10 | Exp + jitter | DLQ |
| APNs 500/503 | 500ms | 3 | Exponential | DLQ |
| WS send fail | Immediate | 1 | N/A | Queue in ScyllaDB |
| Template error | N/A | 0 | N/A | DLQ |
| Invalid token | N/A | 0 | N/A | Deactivate token |
11.12 Circuit Breaker Configuration
| Dependency | Error% to Open | Window | Half-Open Probes | Fallback |
|---|---|---|---|---|
| FCM API | 50% | 60s | 5 requests | Buffer in Kafka |
| APNs API | 50% | 60s | 5 requests | Buffer in Kafka |
| ScyllaDB | 30% | 30s | 10 requests | Local cache |
| Redis | 30% | 30s | 10 requests | In-memory |
| Template Service | 50% | 60s | 3 requests | Raw payload |
State machine:
CLOSED ──(error% > threshold)──▶ OPEN
OPEN ──(cooldown 60s)────────▶ HALF_OPEN
HALF_OPEN ──(N successes)──────▶ CLOSED
HALF_OPEN ──(any failure)──────▶ OPEN
11.13 Dead Letter Queue (DLQ)
Topic: notifications.dlq
Each message carries:
• Original payload
• Error code + message
• Attempt count
• Last attempt timestamp
• Source topic + partition
Processing:
1. Auto-retry every 15 min for transient errors (5xx, timeout)
2. Manual triage dashboard for permanent errors (bad payload, bad template)
3. Alert if DLQ depth > 10,000
4. Retain 7 days → archive to S3
12. Deployment Strategy
12.1 Multi-Region Active-Active
Cross-region data strategy:
- Kafka: NOT replicated cross-region. Each region has independent Kafka clusters with independent topics. No MirrorMaker.
- ScyllaDB: Built-in multi-DC replication. Device tokens, user preferences, and notification status are available in every region.
- Redis: NOT replicated (ephemeral, rebuilt locally per region).
- WebSocket: User connects to nearest region, stays there.
Routing: The ingestion layer handles cross-region delivery. When the US API Gateway receives a notification for an EU user, it looks up the user's device registry in ScyllaDB (which is replicated cross-region). The registry tells it the user's home region. The API Gateway then produces directly to the EU Kafka cluster via a cross-region Kafka producer.
Notification for EU user sent from US: US API Gateway → ScyllaDB lookup (user is EU) → produce to EU Kafka → EU routing pipeline → EU delivery → user's WS in EU
Why not MirrorMaker 2? MirrorMaker replicates entire Kafka topics across regions. At 100 GB/sec ingestion, that means copying most of that data cross-region even though 80-90% of notifications are local (US user → US delivery). The cross-region bandwidth cost alone would be enormous. Ingestion-layer routing only sends cross-region traffic for the small percentage of notifications that actually need it. For broadcast notifications (flash sale to all users), the ingestion layer fans out to each region's Kafka independently.
12.2 Technology Stack
| Component | Technology |
|---|---|
| API Gateway | Envoy + gRPC services |
| Message Bus | Apache Kafka 3.7+ |
| Hot Database | ScyllaDB Enterprise |
| Cache / Registry | Redis Cluster 7.2+ |
| Analytics | ClickHouse |
| Cold Storage | S3 + Parquet |
| Templates | PostgreSQL 16 |
| Orchestration | Kubernetes (EKS) |
| Monitoring | Prometheus + Grafana |
| Tracing | Jaeger / Tempo |
| CI/CD | ArgoCD + GitHub Actions |
| Load Balancer | AWS NLB (L4) + Envoy sidecar (L7) |
12.3 Key Trade-off Decisions
| Decision | Chosen | Why |
|---|---|---|
| Hot DB | ScyllaDB over Cassandra/DynamoDB | 10x throughput per node, compatible model, lower tail latency |
| Message Bus | Kafka over Pulsar/NATS | Battle-tested at scale, strongest ecosystem, per-region clusters with ingestion-layer routing |
| Connection Registry | Redis over etcd/ZooKeeper | Sub-ms lookups, built-in pub/sub, good enough for ephemeral data |
| WS vs SSE | Hybrid (WS primary, SSE fallback) | WS for bidirectional ACK, SSE for firewall-hostile environments |
| Scheduling | Redis sorted set + ScyllaDB | O(log N) retrieval, ScyllaDB for durability |
| LB for WS | L4 NLB + Envoy sidecar | L4 for minimal connection overhead, Envoy for per-pod L7 |
| Analytics | ClickHouse over Druid/ES | Better compression, SQL, trillion-row queries, simpler ops |
| WS Node Selection | Centralized registry (Redis) | Sub-ms exact routing, no broadcast waste, handles multi-device |
13. Observability
13.1 Key Metrics (SLIs)
Ingestion: notification.ingested.rate, rejected.rate, ingestion.latency.p99
Routing: notification.routed.rate, routing.latency.p99, fanout.factor
Delivery: notification.sent/delivered/failed.rate (per channel), delivery.latency.p50/p99
Connections: websocket.connections.active (per pod), dropped.rate
Kafka: consumer.lag (per topic/group), produce.error.rate
Scheduler: scheduled.pending, fired.rate, missed.rate, drift.seconds, preloader.behind.seconds, preloader.loaded.rate
DLQ: dlq.depth, ingestion.rate, retry.success.rate
13.2 Critical Alerts
13.3 Distributed Tracing
Every notification carries X-Trace-Id through the entire pipeline. Sampled at 0.1% (= 100K traces/sec). 100% sampling for CRITICAL priority and DLQ messages. Backend: Jaeger or Grafana Tempo.
Trace span chain:
[Ingest] → [Kafka] → [Route] → [Fan-out] → [Deliver:WEB]
→ [Deliver:FCM]
→ [Deliver:APNs]
14. Security
Authentication: Tenant API key + JWT (15-min TTL, refresh tokens). WebSocket authenticates on HTTP Upgrade, and the JWT is validated before the upgrade completes. Mid-session token expiry handled via REAUTH frame without dropping the connection.
Encryption: TLS 1.3 everywhere. mTLS for service-to-service. Kafka and ScyllaDB encrypted at rest. Optional per-tenant payload encryption via AWS KMS.
Tenant Isolation: Every message, every DB row, every Redis key includes tenant_id. Per-tenant Kafka quotas. Per-tenant rate limits. ScyllaDB workload-aware scheduling prevents one tenant's scans from affecting another's latency.
Input Validation: JSON schema validation. Max payload 4 KB. Template variables sanitized. Deep links validated against tenant-registered URL scheme allowlist.
Explore the Technologies
Dive deeper into the technologies and infrastructure patterns used in this design:
Core Technologies
| Technology | Role in This System | Learn More |
|---|---|---|
| Kafka | Central event bus, 2M+ messages/sec, topic-per-channel routing | Kafka Deep Dive |
| ScyllaDB | Primary hot store, 2.4 PB notification storage, shard-per-core | ScyllaDB Deep Dive |
| Redis | Connection registry for 100M WebSocket connections, Pub/Sub relay | Redis Deep Dive |
| ClickHouse | Real-time analytics on delivery rates, latency percentiles | ClickHouse Deep Dive |
| PostgreSQL | Tenant configuration, template management, audit logs | PostgreSQL Deep Dive |
| gRPC | Service-to-service communication with protobuf serialization | gRPC Deep Dive |
| Envoy | Service mesh sidecar, mTLS, load balancing, circuit breaking | Envoy Deep Dive |
| Prometheus | Metrics collection for delivery SLOs and pipeline health | Prometheus Deep Dive |
| Grafana | Dashboards for real-time delivery monitoring and alerting | Grafana Deep Dive |
Infrastructure Patterns
| Pattern | Role in This System | Learn More |
|---|---|---|
| WebSocket & Real-Time Communication | 100M concurrent connections, SSE fallback, connection draining | WebSocket & Real-Time |
| Circuit Breaker & Resilience Patterns | FCM/APNs circuit breakers, retry policies, fallback chains | Circuit Breaker & Resilience |
| Load Balancer | L4 load balancing for persistent WebSocket connections | Load Balancer |
| Message Queues & Event Streaming | Kafka topic design, priority queues, dead-letter handling | Event Streaming |
| Rate Limiting & Throttling | Per-tenant rate limits, provider rate management | Rate Limiting |
| Kubernetes | Pod scheduling, connection-aware autoscaling, rolling deploys | Kubernetes Architecture |
| Distributed Tracing | End-to-end notification delivery tracing across 12+ services | Distributed Tracing |
| Secrets Management | FCM/APNs credentials, tenant API keys, mTLS certificates | Secrets Management |
| Service Mesh | Envoy sidecar for mTLS, observability, traffic management | Service Mesh |
| Caching Strategies | Template caching, preference caching, connection state | Caching Strategies |
Practice this design: Design a Notification System -- interview question with progressive hints.
At 100M notifications/sec, every layer is partitioned independently. Kafka is the central bus. Redis handles real-time connection routing. ScyllaDB stores everything durable. The WebSocket node selection problem ("event on any server, connection on one specific pod") is solved with a centralized Redis registry and Pub/Sub relay: O(1) lookup, O(1) delivery, zero broadcast waste.