CrackingWalnuts

System DesignMarch 6, 2026· 79 min read

System Design: Notification Platform, 100M Notifications/Second

Goal: Build a notification platform that pushes 100 million notifications per second across web (WebSocket/SSE), Android (FCM), and iOS (APNs). Support scheduling, broadcast, per-user preferences, and survive failures with at-least-once delivery.

Here's how to architect it, from ingestion to delivery, with scale math and failure scenarios.

1. Problem Statement

The goal: a notification platform that pushes 100 million notifications per second across web, Android, and iOS.

Design a notification system that can send 100 million notifications per second. Support web push (real-time via SSE/WebSocket), Android (FCM), iOS (APNs), scheduled notifications, and handle failures gracefully.

A few clarifications before diving in:

Scale clarification: 100M/sec is massive, on par with the biggest platforms in production today. I'll design for this as a sustained peak, not a burst. Every layer must be horizontally partitioned.

Assumptions I'm making:

This is a multi-tenant platform (like OneSignal, Pusher, or Firebase itself, serving many apps).
"Send" means accepted + dispatched. Actual delivery to the user's eyeball depends on device state, OS, and network.
At-least-once delivery semantics. Exactly-once is impractical at this scale, so deduplicate where possible.
Users can be on multiple devices simultaneously (phone + laptop + tablet).
The system is a backend platform. It doesn't build client apps, it provides SDKs and APIs.

What NOT to do:

Poll clients asking "any new notifications?" (this is push, not pull)
Store notifications in a single PostgreSQL table and SELECT by user_id (dies at 1M/sec)
Use Redis as the sole data store (volatile memory, no durability)
Send FCM/APNs synchronously in the request path (one slow Apple response blocks everything)
Broadcast every notification to every WebSocket pod (20B internal messages/sec of waste)
Build one monolith that does ingestion + routing + delivery + scheduling

The trickiest problem in this design? A notification can arrive at any server, but the user's WebSocket lives on one specific pod out of hundreds. Section 9.3 covers how to solve this.

2. Functional Requirements

ID	Requirement	Priority
FR-01	Send real-time push notifications to web browsers via persistent connection (WebSocket/SSE)	P0
FR-02	Send push notifications to Android devices via FCM	P0
FR-03	Send push notifications to iOS devices via APNs	P0
FR-04	Support scheduled notifications (deliver at a future timestamp, recurring via cron)	P0
FR-05	Support broadcast notifications (send to all users, segments, topics)	P0
FR-06	Support targeted notifications (single user, device, or channel)	P0
FR-07	Track delivery status: sent, delivered, read, failed, retried	P1
FR-08	Support notification templates with variable interpolation	P1
FR-09	User-level preference management (opt-in/out per channel, quiet hours)	P1
FR-10	Multi-tenant isolation	P0
FR-11	Idempotent delivery (deduplication)	P0
FR-12	Notification cancellation before delivery	P1
FR-13	Priority levels (critical, high, normal, low)	P1
FR-14	Rate limiting per tenant, per user, per channel	P0
FR-15	Dead letter queue for permanently failed notifications	P1

3. Non-Functional Requirements

ID	Requirement	Target
NFR-01	Throughput	100M notifications/sec sustained
NFR-02	Latency (p50 / p99)	< 100ms / < 500ms for real-time channels
NFR-03	Availability	99.99% (52 min downtime/year)
NFR-04	Durability	No notification loss once accepted (at-least-once)
NFR-05	Horizontal scalability	Linear scale-out, no single bottleneck
NFR-06	Data retention	Hot: 7 days, Warm: 90 days, Cold: 1 year
NFR-07	Concurrent WebSocket/SSE connections	100M peak (designed for 5x growth to 500M)
NFR-08	Scheduled notification precision	≤ 1 second drift
NFR-09	Recovery Time Objective (RTO)	< 30 seconds
NFR-10	Recovery Point Objective (RPO)	0 (no data loss)
NFR-11	Geographic distribution	Multi-region active-active
NFR-12	Tenant isolation	Noisy neighbor protection at every layer

4. High-Level Approach & Technology Selection

4.1 What Kind of System Is This?

100M notifications per second across three channels: WebSocket for real-time web push, FCM for Android, and APNs for iOS. This is a push pipeline, not a request/response system. A client sends "notify user X" and walks away -- the platform takes responsibility for delivery across every channel, every device, potentially minutes or hours later for scheduled notifications.

At this scale, nothing can be synchronous. The ingestion path must accept and persist the notification immediately (202 Accepted), and the delivery path runs asynchronously -- routing to the right devices, checking preferences, dispatching through channel-specific protocols, tracking status, and retrying failures.

4.2 What Components Are Needed?

The system does four things: accepts notification requests, routes them to users and devices, dispatches through channel-specific protocols, and tracks delivery status. Each step has different scaling requirements:

Durable message bus -- Decouple ingestion from delivery. If a delivery channel falls behind (FCM rate-limited, APNs down), events must buffer and replay. The bus also enables independent scaling per channel via separate topics.
Per-channel delivery dispatchers -- Each target (WebSocket, FCM, APNs) speaks a different protocol with different constraints. FCM limits ~1,000 msg/sec per HTTP/2 connection. APNs allows ~4,000 with multiplexing. WebSocket requires finding the exact gateway pod holding the user's connection.
Connection registry -- With 100M concurrent WebSocket connections spread across 200 gateway pods, the system needs a fast lookup: "which pod holds user X's connection?" This must handle 100K+ connects/disconnects per second (100M connections with even 0.1% churn per second). (Deep dive: Section 9.3)
High-throughput hot store -- Device tokens, user preferences, notification status -- all at 100M+ writes/sec. The store must handle wide-column access patterns (user → all their devices) with tunable consistency.
Scheduler -- Future and recurring notifications need a time-indexed store that fires at the right second, backed by a durable store for crash recovery.

4.3 Store Selection

Store	Technology	Rationale
Device tokens, preferences, status	ScyllaDB	100M+ writes/sec, wide-column model for user→device mappings, tunable consistency, linear scale-out, TTL support
Connection registry, rate limits, dedup	Redis Cluster	Sub-ms lookups, Pub/Sub for cross-server message relay, ephemeral data
Scheduled notifications index	Redis sorted sets	O(log N) range queries by fire time
Scheduled notifications store	ScyllaDB	Durability (source of truth for scheduled payloads)
Analytics	ClickHouse	Columnar, trillion-row aggregations, SQL
Cold archive	S3 + Parquet	Cost-effective, queryable via Athena/Spark
Templates	PostgreSQL	Low volume, transactional, relational

4.4 Why ScyllaDB Over Alternatives

At 100M+ writes/sec with wide-column access patterns (user → devices, user → notification status), the system needs a database built for this workload. Here's how the candidates stack up:

Dimension	ScyllaDB	Cassandra	DynamoDB	MongoDB
Throughput per node	~1M ops/sec (shard-per-core, C++)	~100K ops/sec (JVM, shared threads)	Provisioned or on-demand (pay per RCU/WCU)	~50K ops/sec (WiredTiger)
p99 latency under load	< 5ms (no GC pauses)	10-50ms (GC spikes under pressure)	< 10ms (single-digit)	10-100ms (variable)
Data model fit	Wide-column, perfect for user→device mappings	Same CQL model	Key-value / document, no native wide-column clustering	Document, schema-flexible but poor for wide-column
TTL support	Native per-row TTL, zero overhead	Native TTL but tombstone storms cause latency spikes	Native TTL on items	TTL index, separate background deletion
Multi-DC replication	Built-in, tunable per-keyspace	Built-in, same model	Global Tables (managed)	Atlas Global Clusters (managed)
Cost at scale (PB-range)	~10x fewer nodes than Cassandra	~2,000 nodes (JVM overhead)	$$$$ at this write volume	Not viable at this scale
Operational complexity	Medium (fewer nodes, less tuning)	High (JVM tuning, compaction tuning, GC tuning)	Low (fully managed)	High at this scale

Why ScyllaDB wins here:

ScyllaDB's shard-per-core architecture assigns each CPU core its own partition range, memory, and I/O queue. No locks, no GC, no thread contention. That's how a single ScyllaDB node matches 10 Cassandra nodes in throughput. Since it speaks CQL (Cassandra Query Language), the data model, queries, and driver ecosystem are identical. Same proven data model, without Cassandra's operational pain.

DynamoDB would work functionally, but at 100M writes/sec the cost is prohibitive. At $1.25 per million WCUs, that's $125/sec or ~$10.8M/day. ScyllaDB on bare metal costs a fraction of that.

MongoDB can't sustain this write volume with wide-column access patterns. WiredTiger's document-level locking becomes a bottleneck well before 1M ops/sec per node.

5. High-Level Architecture

Here's the architecture, layer by layer.

Bird's-Eye View

Layer Responsibilities

Layer 1, Ingestion: Stateless. Accepts requests, validates, deduplicates, writes to Kafka. Returns 202 only after Kafka confirms persistence. Every pod is identical, so the LB can route to any.

Layer 2, Kafka: The central nervous system. Decouples everything. Provides durability (RF=3), ordering within partitions, replay capability. Separate topics per delivery channel enable independent scaling.

Layer 3A, Routing: The brain. Resolves "send to user X" into "send to device D1 via FCM, device D2 via APNs, and WebSocket connection C1 on server WS-042." Checks preferences, quiet hours, renders templates, fans out to channel topics.

Layer 3B, Scheduler: Leader-elected service. Fires notifications at their scheduled time. Uses Redis sorted sets backed by ScyllaDB for durability.

Layer 4, Delivery: Channel-specific. Each dispatcher speaks the native protocol of its delivery target. WS Gateway holds persistent connections. FCM/APNs dispatchers maintain HTTP/2 connection pools. The hardest part is finding which of 200 pods holds a user's WebSocket connection -- see Section 9.3 for the full deep dive.

Layer 5, Data: ScyllaDB for high-throughput writes. Redis for ephemeral state. ClickHouse for analytics. S3 for cold storage.

6. Back-of-the-Envelope Estimation

Here's the math that shapes the architecture.

6.1 Throughput Breakdown

Target: 100,000,000 notifications/sec

Channel split (assumed):
  Web push (real-time):   30%  →  30M/sec
  Android (FCM):          40%  →  40M/sec
  iOS (APNs):             30%  →  30M/sec

6.2 Connection Math

Registered users:            500M
Peak concurrency (20%):      100M simultaneous WebSocket/SSE connections
Memory per connection:        ~20 KB (buffers, session state, TLS)
Total connection memory:      100M × 20 KB = 2 TB

Connections per server:       500K (aggressive but achievable with epoll/io_uring,
                              ulimit -n 1M+, net.core.somaxconn tuned)
WS Gateway servers needed:    100M / 500K = 200 servers

6.3 Bandwidth

Average notification payload: 1 KB (JSON: title, body, metadata, routing)

Ingestion:  100M/sec × 1 KB = 100 GB/sec inbound
Delivery:   ~120M/sec × 1 KB = 120 GB/sec outbound (1.2x fan-out for multi-device)
Aggregate:  ~1 Tbps across the cluster

6.4 Storage

Daily volume (at 20% average of peak): 1.73 trillion notifications/day
Hot metadata (200 bytes/notification, 7 days): ~2.4 PB
Warm (90 days, compressed columnar):            ~6 PB
Cold (1 year, Parquet on S3):                   ~25 PB

Practical: Store only metadata + status in hot tier.
Purge full payloads 24h after delivery.

6.5 Kafka Sizing

Ingestion rate:         100 GB/sec
Per-broker write rate:  ~200 MB/sec (NVMe + 10GbE)
Brokers for ingestion:  100,000 / 200 = 500 brokers
Replication factor 3:   ~1,500 broker-equivalents of storage
Partitions:             100,000+ (for consumer parallelism)

→ Multi-cluster Kafka deployment, partitioned by tenant hash

6.6 FCM/APNs Dispatcher Sizing

FCM:
  Practical limit: ~1,000 msg/sec per HTTP/2 connection
  40M/sec → 40,000 concurrent connections to Google
  At 100 connections/pod → 400 FCM dispatcher pods

APNs:
  Practical limit: ~4,000 msg/sec per HTTP/2 connection (multiplexed)
  30M/sec → 7,500 concurrent connections to Apple
  At 100 connections/pod → 75 APNs dispatcher pods

6.7 Routing & API Gateway Sizing

API Gateway: stateless request handlers.
  100M notifications/sec
  Each pod handles ~50K requests/sec (gRPC, validation, Kafka produce)
  100M / 50K = 2,000 pods minimum

Routing workers: resolve users → devices, check preferences, fan out.
  100M notifications/sec, each requires 2-3 ScyllaDB reads + preference check
  Each pod handles ~50K routing decisions/sec
  100M / 50K = 2,000 pods minimum

Both are stateless and horizontally scalable. HPA scales them based on CPU and Kafka consumer lag.

6.8 Summary

Resource	Estimate
WS Gateway servers	200 (500K connections each)
Kafka brokers	500+ (multi-cluster)
FCM dispatchers	400 pods
APNs dispatchers	75 pods
API Gateway pods	2,000+
Routing workers	2,000+ pods
Hot storage (ScyllaDB)	2.4 PB
Connection RAM	2 TB
Network	~1 Tbps

Bottom line: every layer scales independently. Shared-nothing architecture. Partition everything.

7. Data Model

7.1 ScyllaDB Schema

Device Tokens, partitioned by user_id (1-10 devices per user = small partitions):

sql

CREATE TABLE device_tokens (
    user_id       TEXT,
    device_id     TEXT,
    platform      TEXT,          -- ANDROID | IOS | WEB
    token         TEXT,
    app_version   TEXT,
    os_version    TEXT,
    locale        TEXT,
    is_active     BOOLEAN,
    created_at    TIMESTAMP,
    updated_at    TIMESTAMP,
    PRIMARY KEY (user_id, device_id)
);

-- Reverse lookup for FCM/APNs error handling (invalid token → find user)
CREATE TABLE token_to_user (
    platform  TEXT,
    token     TEXT,
    user_id   TEXT,
    device_id TEXT,
    PRIMARY KEY ((platform, token))
);

Notification Status, partitioned by user_id, clustered newest-first, 7-day TTL:

sql

CREATE TABLE notification_status (
    user_id           TEXT,
    notification_id   TEXT,
    tenant_id         TEXT,
    channel           TEXT,
    device_id         TEXT,
    status            TEXT,         -- QUEUED | SENT | DELIVERED | READ | FAILED
    error_code        TEXT,
    attempt_count     INT,
    created_at        TIMESTAMP,
    sent_at           TIMESTAMP,
    delivered_at      TIMESTAMP,
    PRIMARY KEY ((user_id), created_at, notification_id, channel)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND default_time_to_live = 604800;

Scheduled Notifications, partitioned by minute-level time buckets:

sql

CREATE TABLE scheduled_notifications (
    fire_time_bucket  TEXT,          -- "2026-02-05T10:00" (minute granularity)
    notification_id   TEXT,
    tenant_id         TEXT,
    fire_at           TIMESTAMP,
    payload           TEXT,          -- full JSON
    recurrence        TEXT,          -- cron expression or null
    status            TEXT,          -- PENDING | FIRED | CANCELLED
    PRIMARY KEY ((fire_time_bucket), fire_at, notification_id)
) WITH CLUSTERING ORDER BY (fire_at ASC);

User Preferences:

sql

CREATE TABLE user_preferences (
    user_id        TEXT,
    channel        TEXT,
    enabled        BOOLEAN,
    quiet_start    TEXT,
    quiet_end      TEXT,
    timezone       TEXT,
    category_prefs MAP<TEXT, BOOLEAN>,
    PRIMARY KEY (user_id, channel)
);

7.2 Redis Data Structures

# ── Connection Registry ──
# "User X is connected on device D, held by server S"
HSET   conn:registry:{user_id}   {device_id}   "{server_id}:{conn_id}"

# "Server S holds these connections" (for drain/cleanup)
SADD   conn:server:{server_id}   "{user_id}:{device_id}"

# ── Rate Limiting (sliding window) ──
ZADD   ratelimit:tenant:{tenant_id}   {timestamp_ms}   {notification_id}

# ── Per-User Rate Limiting (sliding window) ──
ZADD   ratelimit:user:{user_id}   {timestamp_ms}   {notification_id}

# ── Per-Channel Rate Limiting (hourly counter) ──
INCR   ratelimit:user:{user_id}:{channel}
EXPIRE ratelimit:user:{user_id}:{channel} 3600

# ── Deduplication ──
BF.ADD dedup:{date}   {idempotency_key}

# ── Scheduled Index ──
ZADD   scheduled:fire_index:{minute_bucket}   {fire_epoch}   {notification_id}

# ── Topic Subscriptions ──
SADD   channel:subscribers:{topic}   {user_id_1}   {user_id_2} ...

One cluster or many? One Redis Cluster with 50+ shards handles all workloads: connection registry, rate limits, dedup sets, scheduled notification indexes, and Pub/Sub channels. Redis Cluster hashes keys across shards automatically. Different key prefixes (conn:registry:*, ratelimit:*, scheduled:*) naturally spread across shards. Splitting into separate clusters adds operational overhead without meaningful isolation benefit. If a connection registry shard fails, it doesn't affect rate limit shards because they're already on different shards within the same cluster.

7.3 ClickHouse (Analytics)

sql

CREATE TABLE notification_events (
    event_date      Date,
    event_time      DateTime64(3),
    tenant_id       LowCardinality(String),
    notification_id String,
    user_id         String,
    channel         LowCardinality(String),
    status          LowCardinality(String),
    latency_ms      UInt32,
    error_code      LowCardinality(Nullable(String))
) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(event_date)
  ORDER BY (tenant_id, event_date, channel, status)
  TTL event_date + INTERVAL 365 DAY;

8. API Design

8.1 Send Notification

POST /v1/notifications
X-Tenant-Id: tenant_abc
X-Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Authorization: Bearer <jwt>

{
  "notification_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "TARGETED",                          // TARGETED | BROADCAST | TOPIC
  "priority": "HIGH",                          // CRITICAL | HIGH | NORMAL | LOW
  "channels": ["WEB_PUSH", "FCM", "APNS"],     // or ["ALL"]
  "recipients": {
    "user_ids": ["user_123", "user_456"],
    "topic": null,
    "segment_id": null
  },
  "content": {
    "title": "Order Shipped",
    "body": "Your order #{{order_id}} has been shipped.",
    "image_url": "https://cdn.example.com/img/shipped.png",
    "deep_link": "app://orders/{{order_id}}",
    "template_id": "tmpl_order_shipped",
    "template_vars": { "order_id": "ORD-98765" },
    "data": { "order_id": "ORD-98765", "carrier": "FedEx" }
  },
  "schedule": {
    "deliver_at": "2026-02-05T10:00:00Z",
    "timezone": "America/New_York",
    "recurrence": null                          // or cron: "0 9 * * 1"
  },
  "options": {
    "ttl_seconds": 86400,
    "collapse_key": "order_update",
    "sound": "default",
    "badge_count": 3,
    "android": { "channel_id": "orders", "icon": "ic_shipping" },
    "ios": {
      "category": "ORDER_UPDATE",
      "thread_id": "order-98765",
      "interruption_level": "active"
    },
    "web": {
      "require_interaction": false,
      "actions": [
        {"action": "track", "title": "Track Order"},
        {"action": "dismiss", "title": "Dismiss"}
      ]
    }
  }
}

Response (202 Accepted):

json

{
  "status": "ACCEPTED",
  "notification_id": "550e8400-e29b-41d4-a716-446655440000",
  "estimated_recipients": 2,
  "tracking_url": "/v1/notifications/550e8400-.../status"
}

8.2 Other Endpoints

GET  /v1/notifications/{id}/status       → delivery status per channel
DELETE /v1/notifications/{id}             → cancel (if not yet delivered)
POST /v1/notifications/batch             → up to 1000 per batch
POST /v1/devices                          → register device token
PUT  /v1/users/{id}/preferences           → channel opt-in/out, quiet hours

8.3 Internal gRPC (Service-to-Service)

protobuf

service NotificationService {
  rpc Send(SendRequest) returns (SendResponse);
  rpc SendBatch(stream SendRequest) returns (BatchResponse);
  rpc GetStatus(StatusRequest) returns (StatusResponse);
  rpc Cancel(CancelRequest) returns (CancelResponse);
}

service ConnectionRegistryService {
  rpc Register(ConnectionInfo) returns (Ack);
  rpc Deregister(ConnectionId) returns (Ack);
  rpc Lookup(UserId) returns (stream ConnectionInfo);
}

8.4 WebSocket Wire Protocol

Client → Server:  CONNECT {"user_id":"user_123","token":"jwt...","channels":["orders"]}
Server → Client:  CONNECTED {"connection_id":"conn_abc","server":"ws-pod-042"}

Server → Client:  NOTIFICATION {"id":"n_789","title":"...","body":"...","data":{...}}
Client → Server:  ACK {"id":"n_789"}

Server → Client:  PING
Client → Server:  PONG

9. Deep Dives

9.1 Web Push: SSE vs WebSocket

This choice matters. Here's the comparison:

Dimension	WebSocket	SSE
Direction	Full-duplex (bidirectional)	Server → Client only
Client ACK	Native (client sends frames)	Requires separate HTTP POST
Protocol	ws:// (upgrade from HTTP)	Standard HTTP
Auto-reconnect	Manual (must implement)	Built into EventSource API
HTTP/2 multiplexing	No (dedicated TCP per connection)	Yes
Binary support	Yes	Text only
Proxy traversal	Often blocked by corporate firewalls	Passes through any HTTP proxy
Memory per connection	~20 KB	~15 KB
Load balancer compat	Needs L4 or WS-aware L7	Any HTTP LB works

My Recommendation: WebSocket Primary, SSE Fallback

Why WebSocket as primary: At 100M/sec, inline client ACK is essential. With SSE, every delivery confirmation needs a separate HTTP round-trip. That's potentially 30M extra HTTP requests/sec just for acknowledgments. With WebSocket, the ACK is a tiny frame on the existing connection.

Why SSE as fallback: Corporate firewalls and restrictive proxies block WebSocket upgrades. SSE works over plain HTTP and degrades gracefully. The client SDK auto-detects: try WebSocket first → fall back to SSE if upgrade fails.

Connection Lifecycle (WebSocket)

9.2 Connection Management & Message Routing

The Connection Registry

This is the lookup table that answers: "User X has a WebSocket open. Which of the 200 servers is holding it?"

Targeted Notification Routing

Why not go through the load balancer? The external NLB is only for initial client connections. It routes to a random pod, which is useless when a specific one is needed. Once connected, the pod registers itself in Redis. From that point, the routing layer needs a way to reach that exact pod. Three options:

Redis Pub/Sub (chosen approach): Each WebSocket pod subscribes to a channel named after itself (e.g., ws-pod-042:deliver). The routing worker publishes to that channel. Simple, no direct network calls between services.
Direct pod IP via Kubernetes headless service: A headless Service resolves to individual pod IPs instead of a single VIP. The routing worker looks up the pod IP from Redis and calls it directly over gRPC or HTTP. Faster (no pub/sub hop), but couples the routing layer to pod networking and breaks if pods restart and get new IPs before Redis is updated.
Dedicated internal load balancer with sticky routing: A separate internal LB with consistent hashing on user_id. Routes to the same pod every time. No Redis lookup needed for routing, but flexibility is lost. Adding or removing pods reshuffles connections.

Option 1 wins. Redis Pub/Sub decouples the routing worker from pod networking entirely. The routing worker doesn't need to know IPs, open connections, or handle retries. It publishes a message and moves on.

Broadcast / Topic Routing (Staged Fan-out)

For a topic with millions of subscribers, direct fan-out from one worker would OOM. Staged fan-out fixes this:

Total broadcast latency for 10M users: < 5 seconds

9.3 How the Right WebSocket Node Is Selected

The hard part: an event can fire from any server, but the user's WebSocket lives on one specific pod. How does it get routed?

The Problem Visualized

Four approaches. Most don't work at this scale.

Approach 1: Sticky Hashing (Route User to Deterministic Pod)

Idea:  hash(user_id) % num_pods → always routes to the same pod

       user_123 → hash("user_123") % 200 → pod 42 (always)
       Event producer knows: "user_123 is on pod 42"
       → Send notification directly to pod 42

Why this DOESN'T work at this scale:

Pod count changes (scaling, crashes). Rehashing redistributes users → mass disconnects.
Consistent hashing helps but still moves ~1/N users per pod change.
Hot users (celebrities) create hotspot pods.
No multi-device support: user on phone AND laptop can't be on the same pod if they connected at different times through different LB decisions.

Verdict: Rejected. Too fragile, too many edge cases.

Approach 2: Broadcast to All Pods

Idea:  Send every notification to ALL 200 pods.
       Each pod checks: "Do I have this user?" If yes, deliver. If no, discard.

       Event → broadcast to 200 pods → 199 discard, 1 delivers

Why this DOESN'T work at this scale:

100M notifications/sec x 200 pods = 20 BILLION message deliveries/sec internally.
99.5% of messages are wasted (delivered to pods that don't have the user).
Network bandwidth explodes: 100 GB/sec x 200 = 20 TB/sec internal traffic.

Verdict: Rejected. Catastrophically wasteful.

Approach 3: Centralized Registry Lookup (CHOSEN)

Idea:  Maintain a fast lookup table in Redis.
       When event fires → lookup user → find exact pod → route directly.

Here is the complete end-to-end flow showing how an event on any server reaches the right WebSocket node:

PHASE 1: CONNECTION TIME (user connects, happens ONCE)

PHASE 2: EVENT TIME (notification fires, can happen from ANYWHERE)

PHASE 3: WEB DELIVERY (finding the RIGHT node)

PHASE 4: USER OFFLINE (no active connection)

If HGETALL conn:registry:user_123 returns EMPTY:

Store in ScyllaDB: status = 'QUEUED'
When user reconnects to ANY pod (e.g., ws-pod-099):
- ws-pod-099 fetches pending:
  - SELECT * FROM notification_status WHERE user_id = 'user_123' AND status = 'QUEUED'
- Delivers all pending, updates status to DELIVERED

Total hops: Producer → Kafka → Router → Redis Lookup → Pub/Sub → Pod Total time: ~50-150ms end-to-end (p99)

Why this works at scale:

Redis HGETALL on a user key is O(N) where N = devices per user (1-5). Sub-millisecond.
Redis Pub/Sub delivers to the subscribed pod in ~0.1ms.
Each WS pod subscribes to exactly ONE channel (its own name). No topic explosion.
100M lookups/sec across a Redis Cluster with 50+ shards (each shard handles ~2M ops/sec, so 100M / 2M = 50 shards) is well within capacity.
Registration and deregistration are both O(1).

Approach 4: Gossip Protocol (Pod-to-Pod)

Idea:  Each pod gossips its connection list to neighbors.
       Eventually every pod knows the full mapping.

Why this doesn't scale:

100M connections x ~100 bytes = 10 GB of state to replicate to every pod.
Convergence delay: O(log N) gossip rounds = ~8 rounds for 200 pods.
Connection churn (100K+ connects/disconnects per second) overwhelms gossip bandwidth.
This would be reinventing a worse Redis.

Verdict: Rejected.

Handling Stale Registry Entries

A connection might drop without clean deregistration (network partition, pod crash, OOM kill). Now Redis says "user_123 is on ws-pod-042" but the connection is dead.

Three recovery mechanisms run in parallel:

MECHANISM 1: HEARTBEAT (proactive, every 30 seconds)

Each WS pod sends PING to all its local connections. No PONG within 10s → connection declared dead:

HDEL conn:registry:user_123 device_web_1
SREM conn:server:ws-pod-042 user_123:device_web_1

Catches: silent TCP drops, client-side crashes, network glitches Detection time: 30-40 seconds

MECHANISM 2: POD HEALTH REAPER (reactive, every 30 seconds)

Background service checks health of all pods in conn:server:* keys. For crashed pods → bulk deregister all connections:

SMEMBERS conn:server:ws-pod-042
→ for each: HDEL conn:registry:{user_id} {device_id}
DEL conn:server:ws-pod-042

Catches: pod OOM kills, hardware failures, AZ outages Detection time: 10-30 seconds

MECHANISM 3: TTL SAFETY NET (passive)

All Redis keys have 24h TTL. Connections refresh TTL on each PONG. Even if mechanisms 1 and 2 miss it, stale entries expire.

Catches: everything else Detection time: up to 24 hours (worst case safety net)

Recovery after stale entry is cleared:

Next delivery attempt sees empty registry
Notification stored in ScyllaDB as QUEUED
When user reconnects → pending notifications delivered

9.4 Load Balancer & Persistent Connections

The Core Problem

Standard HTTP LBs distribute per-request. WebSocket connections last minutes to hours. Three tensions:

Connection affinity: Once upgraded, all frames must go to the same backend. LB can't re-route mid-connection.
Rolling deploys: New pods replace old ones. Old pods must drain 500K connections gracefully.
Uneven distribution: Long-lived connections cause accumulation. A pod started 2 hours ago has way more connections than one started 2 minutes ago.

Architecture: L4 NLB + Envoy Sidecar

Graceful Drain on Deploy / Scale-Down

Timeline:

T+0s    Kubernetes sends SIGTERM (preStop hook fires)
        │
        ├── Mark self "draining" in Redis:
        │     SET ws-pod-042:status "draining" EX 600
        │
        ├── Deregister from NLB target group
        │     (NLB stops sending NEW TCP connections)
        │
        ├── Send WebSocket CLOSE frame (code 1001: Going Away)
        │     to ALL 500K connected clients:
        │     "Server shutting down, please reconnect"
        │
T+0-5s  Clients receive CLOSE frame
        │
        ├── Well-behaved clients ACK and start reconnecting
        │     to NLB → routed to different healthy pods
        │
T+30s   Send PING to remaining connections
        │
        ├── No PONG → force-close
        │
T+300s  terminationGracePeriodSeconds expires
        │
        ├── Kubernetes sends SIGKILL
        │
        ├── Cleanup: bulk deregister from Redis
        │     SMEMBERS conn:server:ws-pod-042 → HDEL all
        │     DEL conn:server:ws-pod-042

Recovery: Users who reconnected to new pods already have
their registrations updated. Pending undelivered notifications
are fetched from ScyllaDB on reconnect.

Client Reconnection: Avoiding Thundering Herd (Mass Simultaneous Reconnections)

When 500K clients disconnect simultaneously, they must NOT all reconnect at the same instant.

javascript

class NotificationClient {
  constructor(url) {
    this.baseDelay = 1000;     // 1s
    this.maxDelay  = 30000;    // 30s
    this.attempt   = 0;
    this.connect(url);
  }

  connect(url) {
    this.ws = new WebSocket(url);

    this.ws.onopen = () => {
      this.attempt = 0;   // reset on success
      this.sendConnect();
    };

    this.ws.onclose = (event) => {
      if (event.code === 1001) {
        // Server going away, reconnect with small delay (0-2s jitter)
        setTimeout(() => this.connect(url), Math.random() * 2000);
      } else {
        // Unexpected close, exponential backoff
        const delay = Math.min(this.baseDelay * 2 ** this.attempt, this.maxDelay);
        const jitter = delay * 0.5 * Math.random();
        setTimeout(() => { this.attempt++; this.connect(url); }, delay + jitter);
      }
    };
  }
}

With 500K clients and 0-2s jitter, reconnections spread to ~250K/sec, well within NLB capacity.

9.5 Mobile Push: FCM & APNs

FCM (Android)

Error handling:

200 OK           → SENT, commit offset
UNREGISTERED     → Deactivate token in ScyllaDB
INVALID_ARGUMENT → DLQ (bad payload)
429              → Backoff, pause partition consumer
500/503          → Retry 3x with exponential backoff, then DLQ

APNs (iOS)

Error handling:

200 OK           → SENT, commit offset
BadDeviceToken   → Deactivate token
Unregistered     → Deactivate (check timestamp > last registration)
ExpiredProvider  → Regenerate JWT, retry
429              → Backoff
500/503          → Retry 3x, then DLQ

Key APNs detail:
  410 Unregistered includes a timestamp. Only deactivate if
  Apple's timestamp is AFTER the last token registration.
  Otherwise the user may have re-registered on a new device.

9.6 Deduplication

Ingestion-Layer Deduplication (FR-11)

Every notification carries an X-Idempotency-Key header. The API gateway checks this key against a Bloom filter in Redis before writing to Kafka.

API Gateway receives notification:
  1. key = X-Idempotency-Key (or notification_id if no key provided)
  2. BF.EXISTS dedup:{today} key
     → EXISTS (probably duplicate): reject with 409 Conflict
     → NOT EXISTS: BF.ADD dedup:{today} key, proceed to Kafka
  3. Bloom filter rotates daily (dedup:{yesterday} deleted at midnight)

Why a Bloom filter and not a Redis SET? At 100M notifications/sec, storing every idempotency key as a SET member would consume ~100 GB/day (100 bytes per key x 8.64B daily notifications). A Bloom filter with 0.1% false positive rate uses ~1.2 GB for the same volume. The tradeoff: 0.1% of unique notifications are falsely rejected as duplicates. At this scale, that's acceptable. The alternative (100 GB of Redis memory per day) is not.

False positives mean a tiny fraction of legitimate notifications get a 409. The client retries with a new idempotency key. False negatives (letting a duplicate through) are impossible with Bloom filters. That's the right direction for at-least-once semantics: better to occasionally reject a valid notification than to deliver a duplicate.

Delivery-Layer Deduplication

The ingestion Bloom filter catches most duplicates, but Kafka producer retries can still produce the same message twice (the Bloom filter saw it once, marked it, but Kafka didn't ACK in time, so the API gateway retried). Delivery workers guard against this: before dispatching, they check notification_status in ScyllaDB. If the (user_id, notification_id, channel) combination already has status SENT or DELIVERED, the worker skips it. This is a cheap read (partition key lookup) and prevents the user from seeing the same notification twice.

9.7 Multi-Channel Delivery Decision

9.8 Scheduled Notifications

Architecture

Handling 1 Billion Scheduled Notifications

One giant Redis sorted set won't fit in memory (1B x 100 bytes = 100 GB). Solution: partition by minute.

scheduled:fire_index:2026-02-05T10:00   ← only notifications for this minute
scheduled:fire_index:2026-02-05T10:01
...

Average per bucket: 1B / 525,600 min/year ≈ 1,900 entries
Peak (100x): ~190,000 entries per bucket ← trivially fits in memory

Scheduler only reads the CURRENT minute's bucket.
Background job pre-populates Redis buckets 10 minutes ahead from ScyllaDB.

Cancellation

1. API: DELETE /v1/notifications/{id}
2. ScyllaDB: UPDATE status = 'CANCELLED'
3. Redis: ZREM scheduled:fire_index:{bucket} {id}
   (only if fire_at is within 10 min and already pre-loaded into Redis.
    For far-future cancellations, step 2 is sufficient — the pre-loader
    skips CANCELLED entries.)
4. If already in Kafka but not delivered:
   → Delivery workers check: SISMEMBER cancelled:notifications {id}
   → If found, skip delivery

Scheduling Lifecycle (End-to-End)

Here's what happens from the moment a client sends a scheduled notification to the moment it fires:

If fire_at is within 10 minutes, the API writes to both ScyllaDB and Redis. The notification is ready to fire almost immediately. For far-future notifications (hours, days, months), the API writes to ScyllaDB alone. A background pre-loader continuously scans ScyllaDB and populates Redis buckets 10 minutes before fire time. This keeps Redis lean. A notification scheduled 30 days out doesn't sit in Redis for a month. If Redis loses an entry, the consistency sweep catches it. If ScyllaDB is slow, the API retries. The notification is only acknowledged to the client after the ScyllaDB write succeeds.

Time Wheel Deep Dive

The scheduler leader runs a tight loop that ticks every second:

EVERY 1 SECOND:
  1. current_bucket = format(now(), "YYYY-MM-DDTHH:mm")
  2. entries = ZRANGEBYSCORE scheduled:fire_index:{current_bucket} -inf {now_epoch} LIMIT 1000
  3. For each entry:
     a. Fetch full payload from ScyllaDB
     b. Produce to Kafka (notifications.ingest topic)
     c. ZREM from Redis
     d. UPDATE status = 'FIRED' in ScyllaDB
  4. If entries.length == 1000, immediately loop again (drain the bucket)

The LIMIT 1000 per tick prevents the scheduler from blocking on a single massive batch. If a minute bucket has 50K entries (flash sale, Black Friday campaign), the scheduler drains it in ~50 ticks (50 seconds). That's fine because the precision target is ≤ 1 second drift for normal load, and a few seconds for extreme spikes.

Why a leader-elected single writer? Multiple schedulers reading the same bucket would fire duplicate notifications. Leader election via Redis lock (SETNX with TTL) is simple and the failover time (< 5 seconds) is acceptable. The hot standby watches the lock and takes over immediately on expiry.

Recurring Notifications (Cron)

For recurring schedules like "every Monday at 9 AM," the recurrence field holds a cron expression:

recurrence: "0 9 * * 1"    → every Monday at 09:00 UTC
recurrence: "0 */6 * * *"  → every 6 hours
recurrence: null            → one-shot, no recurrence

After the scheduler fires a recurring notification:

1. Fire the current occurrence (produce to Kafka)
2. Compute next fire time from cron expression
3. INSERT new row in ScyllaDB with next fire_at, status = PENDING
4. ZADD the new fire time into the appropriate minute bucket in Redis
5. UPDATE current row status = 'FIRED'

Each occurrence is a separate ScyllaDB row. This gives a complete audit trail (every past firing is queryable) and avoids mutation of a single row that multiple processes might read.

To stop a recurring notification, the client sends DELETE /v1/notifications/{id}. The scheduler checks status before firing. If CANCELLED, it skips and does NOT compute the next occurrence.

Timezone Handling

Scheduled times are stored in UTC internally. The API converts on ingestion:

Client sends:  deliver_at: "2026-03-10T09:00:00", timezone: "America/New_York"
API converts:  fire_at: "2026-03-10T14:00:00Z"  (EST is UTC-5)
Stored in:     ScyllaDB and Redis as UTC epoch

DST transitions matter. "Every day at 9 AM Eastern" shifts between UTC-5 (EST) and UTC-4 (EDT). For recurring notifications, the scheduler recomputes the UTC fire time for each occurrence using the stored timezone. It doesn't just add 24 hours blindly.

March 8, 2026 (before spring forward): 9 AM ET = 14:00 UTC
March 9, 2026 (after spring forward):  9 AM ET = 13:00 UTC  ← 1 hour earlier in UTC

The cron library handles this by resolving the cron expression in the user's timezone first, then converting to UTC.

Missed Fire Recovery

Three things can cause a missed fire: scheduler leader crash, Redis data loss, or a bug that skips an entry.

The consistency sweep catches all of them:

EVERY 5 MINUTES (runs on leader):
  1. Compute buckets to scan:
     past_buckets = last 10 minute buckets (now - 10min to now - 1min)
  2. For each bucket:
     SELECT * FROM scheduled_notifications
     WHERE fire_time_bucket = '{bucket}'
       AND fire_at < now() - 60s
       AND status = 'PENDING'
     LIMIT 10000
  3. For each missed notification:
     a. Re-index in Redis: ZADD scheduled:fire_index:{bucket} {fire_epoch} {id}
     b. The normal time wheel picks it up on the next tick
  4. Log metric: scheduled.missed.recovered.count

The query scans by fire_time_bucket (partition key) so it's efficient. Without the partition key, ScyllaDB would do a full table scan across all nodes, which is catastrophic at this data volume.

The 60-second grace period avoids racing with the time wheel (which might be about to fire the notification). Worst case, a missed notification fires within 5 minutes plus the sweep interval. For a leader crash with < 5 second failover, most scheduled notifications fire within 10 seconds of their target time.

9.9 Notification Templates

Templates live in PostgreSQL. Low volume, transactional, relational queries like "all templates for tenant X." Out of 100M notifications/sec, maybe 5% use templates. That's 5M template renders per second, but spread across only a few hundred distinct templates per tenant.

Where Rendering Happens

Template rendering runs in the routing layer, not at ingestion. The routing worker already fetches user preferences and device tokens from ScyllaDB for every notification. Adding a template lookup is one more read, and it's almost always a cache hit.

Routing worker receives notification with template_id + template_vars
  → Cache lookup: template:{tenant_id}:{template_id}:{version}
  → Cache hit (99%+): render immediately
  → Cache miss: fetch from PostgreSQL, populate cache (TTL 5 min)
  → Mustache-style interpolation: "Your order #{{order_id}}" → "Your order #ORD-98765"
  → Result becomes the notification payload for downstream delivery

Why not render at ingestion time? Because the routing worker already has the user context (locale, device type, preferences). A template might produce different output per device: shorter title for APNs (limited to 178 bytes), longer body for web push. Rendering at routing time allows tailoring the output per channel.

Template Schema

sql

CREATE TABLE notification_templates (
    tenant_id         TEXT NOT NULL,
    template_id       TEXT NOT NULL,
    version           INT NOT NULL DEFAULT 1,
    name              TEXT NOT NULL,
    channel           TEXT NOT NULL,        -- WEB | FCM | APNS | ALL
    title_template    TEXT,
    body_template     TEXT NOT NULL,
    default_vars      JSONB DEFAULT '{}',
    created_at        TIMESTAMP DEFAULT NOW(),
    updated_at        TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (tenant_id, template_id, version)
);

Versioned. When a tenant updates a template, a new version row is inserted rather than mutating the old one. Notifications already in flight keep using the version they were created with. No race between "template updated" and "notification being rendered right now."

Caching Strategy

Templates are a classic read-heavy, write-rare workload. A few hundred templates per tenant, updated maybe a few times a day, read millions of times per second.

Each routing worker keeps a local in-memory LRU cache (bounded at 10K entries, roughly 50 MB). Cache key: {tenant_id}:{template_id}:{version}. TTL: 5 minutes. On a miss, the worker fetches from PostgreSQL and populates the cache.

Why not Redis for template caching? PostgreSQL handles the read volume fine here. Templates are small, the indexes are tight, and the total dataset fits comfortably in PostgreSQL's buffer cache. Adding Redis as an intermediate layer for a low-volume dataset adds a network hop and operational complexity for no meaningful latency win. The local LRU is faster (zero network) and sufficient given how few distinct templates exist per tenant.

Error Handling

What happens when a template fails to render? Missing variable, bad syntax, PostgreSQL unreachable.

Template render fails:

  1. Missing variable in template_vars:
     → Fall back to default_vars from template definition
     → If still missing: send raw payload without template, log warning

  2. template_id doesn't exist:
     → DLQ immediately (bad request, won't fix itself on retry)

  3. PostgreSQL unreachable:
     → Circuit breaker opens (see Section 11.12)
     → Serve from local LRU cache (stale template > no notification)
     → If cache also empty: send raw payload from notification content

A template problem should never cause a dropped notification. Worst case, the user gets an unformatted notification with the raw content fields. That's better than silence.

9.10 Priority Queues

The API accepts four priority levels: CRITICAL, HIGH, NORMAL, LOW. But accepting them in a JSON field and actually enforcing them through the pipeline are different things. Without separate processing paths, a flood of LOW priority marketing blasts would delay CRITICAL alerts like fraud detection, OTP codes, or security warnings.

Separate Kafka Topics Per Priority

notifications.routed.web.critical    → dedicated consumer group (always has capacity)
notifications.routed.web.high        → shared consumer group
notifications.routed.web.normal      → shared consumer group
notifications.routed.web.low         → shared consumer group (first to shed)

Same pattern for .fcm.* and .apns.*

The routing worker reads the notification's priority field and produces to the matching topic. CRITICAL gets its own topic with dedicated consumers that never share resources with lower priorities.

Why not one topic with consumer-side priority? Kafka doesn't support per-message priority within a partition. Messages process in offset order. A consumer can't skip past 10K LOW messages to reach a CRITICAL message stuck behind them. Separate topics give real priority isolation at the Kafka level.

Consumer Allocation

Priority     Consumer Pods    Kafka Partitions    Notes
─────────    ─────────────    ────────────────    ─────
CRITICAL     200 (dedicated)  200                 Never shared, never scaled down
HIGH         600 (shared)     1000                Scales with load
NORMAL       800 (shared)     2000                Bulk of traffic
LOW          400 (shared)     1000                Shed first under pressure

CRITICAL consumers are over-provisioned on purpose. They sit mostly idle. That's the point. When a bank sends a fraud alert, it goes through a pipeline with plenty of headroom, not one competing with a retailer's daily deal blast.

Backpressure and Load Shedding

When the system is under pressure (Kafka consumer lag growing, delivery channels rate-limited), LOW priority notifications slow down first.

Backpressure response:

  1. Consumer lag > 100K on LOW topic for 2+ minutes:
     → Pause 50% of LOW consumers, redistribute pods to HIGH

  2. Consumer lag > 500K on NORMAL topic:
     → Pause all LOW consumers
     → Alert: "LOW priority delivery paused, backpressure active"

  3. Consumer lag > 100K on CRITICAL topic:
     → Page on-call (this should never happen)
     → Auto-scale CRITICAL consumers (already over-provisioned, so investigate root cause)

LOW priority notifications are not dropped during backpressure. They stay in Kafka (24-hour retention) and get processed once the pressure eases. The real-world effect: marketing emails arrive a few minutes late during a traffic spike. Nobody notices.

9.11 Rate Limiting

Three levels, each checked at a different point in the pipeline.

Where Each Check Happens

API Gateway (ingestion):     tenant-level rate limit
Routing Worker:              per-user rate limit + per-channel rate limit

Tenant limits are enforced early, at the API gateway, to reject excessive traffic before it touches Kafka. Per-user and per-channel limits run later, at routing, because the routing worker is the one that knows the target user and their channels.

Redis Data Structures

# Tenant-level: sliding window (already defined in Section 7.2)
ZADD   ratelimit:tenant:{tenant_id}   {timestamp_ms}   {notification_id}

# Per-user: sliding window
ZADD   ratelimit:user:{user_id}   {timestamp_ms}   {notification_id}

# Per-channel per-user: simple counter with TTL
INCR   ratelimit:user:{user_id}:fcm
EXPIRE ratelimit:user:{user_id}:fcm 3600   # resets every hour

Default Limits

Level	Default Limit	Window	Configurable?
Tenant	10,000/sec	Sliding 1s	Yes, per contract
Per-user	100/hour	Sliding 1h	Yes, per tenant config
Per-channel (FCM)	50/hour	Fixed 1h	Yes
Per-channel (APNs)	50/hour	Fixed 1h	Yes
Per-channel (Web)	200/hour	Fixed 1h	Yes

Web gets a higher default because web push is lightweight. No third-party API call, just a WebSocket frame on an existing connection. Users on the web also tend to tolerate higher frequency since they can dismiss notifications instantly.

Sliding Window Implementation

Tenant-level rate limiting uses a sliding window log in Redis. Four commands, pipelined into one round trip:

function checkTenantRateLimit(tenantId, notificationId, limit, windowMs):
    key = "ratelimit:tenant:{tenantId}"
    now = currentTimeMs()
    windowStart = now - windowMs

    pipeline:
        ZREMRANGEBYSCORE key -inf windowStart    // prune expired entries
        ZADD key now notificationId              // add current request
        ZCARD key                                // count entries in window
        EXPIRE key (windowMs / 1000 + 60)        // TTL safety net

    count = pipeline.execute()[2]    // ZCARD result

    if count > limit:
        ZREM key notificationId      // remove the one just added
        return RATE_LIMITED

    return OK

At 100M notifications/sec across 50+ Redis shards, each shard handles roughly 2M rate limit ops/sec. Comfortable.

What Happens When Rate Limited

Tenant rate limited at API Gateway:
  → 429 Too Many Requests
  → Response includes Retry-After header
  → Logged for billing and abuse monitoring

User rate limited at Routing Worker:
  → Notification silently dropped (not queued, not retried)
  → Status set in ScyllaDB: RATE_LIMITED
  → Visible in tenant's analytics dashboard
  → Does NOT count against the tenant's quota (already passed that check)

User-level rate limiting protects the end user from notification spam, not the system from overload. If a buggy integration fires 1,000 notifications at the same person in an hour, only the first 100 get through. The rest are dropped with status RATE_LIMITED so the tenant can see what happened in their dashboard and fix their integration.

Priority Overrides

CRITICAL notifications bypass per-user and per-channel rate limits. A fraud alert or OTP code should never be blocked because the user already got too many marketing notifications that hour. Tenant-level limits still apply though. Even CRITICAL traffic can't exceed the tenant's contracted throughput.

Routing Worker rate limit check:
  if notification.priority == CRITICAL:
      skip per-user rate limit
      skip per-channel rate limit
  else:
      check per-user rate limit   → RATE_LIMITED? drop + log
      check per-channel rate limit → RATE_LIMITED? drop + log

9.12 Zero-Downtime Deploys for WebSocket Gateways

Every team deploying WebSocket servers eventually asks this question: can new code ship without disconnecting users? With 200 gateway pods holding 100M connections, the stakes are high. A bad rollout could cascade through the entire fleet.

Can Zero Downtime Be Achieved?

Let's define it precisely. Zero downtime means: no connected user ever observes a disconnection during a deploy. Not a brief spinner, not a reconnect, nothing.

The answer is no. A WebSocket connection is a stateful TCP session pinned to a specific OS process on a specific pod. Replacing the process (which is what a deploy does) kills the TCP session. There is no HTTP-level redirect or retry built into the WebSocket protocol. Unlike stateless HTTP requests that the load balancer can transparently reroute to a new backend, a WebSocket connection is a long-lived socket held by one process. Kill the process, kill the connection.

Three "clever" approaches come up in every design review. None of them work at this scale:

Kernel fd passing (SCM_RIGHTS). A file descriptor can be transferred between processes on the same host. But gateway pods are separate containers, often on different nodes. fd passing doesn't work across nodes. Even on the same node, the new process would need to reconstruct all application-level state (user_id, subscriptions, auth context, in-flight message buffers) for each of 500K connections. Nobody does this in production at this scale.

Proxy-level connection holding. Envoy can hold the client-side TCP connection open while swapping the backend for HTTP/2 (the GOAWAY frame tells clients to stop sending new streams). But WebSocket traffic is bidirectional frames on a single upgraded connection. There is no "stop using this connection for new work" concept. The proxy would need to buffer frames in both directions indefinitely, which defeats the purpose of a real-time system.

Blue-green with connection serialization. Serialize the entire connection state (TLS session, WebSocket frame state, application state), transfer it to a new pod, and resume. Theoretically possible. In practice, that means building a distributed state migration system that handles 500K connections per pod, all to avoid a 0-2 second reconnect delay. The complexity far outweighs the benefit.

What IS achievable: near-zero-downtime. Each user sees at most a 0-2 second reconnect blip. Zero notifications are lost (ScyllaDB queues anything that arrives during the gap). At any given moment during the rollout, less than 0.5% of connections are cycling. This is what Slack, Discord, and Pusher actually do in production.

Rolling Deploy Strategy for 200 Pods

The key insight: drain one pod at a time, and make sure the replacement pod is fully ready BEFORE the old pod starts draining.

Kubernetes Deployment config:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ws-gateway
spec:
  replicas: 200
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1       # only 1 pod drains at a time
      maxSurge: 1             # spin up 1 new pod before killing old
  template:
    spec:
      terminationGracePeriodSeconds: 300   # 5 min to drain 500K connections
      containers:
      - name: ws-gateway
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "/app/drain.sh"]
      minReadySeconds: 30     # new pod must be healthy 30s before K8s kills next

maxSurge: 1 is what makes this work. Kubernetes starts the new pod first, waits for it to pass readiness checks for 30 seconds (minReadySeconds), and only then sends SIGTERM to the old pod. At no point does capacity drop below 199 pods.

The pacing math:

Each pod takes roughly 35 seconds to cycle: 5 seconds for clients to receive the CLOSE frame + 30 seconds of minReadySeconds on the replacement. With 200 pods going one at a time, the full rollout takes about 117 minutes. At any given moment, only 500K out of 100M connections are in the reconnect window. That's 0.5% of users.

How fast can the rollout go?

Parallelism	Users impacted at once	Rollout time	Redis write spike	Risk
1 pod	500K (0.5%)	~2 hours	Low	Minimal
5 pods	2.5M (2.5%)	~25 min	Moderate	Low
10 pods	5M (5%)	~12 min	High	Medium
20 pods	10M (10%)	~6 min	Very high	Thundering herd risk

At 10 pods in parallel, Redis has to handle 5M HSET + SADD commands in a few seconds from reconnections. A 50-shard Redis Cluster can handle that, but it eats into the headroom that protects against other spikes. 5 pods in parallel is the sweet spot for most rollouts: 25 minutes total, 2.5% user impact, comfortable Redis load.

The Full Deploy Lifecycle

Here's the complete timeline for one pod cycling through a deploy. The critical difference from a crash (Section 11.1): the new pod is already running and accepting connections BEFORE the old pod starts draining.

T-30s   Kubernetes creates ws-pod-042-v2 (new version)
        Pod starts, subscribes to Redis Pub/Sub, joins NLB
        New connections from fresh clients start landing here

T+0s    minReadySeconds (30s) passes. New pod confirmed stable.
        Kubernetes sends SIGTERM to ws-pod-042-v1
        Graceful drain begins (see Section 9.4 for full drain lifecycle)

T+0-2s  Clients reconnect with jitter to healthy pods (including v2)

T+5s    ~95% reconnected

T+300s  SIGKILL. Bulk Redis cleanup. Kubernetes moves to next pod.

The 5-minute grace period (T+0 to T+300s) is generous. Most connections drain in under 5 seconds. The long tail exists for clients on flaky mobile networks that take 30+ seconds to notice the CLOSE frame. The SIGKILL at T+300s is a safety net, not the normal path.

Notification Delivery During the Deploy Window

During the drain, a user can be in one of three states. Each one has a clear delivery path.

State 1: Already reconnected to a new pod. The Redis registry has been updated with the new pod mapping. Notifications route normally through Pub/Sub to the new pod. No issue.

State 2: Mid-reconnect (0-2 second gap). The user disconnected from the old pod but hasn't connected to a new one yet. The Redis registry still points to the old pod. What happens:

Routing Worker → Redis lookup → "user_123 is on ws-pod-042-v1"
  → PUBLISH ws-pod-042-v1:deliver
  → Old pod receives Pub/Sub message
  → Tries to send to WebSocket session → session is closed
  → Writes notification to ScyllaDB: status = QUEUED
  → User reconnects to ws-pod-099-v2
  → ws-pod-099-v2 fetches pending:
      SELECT * FROM notification_status
      WHERE user_id = 'user_123' AND status = 'QUEUED'
  → Delivers all queued notifications
  → Updates status to DELIVERED

This is the same queuing mechanism described in Section 9.3 (Phase 4). No new infrastructure needed.

State 3: Still connected to the old pod. The draining pod hasn't sent the CLOSE frame yet, or the client hasn't processed it. The WebSocket session is still alive. Notifications deliver over the existing connection as normal. The drain process doesn't terminate connections instantly. It sends CLOSE and waits for the client to acknowledge.

Worst-case delivery delay: 2-3 seconds. User disconnects at T+0, reconnects at T+2s (jitter), queued notifications drain from ScyllaDB in the next second. For CRITICAL notifications like OTP codes or fraud alerts, a 2-3 second delay during a deploy is acceptable. The alternative (never deploying) is worse.

User state	% of pod's users	Notification path	Delay
Already reconnected	~90% (by T+2s)	Normal Pub/Sub	None
Mid-reconnect	~8% (0-2s window)	ScyllaDB queue	2-3s
Still on old pod	~2% (slow clients)	Existing WebSocket	None

Canary Deploys for WebSocket Gateways

Canary deploys work for WebSocket gateways, but they work differently than for stateless services.

Blue-green is impractical. It requires 200 extra pods (the "green" fleet) running alongside the existing 200 (the "blue" fleet). Each pod allocates ~10 GB for 500K connections. That's 2 TB of extra memory just for the deploy window. And 100M connections still need to migrate. Expensive, and it doesn't avoid reconnections.

Canary with new-connection routing is practical:

Deploy the new version to 10 pods (out of 200)
These canary pods join the NLB target group alongside the 190 stable pods
Only new connections (from reconnects or fresh users) land on canary pods. Existing connections on the 190 stable pods are untouched
Monitor canary pods for 10-15 minutes: error rates, connection stability, memory usage, WebSocket frame errors
If healthy: proceed with rolling deploy for the remaining 190 pods
If unhealthy: remove canary pods from NLB. Their users (~5M connections) reconnect to the 190 stable pods. Rollback complete in under 30 seconds

This pairs well with ArgoCD Rollouts (already in the tech stack, Section 12.2). The setWeight step controls what percentage of the NLB target group is canary pods. The analysis step monitors metrics from Prometheus before promoting.

The canary phase doesn't test the drain path. It only validates that the new code handles WebSocket connections correctly (no memory leaks, no frame errors, no auth failures). The rolling deploy that follows tests the drain mechanics, but by that point the new code is already validated.

Deploy Frequency at This Scale

Deploy frequency is limited by rollout time, not risk. At 1 pod at a time, a full rollout takes about 2 hours. A new rollout can't start while the previous one is still running (Kubernetes blocks it by default). So there's a practical ceiling.

Separate cadences for different services:

Service	Deploy frequency	Reason
Stateless (routing, FCM, API gateway)	10+/day	No connection state. Instant replacement.
WebSocket gateways	1-2/day	2-hour rollout. Batch changes.
Infrastructure (Kafka, Redis, ScyllaDB)	Weekly	Planned maintenance windows.

For WebSocket gateways, deploying every commit is impractical. Batch changes and ship once or twice a day. For faster iteration, use feature flags. Deploy the code with the flag off, validate with a canary, then flip the flag to activate the feature. This decouples code deployment from feature activation. The safety of a slow rollout with the agility of instant feature toggling.

One exception: security patches or critical bug fixes. For those, accept the 2-hour rollout or increase parallelism to 5-10 pods. A 25-minute rollout with 2.5% user impact is worth it for a security fix.

10. Identify Bottlenecks

10.1 WebSocket Node Selection Problem

Broadcasting every notification to all gateway pods would amplify 100M/sec into 20B internal messages (Section 9.3, Approach 2). The fix is the centralized Redis registry with Pub/Sub relay described in Section 9.3, Approach 3.

10.2 Broadcast Fan-Out Explosion

When a single notification must reach 10M+ subscribers (flash sale, trending topic), direct fan-out from one routing worker causes OOM. One worker can't hold 10M user IDs in memory and produce 10M Kafka messages. Staged fan-out fixes this: scan subscriber sets with SSCAN in micro-batches of 10K users, push those to Kafka, and let hundreds of consumers process them in parallel. One O(N) operation becomes N/10K parallel O(1) operations.

10.3 Connection Registry Scalability

100M WebSocket mappings in Redis means ~10 GB of state (100 bytes per connection). The bigger challenge is the write rate: 100K+ connects/disconnects per second, each requiring an HSET and SADD. A Redis Cluster with 50+ shards handles this by hashing on user_id, keeping each shard under 2M ops/sec. Stale entries from crashed connections get cleaned up three ways: heartbeat checks (30s), a pod health reaper (10-30s), and TTL expiry as a safety net (24h worst case).

10.4 Scheduled Notification Hot Partitions

A single Redis sorted set for 1 billion scheduled notifications needs ~100 GB. That won't work. Partition by minute instead: scheduled:fire_index:2026-02-05T10:00. With 525,600 minutes per year, each bucket averages ~1,900 entries. Even at 100x peaks, that's ~190,000 per bucket. Trivial. The scheduler only loads the current minute, and a background job pre-populates buckets 10 minutes ahead from ScyllaDB.

10.5 Cache Stampede on Redis Failure

When a Redis shard fails, all routing workers that depended on it for connection lookups fail simultaneously. That's potentially hundreds of workers hitting the same error at once.

Partial failure (one shard out of 50):

Each routing worker maintains a local LRU cache of recent Redis lookup results (60-second TTL). When a shard goes down, the cache absorbs hits for recently active users on that shard. Only cache misses result in notifications written to ScyllaDB as QUEUED. With a 50-shard cluster, one shard failure affects ~2% of lookups. Most of those are cache hits. Real impact is minimal.

Full Redis outage:

Both lookup and Pub/Sub delivery are gone. Every notification falls back to ScyllaDB as QUEUED. Real-time delivery stops. Notifications are delayed until either Redis recovers (replica promotion takes 10-15 seconds) or users reconnect.

This is an accepted tradeoff. Redis failover is fast (10-15 seconds), and the 60-second gateway re-registration sweep rebuilds the registry quickly after. Maintaining a durable fallback registry (dual-writing every connection to ScyllaDB) would cost ~110K extra writes/sec permanently to avoid a 10-15 second delay during a rare event. Not worth it. The simpler design: accept the brief delay, queue in ScyllaDB, deliver on recovery.

10.6 Thundering Herd on Gateway Pod Restart

When a gateway pod dies, 500K clients disconnect at once. The server sends CLOSE frame (code 1001) before shutdown. Clients reconnect with jittered backoff (Section 9.4), spreading 500K reconnections to ~250K/sec.

10.7 Cross-Region Notification Latency

With ingestion-layer routing (Section 12.1), a notification for an EU user submitted to a US API Gateway must produce to the EU Kafka cluster across the Atlantic. That adds ~60-120ms of network round-trip. Combined with the normal pipeline latency (~50-150ms), cross-region notifications land at ~110-270ms end-to-end. Still well within the p99 target of 500ms. For the 80-90% of notifications that are local (US user → US delivery), there's zero cross-region overhead.

11. Failure Scenarios

11.1 WebSocket Gateway Pod Crash

SCENARIO: A gateway pod serving 500K concurrent WebSocket connections dies

This is the most interesting failure in the entire system. One pod crash instantly disconnects 500,000 users. If all 500K clients reconnect at the same instant to remaining pods, the result is a thundering herd that cascades through the whole gateway fleet.

Timeline:

Time	Event
T=0	Gateway pod-17 crashes (OOM, hardware fault, or deployment)
T=0	500K WebSocket connections drop. Clients get TCP RST or timeout
T=+3s	LB health check fails, stops routing to pod-17
T=+10s	Pod health reaper detects pod-17 is gone, begins cleaning Redis
T=+10-30s	Reaper removes ~500K stale entries from connection registry
T=+5-30s	Clients begin reconnecting with exponential backoff + jitter
T=+30s	90% of clients reconnected to other pods, new registry entries in Redis
T=+30s	Queued notifications in ScyllaDB delivered on reconnect

The hidden problem: Thundering herd on reconnect

500K clients all lost their connection at the same instant. If they all reconnect immediately, the remaining 199 gateway pods each receive ~2,500 new connections simultaneously on top of their existing 500K. That's manageable per-pod, but the Redis registry writes (500K HSET commands in 5 seconds) can spike Redis CPU.

The client SDK prevents this with jittered exponential backoff (see Section 9.4 for the full reconnection implementation). With jitter, 500K reconnections spread over a 30-second window instead of a 1-second spike.

What happens to notifications during the 30-second gap?

Two outcomes, both safe:

Registry already cleaned: Routing layer can't find the user, queues the notification in ScyllaDB (Section 9.3, Phase 4). Delivered on reconnect.
Stale entry still in Redis: Routing layer finds the stale entry, publishes to ws-pod-017:deliver. Redis PUBLISH returns 0 (no subscriber — pod is dead). The routing layer checks this return value: if 0, it queues the notification in ScyllaDB as QUEUED. Delivered on reconnect.

Either way, no notification is lost.

Metric	Value
Data loss	ZERO -- notifications queued in ScyllaDB
User impact	5-30s reconnect delay, then queued notifications delivered
Capacity impact	<0.5% (1 of 200 pods)

11.2 Redis Cluster Failure

SCENARIO: Redis cluster node fails, partial connection registry lost

Redis holds two critical pieces of state: the connection registry (user_id → gateway pod mapping) and the rate limiter counters. When a Redis node fails, a fraction of both are temporarily unavailable.

Impact analysis:

With a 50+ shard Redis Cluster (50 masters, each with a replica), one master failure affects ~2% of hash slots. The replica promotes to master in 10-15 seconds (Redis Cluster failover). During those 10-15 seconds, reads/writes to affected slots fail.

What breaks:

Connection registry lookups fail for ~2% of users: The routing layer can't find which gateway pod the user is on. It falls back to writing the notification to ScyllaDB as QUEUED. The user's gateway pod will deliver it on the next heartbeat check (30-40 seconds).
Rate limiter unavailable for ~2% of tenants: Circuit breaker kicks in. Fall back to in-memory approximate rate limiter. Slightly over-permissive, but delivering a few extra notifications is better than dropping valid ones.

The hidden problem: Registry rebuild after failover

When the replica promotes, it has all the data (assuming no replication lag). But if the master died before replicating recent writes, some connection entries from the last few seconds are lost. These users are technically connected but invisible to the routing layer.

Recovery: Every gateway pod re-registers all its connections every 60 seconds as a background sweep. Within 60 seconds of failover, the registry is fully accurate again.

Metric	Value
Data loss	ZERO -- notifications queued in ScyllaDB
User impact	10-60s delayed delivery for ~2% of users
Full recovery	<60s after replica promotion

What if 10-15 seconds of delayed delivery is unacceptable?

The alternative: dual-write the connection registry to both Redis and ScyllaDB on every connect/disconnect. With 100M concurrent connections and ~30 minute average session length, that's ~55K connects/sec + ~55K disconnects/sec = ~110K extra ScyllaDB writes/sec. When Redis fails, the routing layer falls back to ScyllaDB for lookups and delivers via direct gRPC to pods (bypassing Redis Pub/Sub). Real-time delivery continues at ~10-20ms instead of ~2-5ms.

The tradeoff: 110K extra writes/sec running permanently to avoid a 10-15 second delay that happens a few times a year. For most systems, not worth it. For systems where every second of real-time delivery matters (trading alerts, emergency notifications), it might be.

11.3 ScyllaDB Node Failure

SCENARIO: One ScyllaDB node in a 9-node cluster goes down (simplified for illustration; production sizing per Section 6 requires hundreds of nodes, but the failure mechanics are identical)

ScyllaDB is the persistent backbone -- it stores device tokens, user preferences, notification status, and the scheduled notification queue. Losing a node sounds scary but is the most boring failure in this system.

Why it's boring:

With RF=3 and consistency level QUORUM (read and write), every piece of data exists on 3 nodes. A QUORUM operation needs 2 of 3 replicas. One node dying means 2 are still available. Reads and writes continue without any application-level change.

Timeline:

Time	Event
T=0	Node-5 goes down
T=0	All queries to data owned by node-5 still succeed (2 of 3 replicas alive)
T=+10s	Gossip protocol detects node-5 as DOWN
T=+10s	Coordinator nodes stop including node-5 in query plans
T=hours	Replacement node joins, streams data from surviving replicas

What ScyllaDB does better than Cassandra here: No GC pauses during the recovery streaming. Cassandra nodes sometimes pause for 5-10 seconds during large compactions or GC, causing cascading timeouts. ScyllaDB's shard-per-core design means streaming happens on dedicated cores without affecting live queries.

The only real risk: If a second node fails before the first recovers AND both nodes hold replicas of the same partition, QUORUM is lost for that partition (1 of 3 replicas alive). With 9 nodes and RF=3, this probability is low (~6.25% per pair), and even then only affects the data on those specific partitions.

Metric	Value
Data loss	ZERO
User impact	NONE -- completely transparent
Recovery	Hours (streaming), but service never degraded

11.4 Kafka Broker Failure

SCENARIO: Kafka broker-7 fails (hardware fault, disk corruption)

Kafka is the durable backbone. Every notification passes through it after the API returns 202 Accepted. Losing a broker is a non-event for data safety, but causes a brief processing blip.

Timeline:

Time	Event
T=0	Broker-7 dies
T=0	Partitions where broker-7 was leader become temporarily leaderless
T=+5-15s	Controller elects new leaders from ISR (in-sync replicas)
T=+5-15s	Producers get NOT_LEADER errors, retry with backoff, succeed after election
T=+5-15s	Consumers pause on affected partitions, resume after new leaders
T=+15s	Fully operational, no data lost

Why no data is lost:

RF=3: Every message exists on 3 brokers
min.insync.replicas=2: A write is only ACKed after 2 replicas confirm
acks=all: Producer waits for all in-sync replicas before confirming
Broker-7 dying means 2 replicas survive. New leader elected from those 2

Per-channel topic design helps here: Each delivery channel (web-push, fcm, apns) has its own Kafka topic. If broker-7 hosted leaders for some FCM partitions, only FCM delivery blips. Web push and APNs continue unaffected. Independent scaling, independent failure domains.

Producer behavior during election:

Producer config:
  acks=all
  retries=5
  retry.backoff.ms=100
  buffer.memory=64MB   // ~5 seconds of buffering at peak

// During 5-15s leader election:
// - Producer buffers messages in 64MB memory
// - Retries on NOT_LEADER_FOR_PARTITION
// - Succeeds after new leader elected
// - If buffer fills: block for max.block.ms (60s), then throw

At 100M notifications/sec with average 500 bytes each, a 15-second broker outage means ~7.5 GB of inflight data across all producers. With 2000+ API gateway pods, each buffers ~3.75 MB. Well within the 64 MB buffer.

Metric	Value
Data loss	ZERO
User impact	5-15 second delivery delay
Recovery	Automatic, no manual intervention

11.5 FCM/APNs Full Service Outage

SCENARIO: Google FCM goes completely down for 2 hours

This actually happens. FCM has had multiple multi-hour outages. When it does, Android delivery stops entirely. The question isn't "how to prevent it" -- that's impossible. The question is "how to not lose a single notification and drain the backlog cleanly when it comes back."

What happens step by step:

FCM dispatcher workers start getting 500/503 errors on every request
Error rate crosses 50% threshold within 60 seconds
Circuit breaker opens. FCM workers stop calling FCM entirely
Kafka consumer for the notifications.fcm topic pauses its partitions
Messages accumulate in Kafka. At 30M FCM notifications/sec, that's 108B messages over 2 hours
Kafka retention is 24 hours. 2-hour outage is well within buffer

The critical detail: Controlled drain on recovery

When FCM comes back, blasting 108B buffered notifications at once would be a mistake. FCM has its own rate limits, and sending a 2-hour backlog at full speed would just trigger 429s.

The recovery drain works like this:

Circuit breaker: OPEN → (probe every 10s) → HALF_OPEN → (5 successes) → CLOSED

After CLOSED:
  1. Resume Kafka consumer with rate limiting
  2. Start at 10% of normal throughput
  3. Ramp up 10% every 30s if error rate < 5%
  4. Full throughput restored in ~5 minutes
  5. Backlog drained at full speed (30M/sec) once stable
  6. Total drain time for 2h outage: ~1 hour at full speed

What about notifications with TTL? Some notifications have a time-to-live (e.g., "Your food is arriving in 5 minutes" is useless 2 hours later). During drain, the FCM worker checks TTL before dispatching. Expired notifications are marked EXPIRED in ScyllaDB and skipped. This actually speeds up drain by reducing the effective backlog.

Metric	Value
Data loss	ZERO -- Kafka buffers everything
User impact	No Android notifications for duration of FCM outage
Recovery time	~1 hour to drain 2-hour backlog
Expired notifications	Skipped during drain (TTL check)

The exact same pattern applies to APNs outages. Separate Kafka topic, separate circuit breaker, separate drain -- one channel going down doesn't affect the others.

11.6 Scheduler Leader Crash

SCENARIO: The scheduler leader crashes mid-tick while firing scheduled notifications

The scheduler service uses leader election via Redis lock. One leader ticks every second, scanning Redis sorted sets for notifications whose fire time has passed. If it crashes, scheduled notifications stop firing until a new leader takes over.

Timeline:

Time	Event
T=0	Leader crashes mid-tick (was processing minute-bucket 14:35)
T=0	Redis lock TTL starts counting down (TTL=10s)
T=+5s	Hot standby notices leader heartbeat missing, starts attempting lock acquisition
T=+10s	Redis lock expires. Standby acquires lock, becomes new leader
T=+10s	New leader starts ticking from current time
T=+10-15s	Notifications scheduled between T=0 and T=+10s are unfired
T=+5min	Consistency sweep runs, catches everything the crash missed

The gap: What about T=0 to T=+10s?

During the 10-second leadership gap, no scheduled notifications fire. The consistency sweep (runs every 5 minutes) scans ScyllaDB for any PENDING notification with fire_at < now() - 60s. This catches the gap. Worst case: a scheduled notification fires 5 minutes late.

What if the leader crashed partway through a batch? Some notifications from the 14:35 bucket were already produced to Kafka, others weren't. The new leader doesn't know which ones were sent. Solution: idempotency keys. Every scheduled notification has a deterministic idempotency key (notification_id + fire_time). If Kafka already has it, the duplicate is dropped by the dedup layer.

Metric	Value
Data loss	ZERO -- ScyllaDB is source of truth
User impact	Up to 5-minute delay for notifications in the gap window
Duplicate risk	None -- idempotency keys on every scheduled notification

11.7 Full Region Failure

SCENARIO: US-EAST region goes completely offline (AWS outage, network partition)

This is the big one. An entire region dying means: all US-EAST Kafka brokers, all Redis nodes, all gateway pods, all ScyllaDB nodes in that DC. Millions of active WebSocket connections drop.

What survives:

ScyllaDB data: Multi-DC replication means EU-WEST has a full copy of all device tokens, user preferences, notification status. Zero data loss.
Kafka messages: NOT replicated cross-region (by design). In-flight messages in US-EAST Kafka are unavailable until the region recovers. But: the ingestion layer already ACKed these to the client as 202 Accepted. They're durable in US-EAST Kafka (on disk, RF=3 within the region). When the region comes back, they'll be processed.

What breaks and how it recovers:

Component	Impact	Recovery
WebSocket connections	All US-EAST WS connections drop	Clients reconnect to EU-WEST via DNS failover (30-60s)
Kafka	US-EAST topics unavailable	New notifications for US users produced to EU-WEST Kafka
Redis	US-EAST registry lost	Rebuilt as clients reconnect to EU-WEST gateways
Scheduled notifications	US-EAST scheduler stops	EU-WEST scheduler picks up (ScyllaDB has the schedule)
In-flight notifications	Stuck in US-EAST Kafka	Delivered when region recovers, or re-fired by sweep

DNS failover timeline:

Time	Event
T=0	US-EAST goes dark
T=+30s	Health checks fail, DNS TTL starts expiring
T=+30-60s	DNS resolves to EU-WEST for US users
T=+60-90s	Clients reconnect to EU-WEST gateway pods
T=+90s	Service restored with higher latency (cross-Atlantic)
T=+5min	EU-WEST auto-scales to handle doubled load

The cross-Atlantic latency penalty: US users connecting to EU-WEST gateways add ~80-100ms round trip. For real-time notifications this is noticeable but acceptable. WebSocket PING/PONG keep the connection alive despite higher latency. When US-EAST recovers, DNS shifts back and clients reconnect to their local region.

Metric	Value
Data loss	ZERO -- ScyllaDB multi-DC has everything
In-flight notifications	Delivered when US-EAST recovers
User impact	30-90s outage during failover, +80ms latency after
Full recovery	When US-EAST comes back online

11.8 Compound Failure: Redis + WebSocket Gateway

SCENARIO: Network partition isolates the Redis cluster from gateway pods, but both are still running

This is worse than either failure alone. The gateway pods are alive and holding WebSocket connections, but can't read or write to the connection registry. The routing layer can't find any user's gateway pod.

What happens:

Gateway pods detect Redis is unreachable. They keep existing WebSocket connections alive (connections don't depend on Redis).
New connections can't be registered in Redis. Users who connect during the partition are invisible to routing.
Routing layer: every Redis lookup fails. Every notification goes to ScyllaDB as QUEUED.
Real-time delivery stops entirely. All notifications are delayed until either Redis recovers or users reconnect.

Recovery depends on which heals first:

If Redis recovers: Gateway pods re-register all connections in the next 60-second sweep. Routing layer starts finding users again. ScyllaDB queue drains as users are discovered online.

If the partition persists > 5 minutes: Gateway pods start proactively draining ScyllaDB queues for their connected users. Each pod knows which users it's serving (local connection map), so it can pull queued notifications from ScyllaDB and deliver them directly, bypassing the routing layer entirely.

Metric	Value
Data loss	ZERO -- ScyllaDB catches everything
User impact	Delayed delivery (seconds to minutes depending on partition duration)
Recovery	Automatic once Redis is reachable or via direct ScyllaDB drain

11.9 When Is Data ACTUALLY Lost?

When is a notification ACTUALLY lost?

Only if Kafka loses committed data. Once the API returns 202 Accepted, the notification is persisted in Kafka with RF=3 and acks=all. From that point, it WILL be delivered eventually, no matter what else fails.

For Kafka to lose a committed message, it takes:

3 brokers holding replicas of the same partition all fail simultaneously WITH unrecoverable disk loss
OR: acks=all is misconfigured (not the case here)
OR: min.insync.replicas is set to 1 (set to 2 in this design)

Probability: astronomically low with proper operations.

Mitigations:

RF=3 across different racks/AZs
min.insync.replicas=2
Rack-aware replica placement
Regular disk health monitoring
Kafka tiered storage to S3 (independent durability layer)

Everything downstream of Kafka is derived:

ScyllaDB notification status: Rebuilt from Kafka replay
Redis connection registry: Rebuilt from gateway pod registration sweeps
Redis rate limiter counters: Rebuilt from incoming traffic patterns
Scheduled notification state: Source of truth is ScyllaDB, backed by Kafka events

The system's durability guarantee comes down to one thing: Kafka is the write-ahead log for every notification. As long as Kafka is intact, every notification is recoverable.

11.10 Failure Scenario Summary

Every scenario above comes back to the same architecture principle: Kafka is the durable backbone, ScyllaDB is the persistent queue, Redis is ephemeral and rebuildable.

Gateway pod crash? Clients reconnect, ScyllaDB holds queued notifications
Redis down? Routing falls back to ScyllaDB queuing, registry rebuilds in 60s
ScyllaDB node down? RF=3 means transparent failover
Kafka broker down? Leader election in 15s, no data loss
FCM/APNs outage? Kafka buffers for 24h, controlled drain on recovery
Scheduler crash? Standby takes over in 10s, sweep catches gaps in 5min
Full region down? DNS failover to other region, ScyllaDB has all data

At-least-once delivery is guaranteed as long as Kafka is intact. Every delivery path has a fallback. No single component failure causes notification loss.

11.11 Retry Strategy Matrix

Error Type	First Retry	Max Retries	Backoff	Final Action
Kafka produce	100ms	5	Exponential	503 to client
ScyllaDB timeout	200ms	3	Exponential	Serve from cache
Redis timeout	50ms	3	Linear	In-memory fallback
FCM 429	1s	10	Exp + jitter	DLQ
FCM 500/503	500ms	3	Exponential	DLQ
APNs 429	1s	10	Exp + jitter	DLQ
APNs 500/503	500ms	3	Exponential	DLQ
WS send fail	Immediate	1	N/A	Queue in ScyllaDB
Template error	N/A	0	N/A	DLQ
Invalid token	N/A	0	N/A	Deactivate token

11.12 Circuit Breaker Configuration

Dependency	Error% to Open	Window	Half-Open Probes	Fallback
FCM API	50%	60s	5 requests	Buffer in Kafka
APNs API	50%	60s	5 requests	Buffer in Kafka
ScyllaDB	30%	30s	10 requests	Local cache
Redis	30%	30s	10 requests	In-memory
Template Service	50%	60s	3 requests	Raw payload

State machine:
  CLOSED ──(error% > threshold)──▶ OPEN
  OPEN   ──(cooldown 60s)────────▶ HALF_OPEN
  HALF_OPEN ──(N successes)──────▶ CLOSED
  HALF_OPEN ──(any failure)──────▶ OPEN

11.13 Dead Letter Queue (DLQ)

Topic: notifications.dlq

Each message carries:
  • Original payload
  • Error code + message
  • Attempt count
  • Last attempt timestamp
  • Source topic + partition

Processing:
  1. Auto-retry every 15 min for transient errors (5xx, timeout)
  2. Manual triage dashboard for permanent errors (bad payload, bad template)
  3. Alert if DLQ depth > 10,000
  4. Retain 7 days → archive to S3

12. Deployment Strategy

12.1 Multi-Region Active-Active

Cross-region data strategy:

Kafka: NOT replicated cross-region. Each region has independent Kafka clusters with independent topics. No MirrorMaker.
ScyllaDB: Built-in multi-DC replication. Device tokens, user preferences, and notification status are available in every region.
Redis: NOT replicated (ephemeral, rebuilt locally per region).
WebSocket: User connects to nearest region, stays there.

Routing: The ingestion layer handles cross-region delivery. When the US API Gateway receives a notification for an EU user, it looks up the user's device registry in ScyllaDB (which is replicated cross-region). The registry tells it the user's home region. The API Gateway then produces directly to the EU Kafka cluster via a cross-region Kafka producer.

Notification for EU user sent from US: US API Gateway → ScyllaDB lookup (user is EU) → produce to EU Kafka → EU routing pipeline → EU delivery → user's WS in EU

Why not MirrorMaker 2? MirrorMaker replicates entire Kafka topics across regions. At 100 GB/sec ingestion, that means copying most of that data cross-region even though 80-90% of notifications are local (US user → US delivery). The cross-region bandwidth cost alone would be enormous. Ingestion-layer routing only sends cross-region traffic for the small percentage of notifications that actually need it. For broadcast notifications (flash sale to all users), the ingestion layer fans out to each region's Kafka independently.

12.2 Technology Stack

Component	Technology
API Gateway	Envoy + gRPC services
Message Bus	Apache Kafka 3.7+
Hot Database	ScyllaDB Enterprise
Cache / Registry	Redis Cluster 7.2+
Analytics	ClickHouse
Cold Storage	S3 + Parquet
Templates	PostgreSQL 16
Orchestration	Kubernetes (EKS)
Monitoring	Prometheus + Grafana
Tracing	Jaeger / Tempo
CI/CD	ArgoCD + GitHub Actions
Load Balancer	AWS NLB (L4) + Envoy sidecar (L7)

12.3 Key Trade-off Decisions

Decision	Chosen	Why
Hot DB	ScyllaDB over Cassandra/DynamoDB	10x throughput per node, compatible model, lower tail latency
Message Bus	Kafka over Pulsar/NATS	Battle-tested at scale, strongest ecosystem, per-region clusters with ingestion-layer routing
Connection Registry	Redis over etcd/ZooKeeper	Sub-ms lookups, built-in pub/sub, good enough for ephemeral data
WS vs SSE	Hybrid (WS primary, SSE fallback)	WS for bidirectional ACK, SSE for firewall-hostile environments
Scheduling	Redis sorted set + ScyllaDB	O(log N) retrieval, ScyllaDB for durability
LB for WS	L4 NLB + Envoy sidecar	L4 for minimal connection overhead, Envoy for per-pod L7
Analytics	ClickHouse over Druid/ES	Better compression, SQL, trillion-row queries, simpler ops
WS Node Selection	Centralized registry (Redis)	Sub-ms exact routing, no broadcast waste, handles multi-device

13. Observability

13.1 Key Metrics (SLIs)

Ingestion:   notification.ingested.rate, rejected.rate, ingestion.latency.p99
Routing:     notification.routed.rate, routing.latency.p99, fanout.factor
Delivery:    notification.sent/delivered/failed.rate (per channel), delivery.latency.p50/p99
Connections: websocket.connections.active (per pod), dropped.rate
Kafka:       consumer.lag (per topic/group), produce.error.rate
Scheduler:   scheduled.pending, fired.rate, missed.rate, drift.seconds, preloader.behind.seconds, preloader.loaded.rate
DLQ:         dlq.depth, ingestion.rate, retry.success.rate

13.2 Critical Alerts

yaml

- alert: DeliveryFailureRate > 5%          # P1 page
  for: 2m

- alert: KafkaConsumerLag > 500K           # P1 page
  for: 5m

- alert: WebSocketDropRate > 10K/min       # P1 page
  for: 1m

- alert: DLQDepth > 10K                    # P2 warning
  for: 10m

- alert: SchedulerLeaderLost               # P1 page
  for: 10s

- alert: FCMCircuitBreakerOpen             # P1 page
  immediate

13.3 Distributed Tracing

Every notification carries X-Trace-Id through the entire pipeline. Sampled at 0.1% (= 100K traces/sec). 100% sampling for CRITICAL priority and DLQ messages. Backend: Jaeger or Grafana Tempo.

Trace span chain:
  [Ingest] → [Kafka] → [Route] → [Fan-out] → [Deliver:WEB]
                                             → [Deliver:FCM]
                                             → [Deliver:APNs]

14. Security

Authentication: Tenant API key + JWT (15-min TTL, refresh tokens). WebSocket authenticates on HTTP Upgrade, and the JWT is validated before the upgrade completes. Mid-session token expiry handled via REAUTH frame without dropping the connection.

Encryption: TLS 1.3 everywhere. mTLS for service-to-service. Kafka and ScyllaDB encrypted at rest. Optional per-tenant payload encryption via AWS KMS.

Tenant Isolation: Every message, every DB row, every Redis key includes tenant_id. Per-tenant Kafka quotas. Per-tenant rate limits. ScyllaDB workload-aware scheduling prevents one tenant's scans from affecting another's latency.

Input Validation: JSON schema validation. Max payload 4 KB. Template variables sanitized. Deep links validated against tenant-registered URL scheme allowlist.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This System	Learn More
Kafka	Central event bus, 2M+ messages/sec, topic-per-channel routing	Kafka
ScyllaDB	Primary hot store, 2.4 PB notification storage, shard-per-core	ScyllaDB
Redis	Connection registry for 100M WebSocket connections, Pub/Sub relay	Redis
ClickHouse	Real-time analytics on delivery rates, latency percentiles	ClickHouse
PostgreSQL	Tenant configuration, template management, audit logs	PostgreSQL
gRPC	Service-to-service communication with protobuf serialization	gRPC
Envoy	Service mesh sidecar, mTLS, load balancing, circuit breaking	Envoy
Prometheus	Metrics collection for delivery SLOs and pipeline health	Prometheus
Grafana	Dashboards for real-time delivery monitoring and alerting	Grafana

Infrastructure Patterns

Pattern	Role in This System	Learn More
WebSocket & Real-Time Communication	100M concurrent connections, SSE fallback, connection draining	WebSocket & Real-Time
Circuit Breaker & Resilience Patterns	FCM/APNs circuit breakers, retry policies, fallback chains	Circuit Breaker & Resilience
Load Balancer	L4 load balancing for persistent WebSocket connections	Load Balancer
Message Queues & Event Streaming	Kafka topic design, priority queues, dead-letter handling	Event Streaming
Rate Limiting & Throttling	Per-tenant rate limits, provider rate management	Rate Limiting
Kubernetes	Pod scheduling, connection-aware autoscaling, rolling deploys	Kubernetes Architecture
Distributed Tracing	End-to-end notification delivery tracing across 12+ services	Distributed Tracing
Secrets Management	FCM/APNs credentials, tenant API keys, mTLS certificates	Secrets Management
Service Mesh	Envoy sidecar for mTLS, observability, traffic management	Service Mesh
Caching Strategies	Template caching, preference caching, connection state	Caching Strategies

Practice this design: Design a Notification System -- interview question with progressive hints.

At 100M notifications/sec, every layer is partitioned independently. Kafka is the central bus. Redis handles real-time connection routing. ScyllaDB stores everything durable. The WebSocket node selection problem ("event on any server, connection on one specific pod") is solved with a centralized Redis registry and Pub/Sub relay: O(1) lookup, O(1) delivery, zero broadcast waste.

CrackingWalnuts

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

System Design: Job Scheduler (10M Jobs/day, DAG Dependencies, Effectively-Once Execution)

April 15, 2026 · 61 min read

System Design: LeetCode (Code Sandbox, Container Isolation, Real-Time Contests)

April 12, 2026 · 68 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignMarch 6, 2026· 79 min read

System Design: Notification Platform, 100M Notifications/Second

Goal: Build a notification platform that pushes 100 million notifications per second across web (WebSocket/SSE), Android (FCM), and iOS (APNs). Support scheduling, broadcast, per-user preferences, and survive failures with at-least-once delivery.

Here's how to architect it, from ingestion to delivery, with scale math and failure scenarios.

1. Problem Statement

The goal: a notification platform that pushes 100 million notifications per second across web, Android, and iOS.

Design a notification system that can send 100 million notifications per second. Support web push (real-time via SSE/WebSocket), Android (FCM), iOS (APNs), scheduled notifications, and handle failures gracefully.

A few clarifications before diving in:

Assumptions I'm making:

This is a multi-tenant platform (like OneSignal, Pusher, or Firebase itself, serving many apps).
"Send" means accepted + dispatched. Actual delivery to the user's eyeball depends on device state, OS, and network.
At-least-once delivery semantics. Exactly-once is impractical at this scale, so deduplicate where possible.
Users can be on multiple devices simultaneously (phone + laptop + tablet).
The system is a backend platform. It doesn't build client apps, it provides SDKs and APIs.

What NOT to do:

Poll clients asking "any new notifications?" (this is push, not pull)
Store notifications in a single PostgreSQL table and SELECT by user_id (dies at 1M/sec)
Use Redis as the sole data store (volatile memory, no durability)
Send FCM/APNs synchronously in the request path (one slow Apple response blocks everything)
Broadcast every notification to every WebSocket pod (20B internal messages/sec of waste)
Build one monolith that does ingestion + routing + delivery + scheduling

The trickiest problem in this design? A notification can arrive at any server, but the user's WebSocket lives on one specific pod out of hundreds. Section 9.3 covers how to solve this.

2. Functional Requirements

ID	Requirement	Priority
FR-01	Send real-time push notifications to web browsers via persistent connection (WebSocket/SSE)	P0
FR-02	Send push notifications to Android devices via FCM	P0
FR-03	Send push notifications to iOS devices via APNs	P0
FR-04	Support scheduled notifications (deliver at a future timestamp, recurring via cron)	P0
FR-05	Support broadcast notifications (send to all users, segments, topics)	P0
FR-06	Support targeted notifications (single user, device, or channel)	P0
FR-07	Track delivery status: sent, delivered, read, failed, retried	P1
FR-08	Support notification templates with variable interpolation	P1
FR-09	User-level preference management (opt-in/out per channel, quiet hours)	P1
FR-10	Multi-tenant isolation	P0
FR-11	Idempotent delivery (deduplication)	P0
FR-12	Notification cancellation before delivery	P1
FR-13	Priority levels (critical, high, normal, low)	P1
FR-14	Rate limiting per tenant, per user, per channel	P0
FR-15	Dead letter queue for permanently failed notifications	P1

3. Non-Functional Requirements

ID	Requirement	Target
NFR-01	Throughput	100M notifications/sec sustained
NFR-02	Latency (p50 / p99)	< 100ms / < 500ms for real-time channels
NFR-03	Availability	99.99% (52 min downtime/year)
NFR-04	Durability	No notification loss once accepted (at-least-once)
NFR-05	Horizontal scalability	Linear scale-out, no single bottleneck
NFR-06	Data retention	Hot: 7 days, Warm: 90 days, Cold: 1 year
NFR-07	Concurrent WebSocket/SSE connections	100M peak (designed for 5x growth to 500M)
NFR-08	Scheduled notification precision	≤ 1 second drift
NFR-09	Recovery Time Objective (RTO)	< 30 seconds
NFR-10	Recovery Point Objective (RPO)	0 (no data loss)
NFR-11	Geographic distribution	Multi-region active-active
NFR-12	Tenant isolation	Noisy neighbor protection at every layer

4. High-Level Approach & Technology Selection

4.1 What Kind of System Is This?

4.2 What Components Are Needed?

Durable message bus -- Decouple ingestion from delivery. If a delivery channel falls behind (FCM rate-limited, APNs down), events must buffer and replay. The bus also enables independent scaling per channel via separate topics.
Per-channel delivery dispatchers -- Each target (WebSocket, FCM, APNs) speaks a different protocol with different constraints. FCM limits ~1,000 msg/sec per HTTP/2 connection. APNs allows ~4,000 with multiplexing. WebSocket requires finding the exact gateway pod holding the user's connection.
Connection registry -- With 100M concurrent WebSocket connections spread across 200 gateway pods, the system needs a fast lookup: "which pod holds user X's connection?" This must handle 100K+ connects/disconnects per second (100M connections with even 0.1% churn per second). (Deep dive: Section 9.3)
High-throughput hot store -- Device tokens, user preferences, notification status -- all at 100M+ writes/sec. The store must handle wide-column access patterns (user → all their devices) with tunable consistency.
Scheduler -- Future and recurring notifications need a time-indexed store that fires at the right second, backed by a durable store for crash recovery.

4.3 Store Selection

Store	Technology	Rationale
Device tokens, preferences, status	ScyllaDB	100M+ writes/sec, wide-column model for user→device mappings, tunable consistency, linear scale-out, TTL support
Connection registry, rate limits, dedup	Redis Cluster	Sub-ms lookups, Pub/Sub for cross-server message relay, ephemeral data
Scheduled notifications index	Redis sorted sets	O(log N) range queries by fire time
Scheduled notifications store	ScyllaDB	Durability (source of truth for scheduled payloads)
Analytics	ClickHouse	Columnar, trillion-row aggregations, SQL
Cold archive	S3 + Parquet	Cost-effective, queryable via Athena/Spark
Templates	PostgreSQL	Low volume, transactional, relational

4.4 Why ScyllaDB Over Alternatives

At 100M+ writes/sec with wide-column access patterns (user → devices, user → notification status), the system needs a database built for this workload. Here's how the candidates stack up:

Dimension	ScyllaDB	Cassandra	DynamoDB	MongoDB
Throughput per node	~1M ops/sec (shard-per-core, C++)	~100K ops/sec (JVM, shared threads)	Provisioned or on-demand (pay per RCU/WCU)	~50K ops/sec (WiredTiger)
p99 latency under load	< 5ms (no GC pauses)	10-50ms (GC spikes under pressure)	< 10ms (single-digit)	10-100ms (variable)
Data model fit	Wide-column, perfect for user→device mappings	Same CQL model	Key-value / document, no native wide-column clustering	Document, schema-flexible but poor for wide-column
TTL support	Native per-row TTL, zero overhead	Native TTL but tombstone storms cause latency spikes	Native TTL on items	TTL index, separate background deletion
Multi-DC replication	Built-in, tunable per-keyspace	Built-in, same model	Global Tables (managed)	Atlas Global Clusters (managed)
Cost at scale (PB-range)	~10x fewer nodes than Cassandra	~2,000 nodes (JVM overhead)	$$$$ at this write volume	Not viable at this scale
Operational complexity	Medium (fewer nodes, less tuning)	High (JVM tuning, compaction tuning, GC tuning)	Low (fully managed)	High at this scale

Why ScyllaDB wins here:

DynamoDB would work functionally, but at 100M writes/sec the cost is prohibitive. At $1.25 per million WCUs, that's $125/sec or ~$10.8M/day. ScyllaDB on bare metal costs a fraction of that.

MongoDB can't sustain this write volume with wide-column access patterns. WiredTiger's document-level locking becomes a bottleneck well before 1M ops/sec per node.

5. High-Level Architecture

Here's the architecture, layer by layer.

Bird's-Eye View

Layer Responsibilities

Layer 1, Ingestion: Stateless. Accepts requests, validates, deduplicates, writes to Kafka. Returns 202 only after Kafka confirms persistence. Every pod is identical, so the LB can route to any.

Layer 3B, Scheduler: Leader-elected service. Fires notifications at their scheduled time. Uses Redis sorted sets backed by ScyllaDB for durability.

Layer 5, Data: ScyllaDB for high-throughput writes. Redis for ephemeral state. ClickHouse for analytics. S3 for cold storage.

6. Back-of-the-Envelope Estimation

Here's the math that shapes the architecture.

6.1 Throughput Breakdown

Target: 100,000,000 notifications/sec

Channel split (assumed):
  Web push (real-time):   30%  →  30M/sec
  Android (FCM):          40%  →  40M/sec
  iOS (APNs):             30%  →  30M/sec

6.2 Connection Math

Registered users:            500M
Peak concurrency (20%):      100M simultaneous WebSocket/SSE connections
Memory per connection:        ~20 KB (buffers, session state, TLS)
Total connection memory:      100M × 20 KB = 2 TB

Connections per server:       500K (aggressive but achievable with epoll/io_uring,
                              ulimit -n 1M+, net.core.somaxconn tuned)
WS Gateway servers needed:    100M / 500K = 200 servers

6.3 Bandwidth

Average notification payload: 1 KB (JSON: title, body, metadata, routing)

Ingestion:  100M/sec × 1 KB = 100 GB/sec inbound
Delivery:   ~120M/sec × 1 KB = 120 GB/sec outbound (1.2x fan-out for multi-device)
Aggregate:  ~1 Tbps across the cluster

6.4 Storage

Daily volume (at 20% average of peak): 1.73 trillion notifications/day
Hot metadata (200 bytes/notification, 7 days): ~2.4 PB
Warm (90 days, compressed columnar):            ~6 PB
Cold (1 year, Parquet on S3):                   ~25 PB

Practical: Store only metadata + status in hot tier.
Purge full payloads 24h after delivery.

6.5 Kafka Sizing

Ingestion rate:         100 GB/sec
Per-broker write rate:  ~200 MB/sec (NVMe + 10GbE)
Brokers for ingestion:  100,000 / 200 = 500 brokers
Replication factor 3:   ~1,500 broker-equivalents of storage
Partitions:             100,000+ (for consumer parallelism)

→ Multi-cluster Kafka deployment, partitioned by tenant hash

6.6 FCM/APNs Dispatcher Sizing

FCM:
  Practical limit: ~1,000 msg/sec per HTTP/2 connection
  40M/sec → 40,000 concurrent connections to Google
  At 100 connections/pod → 400 FCM dispatcher pods

APNs:
  Practical limit: ~4,000 msg/sec per HTTP/2 connection (multiplexed)
  30M/sec → 7,500 concurrent connections to Apple
  At 100 connections/pod → 75 APNs dispatcher pods

6.7 Routing & API Gateway Sizing

API Gateway: stateless request handlers.
  100M notifications/sec
  Each pod handles ~50K requests/sec (gRPC, validation, Kafka produce)
  100M / 50K = 2,000 pods minimum

Routing workers: resolve users → devices, check preferences, fan out.
  100M notifications/sec, each requires 2-3 ScyllaDB reads + preference check
  Each pod handles ~50K routing decisions/sec
  100M / 50K = 2,000 pods minimum

Both are stateless and horizontally scalable. HPA scales them based on CPU and Kafka consumer lag.

6.8 Summary

Resource	Estimate
WS Gateway servers	200 (500K connections each)
Kafka brokers	500+ (multi-cluster)
FCM dispatchers	400 pods
APNs dispatchers	75 pods
API Gateway pods	2,000+
Routing workers	2,000+ pods
Hot storage (ScyllaDB)	2.4 PB
Connection RAM	2 TB
Network	~1 Tbps

Bottom line: every layer scales independently. Shared-nothing architecture. Partition everything.

7. Data Model

7.1 ScyllaDB Schema

Device Tokens, partitioned by user_id (1-10 devices per user = small partitions):

sql

CREATE TABLE device_tokens (
    user_id       TEXT,
    device_id     TEXT,
    platform      TEXT,          -- ANDROID | IOS | WEB
    token         TEXT,
    app_version   TEXT,
    os_version    TEXT,
    locale        TEXT,
    is_active     BOOLEAN,
    created_at    TIMESTAMP,
    updated_at    TIMESTAMP,
    PRIMARY KEY (user_id, device_id)
);

-- Reverse lookup for FCM/APNs error handling (invalid token → find user)
CREATE TABLE token_to_user (
    platform  TEXT,
    token     TEXT,
    user_id   TEXT,
    device_id TEXT,
    PRIMARY KEY ((platform, token))
);

Notification Status, partitioned by user_id, clustered newest-first, 7-day TTL:

sql

CREATE TABLE notification_status (
    user_id           TEXT,
    notification_id   TEXT,
    tenant_id         TEXT,
    channel           TEXT,
    device_id         TEXT,
    status            TEXT,         -- QUEUED | SENT | DELIVERED | READ | FAILED
    error_code        TEXT,
    attempt_count     INT,
    created_at        TIMESTAMP,
    sent_at           TIMESTAMP,
    delivered_at      TIMESTAMP,
    PRIMARY KEY ((user_id), created_at, notification_id, channel)
) WITH CLUSTERING ORDER BY (created_at DESC)
  AND default_time_to_live = 604800;

Scheduled Notifications, partitioned by minute-level time buckets:

sql

CREATE TABLE scheduled_notifications (
    fire_time_bucket  TEXT,          -- "2026-02-05T10:00" (minute granularity)
    notification_id   TEXT,
    tenant_id         TEXT,
    fire_at           TIMESTAMP,
    payload           TEXT,          -- full JSON
    recurrence        TEXT,          -- cron expression or null
    status            TEXT,          -- PENDING | FIRED | CANCELLED
    PRIMARY KEY ((fire_time_bucket), fire_at, notification_id)
) WITH CLUSTERING ORDER BY (fire_at ASC);

User Preferences:

sql

CREATE TABLE user_preferences (
    user_id        TEXT,
    channel        TEXT,
    enabled        BOOLEAN,
    quiet_start    TEXT,
    quiet_end      TEXT,
    timezone       TEXT,
    category_prefs MAP<TEXT, BOOLEAN>,
    PRIMARY KEY (user_id, channel)
);

7.2 Redis Data Structures

# ── Connection Registry ──
# "User X is connected on device D, held by server S"
HSET   conn:registry:{user_id}   {device_id}   "{server_id}:{conn_id}"

# "Server S holds these connections" (for drain/cleanup)
SADD   conn:server:{server_id}   "{user_id}:{device_id}"

# ── Rate Limiting (sliding window) ──
ZADD   ratelimit:tenant:{tenant_id}   {timestamp_ms}   {notification_id}

# ── Per-User Rate Limiting (sliding window) ──
ZADD   ratelimit:user:{user_id}   {timestamp_ms}   {notification_id}

# ── Per-Channel Rate Limiting (hourly counter) ──
INCR   ratelimit:user:{user_id}:{channel}
EXPIRE ratelimit:user:{user_id}:{channel} 3600

# ── Deduplication ──
BF.ADD dedup:{date}   {idempotency_key}

# ── Scheduled Index ──
ZADD   scheduled:fire_index:{minute_bucket}   {fire_epoch}   {notification_id}

# ── Topic Subscriptions ──
SADD   channel:subscribers:{topic}   {user_id_1}   {user_id_2} ...

7.3 ClickHouse (Analytics)

sql

CREATE TABLE notification_events (
    event_date      Date,
    event_time      DateTime64(3),
    tenant_id       LowCardinality(String),
    notification_id String,
    user_id         String,
    channel         LowCardinality(String),
    status          LowCardinality(String),
    latency_ms      UInt32,
    error_code      LowCardinality(Nullable(String))
) ENGINE = MergeTree()
  PARTITION BY toYYYYMM(event_date)
  ORDER BY (tenant_id, event_date, channel, status)
  TTL event_date + INTERVAL 365 DAY;

8. API Design

8.1 Send Notification

POST /v1/notifications
X-Tenant-Id: tenant_abc
X-Idempotency-Key: 550e8400-e29b-41d4-a716-446655440000
Authorization: Bearer <jwt>

{
  "notification_id": "550e8400-e29b-41d4-a716-446655440000",
  "type": "TARGETED",                          // TARGETED | BROADCAST | TOPIC
  "priority": "HIGH",                          // CRITICAL | HIGH | NORMAL | LOW
  "channels": ["WEB_PUSH", "FCM", "APNS"],     // or ["ALL"]
  "recipients": {
    "user_ids": ["user_123", "user_456"],
    "topic": null,
    "segment_id": null
  },
  "content": {
    "title": "Order Shipped",
    "body": "Your order #{{order_id}} has been shipped.",
    "image_url": "https://cdn.example.com/img/shipped.png",
    "deep_link": "app://orders/{{order_id}}",
    "template_id": "tmpl_order_shipped",
    "template_vars": { "order_id": "ORD-98765" },
    "data": { "order_id": "ORD-98765", "carrier": "FedEx" }
  },
  "schedule": {
    "deliver_at": "2026-02-05T10:00:00Z",
    "timezone": "America/New_York",
    "recurrence": null                          // or cron: "0 9 * * 1"
  },
  "options": {
    "ttl_seconds": 86400,
    "collapse_key": "order_update",
    "sound": "default",
    "badge_count": 3,
    "android": { "channel_id": "orders", "icon": "ic_shipping" },
    "ios": {
      "category": "ORDER_UPDATE",
      "thread_id": "order-98765",
      "interruption_level": "active"
    },
    "web": {
      "require_interaction": false,
      "actions": [
        {"action": "track", "title": "Track Order"},
        {"action": "dismiss", "title": "Dismiss"}
      ]
    }
  }
}

Response (202 Accepted):

json

{
  "status": "ACCEPTED",
  "notification_id": "550e8400-e29b-41d4-a716-446655440000",
  "estimated_recipients": 2,
  "tracking_url": "/v1/notifications/550e8400-.../status"
}

8.2 Other Endpoints

GET  /v1/notifications/{id}/status       → delivery status per channel
DELETE /v1/notifications/{id}             → cancel (if not yet delivered)
POST /v1/notifications/batch             → up to 1000 per batch
POST /v1/devices                          → register device token
PUT  /v1/users/{id}/preferences           → channel opt-in/out, quiet hours

8.3 Internal gRPC (Service-to-Service)

protobuf

service NotificationService {
  rpc Send(SendRequest) returns (SendResponse);
  rpc SendBatch(stream SendRequest) returns (BatchResponse);
  rpc GetStatus(StatusRequest) returns (StatusResponse);
  rpc Cancel(CancelRequest) returns (CancelResponse);
}

service ConnectionRegistryService {
  rpc Register(ConnectionInfo) returns (Ack);
  rpc Deregister(ConnectionId) returns (Ack);
  rpc Lookup(UserId) returns (stream ConnectionInfo);
}

8.4 WebSocket Wire Protocol

Client → Server:  CONNECT {"user_id":"user_123","token":"jwt...","channels":["orders"]}
Server → Client:  CONNECTED {"connection_id":"conn_abc","server":"ws-pod-042"}

Server → Client:  NOTIFICATION {"id":"n_789","title":"...","body":"...","data":{...}}
Client → Server:  ACK {"id":"n_789"}

Server → Client:  PING
Client → Server:  PONG

9. Deep Dives

9.1 Web Push: SSE vs WebSocket

This choice matters. Here's the comparison:

Dimension	WebSocket	SSE
Direction	Full-duplex (bidirectional)	Server → Client only
Client ACK	Native (client sends frames)	Requires separate HTTP POST
Protocol	ws:// (upgrade from HTTP)	Standard HTTP
Auto-reconnect	Manual (must implement)	Built into EventSource API
HTTP/2 multiplexing	No (dedicated TCP per connection)	Yes
Binary support	Yes	Text only
Proxy traversal	Often blocked by corporate firewalls	Passes through any HTTP proxy
Memory per connection	~20 KB	~15 KB
Load balancer compat	Needs L4 or WS-aware L7	Any HTTP LB works

My Recommendation: WebSocket Primary, SSE Fallback

Connection Lifecycle (WebSocket)

9.2 Connection Management & Message Routing

The Connection Registry

This is the lookup table that answers: "User X has a WebSocket open. Which of the 200 servers is holding it?"

Targeted Notification Routing

Redis Pub/Sub (chosen approach): Each WebSocket pod subscribes to a channel named after itself (e.g., ws-pod-042:deliver). The routing worker publishes to that channel. Simple, no direct network calls between services.
Direct pod IP via Kubernetes headless service: A headless Service resolves to individual pod IPs instead of a single VIP. The routing worker looks up the pod IP from Redis and calls it directly over gRPC or HTTP. Faster (no pub/sub hop), but couples the routing layer to pod networking and breaks if pods restart and get new IPs before Redis is updated.
Dedicated internal load balancer with sticky routing: A separate internal LB with consistent hashing on user_id. Routes to the same pod every time. No Redis lookup needed for routing, but flexibility is lost. Adding or removing pods reshuffles connections.

Broadcast / Topic Routing (Staged Fan-out)

For a topic with millions of subscribers, direct fan-out from one worker would OOM. Staged fan-out fixes this:

Total broadcast latency for 10M users: < 5 seconds

9.3 How the Right WebSocket Node Is Selected

The hard part: an event can fire from any server, but the user's WebSocket lives on one specific pod. How does it get routed?

The Problem Visualized

Four approaches. Most don't work at this scale.

Approach 1: Sticky Hashing (Route User to Deterministic Pod)

Idea:  hash(user_id) % num_pods → always routes to the same pod

       user_123 → hash("user_123") % 200 → pod 42 (always)
       Event producer knows: "user_123 is on pod 42"
       → Send notification directly to pod 42

Why this DOESN'T work at this scale:

Pod count changes (scaling, crashes). Rehashing redistributes users → mass disconnects.
Consistent hashing helps but still moves ~1/N users per pod change.
Hot users (celebrities) create hotspot pods.
No multi-device support: user on phone AND laptop can't be on the same pod if they connected at different times through different LB decisions.

Verdict: Rejected. Too fragile, too many edge cases.

Approach 2: Broadcast to All Pods

Idea:  Send every notification to ALL 200 pods.
       Each pod checks: "Do I have this user?" If yes, deliver. If no, discard.

       Event → broadcast to 200 pods → 199 discard, 1 delivers

Why this DOESN'T work at this scale:

100M notifications/sec x 200 pods = 20 BILLION message deliveries/sec internally.
99.5% of messages are wasted (delivered to pods that don't have the user).
Network bandwidth explodes: 100 GB/sec x 200 = 20 TB/sec internal traffic.

Verdict: Rejected. Catastrophically wasteful.

Approach 3: Centralized Registry Lookup (CHOSEN)

Idea:  Maintain a fast lookup table in Redis.
       When event fires → lookup user → find exact pod → route directly.

Here is the complete end-to-end flow showing how an event on any server reaches the right WebSocket node:

PHASE 1: CONNECTION TIME (user connects, happens ONCE)

PHASE 2: EVENT TIME (notification fires, can happen from ANYWHERE)

PHASE 3: WEB DELIVERY (finding the RIGHT node)

PHASE 4: USER OFFLINE (no active connection)

If HGETALL conn:registry:user_123 returns EMPTY:

Store in ScyllaDB: status = 'QUEUED'
When user reconnects to ANY pod (e.g., ws-pod-099):
- ws-pod-099 fetches pending:
  - SELECT * FROM notification_status WHERE user_id = 'user_123' AND status = 'QUEUED'
- Delivers all pending, updates status to DELIVERED

Total hops: Producer → Kafka → Router → Redis Lookup → Pub/Sub → Pod Total time: ~50-150ms end-to-end (p99)

Why this works at scale:

Redis HGETALL on a user key is O(N) where N = devices per user (1-5). Sub-millisecond.
Redis Pub/Sub delivers to the subscribed pod in ~0.1ms.
Each WS pod subscribes to exactly ONE channel (its own name). No topic explosion.
100M lookups/sec across a Redis Cluster with 50+ shards (each shard handles ~2M ops/sec, so 100M / 2M = 50 shards) is well within capacity.
Registration and deregistration are both O(1).

Approach 4: Gossip Protocol (Pod-to-Pod)

Idea:  Each pod gossips its connection list to neighbors.
       Eventually every pod knows the full mapping.

Why this doesn't scale:

100M connections x ~100 bytes = 10 GB of state to replicate to every pod.
Convergence delay: O(log N) gossip rounds = ~8 rounds for 200 pods.
Connection churn (100K+ connects/disconnects per second) overwhelms gossip bandwidth.
This would be reinventing a worse Redis.

Verdict: Rejected.

Handling Stale Registry Entries

A connection might drop without clean deregistration (network partition, pod crash, OOM kill). Now Redis says "user_123 is on ws-pod-042" but the connection is dead.

Three recovery mechanisms run in parallel:

MECHANISM 1: HEARTBEAT (proactive, every 30 seconds)

Each WS pod sends PING to all its local connections. No PONG within 10s → connection declared dead:

HDEL conn:registry:user_123 device_web_1
SREM conn:server:ws-pod-042 user_123:device_web_1

Catches: silent TCP drops, client-side crashes, network glitches Detection time: 30-40 seconds

MECHANISM 2: POD HEALTH REAPER (reactive, every 30 seconds)

Background service checks health of all pods in conn:server:* keys. For crashed pods → bulk deregister all connections:

SMEMBERS conn:server:ws-pod-042
→ for each: HDEL conn:registry:{user_id} {device_id}
DEL conn:server:ws-pod-042

Catches: pod OOM kills, hardware failures, AZ outages Detection time: 10-30 seconds

MECHANISM 3: TTL SAFETY NET (passive)

All Redis keys have 24h TTL. Connections refresh TTL on each PONG. Even if mechanisms 1 and 2 miss it, stale entries expire.

Catches: everything else Detection time: up to 24 hours (worst case safety net)

Recovery after stale entry is cleared:

Next delivery attempt sees empty registry
Notification stored in ScyllaDB as QUEUED
When user reconnects → pending notifications delivered

9.4 Load Balancer & Persistent Connections

The Core Problem

Standard HTTP LBs distribute per-request. WebSocket connections last minutes to hours. Three tensions:

Connection affinity: Once upgraded, all frames must go to the same backend. LB can't re-route mid-connection.
Rolling deploys: New pods replace old ones. Old pods must drain 500K connections gracefully.
Uneven distribution: Long-lived connections cause accumulation. A pod started 2 hours ago has way more connections than one started 2 minutes ago.

Architecture: L4 NLB + Envoy Sidecar

Graceful Drain on Deploy / Scale-Down

Timeline:

T+0s    Kubernetes sends SIGTERM (preStop hook fires)
        │
        ├── Mark self "draining" in Redis:
        │     SET ws-pod-042:status "draining" EX 600
        │
        ├── Deregister from NLB target group
        │     (NLB stops sending NEW TCP connections)
        │
        ├── Send WebSocket CLOSE frame (code 1001: Going Away)
        │     to ALL 500K connected clients:
        │     "Server shutting down, please reconnect"
        │
T+0-5s  Clients receive CLOSE frame
        │
        ├── Well-behaved clients ACK and start reconnecting
        │     to NLB → routed to different healthy pods
        │
T+30s   Send PING to remaining connections
        │
        ├── No PONG → force-close
        │
T+300s  terminationGracePeriodSeconds expires
        │
        ├── Kubernetes sends SIGKILL
        │
        ├── Cleanup: bulk deregister from Redis
        │     SMEMBERS conn:server:ws-pod-042 → HDEL all
        │     DEL conn:server:ws-pod-042

Recovery: Users who reconnected to new pods already have
their registrations updated. Pending undelivered notifications
are fetched from ScyllaDB on reconnect.

Client Reconnection: Avoiding Thundering Herd (Mass Simultaneous Reconnections)

When 500K clients disconnect simultaneously, they must NOT all reconnect at the same instant.

javascript

class NotificationClient {
  constructor(url) {
    this.baseDelay = 1000;     // 1s
    this.maxDelay  = 30000;    // 30s
    this.attempt   = 0;
    this.connect(url);
  }

  connect(url) {
    this.ws = new WebSocket(url);

    this.ws.onopen = () => {
      this.attempt = 0;   // reset on success
      this.sendConnect();
    };

    this.ws.onclose = (event) => {
      if (event.code === 1001) {
        // Server going away, reconnect with small delay (0-2s jitter)
        setTimeout(() => this.connect(url), Math.random() * 2000);
      } else {
        // Unexpected close, exponential backoff
        const delay = Math.min(this.baseDelay * 2 ** this.attempt, this.maxDelay);
        const jitter = delay * 0.5 * Math.random();
        setTimeout(() => { this.attempt++; this.connect(url); }, delay + jitter);
      }
    };
  }
}

With 500K clients and 0-2s jitter, reconnections spread to ~250K/sec, well within NLB capacity.

9.5 Mobile Push: FCM & APNs

FCM (Android)

Error handling:

200 OK           → SENT, commit offset
UNREGISTERED     → Deactivate token in ScyllaDB
INVALID_ARGUMENT → DLQ (bad payload)
429              → Backoff, pause partition consumer
500/503          → Retry 3x with exponential backoff, then DLQ

APNs (iOS)

Error handling:

200 OK           → SENT, commit offset
BadDeviceToken   → Deactivate token
Unregistered     → Deactivate (check timestamp > last registration)
ExpiredProvider  → Regenerate JWT, retry
429              → Backoff
500/503          → Retry 3x, then DLQ

Key APNs detail:
  410 Unregistered includes a timestamp. Only deactivate if
  Apple's timestamp is AFTER the last token registration.
  Otherwise the user may have re-registered on a new device.

9.6 Deduplication

Ingestion-Layer Deduplication (FR-11)

Every notification carries an X-Idempotency-Key header. The API gateway checks this key against a Bloom filter in Redis before writing to Kafka.

API Gateway receives notification:
  1. key = X-Idempotency-Key (or notification_id if no key provided)
  2. BF.EXISTS dedup:{today} key
     → EXISTS (probably duplicate): reject with 409 Conflict
     → NOT EXISTS: BF.ADD dedup:{today} key, proceed to Kafka
  3. Bloom filter rotates daily (dedup:{yesterday} deleted at midnight)

Delivery-Layer Deduplication

9.7 Multi-Channel Delivery Decision

9.8 Scheduled Notifications

Architecture

Handling 1 Billion Scheduled Notifications

One giant Redis sorted set won't fit in memory (1B x 100 bytes = 100 GB). Solution: partition by minute.

scheduled:fire_index:2026-02-05T10:00   ← only notifications for this minute
scheduled:fire_index:2026-02-05T10:01
...

Average per bucket: 1B / 525,600 min/year ≈ 1,900 entries
Peak (100x): ~190,000 entries per bucket ← trivially fits in memory

Scheduler only reads the CURRENT minute's bucket.
Background job pre-populates Redis buckets 10 minutes ahead from ScyllaDB.

Cancellation

1. API: DELETE /v1/notifications/{id}
2. ScyllaDB: UPDATE status = 'CANCELLED'
3. Redis: ZREM scheduled:fire_index:{bucket} {id}
   (only if fire_at is within 10 min and already pre-loaded into Redis.
    For far-future cancellations, step 2 is sufficient — the pre-loader
    skips CANCELLED entries.)
4. If already in Kafka but not delivered:
   → Delivery workers check: SISMEMBER cancelled:notifications {id}
   → If found, skip delivery

Scheduling Lifecycle (End-to-End)

Here's what happens from the moment a client sends a scheduled notification to the moment it fires:

Time Wheel Deep Dive

The scheduler leader runs a tight loop that ticks every second:

EVERY 1 SECOND:
  1. current_bucket = format(now(), "YYYY-MM-DDTHH:mm")
  2. entries = ZRANGEBYSCORE scheduled:fire_index:{current_bucket} -inf {now_epoch} LIMIT 1000
  3. For each entry:
     a. Fetch full payload from ScyllaDB
     b. Produce to Kafka (notifications.ingest topic)
     c. ZREM from Redis
     d. UPDATE status = 'FIRED' in ScyllaDB
  4. If entries.length == 1000, immediately loop again (drain the bucket)

Recurring Notifications (Cron)

For recurring schedules like "every Monday at 9 AM," the recurrence field holds a cron expression:

recurrence: "0 9 * * 1"    → every Monday at 09:00 UTC
recurrence: "0 */6 * * *"  → every 6 hours
recurrence: null            → one-shot, no recurrence

After the scheduler fires a recurring notification:

1. Fire the current occurrence (produce to Kafka)
2. Compute next fire time from cron expression
3. INSERT new row in ScyllaDB with next fire_at, status = PENDING
4. ZADD the new fire time into the appropriate minute bucket in Redis
5. UPDATE current row status = 'FIRED'

Each occurrence is a separate ScyllaDB row. This gives a complete audit trail (every past firing is queryable) and avoids mutation of a single row that multiple processes might read.

To stop a recurring notification, the client sends DELETE /v1/notifications/{id}. The scheduler checks status before firing. If CANCELLED, it skips and does NOT compute the next occurrence.

Timezone Handling

Scheduled times are stored in UTC internally. The API converts on ingestion:

Client sends:  deliver_at: "2026-03-10T09:00:00", timezone: "America/New_York"
API converts:  fire_at: "2026-03-10T14:00:00Z"  (EST is UTC-5)
Stored in:     ScyllaDB and Redis as UTC epoch

March 8, 2026 (before spring forward): 9 AM ET = 14:00 UTC
March 9, 2026 (after spring forward):  9 AM ET = 13:00 UTC  ← 1 hour earlier in UTC

The cron library handles this by resolving the cron expression in the user's timezone first, then converting to UTC.

Missed Fire Recovery

Three things can cause a missed fire: scheduler leader crash, Redis data loss, or a bug that skips an entry.

The consistency sweep catches all of them:

EVERY 5 MINUTES (runs on leader):
  1. Compute buckets to scan:
     past_buckets = last 10 minute buckets (now - 10min to now - 1min)
  2. For each bucket:
     SELECT * FROM scheduled_notifications
     WHERE fire_time_bucket = '{bucket}'
       AND fire_at < now() - 60s
       AND status = 'PENDING'
     LIMIT 10000
  3. For each missed notification:
     a. Re-index in Redis: ZADD scheduled:fire_index:{bucket} {fire_epoch} {id}
     b. The normal time wheel picks it up on the next tick
  4. Log metric: scheduled.missed.recovered.count

The query scans by fire_time_bucket (partition key) so it's efficient. Without the partition key, ScyllaDB would do a full table scan across all nodes, which is catastrophic at this data volume.

9.9 Notification Templates

Where Rendering Happens

Routing worker receives notification with template_id + template_vars
  → Cache lookup: template:{tenant_id}:{template_id}:{version}
  → Cache hit (99%+): render immediately
  → Cache miss: fetch from PostgreSQL, populate cache (TTL 5 min)
  → Mustache-style interpolation: "Your order #{{order_id}}" → "Your order #ORD-98765"
  → Result becomes the notification payload for downstream delivery

Template Schema

sql

CREATE TABLE notification_templates (
    tenant_id         TEXT NOT NULL,
    template_id       TEXT NOT NULL,
    version           INT NOT NULL DEFAULT 1,
    name              TEXT NOT NULL,
    channel           TEXT NOT NULL,        -- WEB | FCM | APNS | ALL
    title_template    TEXT,
    body_template     TEXT NOT NULL,
    default_vars      JSONB DEFAULT '{}',
    created_at        TIMESTAMP DEFAULT NOW(),
    updated_at        TIMESTAMP DEFAULT NOW(),
    PRIMARY KEY (tenant_id, template_id, version)
);

Caching Strategy

Templates are a classic read-heavy, write-rare workload. A few hundred templates per tenant, updated maybe a few times a day, read millions of times per second.

Error Handling

What happens when a template fails to render? Missing variable, bad syntax, PostgreSQL unreachable.

Template render fails:

  1. Missing variable in template_vars:
     → Fall back to default_vars from template definition
     → If still missing: send raw payload without template, log warning

  2. template_id doesn't exist:
     → DLQ immediately (bad request, won't fix itself on retry)

  3. PostgreSQL unreachable:
     → Circuit breaker opens (see Section 11.12)
     → Serve from local LRU cache (stale template > no notification)
     → If cache also empty: send raw payload from notification content

A template problem should never cause a dropped notification. Worst case, the user gets an unformatted notification with the raw content fields. That's better than silence.

9.10 Priority Queues

Separate Kafka Topics Per Priority

notifications.routed.web.critical    → dedicated consumer group (always has capacity)
notifications.routed.web.high        → shared consumer group
notifications.routed.web.normal      → shared consumer group
notifications.routed.web.low         → shared consumer group (first to shed)

Same pattern for .fcm.* and .apns.*

The routing worker reads the notification's priority field and produces to the matching topic. CRITICAL gets its own topic with dedicated consumers that never share resources with lower priorities.

Consumer Allocation

Priority     Consumer Pods    Kafka Partitions    Notes
─────────    ─────────────    ────────────────    ─────
CRITICAL     200 (dedicated)  200                 Never shared, never scaled down
HIGH         600 (shared)     1000                Scales with load
NORMAL       800 (shared)     2000                Bulk of traffic
LOW          400 (shared)     1000                Shed first under pressure

Backpressure and Load Shedding

When the system is under pressure (Kafka consumer lag growing, delivery channels rate-limited), LOW priority notifications slow down first.

Backpressure response:

  1. Consumer lag > 100K on LOW topic for 2+ minutes:
     → Pause 50% of LOW consumers, redistribute pods to HIGH

  2. Consumer lag > 500K on NORMAL topic:
     → Pause all LOW consumers
     → Alert: "LOW priority delivery paused, backpressure active"

  3. Consumer lag > 100K on CRITICAL topic:
     → Page on-call (this should never happen)
     → Auto-scale CRITICAL consumers (already over-provisioned, so investigate root cause)

9.11 Rate Limiting

Three levels, each checked at a different point in the pipeline.

Where Each Check Happens

API Gateway (ingestion):     tenant-level rate limit
Routing Worker:              per-user rate limit + per-channel rate limit

Redis Data Structures

# Tenant-level: sliding window (already defined in Section 7.2)
ZADD   ratelimit:tenant:{tenant_id}   {timestamp_ms}   {notification_id}

# Per-user: sliding window
ZADD   ratelimit:user:{user_id}   {timestamp_ms}   {notification_id}

# Per-channel per-user: simple counter with TTL
INCR   ratelimit:user:{user_id}:fcm
EXPIRE ratelimit:user:{user_id}:fcm 3600   # resets every hour

Default Limits

Level	Default Limit	Window	Configurable?
Tenant	10,000/sec	Sliding 1s	Yes, per contract
Per-user	100/hour	Sliding 1h	Yes, per tenant config
Per-channel (FCM)	50/hour	Fixed 1h	Yes
Per-channel (APNs)	50/hour	Fixed 1h	Yes
Per-channel (Web)	200/hour	Fixed 1h	Yes

Sliding Window Implementation

Tenant-level rate limiting uses a sliding window log in Redis. Four commands, pipelined into one round trip:

function checkTenantRateLimit(tenantId, notificationId, limit, windowMs):
    key = "ratelimit:tenant:{tenantId}"
    now = currentTimeMs()
    windowStart = now - windowMs

    pipeline:
        ZREMRANGEBYSCORE key -inf windowStart    // prune expired entries
        ZADD key now notificationId              // add current request
        ZCARD key                                // count entries in window
        EXPIRE key (windowMs / 1000 + 60)        // TTL safety net

    count = pipeline.execute()[2]    // ZCARD result

    if count > limit:
        ZREM key notificationId      // remove the one just added
        return RATE_LIMITED

    return OK

At 100M notifications/sec across 50+ Redis shards, each shard handles roughly 2M rate limit ops/sec. Comfortable.

What Happens When Rate Limited

Tenant rate limited at API Gateway:
  → 429 Too Many Requests
  → Response includes Retry-After header
  → Logged for billing and abuse monitoring

User rate limited at Routing Worker:
  → Notification silently dropped (not queued, not retried)
  → Status set in ScyllaDB: RATE_LIMITED
  → Visible in tenant's analytics dashboard
  → Does NOT count against the tenant's quota (already passed that check)

Priority Overrides

Routing Worker rate limit check:
  if notification.priority == CRITICAL:
      skip per-user rate limit
      skip per-channel rate limit
  else:
      check per-user rate limit   → RATE_LIMITED? drop + log
      check per-channel rate limit → RATE_LIMITED? drop + log

9.12 Zero-Downtime Deploys for WebSocket Gateways

Can Zero Downtime Be Achieved?

Let's define it precisely. Zero downtime means: no connected user ever observes a disconnection during a deploy. Not a brief spinner, not a reconnect, nothing.

Three "clever" approaches come up in every design review. None of them work at this scale:

Rolling Deploy Strategy for 200 Pods

The key insight: drain one pod at a time, and make sure the replacement pod is fully ready BEFORE the old pod starts draining.

Kubernetes Deployment config:

yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ws-gateway
spec:
  replicas: 200
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1       # only 1 pod drains at a time
      maxSurge: 1             # spin up 1 new pod before killing old
  template:
    spec:
      terminationGracePeriodSeconds: 300   # 5 min to drain 500K connections
      containers:
      - name: ws-gateway
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "/app/drain.sh"]
      minReadySeconds: 30     # new pod must be healthy 30s before K8s kills next

The pacing math:

How fast can the rollout go?

Parallelism	Users impacted at once	Rollout time	Redis write spike	Risk
1 pod	500K (0.5%)	~2 hours	Low	Minimal
5 pods	2.5M (2.5%)	~25 min	Moderate	Low
10 pods	5M (5%)	~12 min	High	Medium
20 pods	10M (10%)	~6 min	Very high	Thundering herd risk

The Full Deploy Lifecycle

T-30s   Kubernetes creates ws-pod-042-v2 (new version)
        Pod starts, subscribes to Redis Pub/Sub, joins NLB
        New connections from fresh clients start landing here

T+0s    minReadySeconds (30s) passes. New pod confirmed stable.
        Kubernetes sends SIGTERM to ws-pod-042-v1
        Graceful drain begins (see Section 9.4 for full drain lifecycle)

T+0-2s  Clients reconnect with jitter to healthy pods (including v2)

T+5s    ~95% reconnected

T+300s  SIGKILL. Bulk Redis cleanup. Kubernetes moves to next pod.

Notification Delivery During the Deploy Window

During the drain, a user can be in one of three states. Each one has a clear delivery path.

State 1: Already reconnected to a new pod. The Redis registry has been updated with the new pod mapping. Notifications route normally through Pub/Sub to the new pod. No issue.

State 2: Mid-reconnect (0-2 second gap). The user disconnected from the old pod but hasn't connected to a new one yet. The Redis registry still points to the old pod. What happens:

Routing Worker → Redis lookup → "user_123 is on ws-pod-042-v1"
  → PUBLISH ws-pod-042-v1:deliver
  → Old pod receives Pub/Sub message
  → Tries to send to WebSocket session → session is closed
  → Writes notification to ScyllaDB: status = QUEUED
  → User reconnects to ws-pod-099-v2
  → ws-pod-099-v2 fetches pending:
      SELECT * FROM notification_status
      WHERE user_id = 'user_123' AND status = 'QUEUED'
  → Delivers all queued notifications
  → Updates status to DELIVERED

This is the same queuing mechanism described in Section 9.3 (Phase 4). No new infrastructure needed.

User state	% of pod's users	Notification path	Delay
Already reconnected	~90% (by T+2s)	Normal Pub/Sub	None
Mid-reconnect	~8% (0-2s window)	ScyllaDB queue	2-3s
Still on old pod	~2% (slow clients)	Existing WebSocket	None

Canary Deploys for WebSocket Gateways

Canary deploys work for WebSocket gateways, but they work differently than for stateless services.

Canary with new-connection routing is practical:

Deploy the new version to 10 pods (out of 200)
These canary pods join the NLB target group alongside the 190 stable pods
Only new connections (from reconnects or fresh users) land on canary pods. Existing connections on the 190 stable pods are untouched
Monitor canary pods for 10-15 minutes: error rates, connection stability, memory usage, WebSocket frame errors
If healthy: proceed with rolling deploy for the remaining 190 pods
If unhealthy: remove canary pods from NLB. Their users (~5M connections) reconnect to the 190 stable pods. Rollback complete in under 30 seconds

Deploy Frequency at This Scale

Separate cadences for different services:

Service	Deploy frequency	Reason
Stateless (routing, FCM, API gateway)	10+/day	No connection state. Instant replacement.
WebSocket gateways	1-2/day	2-hour rollout. Batch changes.
Infrastructure (Kafka, Redis, ScyllaDB)	Weekly	Planned maintenance windows.

10. Identify Bottlenecks

10.1 WebSocket Node Selection Problem

10.2 Broadcast Fan-Out Explosion

10.3 Connection Registry Scalability

10.4 Scheduled Notification Hot Partitions

10.5 Cache Stampede on Redis Failure

When a Redis shard fails, all routing workers that depended on it for connection lookups fail simultaneously. That's potentially hundreds of workers hitting the same error at once.

Partial failure (one shard out of 50):

Full Redis outage:

10.6 Thundering Herd on Gateway Pod Restart

10.7 Cross-Region Notification Latency

11. Failure Scenarios

11.1 WebSocket Gateway Pod Crash

SCENARIO: A gateway pod serving 500K concurrent WebSocket connections dies

Timeline:

Time	Event
T=0	Gateway pod-17 crashes (OOM, hardware fault, or deployment)
T=0	500K WebSocket connections drop. Clients get TCP RST or timeout
T=+3s	LB health check fails, stops routing to pod-17
T=+10s	Pod health reaper detects pod-17 is gone, begins cleaning Redis
T=+10-30s	Reaper removes ~500K stale entries from connection registry
T=+5-30s	Clients begin reconnecting with exponential backoff + jitter
T=+30s	90% of clients reconnected to other pods, new registry entries in Redis
T=+30s	Queued notifications in ScyllaDB delivered on reconnect

The hidden problem: Thundering herd on reconnect

What happens to notifications during the 30-second gap?

Two outcomes, both safe:

Registry already cleaned: Routing layer can't find the user, queues the notification in ScyllaDB (Section 9.3, Phase 4). Delivered on reconnect.
Stale entry still in Redis: Routing layer finds the stale entry, publishes to ws-pod-017:deliver. Redis PUBLISH returns 0 (no subscriber — pod is dead). The routing layer checks this return value: if 0, it queues the notification in ScyllaDB as QUEUED. Delivered on reconnect.

Either way, no notification is lost.

Metric	Value
Data loss	ZERO -- notifications queued in ScyllaDB
User impact	5-30s reconnect delay, then queued notifications delivered
Capacity impact	<0.5% (1 of 200 pods)

11.2 Redis Cluster Failure

SCENARIO: Redis cluster node fails, partial connection registry lost

Impact analysis:

What breaks:

Connection registry lookups fail for ~2% of users: The routing layer can't find which gateway pod the user is on. It falls back to writing the notification to ScyllaDB as QUEUED. The user's gateway pod will deliver it on the next heartbeat check (30-40 seconds).
Rate limiter unavailable for ~2% of tenants: Circuit breaker kicks in. Fall back to in-memory approximate rate limiter. Slightly over-permissive, but delivering a few extra notifications is better than dropping valid ones.

The hidden problem: Registry rebuild after failover

Recovery: Every gateway pod re-registers all its connections every 60 seconds as a background sweep. Within 60 seconds of failover, the registry is fully accurate again.

Metric	Value
Data loss	ZERO -- notifications queued in ScyllaDB
User impact	10-60s delayed delivery for ~2% of users
Full recovery	<60s after replica promotion

What if 10-15 seconds of delayed delivery is unacceptable?

11.3 ScyllaDB Node Failure

SCENARIO: One ScyllaDB node in a 9-node cluster goes down (simplified for illustration; production sizing per Section 6 requires hundreds of nodes, but the failure mechanics are identical)

Why it's boring:

Timeline:

Time	Event
T=0	Node-5 goes down
T=0	All queries to data owned by node-5 still succeed (2 of 3 replicas alive)
T=+10s	Gossip protocol detects node-5 as DOWN
T=+10s	Coordinator nodes stop including node-5 in query plans
T=hours	Replacement node joins, streams data from surviving replicas

Metric	Value
Data loss	ZERO
User impact	NONE -- completely transparent
Recovery	Hours (streaming), but service never degraded

11.4 Kafka Broker Failure

SCENARIO: Kafka broker-7 fails (hardware fault, disk corruption)

Kafka is the durable backbone. Every notification passes through it after the API returns 202 Accepted. Losing a broker is a non-event for data safety, but causes a brief processing blip.

Timeline:

Time	Event
T=0	Broker-7 dies
T=0	Partitions where broker-7 was leader become temporarily leaderless
T=+5-15s	Controller elects new leaders from ISR (in-sync replicas)
T=+5-15s	Producers get NOT_LEADER errors, retry with backoff, succeed after election
T=+5-15s	Consumers pause on affected partitions, resume after new leaders
T=+15s	Fully operational, no data lost

Why no data is lost:

RF=3: Every message exists on 3 brokers
min.insync.replicas=2: A write is only ACKed after 2 replicas confirm
acks=all: Producer waits for all in-sync replicas before confirming
Broker-7 dying means 2 replicas survive. New leader elected from those 2

Producer behavior during election:

Producer config:
  acks=all
  retries=5
  retry.backoff.ms=100
  buffer.memory=64MB   // ~5 seconds of buffering at peak

// During 5-15s leader election:
// - Producer buffers messages in 64MB memory
// - Retries on NOT_LEADER_FOR_PARTITION
// - Succeeds after new leader elected
// - If buffer fills: block for max.block.ms (60s), then throw

Metric	Value
Data loss	ZERO
User impact	5-15 second delivery delay
Recovery	Automatic, no manual intervention

11.5 FCM/APNs Full Service Outage

SCENARIO: Google FCM goes completely down for 2 hours

What happens step by step:

FCM dispatcher workers start getting 500/503 errors on every request
Error rate crosses 50% threshold within 60 seconds
Circuit breaker opens. FCM workers stop calling FCM entirely
Kafka consumer for the notifications.fcm topic pauses its partitions
Messages accumulate in Kafka. At 30M FCM notifications/sec, that's 108B messages over 2 hours
Kafka retention is 24 hours. 2-hour outage is well within buffer

The critical detail: Controlled drain on recovery

When FCM comes back, blasting 108B buffered notifications at once would be a mistake. FCM has its own rate limits, and sending a 2-hour backlog at full speed would just trigger 429s.

The recovery drain works like this:

Circuit breaker: OPEN → (probe every 10s) → HALF_OPEN → (5 successes) → CLOSED

After CLOSED:
  1. Resume Kafka consumer with rate limiting
  2. Start at 10% of normal throughput
  3. Ramp up 10% every 30s if error rate < 5%
  4. Full throughput restored in ~5 minutes
  5. Backlog drained at full speed (30M/sec) once stable
  6. Total drain time for 2h outage: ~1 hour at full speed

Metric	Value
Data loss	ZERO -- Kafka buffers everything
User impact	No Android notifications for duration of FCM outage
Recovery time	~1 hour to drain 2-hour backlog
Expired notifications	Skipped during drain (TTL check)

The exact same pattern applies to APNs outages. Separate Kafka topic, separate circuit breaker, separate drain -- one channel going down doesn't affect the others.

11.6 Scheduler Leader Crash

SCENARIO: The scheduler leader crashes mid-tick while firing scheduled notifications

Timeline:

Time	Event
T=0	Leader crashes mid-tick (was processing minute-bucket 14:35)
T=0	Redis lock TTL starts counting down (TTL=10s)
T=+5s	Hot standby notices leader heartbeat missing, starts attempting lock acquisition
T=+10s	Redis lock expires. Standby acquires lock, becomes new leader
T=+10s	New leader starts ticking from current time
T=+10-15s	Notifications scheduled between T=0 and T=+10s are unfired
T=+5min	Consistency sweep runs, catches everything the crash missed

The gap: What about T=0 to T=+10s?

Metric	Value
Data loss	ZERO -- ScyllaDB is source of truth
User impact	Up to 5-minute delay for notifications in the gap window
Duplicate risk	None -- idempotency keys on every scheduled notification

11.7 Full Region Failure

SCENARIO: US-EAST region goes completely offline (AWS outage, network partition)

This is the big one. An entire region dying means: all US-EAST Kafka brokers, all Redis nodes, all gateway pods, all ScyllaDB nodes in that DC. Millions of active WebSocket connections drop.

What survives:

ScyllaDB data: Multi-DC replication means EU-WEST has a full copy of all device tokens, user preferences, notification status. Zero data loss.
Kafka messages: NOT replicated cross-region (by design). In-flight messages in US-EAST Kafka are unavailable until the region recovers. But: the ingestion layer already ACKed these to the client as 202 Accepted. They're durable in US-EAST Kafka (on disk, RF=3 within the region). When the region comes back, they'll be processed.

What breaks and how it recovers:

Component	Impact	Recovery
WebSocket connections	All US-EAST WS connections drop	Clients reconnect to EU-WEST via DNS failover (30-60s)
Kafka	US-EAST topics unavailable	New notifications for US users produced to EU-WEST Kafka
Redis	US-EAST registry lost	Rebuilt as clients reconnect to EU-WEST gateways
Scheduled notifications	US-EAST scheduler stops	EU-WEST scheduler picks up (ScyllaDB has the schedule)
In-flight notifications	Stuck in US-EAST Kafka	Delivered when region recovers, or re-fired by sweep

DNS failover timeline:

Time	Event
T=0	US-EAST goes dark
T=+30s	Health checks fail, DNS TTL starts expiring
T=+30-60s	DNS resolves to EU-WEST for US users
T=+60-90s	Clients reconnect to EU-WEST gateway pods
T=+90s	Service restored with higher latency (cross-Atlantic)
T=+5min	EU-WEST auto-scales to handle doubled load

Metric	Value
Data loss	ZERO -- ScyllaDB multi-DC has everything
In-flight notifications	Delivered when US-EAST recovers
User impact	30-90s outage during failover, +80ms latency after
Full recovery	When US-EAST comes back online

11.8 Compound Failure: Redis + WebSocket Gateway

SCENARIO: Network partition isolates the Redis cluster from gateway pods, but both are still running

What happens:

Gateway pods detect Redis is unreachable. They keep existing WebSocket connections alive (connections don't depend on Redis).
New connections can't be registered in Redis. Users who connect during the partition are invisible to routing.
Routing layer: every Redis lookup fails. Every notification goes to ScyllaDB as QUEUED.
Real-time delivery stops entirely. All notifications are delayed until either Redis recovers or users reconnect.

Recovery depends on which heals first:

If Redis recovers: Gateway pods re-register all connections in the next 60-second sweep. Routing layer starts finding users again. ScyllaDB queue drains as users are discovered online.

Metric	Value
Data loss	ZERO -- ScyllaDB catches everything
User impact	Delayed delivery (seconds to minutes depending on partition duration)
Recovery	Automatic once Redis is reachable or via direct ScyllaDB drain

11.9 When Is Data ACTUALLY Lost?

When is a notification ACTUALLY lost?

For Kafka to lose a committed message, it takes:

3 brokers holding replicas of the same partition all fail simultaneously WITH unrecoverable disk loss
OR: acks=all is misconfigured (not the case here)
OR: min.insync.replicas is set to 1 (set to 2 in this design)

Probability: astronomically low with proper operations.

Mitigations:

RF=3 across different racks/AZs
min.insync.replicas=2
Rack-aware replica placement
Regular disk health monitoring
Kafka tiered storage to S3 (independent durability layer)

Everything downstream of Kafka is derived:

ScyllaDB notification status: Rebuilt from Kafka replay
Redis connection registry: Rebuilt from gateway pod registration sweeps
Redis rate limiter counters: Rebuilt from incoming traffic patterns
Scheduled notification state: Source of truth is ScyllaDB, backed by Kafka events

The system's durability guarantee comes down to one thing: Kafka is the write-ahead log for every notification. As long as Kafka is intact, every notification is recoverable.

11.10 Failure Scenario Summary

Every scenario above comes back to the same architecture principle: Kafka is the durable backbone, ScyllaDB is the persistent queue, Redis is ephemeral and rebuildable.

Gateway pod crash? Clients reconnect, ScyllaDB holds queued notifications
Redis down? Routing falls back to ScyllaDB queuing, registry rebuilds in 60s
ScyllaDB node down? RF=3 means transparent failover
Kafka broker down? Leader election in 15s, no data loss
FCM/APNs outage? Kafka buffers for 24h, controlled drain on recovery
Scheduler crash? Standby takes over in 10s, sweep catches gaps in 5min
Full region down? DNS failover to other region, ScyllaDB has all data

At-least-once delivery is guaranteed as long as Kafka is intact. Every delivery path has a fallback. No single component failure causes notification loss.

11.11 Retry Strategy Matrix

Error Type	First Retry	Max Retries	Backoff	Final Action
Kafka produce	100ms	5	Exponential	503 to client
ScyllaDB timeout	200ms	3	Exponential	Serve from cache
Redis timeout	50ms	3	Linear	In-memory fallback
FCM 429	1s	10	Exp + jitter	DLQ
FCM 500/503	500ms	3	Exponential	DLQ
APNs 429	1s	10	Exp + jitter	DLQ
APNs 500/503	500ms	3	Exponential	DLQ
WS send fail	Immediate	1	N/A	Queue in ScyllaDB
Template error	N/A	0	N/A	DLQ
Invalid token	N/A	0	N/A	Deactivate token

11.12 Circuit Breaker Configuration

Dependency	Error% to Open	Window	Half-Open Probes	Fallback
FCM API	50%	60s	5 requests	Buffer in Kafka
APNs API	50%	60s	5 requests	Buffer in Kafka
ScyllaDB	30%	30s	10 requests	Local cache
Redis	30%	30s	10 requests	In-memory
Template Service	50%	60s	3 requests	Raw payload

State machine:
  CLOSED ──(error% > threshold)──▶ OPEN
  OPEN   ──(cooldown 60s)────────▶ HALF_OPEN
  HALF_OPEN ──(N successes)──────▶ CLOSED
  HALF_OPEN ──(any failure)──────▶ OPEN

11.13 Dead Letter Queue (DLQ)

Topic: notifications.dlq

Each message carries:
  • Original payload
  • Error code + message
  • Attempt count
  • Last attempt timestamp
  • Source topic + partition

Processing:
  1. Auto-retry every 15 min for transient errors (5xx, timeout)
  2. Manual triage dashboard for permanent errors (bad payload, bad template)
  3. Alert if DLQ depth > 10,000
  4. Retain 7 days → archive to S3

12. Deployment Strategy

12.1 Multi-Region Active-Active

Cross-region data strategy:

Kafka: NOT replicated cross-region. Each region has independent Kafka clusters with independent topics. No MirrorMaker.
ScyllaDB: Built-in multi-DC replication. Device tokens, user preferences, and notification status are available in every region.
Redis: NOT replicated (ephemeral, rebuilt locally per region).
WebSocket: User connects to nearest region, stays there.

Notification for EU user sent from US: US API Gateway → ScyllaDB lookup (user is EU) → produce to EU Kafka → EU routing pipeline → EU delivery → user's WS in EU

12.2 Technology Stack

Component	Technology
API Gateway	Envoy + gRPC services
Message Bus	Apache Kafka 3.7+
Hot Database	ScyllaDB Enterprise
Cache / Registry	Redis Cluster 7.2+
Analytics	ClickHouse
Cold Storage	S3 + Parquet
Templates	PostgreSQL 16
Orchestration	Kubernetes (EKS)
Monitoring	Prometheus + Grafana
Tracing	Jaeger / Tempo
CI/CD	ArgoCD + GitHub Actions
Load Balancer	AWS NLB (L4) + Envoy sidecar (L7)

12.3 Key Trade-off Decisions

Decision	Chosen	Why
Hot DB	ScyllaDB over Cassandra/DynamoDB	10x throughput per node, compatible model, lower tail latency
Message Bus	Kafka over Pulsar/NATS	Battle-tested at scale, strongest ecosystem, per-region clusters with ingestion-layer routing
Connection Registry	Redis over etcd/ZooKeeper	Sub-ms lookups, built-in pub/sub, good enough for ephemeral data
WS vs SSE	Hybrid (WS primary, SSE fallback)	WS for bidirectional ACK, SSE for firewall-hostile environments
Scheduling	Redis sorted set + ScyllaDB	O(log N) retrieval, ScyllaDB for durability
LB for WS	L4 NLB + Envoy sidecar	L4 for minimal connection overhead, Envoy for per-pod L7
Analytics	ClickHouse over Druid/ES	Better compression, SQL, trillion-row queries, simpler ops
WS Node Selection	Centralized registry (Redis)	Sub-ms exact routing, no broadcast waste, handles multi-device

13. Observability

13.1 Key Metrics (SLIs)

Ingestion:   notification.ingested.rate, rejected.rate, ingestion.latency.p99
Routing:     notification.routed.rate, routing.latency.p99, fanout.factor
Delivery:    notification.sent/delivered/failed.rate (per channel), delivery.latency.p50/p99
Connections: websocket.connections.active (per pod), dropped.rate
Kafka:       consumer.lag (per topic/group), produce.error.rate
Scheduler:   scheduled.pending, fired.rate, missed.rate, drift.seconds, preloader.behind.seconds, preloader.loaded.rate
DLQ:         dlq.depth, ingestion.rate, retry.success.rate

13.2 Critical Alerts

yaml

- alert: DeliveryFailureRate > 5%          # P1 page
  for: 2m

- alert: KafkaConsumerLag > 500K           # P1 page
  for: 5m

- alert: WebSocketDropRate > 10K/min       # P1 page
  for: 1m

- alert: DLQDepth > 10K                    # P2 warning
  for: 10m

- alert: SchedulerLeaderLost               # P1 page
  for: 10s

- alert: FCMCircuitBreakerOpen             # P1 page
  immediate

13.3 Distributed Tracing

Every notification carries X-Trace-Id through the entire pipeline. Sampled at 0.1% (= 100K traces/sec). 100% sampling for CRITICAL priority and DLQ messages. Backend: Jaeger or Grafana Tempo.

Trace span chain:
  [Ingest] → [Kafka] → [Route] → [Fan-out] → [Deliver:WEB]
                                             → [Deliver:FCM]
                                             → [Deliver:APNs]

14. Security

Encryption: TLS 1.3 everywhere. mTLS for service-to-service. Kafka and ScyllaDB encrypted at rest. Optional per-tenant payload encryption via AWS KMS.

Input Validation: JSON schema validation. Max payload 4 KB. Template variables sanitized. Deep links validated against tenant-registered URL scheme allowlist.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This System	Learn More
Kafka	Central event bus, 2M+ messages/sec, topic-per-channel routing	Kafka
ScyllaDB	Primary hot store, 2.4 PB notification storage, shard-per-core	ScyllaDB
Redis	Connection registry for 100M WebSocket connections, Pub/Sub relay	Redis
ClickHouse	Real-time analytics on delivery rates, latency percentiles	ClickHouse
PostgreSQL	Tenant configuration, template management, audit logs	PostgreSQL
gRPC	Service-to-service communication with protobuf serialization	gRPC
Envoy	Service mesh sidecar, mTLS, load balancing, circuit breaking	Envoy
Prometheus	Metrics collection for delivery SLOs and pipeline health	Prometheus
Grafana	Dashboards for real-time delivery monitoring and alerting	Grafana

Infrastructure Patterns

Pattern	Role in This System	Learn More
WebSocket & Real-Time Communication	100M concurrent connections, SSE fallback, connection draining	WebSocket & Real-Time
Circuit Breaker & Resilience Patterns	FCM/APNs circuit breakers, retry policies, fallback chains	Circuit Breaker & Resilience
Load Balancer	L4 load balancing for persistent WebSocket connections	Load Balancer
Message Queues & Event Streaming	Kafka topic design, priority queues, dead-letter handling	Event Streaming
Rate Limiting & Throttling	Per-tenant rate limits, provider rate management	Rate Limiting
Kubernetes	Pod scheduling, connection-aware autoscaling, rolling deploys	Kubernetes Architecture
Distributed Tracing	End-to-end notification delivery tracing across 12+ services	Distributed Tracing
Secrets Management	FCM/APNs credentials, tenant API keys, mTLS certificates	Secrets Management
Service Mesh	Envoy sidecar for mTLS, observability, traffic management	Service Mesh
Caching Strategies	Template caching, preference caching, connection state	Caching Strategies

Practice this design: Design a Notification System -- interview question with progressive hints.

CrackingWalnuts

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

System Design: Job Scheduler (10M Jobs/day, DAG Dependencies, Effectively-Once Execution)

April 15, 2026 · 61 min read

System Design: LeetCode (Code Sandbox, Container Isolation, Real-Time Contests)

April 12, 2026 · 68 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.