WebSocket & Real-Time Communication
Architecture Diagram
Why It Exists
HTTP makes the client ask for everything. Every single interaction starts with a request. That works fine for loading pages, but it falls apart for real-time features: notifications, chat, live dashboards, collaborative editing, multiplayer games. The server has data right now and needs to push it immediately.
Polling wastes bandwidth and adds noticeable latency. Long-polling ties up server threads waiting around. WebSocket takes a different approach. After a single HTTP Upgrade handshake, the result is a persistent, full-duplex TCP connection. Both sides can send frames whenever they want. The framing overhead is 2-14 bytes per frame versus hundreds of bytes in HTTP headers. That difference matters a lot when sending thousands of small messages per second.
How It Works
Protocol Comparison
| Protocol | Direction | Connection | Overhead | Browser Support | Use Case |
|---|---|---|---|---|---|
| HTTP Polling | Client-pull | New connection per poll | High (headers every request) | Universal | Simple, infrequent updates |
| Long-Polling | Client-pull (held) | Held open, re-established | Medium | Universal | Fallback when WS unavailable |
| SSE | Server-push | Persistent HTTP/2 stream | Low (text/event-stream) | All modern browsers | Notifications, feeds, dashboards |
| WebSocket | Bidirectional | Persistent TCP | Minimal (2-14 byte frame) | All modern browsers | Chat, gaming, collaboration |
A quick opinion: SSE is underrated. If the data flow is one-directional (server to client), SSE is simpler to operate, works through HTTP/2 without special proxy config, and handles reconnection automatically. Reach for WebSocket when bidirectional communication is actually needed.
WebSocket Lifecycle
-
HTTP Upgrade Handshake. The client sends a
GETwithUpgrade: websocketand aSec-WebSocket-Key. The server responds with101 Switching Protocolsand aSec-WebSocket-Acceptheader (a SHA-1 hash proving it read the key). After this, the TCP connection switches from HTTP framing to WebSocket framing. The HTTP path is over. -
Data Transfer. Both sides exchange frames: text frames (UTF-8), binary frames, ping/pong for keepalive, and close frames. Framing is lightweight. A small message adds only 2 bytes of overhead. Messages can also be fragmented across multiple frames to stream large payloads without buffering the whole thing in memory.
-
Keepalive. The server sends
pingframes at a regular interval, typically every 30-60 seconds. The client must respond withpong. No pong within the timeout? The server closes the connection. This is how silently-dropped TCP connections from NAT timeouts, mobile network switches, or client crashes get caught. Without this, ghost connections accumulate and metrics become unreliable. -
Close Handshake. Either side sends a close frame with a status code (1000 = normal, 1001 = going away, 1008 = policy violation). The other side responds with its own close frame, then both tear down the TCP socket. Clean and orderly.
The Connection Routing Problem
This is where most teams hit a wall when scaling past a handful of servers.
The core challenge: an event can arrive at any server, but the target client is connected to one specific pod. A notification for user X hits the ingestion API, but user X's WebSocket lives on Pod 47 out of 200 pods. How does it get there?
Solution: Redis Connection Registry + Pub/Sub
- When a client connects, the WebSocket pod writes
{user_id -> pod_id}into Redis with a TTL matching the keepalive interval - When an event targets user X, the producer looks up
user_id -> pod_idin Redis. O(1) lookup. - The producer publishes the event to a Redis Pub/Sub channel for that specific pod (e.g.,
ws:pod:47) - Pod 47 picks up the event and pushes it down user X's WebSocket from its local connection map
This provides O(1) lookup + O(1) delivery. No broadcast waste. I have seen teams skip this step and just fan out every event to every pod, which technically works but burns CPU and bandwidth at O(N) per event. It becomes painful fast past 20-30 pods.
Production Considerations
Load Balancing for Persistent Connections
WebSocket connections are long-lived, and that breaks a fundamental assumption of round-robin HTTP load balancing. Adding a new pod during scale-out means it gets zero existing connections while old pods stay overloaded. The balancer looks like it is distributing evenly, but the actual connection distribution is wildly skewed.
- Use L4 (TCP) load balancing for WebSocket traffic. L7 HTTP load balancers often buffer the Upgrade request or enforce HTTP timeouts that kill idle connections. AWS NLB, HAProxy in TCP mode, or Envoy with
upgrade_configsall handle this correctly. - Least-connections algorithm. This routes new connections to whichever pod has the fewest active connections, so things balance out naturally over time. Round-robin is the wrong choice here.
- Connection limits per pod. Set a hard max (e.g., 50K per pod). When a pod hits the cap, it fails its health check so the load balancer stops sending new connections. Without this, a single pod will accumulate connections until it runs out of memory.
Graceful Connection Draining
During a rolling deployment, killing a pod drops all its WebSocket connections at once. If every client reconnects simultaneously, the remaining pods see a connection spike (thundering herd) and can cascade-fail. I have watched this take down a production cluster.
Drain protocol:
- Mark the pod as draining (stop accepting new connections via health check)
- Send a custom
RECONNECTframe to connected clients with a random delay hint (0-30s) - Clients disconnect and reconnect to a new pod at a staggered time
- Wait for the drain timeout (e.g., 60s), then terminate remaining connections
- Pod shuts down
In Kubernetes, set terminationGracePeriodSeconds to match the drain timeout. Use a preStop hook to trigger the drain logic before the pod receives SIGTERM. This is one of those things that seems optional until it causes an outage at 2 AM.
Scaling to 100M Concurrent Connections
At notification-system scale (100M concurrent connections), the architecture looks like this:
- 2,000-4,000 WebSocket pods at 25K-50K connections each
- Redis Cluster for the connection registry (100M keys at ~100 bytes each = ~10 GB, which is manageable)
- Sharded Redis Pub/Sub. A single Redis Pub/Sub instance tops out around 500K messages/sec. Shard by pod ID hash to spread the load.
- Kernel tuning: bump
net.core.somaxconn,fs.file-max,net.ipv4.tcp_tw_reuse. Every connection uses a file descriptor, and the kernel default of 1024 is laughably low for this. - Memory budget: each idle WebSocket connection eats ~10-50 KB (buffers, TLS state, application state). At 50K connections per pod, that's 500 MB to 2.5 GB just on connection overhead before the application does anything useful.
Most teams will never hit this scale. But knowing the ceiling matters because it shows where the architecture starts to buckle and what needs investment ahead of time.
Failure Scenarios
Scenario 1: NAT/Proxy Timeout Causes Silent Connection Drops. Corporate firewalls and cloud NAT gateways silently kill idle TCP connections after 60-300 seconds. Without application-level keepalive, the server thinks the connection is alive while packets are being blackholed. The connection registry says the user is online, but events sent to them vanish. Active-user metrics quietly inflate with ghost connections. Detection: monitor the ratio of pong responses to ping sends and alert when the pong rate drops below 95%. Track connections_cleaned_up_by_timeout as a separate metric. Recovery: run server-side pings every 30 seconds (below most NAT timeouts). On timeout, remove the connection from both the local map and the Redis registry immediately. Client-side, implement reconnection with exponential backoff (1s, 2s, 4s, max 30s) plus jitter to avoid creating a mini thundering herd.
Scenario 2: Redis Registry Failure Breaks Connection Routing. The Redis cluster used for connection routing goes through a failover. During the 10-30 second failover window, new connections cannot be registered and event routing lookups fail. Events queue up but cannot be delivered. After failover, the new Redis primary has an empty or stale dataset. All existing connections look unroutable. Detection: monitor redis_connection_registry_errors and event delivery failure rate. Alert when delivery drops below 99%. Recovery: each WebSocket pod keeps a local connection map as the source of truth. After Redis comes back, pods re-register all their active connections in a background sweep (rate-limit this to avoid hammering the fresh Redis instance). During the outage, fall back to broadcasting events to all pods. It is O(N) and wasteful, but it keeps events flowing. For extra safety, set up dual-write to a secondary Redis cluster so failover is instant.
Scenario 3: Thundering Herd After Pod Crash. A WebSocket pod handling 50K connections dies unexpectedly (OOM, node failure). All 50K clients detect the disconnect and try to reconnect at once. The load balancer distributes these across remaining pods, creating a 50K-connection spike. Pods that absorb too many reconnections may OOM themselves, creating a cascading failure. Detection: monitor connections_per_second and alert when it exceeds 5x the normal rate. Track per-pod connection count variance. Recovery: the real fix is client-side reconnection with jitter: reconnect_delay = min(base * 2^attempt + random(0, 1000ms), 30s). Server-side, add connection rate limiting per pod (e.g., max 1,000 new connections/second). In Kubernetes, use pod disruption budgets to prevent multiple pods from going down at the same time during voluntary operations.
Capacity Planning
Connection density formula: pods_needed = total_connections / connections_per_pod. The right connections-per-pod number depends on the message rate, message size, and available memory. Profile this with realistic traffic before picking a number.
| Scale Tier | Concurrent Connections | Pods (50K/pod) | Redis Registry | Bandwidth | Reference |
|---|---|---|---|---|---|
| Startup | 10K | 1-2 | Single Redis | 10 Mbps | Early-stage app |
| Mid-scale | 500K | 10-15 | Redis Sentinel | 500 Mbps | Chat/collab platform |
| Large-scale | 10M | 200-400 | Redis Cluster (3 shards) | 10 Gbps | Gaming/trading platform |
| Hyper-scale | 100M+ | 2,000-4,000 | Redis Cluster (10+ shards) | 100+ Gbps | Notification system |
Key thresholds to watch: The kernel file descriptor limit defaults to 1024. Set it to 1M+ for WebSocket servers or the limit will be hit fast. Each TLS connection adds ~30 KB of memory overhead, so at 50K connections, that is 1.5 GB just for TLS state. Redis Pub/Sub maxes out around 500K messages/sec per shard, so partition by pod ID hash. And keep an eye on TIME_WAIT socket accumulation during connection churn. Enable tcp_tw_reuse to reclaim ports faster, otherwise ephemeral port exhaustion will occur under load.
Key Points
- •WebSocket provides full-duplex, persistent connections over a single TCP socket, killing HTTP polling overhead entirely
- •Connection lifecycle: HTTP Upgrade handshake, bidirectional frames, ping/pong keepalive, close handshake
- •SSE (Server-Sent Events) is the simpler option for server-to-client push. It works with HTTP/2 and auto-reconnects out of the box
- •A connection registry (Redis) solves the routing problem: an event can land on any server, but the connection lives on one specific pod
- •Graceful connection draining during deploys prevents mass reconnect thundering herds
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Socket.IO | Open Source | Auto-reconnect, fallback transports, rooms/namespaces | Small-Medium |
| ws (Node.js) | Open Source | Minimal WebSocket server, high performance, no abstraction | Medium-Enterprise |
| Envoy/NGINX | Open Source | L4/L7 proxy with WebSocket support, connection draining | Medium-Enterprise |
| AWS API Gateway WebSocket | Managed | Serverless WebSocket with Lambda integration, connection management | Small-Large |
Common Mistakes
- Using L7 HTTP load balancers that buffer requests. WebSocket needs L4 (TCP) pass-through or L7 with explicit upgrade support.
- Broadcasting events to all pods instead of routing to the right one. That is O(N) fan-out when O(1) targeted delivery is possible.
- Skipping ping/pong keepalive. Silent TCP connection drops go undetected and stale connections pile up.
- Deploying without graceful drain. A rolling restart drops all connections at once and triggers a thundering herd reconnect storm.
- Storing connection state only in local memory. When a pod dies, all connection metadata is gone with no recovery path.