Load Balancer

Why It Exists

Every server has a ceiling. Throwing more CPU and RAM at it works for a while, but vertical scaling runs out eventually. Load balancers sit in front of the fleet and spread requests across multiple servers. That provides more throughput and, just as importantly, fault tolerance when individual machines go down.

Anyone who has manually restarted a single overloaded box at 2 AM already understands the problem this solves.

How It Works

L4 vs L7

Layer 4 (Transport) works at the TCP/UDP level. It sees IP addresses and ports, nothing more. No HTTP headers, no URLs. That makes it fast. Really fast. It just shuffles packets using NAT or DSR (Direct Server Return) with minimal inspection.

Layer 7 (Application) works at the HTTP/HTTPS level. It can look inside requests and route based on URL path, headers, cookies, even the request body. This provides content-based routing, SSL termination, and the ability to modify requests before they hit backends. The tradeoff is higher latency because the load balancer has to parse the full protocol.

My rule of thumb: start with L7 unless latency is the top constraint. The visibility alone is worth it.

Algorithms

Round Robin rotates through backends one by one. Works great when servers are identical and requests take roughly the same time. Falls apart fast when either assumption breaks.
Weighted Round Robin assigns weights. A server with weight 3 gets 3x the traffic of one with weight 1. Useful for mixed hardware, but the weights need to stay accurate as the fleet changes.
Least Connections sends traffic to the server with the fewest active connections. Pick this when request durations vary a lot. It is the most forgiving algorithm in practice.
Consistent Hashing hashes a request attribute (like user ID) to a specific server. This provides session affinity without sticky sessions, and only about 1/N keys need to remap when a server is added or removed. Great for caching layers.
Random with Two Choices (P2C) picks two servers at random and routes to whichever has fewer connections. This sounds too simple to work well, but it actually avoids the herding behavior that least-connections can cause. Envoy uses this as its default, and for good reason.

Production Considerations

Connection draining. Before pulling a backend out of rotation (deploy, scale-down, whatever), stop sending it new connections but let existing ones finish. 30 to 60 seconds is a typical drain timeout. Skip this and requests will drop every single deploy.
Health checks. Use both active checks (the LB probes the backend) and passive checks (the LB watches response codes from real traffic). Active catches the case where a server is technically up but returning garbage. Passive catches issues faster since it piggybacks on real requests.
Cross-zone balancing. In multi-AZ setups, make sure traffic distributes evenly across zones. AWS ALB does this by default. When running a self-managed load balancer, configure this explicitly or hotspots will develop.
TLS termination. Terminate TLS at the load balancer to take the crypto workload off backends. If compliance requires encryption in transit all the way through, re-encrypt with mTLS between the LB and backends.
Global Server Load Balancing (GSLB) uses DNS to route users to the nearest region. Pair it with local load balancers for a proper multi-region setup.

Failure Scenarios

Scenario 1: Health Check False Positives During GC Pauses. A Java backend hits a 4-second stop-the-world GC pause. The load balancer's health check (2s timeout, 3 consecutive failures) marks the instance unhealthy and pulls it from the pool. The JVM recovers, but the LB takes 30 seconds to re-add it (3 passing checks at 10s intervals). During peak traffic, losing even one backend overloads the rest, triggering a cascade. Detection: correlate backend_removed_total with JVM GC metrics and alert on healthy_backend_count < expected_count for more than 15 seconds. Fix: use a longer timeout (5s) or require more consecutive failures (5) for services with known GC pauses. Separate liveness probes from readiness probes. I have seen this exact scenario take down a checkout flow on Black Friday.

Scenario 2: Connection Draining Failure During Deploy. A rolling deploy yanks backends without draining connections. WebSocket connections and long-polling requests drop immediately. For an e-commerce site mid-checkout, that means 2 to 5 percent of active transactions just fail. Detection: look for a spike in connection_reset errors correlated with deploy timestamps and track in_flight_requests_at_deregister. Fix: enforce a mandatory 60-second drain period in the deployment pipeline, set deregistration_delay on ALB target groups, and use pre-stop hooks in Kubernetes (sleep 15 before the container shuts down). This should be non-negotiable in any production deployment config.

Scenario 3: Asymmetric Load After AZ Failure. One of three availability zones goes down. Cross-zone balancing is off (common with NLB to save a bit of latency). The two surviving zones now take all the traffic, but the backends were provisioned per-zone, so each zone suddenly handles 50% instead of 33%. CPU spikes to 95%, latency degrades, and auto-scaling needs 3 to 5 minutes to spin up new instances. Detection: track per-AZ CPU, connection count, and latency. Alert on per_az_cpu > 80%. Fix: turn on cross-zone balancing for anything critical, even with the small latency hit (around 0.5ms). Provision each AZ to handle at least 50% of total traffic. This is N+1 AZ capacity planning, and it is not optional.

Capacity Planning

A single HAProxy instance can handle roughly 2M concurrent TCP connections and 300K HTTP RPS on modern hardware (8 vCPU, 16GB RAM). AWS NLB scales to millions of requests per second with no warm-up required. AWS ALB handles about 100K concurrent connections per node and auto-scales, but pre-warming is needed for sudden traffic spikes (request this through AWS support).

Metric	Warning Threshold	Critical Threshold
Active connections	> 70% of max	> 85% of max
New connections/sec	> 60% of tested max	> 80% of tested max
Backend response time (P99)	> 500ms	> 2s
Error rate (5xx from backends)	> 0.5%	> 2%
Spillover count (ALB)	> 0 per minute	> 100 per minute

Real-world numbers worth knowing: GitHub runs HAProxy at roughly 800K RPS with under 1ms added latency. Cloudflare runs custom L4 load balancers handling 40M RPS globally using Maglev-style consistent hashing. Netflix keeps around 700 backend servers per ELB cluster in peak regions. For capacity planning, use this formula: required_lb_capacity = peak_connections * 3 * avg_connection_duration / drain_timeout. And always load-test with realistic keep-alive ratios. In production, about 80% of connections are keep-alive, and that completely changes the concurrency model compared to short-lived benchmark connections.

Architecture Decision Record

ADR: L4 vs L7 Load Balancing

Context: Picking between transport-layer and application-layer load balancing affects latency, visibility, cost, and operational complexity.

Criteria (Weight)	L4 (NLB / HAProxy TCP)	L7 (ALB / HAProxy HTTP)	Both (Tiered)
Latency (25%)	~0.1ms added	~1-5ms added	~1-5ms total
Visibility (25%)	IP/port only	Full HTTP context	Full HTTP context
Cost per Mreq (20%)	~$0.006 (NLB)	~$0.008 (ALB)	Combined
TLS termination (15%)	Pass-through or terminate	Terminate + re-encrypt	Flexible
Deployment features (15%)	None	Canary, blue-green, header routing	Full

Decision framework:

gRPC/HTTP2 services with identical backends AND no content routing needed. Go with L4 (NLB). Lowest latency, lowest cost, no HTTP parsing overhead. This is the right call for internal service-to-service traffic where request inspection isn't needed.
Web applications AND API services that need path routing, canary deploys, or WAF. Use L7 (ALB or HAProxy HTTP mode). The routing flexibility and observability are worth the extra latency. Most teams should start here.
More than 50 engineers OR traffic above 500K RPS OR multi-region. Use a tiered approach: L4 at the edge for DDoS absorption and TLS pass-through, L7 behind it for content routing. Google's architecture does exactly this with Maglev (L4) in front of GFE (L7). Each layer scales independently.
Ultra-low-latency requirements (under 1ms added) AND more than 1M RPS. Look at DPDK-based or kernel-bypass load balancers like Katran (Meta) or Maglev (Google). These run at L4 with software-defined networking and hit line-rate packet forwarding. But be realistic: unless the operation is at hyperscaler scale with a dedicated networking team, this is more complexity than necessary.

Tool	Type	Best For	Scale
AWS ALB/NLB	Managed	Cloud-native, auto-scaling integration	Small-Enterprise
HAProxy	Open Source	High-performance L4/L7, battle-tested	Medium-Enterprise
Envoy	Open Source	Service mesh, advanced observability	Large-Enterprise
NGINX	Open Source	Web server + reverse proxy + LB	Small-Enterprise

Why It Exists

Anyone who has manually restarted a single overloaded box at 2 AM already understands the problem this solves.

How It Works

L4 vs L7

My rule of thumb: start with L7 unless latency is the top constraint. The visibility alone is worth it.

Algorithms

Round Robin rotates through backends one by one. Works great when servers are identical and requests take roughly the same time. Falls apart fast when either assumption breaks.
Weighted Round Robin assigns weights. A server with weight 3 gets 3x the traffic of one with weight 1. Useful for mixed hardware, but the weights need to stay accurate as the fleet changes.
Least Connections sends traffic to the server with the fewest active connections. Pick this when request durations vary a lot. It is the most forgiving algorithm in practice.
Consistent Hashing hashes a request attribute (like user ID) to a specific server. This provides session affinity without sticky sessions, and only about 1/N keys need to remap when a server is added or removed. Great for caching layers.
Random with Two Choices (P2C) picks two servers at random and routes to whichever has fewer connections. This sounds too simple to work well, but it actually avoids the herding behavior that least-connections can cause. Envoy uses this as its default, and for good reason.

Production Considerations

Connection draining. Before pulling a backend out of rotation (deploy, scale-down, whatever), stop sending it new connections but let existing ones finish. 30 to 60 seconds is a typical drain timeout. Skip this and requests will drop every single deploy.
Health checks. Use both active checks (the LB probes the backend) and passive checks (the LB watches response codes from real traffic). Active catches the case where a server is technically up but returning garbage. Passive catches issues faster since it piggybacks on real requests.
Cross-zone balancing. In multi-AZ setups, make sure traffic distributes evenly across zones. AWS ALB does this by default. When running a self-managed load balancer, configure this explicitly or hotspots will develop.
TLS termination. Terminate TLS at the load balancer to take the crypto workload off backends. If compliance requires encryption in transit all the way through, re-encrypt with mTLS between the LB and backends.
Global Server Load Balancing (GSLB) uses DNS to route users to the nearest region. Pair it with local load balancers for a proper multi-region setup.

Failure Scenarios

Capacity Planning

Metric	Warning Threshold	Critical Threshold
Active connections	> 70% of max	> 85% of max
New connections/sec	> 60% of tested max	> 80% of tested max
Backend response time (P99)	> 500ms	> 2s
Error rate (5xx from backends)	> 0.5%	> 2%
Spillover count (ALB)	> 0 per minute	> 100 per minute

Architecture Decision Record

ADR: L4 vs L7 Load Balancing

Context: Picking between transport-layer and application-layer load balancing affects latency, visibility, cost, and operational complexity.

Criteria (Weight)	L4 (NLB / HAProxy TCP)	L7 (ALB / HAProxy HTTP)	Both (Tiered)
Latency (25%)	~0.1ms added	~1-5ms added	~1-5ms total
Visibility (25%)	IP/port only	Full HTTP context	Full HTTP context
Cost per Mreq (20%)	~$0.006 (NLB)	~$0.008 (ALB)	Combined
TLS termination (15%)	Pass-through or terminate	Terminate + re-encrypt	Flexible
Deployment features (15%)	None	Canary, blue-green, header routing	Full

Decision framework:

gRPC/HTTP2 services with identical backends AND no content routing needed. Go with L4 (NLB). Lowest latency, lowest cost, no HTTP parsing overhead. This is the right call for internal service-to-service traffic where request inspection isn't needed.
Web applications AND API services that need path routing, canary deploys, or WAF. Use L7 (ALB or HAProxy HTTP mode). The routing flexibility and observability are worth the extra latency. Most teams should start here.
More than 50 engineers OR traffic above 500K RPS OR multi-region. Use a tiered approach: L4 at the edge for DDoS absorption and TLS pass-through, L7 behind it for content routing. Google's architecture does exactly this with Maglev (L4) in front of GFE (L7). Each layer scales independently.
Ultra-low-latency requirements (under 1ms added) AND more than 1M RPS. Look at DPDK-based or kernel-bypass load balancers like Katran (Meta) or Maglev (Google). These run at L4 with software-defined networking and hit line-rate packet forwarding. But be realistic: unless the operation is at hyperscaler scale with a dedicated networking team, this is more complexity than necessary.

Architecture Diagram

Why It Exists

How It Works

L4 vs L7

Algorithms

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: L4 vs L7 Load Balancing

Key Points

Tool Comparison

Common Mistakes

Related Topics

Load Balancer

Architecture Diagram

Why It Exists

How It Works

L4 vs L7

Algorithms

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: L4 vs L7 Load Balancing

Key Points

Tool Comparison

Common Mistakes

Related Topics