Performance & ObservabilityTopic 2 of 6

Performance & ObservabilityIntermediate

Load Balancing Algorithms

TCPHTTPgRPC

Load balancing algorithms determine how traffic is distributed across backends — from simple round robin to consistent hashing and P2C, the choice depends on workload characteristics and failure modes.

The Problem

Distributing traffic evenly sounds simple, but the wrong algorithm causes hotspots, cache misses, and cascading failures under real-world conditions.

Mental Model

Round robin is dealing cards — fair but blind. Least connections is joining the shortest checkout line. Consistent hashing is assigning each customer a personal cashier — stable mapping with minimal disruption when cashiers change.

Architecture Diagram

How It Works

A load balancer's job sounds trivial: distribute incoming requests across a pool of backend servers. In practice, the algorithm selection determines whether the system handles failures gracefully or cascades into a full outage. The wrong algorithm turns a 10% capacity reduction into 100% unavailability.

Round Robin

The simplest algorithm. Requests are assigned to backends in order: A, B, C, A, B, C. No state is maintained. No decisions are made. The load balancer cycles through the list.

When it works: All backends are identical (same hardware, same capacity) and all requests are roughly equal cost. A fleet of stateless API servers handling uniform GET requests is the ideal case.

When it fails: If backend A is a 2-core VM and backend C is a 16-core VM, both get the same traffic. A is overwhelmed while C is idle. Or if some requests take 10ms and others take 10 seconds, round robin sends an equal number of slow requests to each backend, regardless of current load.

Weighted round robin partially addresses the capacity problem. Assign weights proportional to capacity: A=1, B=1, C=4 means C gets 4x the traffic. This works for static capacity differences but does not adapt to runtime load.

# HAProxy weighted round robin
backend api_servers
    balance roundrobin
    server api1 10.0.1.1:8080 weight 1
    server api2 10.0.1.2:8080 weight 1
    server api3 10.0.1.3:8080 weight 4

Least Connections

Instead of cycling blindly, the load balancer tracks the number of active connections to each backend and sends the next request to the backend with the fewest connections. This naturally adapts to backends with different processing speeds — a fast backend completes requests quickly, its connection count drops, and it receives more traffic.

When it works: Request processing time varies significantly (some requests are expensive, others are cheap). The algorithm automatically sends fewer requests to backends that are processing slowly.

When it fails: All requests are equal cost and backends are identical. The connection tracking overhead provides zero benefit over round robin. Also fails when backends use long-lived connections (WebSocket, gRPC streams) — the connection count does not reflect actual load.

Weighted least connections combines weights with connection tracking: the load balancer picks the backend with the lowest connections / weight ratio.

# NGINX least connections
upstream api_servers {
    least_conn;
    server 10.0.1.1:8080;
    server 10.0.1.2:8080;
    server 10.0.1.3:8080;
}

Least Response Time

A refinement of least connections. Instead of counting connections, the load balancer tracks the average response time of each backend and routes to the fastest one. This directly optimizes for latency.

NGINX Plus and HAProxy Enterprise support this. The algorithm picks the backend with the lowest active_connections * average_response_time product, balancing both load and speed.

The downside: response time measurement introduces a feedback loop. A backend that receives fewer requests (because it was briefly slow) has less data to measure, and its average may not reflect current performance. Implementations typically use exponentially weighted moving averages to handle this.

Consistent Hashing

Consistent hashing maps both backends and request keys to positions on a circular hash space (the "ring"). A request hashes to a position on the ring and routes to the nearest backend clockwise.

The critical property: when a backend is added or removed, only K/N keys move (K = total keys, N = number of backends). In contrast, modulo-based hashing (hash(key) % N) reshuffles nearly all keys when N changes.

Ring:     [0] -------- [Backend A at 0.25] -------- [Backend B at 0.50] -------- [Backend C at 0.75] -------- [1.0/0]
Request:  hash("user-123") = 0.33 → routes to Backend B (nearest clockwise)
Remove B: hash("user-123") = 0.33 → now routes to Backend C
          (only keys between A and B moved; everything else stays)

Virtual nodes solve the distribution problem. With 3 physical backends on a ring, the distribution is uneven (some backends cover a larger arc). By placing 100-200 virtual nodes per backend, the hash space distribution becomes nearly uniform.

When it works: Cache-heavy workloads (CDN, distributed cache). Consistent hashing ensures that the same user/key always hits the same backend, maximizing cache hit rates. When a backend fails, only its keys move — the other backends' caches remain warm.

When it fails: Without virtual nodes, distribution is wildly uneven. With hot keys (a single key receiving 10% of all traffic), consistent hashing concentrates that load on one backend. The solution is to add a "bounded load" variant that redistributes traffic when a backend exceeds a threshold.

# Envoy ring hash configuration
clusters:
  - name: cache_cluster
    lb_policy: RING_HASH
    ring_hash_lb_config:
      minimum_ring_size: 1024
      maximum_ring_size: 8388608

Power of Two Choices (P2C)

P2C is elegantly simple: for each request, pick two backends at random and send the request to whichever has fewer active connections (or lower load). That is the entire algorithm.

The mathematical insight (from Mitzenmacher's thesis) is profound: choosing the least loaded of two random choices produces an exponential improvement in maximum load compared to random assignment. With pure random selection, the most loaded backend has O(log N / log log N) connections. With two choices, it drops to O(log log N) — a massive reduction.

Why Envoy chose P2C as the default: It requires almost no state (no global view of all backends), handles heterogeneous backends well, avoids thundering herd problems (unlike least-connections, which can send many concurrent requests to the same "least loaded" backend), and performs well even with stale load information.

# Envoy least_request (P2C) — the default
clusters:
  - name: api_cluster
    lb_policy: LEAST_REQUEST
    least_request_lb_config:
      choice_count: 2  # P2C

Maglev (Google)

Maglev is Google's consistent hashing algorithm designed for L4 load balancing. It uses a lookup table (default size: 65,537 entries, a prime number) that provides:

Consistent mapping: Adding or removing a backend changes minimal table entries
Uniform distribution: Each backend gets an equal share of the table (within 1%)
O(1) lookup: Request routing is a single table lookup, no ring traversal

The lookup table is built by having each backend generate a permutation of table positions based on its hash. Backends fill the table in round-robin order according to their permutations. The result is a table where each position maps to exactly one backend, with near-perfect uniformity.

Maglev is specifically designed for environments where backends change frequently (rolling deployments, autoscaling) and consistent mapping is critical (maintaining TCP connections through backend changes).

Algorithm Performance Comparison

Algorithm	Distribution	State Required	Backend Change Impact	Best For
Round Robin	Perfect (uniform)	None	Full redistribution	Identical backends, uniform requests
Weighted RR	Proportional to weights	Static weights	Full redistribution	Mixed-capacity backends
Least Connections	Adaptive to load	Per-backend connection count	Gradual rebalance	Variable request cost
Consistent Hash	Near-uniform (with vnodes)	Hash ring	Minimal (K/N keys move)	Cache locality, session affinity
P2C	Near-optimal	Minimal (sampled)	Graceful	General purpose, service mesh
Maglev	Uniform (within 1%)	Lookup table	Minimal	L4, high-throughput, Google-scale

Failure Modes and Cascading

The most dangerous scenario for load balancing is a partial outage. One backend becomes slow (not dead — slow). Round robin continues sending 33% of traffic to the degraded backend. That traffic queues up, connections pool, and the load balancer's health check (which runs every 10 seconds) has not detected the issue yet.

Meanwhile, the two healthy backends are receiving traffic normally, but the 33% of traffic hitting the slow backend is timing out and being retried — now hitting the healthy backends with 1.33x normal load. If retries are aggressive, the healthy backends overload too. A 33% capacity reduction becomes a 100% outage.

Mitigation strategies:

Outlier detection. Track per-backend error rates and latency. Eject backends that exceed thresholds (Envoy does this automatically).
Circuit breakers. Cap the maximum number of concurrent connections to any single backend. Once the limit is hit, fail fast instead of queueing.
Retry budgets. Limit retries to a percentage of total traffic (e.g., 20%). This prevents retry storms from amplifying partial failures.
Slow-start. When a new or recovered backend enters the pool, ramp up its traffic gradually over 30-60 seconds instead of sending a full share immediately.

# Envoy outlier detection — eject bad backends automatically
clusters:
  - name: api_cluster
    outlier_detection:
      consecutive_5xx: 5
      interval: 10s
      base_ejection_time: 30s
      max_ejection_percent: 50

Choosing the Right Algorithm

Start with P2C. It handles most workloads well, requires minimal configuration, avoids thundering herd, and degrades gracefully. This is why Envoy, gRPC, and many modern load balancers default to it.

Switch to consistent hashing when cache locality or session affinity is required. CDNs, distributed caches (Redis, Memcached), and stateful services benefit from routing the same key to the same backend.

Use round robin only when backends are identical and requests are uniform. It is the simplest to reason about and debug.

Use least connections when request processing time varies by orders of magnitude (e.g., a mix of fast reads and slow aggregation queries).

Use Maglev at L4 (TCP/UDP) where consistent hashing is needed at wire speed with perfectly uniform distribution. Most teams will never implement Maglev directly — it is embedded in Google's infrastructure and Envoy's Maglev LB implementation.

The algorithm matters less than the failure handling around it. A round-robin load balancer with good health checks, circuit breakers, and outlier detection will outperform a sophisticated algorithm with no failure handling every single time.

Key Points

•Round robin is the simplest algorithm and works well when all backends are identical and requests are roughly equal cost — it fails when backends differ in capacity or requests vary in weight
•Consistent hashing minimizes cache disruption when backends are added or removed — only K/N keys move (K = total keys, N = backends) instead of rehashing everything
•Power of Two Choices (P2C) picks two random backends and routes to the one with fewer connections — this simple approach produces near-optimal distribution and is Envoy's default
•Maglev (Google) uses a lookup table that provides consistent hashing with perfectly uniform distribution and O(1) lookup time — designed for L4 load balancing at scale
•The wrong algorithm causes cascading failures: round robin during a partial outage keeps sending traffic to slow backends, turning a degradation into a full outage

Key Components

Component	Role
Backend Pool	The set of healthy server instances that the load balancer distributes traffic across — managed via health checks and service discovery
Health Check	Active probes (HTTP 200, TCP connect, gRPC health) that detect unhealthy backends and remove them from rotation before clients notice
Selection Algorithm	The core logic that picks which backend receives each request — round robin, least connections, consistent hashing, P2C, or Maglev
Hash Ring (Consistent Hashing)	A circular address space where backends and request keys are mapped to positions — requests route to the nearest backend clockwise, minimizing redistribution when backends change
Connection Tracking	State maintained per active connection for algorithms like least-connections — tracks how many open connections each backend has to make informed decisions

When to Use

Use round robin for homogeneous backends with uniform request costs. Use least connections when request processing time varies significantly. Use consistent hashing when session affinity or cache locality matters. Use P2C as the default when unsure — it handles most workloads well with minimal overhead.

Tool Comparison

Tool	Type	Best For	Scale
HAProxy	Open Source	High-performance L4/L7 load balancing with round robin, least-connections, source hashing, and URI hashing built in	Millions of connections, single-node
Envoy	Open Source	Service mesh sidecar with P2C, ring hash, Maglev, and zone-aware load balancing — the Istio and AWS App Mesh default	Cloud-native, per-pod sidecar
NGINX	Open Source	L7 reverse proxy with round robin, least-connections, IP hash, and generic hash — the most deployed web server	Small to Enterprise
AWS ALB	Managed	Managed L7 load balancing with least outstanding requests, round robin, and tight integration with ECS/EKS target groups	Enterprise cloud

Debug Checklist

Check backend distribution: monitor requests-per-second per backend — skew beyond 20% indicates an algorithm or weight misconfiguration
Monitor backend health: verify health checks are running and failing backends are removed within the expected interval (typically 10-30 seconds)
Watch for hot backends: track CPU and latency per backend — consistent hashing can create hotspots if popular keys cluster on one node
Test failover behavior: kill one backend and verify traffic redistributes without error spikes — slow redistribution suggests health check intervals are too long
Profile connection reuse: check if keep-alive connections are causing least-connections to undercount — long-lived connections on one backend bias the algorithm

Common Mistakes

Using round robin with backends that have different CPU or memory capacities. A 2-core instance gets the same traffic as a 16-core instance, overloading the smaller one.
Implementing consistent hashing without virtual nodes. With K backends and no virtual nodes, the hash space distribution is wildly uneven — some backends get 3x the traffic.
Choosing least-connections for stateless HTTP APIs where all requests are equal cost. The connection tracking overhead provides no benefit — round robin is simpler and equivalent.
Not implementing slow-start for new backends. A freshly started instance that receives its full share of traffic immediately may overwhelm cold caches and connection pools.
Ignoring request cost variation. Least-connections assumes all connections are equal — a backend with 10 lightweight GETs appears busier than one with 2 heavy report-generation queries.

Real World Usage

•Google designed Maglev for its L4 load balancers — every packet entering Google's network hits a Maglev instance that routes to a backend using a consistent hash lookup table
•Envoy (default for Istio and Lyft) uses P2C as its default algorithm, providing near-optimal distribution without the overhead of tracking all backend states
•Netflix uses weighted round robin with zone-aware routing to distribute traffic across AWS availability zones while respecting per-zone capacity limits
•Cloudflare uses consistent hashing at its edge to route requests to the same origin server, maximizing cache hit rates across its global CDN

RFCs & Specs

RFC 7230 (HTTP/1.1 Message Syntax)RFC 9110 (HTTP Semantics)

Load Balancing Algorithms

TCPHTTPgRPC

The Problem

Distributing traffic evenly sounds simple, but the wrong algorithm causes hotspots, cache misses, and cascading failures under real-world conditions.

Mental Model

Architecture Diagram

How It Works

Round Robin

The simplest algorithm. Requests are assigned to backends in order: A, B, C, A, B, C. No state is maintained. No decisions are made. The load balancer cycles through the list.

When it works: All backends are identical (same hardware, same capacity) and all requests are roughly equal cost. A fleet of stateless API servers handling uniform GET requests is the ideal case.

# HAProxy weighted round robin
backend api_servers
    balance roundrobin
    server api1 10.0.1.1:8080 weight 1
    server api2 10.0.1.2:8080 weight 1
    server api3 10.0.1.3:8080 weight 4

Least Connections

Weighted least connections combines weights with connection tracking: the load balancer picks the backend with the lowest connections / weight ratio.

# NGINX least connections
upstream api_servers {
    least_conn;
    server 10.0.1.1:8080;
    server 10.0.1.2:8080;
    server 10.0.1.3:8080;
}

Least Response Time

NGINX Plus and HAProxy Enterprise support this. The algorithm picks the backend with the lowest active_connections * average_response_time product, balancing both load and speed.

Consistent Hashing

Consistent hashing maps both backends and request keys to positions on a circular hash space (the "ring"). A request hashes to a position on the ring and routes to the nearest backend clockwise.

Ring:     [0] -------- [Backend A at 0.25] -------- [Backend B at 0.50] -------- [Backend C at 0.75] -------- [1.0/0]
Request:  hash("user-123") = 0.33 → routes to Backend B (nearest clockwise)
Remove B: hash("user-123") = 0.33 → now routes to Backend C
          (only keys between A and B moved; everything else stays)

# Envoy ring hash configuration
clusters:
  - name: cache_cluster
    lb_policy: RING_HASH
    ring_hash_lb_config:
      minimum_ring_size: 1024
      maximum_ring_size: 8388608

Power of Two Choices (P2C)

P2C is elegantly simple: for each request, pick two backends at random and send the request to whichever has fewer active connections (or lower load). That is the entire algorithm.

# Envoy least_request (P2C) — the default
clusters:
  - name: api_cluster
    lb_policy: LEAST_REQUEST
    least_request_lb_config:
      choice_count: 2  # P2C

Maglev (Google)

Maglev is Google's consistent hashing algorithm designed for L4 load balancing. It uses a lookup table (default size: 65,537 entries, a prime number) that provides:

Consistent mapping: Adding or removing a backend changes minimal table entries
Uniform distribution: Each backend gets an equal share of the table (within 1%)
O(1) lookup: Request routing is a single table lookup, no ring traversal

Algorithm Performance Comparison

Algorithm	Distribution	State Required	Backend Change Impact	Best For
Round Robin	Perfect (uniform)	None	Full redistribution	Identical backends, uniform requests
Weighted RR	Proportional to weights	Static weights	Full redistribution	Mixed-capacity backends
Least Connections	Adaptive to load	Per-backend connection count	Gradual rebalance	Variable request cost
Consistent Hash	Near-uniform (with vnodes)	Hash ring	Minimal (K/N keys move)	Cache locality, session affinity
P2C	Near-optimal	Minimal (sampled)	Graceful	General purpose, service mesh
Maglev	Uniform (within 1%)	Lookup table	Minimal	L4, high-throughput, Google-scale

Failure Modes and Cascading

Mitigation strategies:

Outlier detection. Track per-backend error rates and latency. Eject backends that exceed thresholds (Envoy does this automatically).
Circuit breakers. Cap the maximum number of concurrent connections to any single backend. Once the limit is hit, fail fast instead of queueing.
Retry budgets. Limit retries to a percentage of total traffic (e.g., 20%). This prevents retry storms from amplifying partial failures.
Slow-start. When a new or recovered backend enters the pool, ramp up its traffic gradually over 30-60 seconds instead of sending a full share immediately.

# Envoy outlier detection — eject bad backends automatically
clusters:
  - name: api_cluster
    outlier_detection:
      consecutive_5xx: 5
      interval: 10s
      base_ejection_time: 30s
      max_ejection_percent: 50

Choosing the Right Algorithm

Use round robin only when backends are identical and requests are uniform. It is the simplest to reason about and debug.

Use least connections when request processing time varies by orders of magnitude (e.g., a mix of fast reads and slow aggregation queries).

Key Points

•Round robin is the simplest algorithm and works well when all backends are identical and requests are roughly equal cost — it fails when backends differ in capacity or requests vary in weight
•Consistent hashing minimizes cache disruption when backends are added or removed — only K/N keys move (K = total keys, N = backends) instead of rehashing everything
•Power of Two Choices (P2C) picks two random backends and routes to the one with fewer connections — this simple approach produces near-optimal distribution and is Envoy's default
•Maglev (Google) uses a lookup table that provides consistent hashing with perfectly uniform distribution and O(1) lookup time — designed for L4 load balancing at scale
•The wrong algorithm causes cascading failures: round robin during a partial outage keeps sending traffic to slow backends, turning a degradation into a full outage

Key Components

Component	Role
Backend Pool	The set of healthy server instances that the load balancer distributes traffic across — managed via health checks and service discovery
Health Check	Active probes (HTTP 200, TCP connect, gRPC health) that detect unhealthy backends and remove them from rotation before clients notice
Selection Algorithm	The core logic that picks which backend receives each request — round robin, least connections, consistent hashing, P2C, or Maglev
Hash Ring (Consistent Hashing)	A circular address space where backends and request keys are mapped to positions — requests route to the nearest backend clockwise, minimizing redistribution when backends change
Connection Tracking	State maintained per active connection for algorithms like least-connections — tracks how many open connections each backend has to make informed decisions

When to Use

Tool Comparison

Tool	Type	Best For	Scale
HAProxy	Open Source	High-performance L4/L7 load balancing with round robin, least-connections, source hashing, and URI hashing built in	Millions of connections, single-node
Envoy	Open Source	Service mesh sidecar with P2C, ring hash, Maglev, and zone-aware load balancing — the Istio and AWS App Mesh default	Cloud-native, per-pod sidecar
NGINX	Open Source	L7 reverse proxy with round robin, least-connections, IP hash, and generic hash — the most deployed web server	Small to Enterprise
AWS ALB	Managed	Managed L7 load balancing with least outstanding requests, round robin, and tight integration with ECS/EKS target groups	Enterprise cloud

Debug Checklist

Check backend distribution: monitor requests-per-second per backend — skew beyond 20% indicates an algorithm or weight misconfiguration
Monitor backend health: verify health checks are running and failing backends are removed within the expected interval (typically 10-30 seconds)
Watch for hot backends: track CPU and latency per backend — consistent hashing can create hotspots if popular keys cluster on one node
Test failover behavior: kill one backend and verify traffic redistributes without error spikes — slow redistribution suggests health check intervals are too long
Profile connection reuse: check if keep-alive connections are causing least-connections to undercount — long-lived connections on one backend bias the algorithm

Common Mistakes

Using round robin with backends that have different CPU or memory capacities. A 2-core instance gets the same traffic as a 16-core instance, overloading the smaller one.
Implementing consistent hashing without virtual nodes. With K backends and no virtual nodes, the hash space distribution is wildly uneven — some backends get 3x the traffic.
Choosing least-connections for stateless HTTP APIs where all requests are equal cost. The connection tracking overhead provides no benefit — round robin is simpler and equivalent.
Not implementing slow-start for new backends. A freshly started instance that receives its full share of traffic immediately may overwhelm cold caches and connection pools.
Ignoring request cost variation. Least-connections assumes all connections are equal — a backend with 10 lightweight GETs appears busier than one with 2 heavy report-generation queries.

Real World Usage

•Google designed Maglev for its L4 load balancers — every packet entering Google's network hits a Maglev instance that routes to a backend using a consistent hash lookup table
•Envoy (default for Istio and Lyft) uses P2C as its default algorithm, providing near-optimal distribution without the overhead of tracking all backend states
•Netflix uses weighted round robin with zone-aware routing to distribute traffic across AWS availability zones while respecting per-zone capacity limits
•Cloudflare uses consistent hashing at its edge to route requests to the same origin server, maximizing cache hit rates across its global CDN

RFCs & Specs

RFC 7230 (HTTP/1.1 Message Syntax)RFC 9110 (HTTP Semantics)

The Problem

Mental Model

Architecture Diagram

How It Works

Round Robin

Least Connections

Least Response Time

Consistent Hashing

Power of Two Choices (P2C)

Maglev (Google)

Algorithm Performance Comparison

Failure Modes and Cascading

Choosing the Right Algorithm

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics

The Problem

Mental Model

Architecture Diagram

How It Works

Round Robin

Least Connections

Least Response Time

Consistent Hashing

Power of Two Choices (P2C)

Maglev (Google)

Algorithm Performance Comparison

Failure Modes and Cascading

Choosing the Right Algorithm

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics