Failure Detection Algorithms

The Impossible Distinction

A node has not responded in 5 seconds. Is it dead? Maybe. It could have crashed. The disk could have failed. The power supply could have blown. Or the node could be perfectly healthy but stuck in a 6-second garbage collection pause. Or the network switch between you and the node could be dropping packets. Or the node's CPU could be pegged at 100% processing a backlog and it simply has not gotten around to responding yet.

From the outside, all of these look identical: silence. And this is the fundamental problem that makes failure detection in distributed systems genuinely hard. There is no way to distinguish "dead" from "very slow" using only network messages. Any timeout you pick is a guess, and every guess can be wrong in both directions.

Binary Failure Detectors: Simple and Fragile

The simplest approach: every node sends a heartbeat every T seconds. If you have not heard from a node in 2T seconds (or 3T, or whatever multiple you pick), declare it dead.

This works fine on a local network with dedicated hardware where latency is sub-millisecond and jitter is negligible. Set T to 1 second, declare dead after 3 seconds of silence, and you will catch real failures quickly with very few false positives.

It breaks in the real world.

Cloud VMs share physical hosts. Network virtualization adds latency and jitter. A noisy neighbor running a CPU-intensive job can delay your heartbeat processing. Cross-availability-zone traffic goes through more switches and has higher variance. Cross-region traffic has baseline latencies of 50-200ms with occasional spikes.

A fixed 3-second timeout that works in us-east-1a will cause constant false positives when a node in us-west-2 is 200ms away and occasionally spikes to 2 seconds during congestion. You can increase the timeout to 10 seconds, but then actual failures take 10 seconds to detect, and during that 10 seconds, every request routed to the dead node fails.

What you really want is a failure detector that adapts to the network conditions it observes.

The Phi Accrual Failure Detector

The phi accrual failure detector (Hayashibara et al., 2004) takes a fundamentally different approach. Instead of outputting a binary alive/dead, it outputs a continuous value called phi that represents how suspicious the current silence is, given what we have seen before.

Here is how it works.

Track every heartbeat arrival. Specifically, track the interval between consecutive heartbeats. Over time, you build a distribution of these intervals. On a local network, the distribution might be centered around 1000ms (the heartbeat period) with a standard deviation of 5ms. On a cross-region link, the center might be 1000ms but the standard deviation might be 100ms.

When a heartbeat is overdue, compute how unusual the current delay is given the observed distribution. If heartbeats typically arrive every 1000ms with a standard deviation of 5ms, and the current heartbeat is 50ms late, that is 10 standard deviations, extremely unusual. If heartbeats arrive every 1000ms with a standard deviation of 100ms, a 50ms delay is totally normal, less than 1 standard deviation.

The phi value is defined as -log10(P(delay >= current_delay)). If the probability of this much delay (or more) is 1 in 1000, then phi = 3. If it is 1 in 1,000,000, phi = 6. Higher phi means more suspicious.

The application sets a threshold. Cassandra uses phi = 8, which means declaring a node dead when the probability of the observed delay under normal conditions is less than 1 in 10^8 (one in a hundred million). That is conservative enough to avoid false positives from network jitter while still detecting actual failures within seconds.

Why Phi Accrual Adapts Automatically

The magic of phi accrual is that the same threshold works across radically different network conditions. On a low-latency local network with tight heartbeat intervals, a 200ms delay produces a very high phi because it is far from the observed mean. On a high-latency cross-region link with loose heartbeat intervals, a 200ms delay produces a low phi because that kind of variation is normal.

You do not need to tune timeouts per network path. The failure detector learns the baseline automatically from the heartbeat history and measures deviations from that baseline. Deploy the same code in a data center and across continents, and it adapts.

There are practical considerations. The distribution estimate needs a warm-up period. The first few heartbeats do not give you a reliable mean and variance, so most implementations use a default window of 100-1000 samples and fall back to a fixed timeout during startup. The distribution is typically modeled as a normal distribution, which works well enough for heartbeat intervals. Some implementations use an exponential distribution instead, which is a better fit for inter-arrival times but gives similar results in practice.

Cassandra's implementation maintains a sliding window of the most recent 1000 heartbeat intervals. This allows the detector to adapt to changing network conditions: if cross-region latency increases due to a network change, the distribution updates and the detector adjusts its sensitivity.

SWIM: Scalable Failure Detection

The approaches above assume some form of all-to-all heartbeating: every node monitors every other node. With N nodes, that is O(N^2) heartbeat messages per interval. At 20 nodes, that is 380 heartbeats per second (assuming 1-second intervals), which is fine. At 200 nodes, it is 39,800. At 2000 nodes, it is nearly 4 million. The network overhead becomes the bottleneck.

SWIM (Scalable Weakly-consistent Infection-style Membership) takes a different approach. Each node, during each protocol period, picks one random peer and sends it a direct probe (ping). If the peer responds, great. If not, the node picks k other random peers and asks them to probe the unresponsive node on its behalf (indirect probing).

If neither the direct probe nor any of the k indirect probes get a response, the node marks the target as suspected. After a configurable suspicion timeout, the suspected node is declared dead.

Total messages per period: each of the N nodes sends 1 direct probe and at most k indirect probes. That is O(N) total messages. The constant factor is small: with k = 3, each node sends at most 4 messages per period. At 2000 nodes, that is about 8000 messages per period instead of 4 million.

SWIM has another property: failure detection time is bounded and independent of cluster size. If the protocol period is 1 second and the suspicion timeout is 5 seconds, a failed node is detected within about 6 seconds regardless of whether the cluster has 10 nodes or 10,000.

HashiCorp's memberlist library implements SWIM and is used by Consul, Nomad, and Serf. It adds some optimizations to the basic SWIM protocol: piggybacking membership updates on ping/ack messages (reducing the number of dedicated gossip messages), suspicion subperiods (giving suspected nodes a chance to refute before being declared dead), and compound messages (batching multiple protocol messages into a single UDP packet).

Indirect Probing: Why It Matters

Direct probing between two specific nodes is vulnerable to asymmetric network issues. Maybe the link between Node A and Node B is congested, but Node B is perfectly reachable from Node C. If A can only probe B directly, A will falsely suspect B.

Indirect probing routes around this. A asks C to check on B. C pings B successfully and reports back to A. Now A knows B is alive despite A's own inability to reach B. This significantly reduces false positives from localized network issues.

The value of k (the number of indirect probers) is a tradeoff. Higher k means more redundant checks and fewer false positives, but more network messages. In practice, k = 3 works well. The probability that all 3 indirect probers also cannot reach a healthy node is very low unless there is a genuine partition isolating that node.

Adaptive Timeouts: The TCP Approach

TCP has been solving a version of this problem for decades. The retransmission timeout (RTO) adapts to network conditions using an exponentially weighted moving average (EWMA) of round-trip times plus a safety margin.

The Jacobson/Karels algorithm (which underpins TCP timeout estimation) tracks:

SRTT (smoothed round-trip time): EWMA of observed RTTs
RTTVAR (RTT variance): EWMA of RTT deviations from the mean
RTO = SRTT + 4 * RTTVAR

The 4 * RTTVAR term is the safety margin. On a stable network with low variance, the safety margin is small and timeouts are tight. On a jittery network, the margin grows automatically.

This approach works well for point-to-point connections where you have frequent round-trip measurements. It is less suitable for cluster-wide failure detection because each node pair would need its own EWMA state, which gets back to the O(N^2) problem.

Some systems (like gRPC's health checking) use adaptive timeouts for individual connections while relying on a separate mechanism (like SWIM or a gossip protocol) for cluster-wide membership.

The GC Pause Problem

Garbage-collected runtimes (JVM, .NET, Go to a lesser extent) introduce a class of failure that no network-level detector can distinguish from actual death. During a stop-the-world GC pause, the application cannot process heartbeats, respond to probes, or send its own heartbeats. To every other node, it looks exactly dead.

JVM-based systems like Cassandra, Elasticsearch, and Kafka can experience GC pauses from hundreds of milliseconds to tens of seconds, depending on heap size, GC algorithm, and allocation rate. G1GC on a 32GB heap might pause for 200ms routinely and 2-5 seconds under pressure. CMS or ZGC reduce pause times but do not eliminate them.

The failure detector timeout must accommodate worst-case GC pauses. If your JVM can pause for 5 seconds, your failure detection timeout must be at least 5 seconds, plus some margin for network delay. This means actual crashes take at least 5+ seconds to detect.

Some systems work around this with separate heartbeat processes that run outside the JVM. The heartbeat daemon is a lightweight process (or a native thread in the JVM using off-heap memory) that continues sending heartbeats even during GC. If the heartbeat daemon is alive but the application is paused, the node is not declared dead, and requests are queued until the application resumes.

Kubernetes takes a different approach entirely. The kubelet sends heartbeats to the API server via node leases (stored in etcd). The default node monitor grace period is 40 seconds, which is deliberately long to avoid false positives from transient issues. For applications that need faster failure detection, liveness probes run on a per-pod basis with configurable timeouts.

Designing for False Positives

No matter how good your failure detector is, false positives will happen. A node declared dead that is actually alive will cause problems: its work might be reassigned to another node (duplicate processing), it might be evicted from the cluster (unnecessary rebalancing), and when it comes back, there is confusion about who owns what.

The best systems design for graceful false positives. Fencing tokens prevent a "zombie" node (declared dead but still alive) from making conflicting writes. Idempotent operations ensure that duplicate processing is harmless. Suspicion states (SWIM's approach) give nodes a grace period to refute their own death notice before the cluster acts on it.

Cassandra handles this particularly well. When the phi accrual detector suspects a node, the coordinator stops routing new requests to it but does not immediately remove it from the ring. If the node recovers and sends a heartbeat, it is reinstated without any data movement. Only after an extended absence (or explicit operator action) is the node fully removed and its data redistributed. The distinction between "suspected" and "confirmed dead" is crucial in production.

Choosing the Right Approach

For small clusters (under 20 nodes) with stable network conditions, simple heartbeats with a generous timeout work fine. The implementation is trivial, debugging is straightforward, and the O(N^2) message overhead is negligible.

For medium clusters (20-200 nodes), phi accrual failure detection gives you adaptive behavior without the complexity of SWIM. Cassandra and Akka both prove this works well at this scale.

For large clusters (200+ nodes), SWIM or a SWIM-based protocol is the way to go. The O(N) message overhead and bounded detection time make it viable at scales where all-to-all heartbeating breaks down.

For cross-region deployments, phi accrual is almost mandatory. Fixed timeouts simply cannot handle the latency variance between a 1ms local link and a 200ms cross-region hop. The phi accrual detector handles both with the same threshold.

Whatever you choose, instrument it. Track false positive rates. Track detection latency for real failures. Track the phi values (or equivalent) across your cluster to understand the baseline and spot network degradation before it causes false positives.

The Impossible Distinction

Binary Failure Detectors: Simple and Fragile

The simplest approach: every node sends a heartbeat every T seconds. If you have not heard from a node in 2T seconds (or 3T, or whatever multiple you pick), declare it dead.

It breaks in the real world.

What you really want is a failure detector that adapts to the network conditions it observes.

The Phi Accrual Failure Detector

Here is how it works.

Why Phi Accrual Adapts Automatically

SWIM: Scalable Failure Detection

If neither the direct probe nor any of the k indirect probes get a response, the node marks the target as suspected. After a configurable suspicion timeout, the suspected node is declared dead.

Indirect Probing: Why It Matters

Adaptive Timeouts: The TCP Approach

The Jacobson/Karels algorithm (which underpins TCP timeout estimation) tracks:

SRTT (smoothed round-trip time): EWMA of observed RTTs
RTTVAR (RTT variance): EWMA of RTT deviations from the mean
RTO = SRTT + 4 * RTTVAR

The 4 * RTTVAR term is the safety margin. On a stable network with low variance, the safety margin is small and timeouts are tight. On a jittery network, the margin grows automatically.

Some systems (like gRPC's health checking) use adaptive timeouts for individual connections while relying on a separate mechanism (like SWIM or a gossip protocol) for cluster-wide membership.

The GC Pause Problem

Designing for False Positives

Choosing the Right Approach

For medium clusters (20-200 nodes), phi accrual failure detection gives you adaptive behavior without the complexity of SWIM. Cassandra and Akka both prove this works well at this scale.

Architecture

The Impossible Distinction

Binary Failure Detectors: Simple and Fragile

The Phi Accrual Failure Detector

Why Phi Accrual Adapts Automatically

SWIM: Scalable Failure Detection

Indirect Probing: Why It Matters

Adaptive Timeouts: The TCP Approach

The GC Pause Problem

Designing for False Positives

Choosing the Right Approach

Key Points

Used By

Common Mistakes

Related

Failure Detection Algorithms

Architecture

The Impossible Distinction

Binary Failure Detectors: Simple and Fragile

The Phi Accrual Failure Detector

Why Phi Accrual Adapts Automatically

SWIM: Scalable Failure Detection

Indirect Probing: Why It Matters

Adaptive Timeouts: The TCP Approach

The GC Pause Problem

Designing for False Positives

Choosing the Right Approach

Key Points

Used By

Common Mistakes

Related