Heartbeat & Leader Election

What it is

Heartbeat: each instance periodically writes a "still alive" signal to shared storage. Other instances (or operators) read these signals to know who's running.

Leader election: a special case of heartbeat for coordination. Multiple instances compete to become the single "leader" responsible for some task (running a scheduled job, owning a partition, managing failover). The winner holds a leader-key with TTL; periodically renews it; if it dies, the key expires and another instance takes over.

These patterns appear in: schedulers (only one runs a job), partitioned services (one owner per partition), database failover (one primary writer), service discovery (which instances are healthy).

The mechanics

Election is built on a single primitive: atomic compare-and-set with TTL. Every instance tries to set the leader key with NX (fail if exists). Whoever wins is the leader. Losers retry periodically.

The leader must renew the key before TTL expires. If renewal fails (because the lease was stolen or the storage went down), leadership is lost; the leader stops doing leader work; some other instance will eventually win the next election.

Lease sizing

Two clocks: the lease TTL (how long the leader has before its claim expires) and the renewal interval (how often the leader extends the lease).

Rule of thumb: renew interval × 3 ≤ lease TTL. With renewal every 5s and lease of 15s, the lease tolerates two missed renewals before expiring. This rides through transient network blips without leadership flapping.

Shorter lease = faster failover when the leader dies, but more flapping under instability. Typical production values: 15-30s lease, 5-10s renewal interval.

Split-brain

The hard problem.

Two instances both think they're leader. Happens when:

Network partition: leader can't reach the coordination service for renewal but is otherwise fine. Lease expires; new leader takes over. Network heals; old leader is still running, still thinks it's leader.
GC pause / VM pause: leader is paused for longer than the lease. Same outcome.
Clock drift: leader's local clock disagrees with the coordination service.

Detecting split-brain after the fact is hard. Preventing the resulting damage is the right goal: fencing tokens.

Fencing tokens

Each leadership acquisition gets a monotonically increasing token (the etcd revision, the ZK sequence number, an INCR'd counter in Redis). The leader passes the token on every protected operation. Downstream resources track the largest token they've seen and reject operations with smaller tokens.

The stale leader's writes have an old token; rejected. The new leader's writes have a larger token; accepted. Even if both leaders are simultaneously writing, only the latest one wins.

This requires support from the downstream resources. Databases can enforce via row-level fencing columns. Object stores can enforce via conditional puts. Bare files cannot.

What every implementation should have

Three guarantees:

Lease renewal is on the critical path of being the leader. If renewal fails, leader loses leadership immediately; cancels its work.
Work checks the leadership context. A long-running operation that started while leader checks "am I still leader?" before each step. If not, stops.
Downstream operations carry a fencing token. Resources reject stale-leader writes.

Without these, leadership is just a hint. Bug reports come back as "we ran the job twice" or "the data is corrupted by a zombie process".

When to use which backing store

For Kubernetes-native services: use the Lease API. Lightweight, integrated, the right primitive for the platform.

For other services in a coordinator-managed environment: ZK, etcd, or Consul. They give correct primitives (linearisable, ephemeral nodes) and natural fencing tokens (sequence numbers).

For ad-hoc cases where adding a coordinator is overkill: Redis with NX-EX, knowing the limitations. Add fencing tokens if correctness matters.

For high-stakes leadership (primary failover, write coordination): the coordinator approach. The operational cost is justified by avoiding split-brain incidents.

Follow-up questions

▸What is split-brain and how does it happen?

Two instances both believe they are the leader. Happens during network partitions: leader A is fine but the network between A and the coordination service is broken; A's lease expires; B claims leadership; network heals; A still thinks it's leader. Both write. Fencing tokens (monotonic counter on each leadership claim) let downstream resources reject the stale leader's writes.

▸ZooKeeper / etcd / Consul vs Redis for election?

ZK/etcd/Consul are designed for coordination: linearisable, ephemeral nodes that auto-clean on disconnect, sequence numbers as natural fencing tokens. Operationally heavier (the coordination cluster must be run). Redis is simpler but well-known to have edge cases (the Redlock debate). For high-stakes leadership (primary database failover), use ZK/etcd. For low-stakes (cron deduplication), Redis is fine.

▸How do I size the lease TTL?

Trade-off between failover speed and stability. Short TTL (e.g., 5s): fast failover when the leader dies, but risk of flapping during transient network blips. Long TTL (e.g., 60s): stable, but slow failover. Most production systems use 15-30s lease with renewal every 5-10s. The 'renew interval × 3 ≤ lease' rule of thumb keeps the lease healthy under one missed renewal.

▸Does the leader need to know it's still the leader?

Yes. Pattern: leader holds a 'leadership context' that is cancelled if the renewal fails. All work performed by the leader checks this context. If the context is cancelled (lease lost), the leader stops doing work immediately, even if it hasn't yet learned about the new leader. Without this, a leader running a slow operation can keep writing after losing leadership.

What it is

Heartbeat: each instance periodically writes a "still alive" signal to shared storage. Other instances (or operators) read these signals to know who's running.

These patterns appear in: schedulers (only one runs a job), partitioned services (one owner per partition), database failover (one primary writer), service discovery (which instances are healthy).

The mechanics

Election is built on a single primitive: atomic compare-and-set with TTL. Every instance tries to set the leader key with NX (fail if exists). Whoever wins is the leader. Losers retry periodically.

Lease sizing

Two clocks: the lease TTL (how long the leader has before its claim expires) and the renewal interval (how often the leader extends the lease).

Shorter lease = faster failover when the leader dies, but more flapping under instability. Typical production values: 15-30s lease, 5-10s renewal interval.

Split-brain

The hard problem.

Two instances both think they're leader. Happens when:

Network partition: leader can't reach the coordination service for renewal but is otherwise fine. Lease expires; new leader takes over. Network heals; old leader is still running, still thinks it's leader.
GC pause / VM pause: leader is paused for longer than the lease. Same outcome.
Clock drift: leader's local clock disagrees with the coordination service.

Detecting split-brain after the fact is hard. Preventing the resulting damage is the right goal: fencing tokens.

Fencing tokens

The stale leader's writes have an old token; rejected. The new leader's writes have a larger token; accepted. Even if both leaders are simultaneously writing, only the latest one wins.

This requires support from the downstream resources. Databases can enforce via row-level fencing columns. Object stores can enforce via conditional puts. Bare files cannot.

What every implementation should have

Three guarantees:

Lease renewal is on the critical path of being the leader. If renewal fails, leader loses leadership immediately; cancels its work.
Work checks the leadership context. A long-running operation that started while leader checks "am I still leader?" before each step. If not, stops.
Downstream operations carry a fencing token. Resources reject stale-leader writes.

Without these, leadership is just a hint. Bug reports come back as "we ran the job twice" or "the data is corrupted by a zombie process".

When to use which backing store

For Kubernetes-native services: use the Lease API. Lightweight, integrated, the right primitive for the platform.

For other services in a coordinator-managed environment: ZK, etcd, or Consul. They give correct primitives (linearisable, ephemeral nodes) and natural fencing tokens (sequence numbers).

For ad-hoc cases where adding a coordinator is overkill: Redis with NX-EX, knowing the limitations. Add fencing tokens if correctness matters.

For high-stakes leadership (primary failover, write coordination): the coordinator approach. The operational cost is justified by avoiding split-brain incidents.

Follow-up questions

▸What is split-brain and how does it happen?

▸ZooKeeper / etcd / Consul vs Redis for election?

▸How do I size the lease TTL?

▸Does the leader need to know it's still the leader?

Diagram

What it is

The mechanics

Lease sizing

Split-brain

Fencing tokens

What every implementation should have

When to use which backing store

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Heartbeat & Leader Election

Diagram

What it is

The mechanics

Lease sizing

Split-brain

Fencing tokens

What every implementation should have

When to use which backing store

Implementations

Key points

Follow-up questions

Gotchas

Related reading