Heartbeat & Leader Election
Heartbeat: each instance periodically writes 'I'm alive' to shared storage. Leader election: instances compete to become the single coordinator; loser instances watch the leader's heartbeat. When the leader's heartbeat stops, a new election runs. Backed by ZooKeeper, etcd, Consul, or a simpler Redis lease.
Diagram
What it is
Heartbeat: each instance periodically writes a "still alive" signal to shared storage. Other instances (or operators) read these signals to know who's running.
Leader election: a special case of heartbeat for coordination. Multiple instances compete to become the single "leader" responsible for some task (running a scheduled job, owning a partition, managing failover). The winner holds a leader-key with TTL; periodically renews it; if it dies, the key expires and another instance takes over.
These patterns appear in: schedulers (only one runs a job), partitioned services (one owner per partition), database failover (one primary writer), service discovery (which instances are healthy).
The mechanics
Election is built on a single primitive: atomic compare-and-set with TTL. Every instance tries to set the leader key with NX (fail if exists). Whoever wins is the leader. Losers retry periodically.
The leader must renew the key before TTL expires. If renewal fails (because the lease was stolen or the storage went down), leadership is lost; the leader stops doing leader work; some other instance will eventually win the next election.
Lease sizing
Two clocks: the lease TTL (how long the leader has before its claim expires) and the renewal interval (how often the leader extends the lease).
Rule of thumb: renew interval × 3 ≤ lease TTL. With renewal every 5s and lease of 15s, the lease tolerates two missed renewals before expiring. This rides through transient network blips without leadership flapping.
Shorter lease = faster failover when the leader dies, but more flapping under instability. Typical production values: 15-30s lease, 5-10s renewal interval.
Split-brain
The hard problem.
Two instances both think they're leader. Happens when:
- Network partition: leader can't reach the coordination service for renewal but is otherwise fine. Lease expires; new leader takes over. Network heals; old leader is still running, still thinks it's leader.
- GC pause / VM pause: leader is paused for longer than the lease. Same outcome.
- Clock drift: leader's local clock disagrees with the coordination service.
Detecting split-brain after the fact is hard. Preventing the resulting damage is the right goal: fencing tokens.
Fencing tokens
Each leadership acquisition gets a monotonically increasing token (the etcd revision, the ZK sequence number, an INCR'd counter in Redis). The leader passes the token on every protected operation. Downstream resources track the largest token they've seen and reject operations with smaller tokens.
The stale leader's writes have an old token; rejected. The new leader's writes have a larger token; accepted. Even if both leaders are simultaneously writing, only the latest one wins.
This requires support from the downstream resources. Databases can enforce via row-level fencing columns. Object stores can enforce via conditional puts. Bare files cannot.
What every implementation should have
Three guarantees:
- Lease renewal is on the critical path of being the leader. If renewal fails, leader loses leadership immediately; cancels its work.
- Work checks the leadership context. A long-running operation that started while leader checks "am I still leader?" before each step. If not, stops.
- Downstream operations carry a fencing token. Resources reject stale-leader writes.
Without these, leadership is just a hint. Bug reports come back as "we ran the job twice" or "the data is corrupted by a zombie process".
When to use which backing store
For Kubernetes-native services: use the Lease API. Lightweight, integrated, the right primitive for the platform.
For other services in a coordinator-managed environment: ZK, etcd, or Consul. They give correct primitives (linearisable, ephemeral nodes) and natural fencing tokens (sequence numbers).
For ad-hoc cases where adding a coordinator is overkill: Redis with NX-EX, knowing the limitations. Add fencing tokens if correctness matters.
For high-stakes leadership (primary failover, write coordination): the coordinator approach. The operational cost is justified by avoiding split-brain incidents.
Implementations
Kubernetes provides a Lease resource designed for this. The kubelet manages the underlying etcd lease; client libraries handle the renewal and watch logic. Standard practice for Kubernetes-native controllers.
1 // Using the Java client for Kubernetes:
2 import io.kubernetes.client.extended.leaderelection.*;
3
4 LeaseLock lock = new LeaseLock("default", "my-controller-leader", instanceId, coreV1);
5
6 LeaderElectionConfig config = new LeaderElectionConfig(
7 lock,
8 Duration.ofSeconds(30), // lease duration
9 Duration.ofSeconds(20), // renew deadline
10 Duration.ofSeconds(5) // retry period
11 );
12
13 LeaderElector elector = new LeaderElector(config);
14 elector.run(
15 () -> runAsLeader(), // onStartedLeading
16 () -> stopWork() // onStoppedLeading
17 );Key points
- •Heartbeat: leader writes a key with TTL every N seconds; if it stops writing, the key expires.
- •Election: candidates try to atomically claim the leader key. The winner is the leader. Losers watch.
- •Lease > heartbeat interval × 3: tolerates transient network blips without flapping.
- •Always design for split-brain: two instances that both think they are leader. Fencing tokens prevent corruption.
- •ZooKeeper, etcd, Consul, Kubernetes leases give correct primitives. Redis with TTL is simpler but has known edge cases.
Follow-up questions
▸What is split-brain and how does it happen?
▸ZooKeeper / etcd / Consul vs Redis for election?
▸How do I size the lease TTL?
▸Does the leader need to know it's still the leader?
Gotchas
- !TTL too short: harmless network blip causes leadership flap
- !Renewing on a separate connection: connection drops independently, lease lost while leader is fine
- !Doing work after losing leadership: the new leader is also writing; conflict
- !No fencing token: stale leader corrupts data after lease expiry
- !Single-region election with no fencing: cross-region partition can have two regional leaders