etcd

Why It Exists

Every distributed system eventually needs a single source of truth for coordination metadata. Which node is the leader? What services are running where? What configuration is active right now? Without a strongly consistent store backing these answers, the result is split-brain scenarios, data corruption, duplicate processing, and cascading failures.

etcd solves this with a linearizable key-value store built on Raft consensus. Every read returns the most recent committed write, even during partial network failures. That's the whole promise, and it delivers.

CoreOS built etcd because ZooKeeper was a bad fit for the Go-based cloud-native world. ZooKeeper's Java runtime, session-based ephemeral nodes, and hierarchical namespace all added complexity that didn't map well to what people actually needed. etcd shipped as a single Go binary with a flat keyspace and a gRPC API. Kubernetes picked it as its sole persistence layer, and that decision made etcd the most widely deployed consensus system in modern infrastructure.

How It Works Internally

Every write follows a strict path. The client sends a gRPC request to any cluster member, which forwards it to the current Raft leader. The leader appends the entry to its Write-Ahead Log (WAL), replicates it to a quorum of followers (majority of nodes), waits for acknowledgment, then applies it to the state machine. The state machine is a bbolt B+ tree database that stores all key-value pairs using Multi-Version Concurrency Control (MVCC). Each mutation creates a new revision number. Both current and historical values get indexed.

The MVCC model is the critical piece. etcd maintains a keyIndex in-memory B-tree that maps keys to their revision histories, plus a backend B+ tree (bbolt) that maps revisions to values. For a range query on key /registry/pods, it first checks the keyIndex for the latest revision, then fetches the value from bbolt. This dual-index design is what makes watches efficient: when a client subscribes to changes starting from revision N, etcd replays all mutations since that revision straight from the backend store.

The Watch API runs over gRPC server-side streaming. Internally, the watch hub splits watchers into two groups: synced watchers (caught up to current revision) and unsynced watchers (still replaying history). Synced watchers get events directly from the apply loop with near-zero overhead. This design lets a single etcd cluster handle tens of thousands of concurrent watches. Kubernetes depends on this heavily for its controller reconciliation loops.

Leases provide TTL semantics. Attach a lease (with a grant TTL) to keys, then periodically send KeepAlive RPCs. If the lease expires because of missed heartbeats, etcd automatically deletes all attached keys. This is the building block for service registration: a service registers itself with a 10-second lease, heartbeats every 3 seconds, and gets automatically deregistered on crash. Simple and reliable.

Production Architecture

Run 3 or 5 members across failure domains (availability zones or racks). The write path needs quorum agreement: a 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2. Going beyond 5 nodes is counterproductive. Raft replication cost grows linearly with cluster size, and a 7-node cluster doing ~10,000 writes/sec generates a lot of cross-AZ network traffic for marginal fault tolerance gains.

Storage must be local SSD. I cannot stress this enough. The Raft protocol requires durable WAL writes on every commit, and the 99th percentile fsync latency directly determines the write throughput ceiling. On AWS, use i3/i3en instances with NVMe or io2 EBS with provisioned IOPS (minimum 3,000 IOPS). Network-attached storage with variable latency (gp2/gp3) will cause leader election storms during IO spikes. I have seen this kill production clusters more than once.

For Kubernetes at scale, the standard pattern is a dedicated etcd cluster separate from worker nodes. The kube-apiserver connects over mTLS. At very large scale (5,000+ nodes), teams run separate etcd clusters for events vs. core state using the --etcd-servers-overrides flag, because event volume alone can overwhelm a single cluster.

For backups, use etcdctl snapshot save on a periodic cron (every 15-30 minutes) and upload to object storage. Snapshots are point-in-time consistent and include the full bbolt database. Restoration replaces the entire cluster state, so test the restore process before it is needed.

Decision Criteria

Criteria	etcd	ZooKeeper	Consul
Consensus	Raft	ZAB (Paxos-like)	Raft
Data model	Flat key-value with MVCC	Hierarchical znodes	KV + service catalog
Language	Go	Java	Go
Max recommended DB size	8 GB	~500 MB (per znode: 1 MB)	No hard limit (uses Raft log)
Watch mechanism	gRPC streaming, revision-based	Session-based, one-shot triggers	Blocking long-poll queries
Write throughput	~10,000-30,000 writes/sec	~10,000-15,000 writes/sec	~5,000-10,000 writes/sec
Kubernetes integration	Native (backing store)	None (deprecated for K8s)	Service mesh focus
Service discovery	Via watches on key prefixes	Via ephemeral znodes	Native with health checks
Operational overhead	Low (single Go binary)	High (JVM tuning, GC pauses)	Medium (agent-based)

Capacity Planning

Cluster sizing: 3 nodes for most workloads. 5 nodes for production clusters that need higher fault tolerance. Never exceed 7. Each node needs 2-4 dedicated CPU cores, 8 GB RAM, and local SSD with consistent sub-10ms fsync latency.

Throughput: A well-tuned 3-node cluster on local SSDs handles ~10,000 linearizable writes/sec and ~30,000 serializable reads/sec. Each write generates WAL entries replicated to all followers, so network bandwidth is roughly write_rate * avg_value_size * (N-1) where N is cluster size.

Storage: Total DB size = num_keys * avg_value_size * avg_revisions_per_key. With auto-compaction retaining 1 hour of revisions, a cluster with 100,000 keys averaging 1 KB values and 10 revisions/hour uses about 1 GB. The hard recommendation is to stay under 8 GB. Exceeding this triggers expensive defragmentation cycles and risks OOM.

Watch load: A single etcd node can sustain ~10,000 concurrent watch streams. Kubernetes clusters with 2,000 nodes and aggressive controller reconciliation can generate 50,000+ watches. This is a common scaling bottleneck solved with watch coalescing or the apiserver watch cache.

Network: Raft heartbeats run every 100ms by default with a 1,000ms election timeout. Cross-AZ round-trip latency must stay under 50ms for stable leadership. Budget 10 Mbps baseline for a moderately active cluster, and scale with write volume.

Failure Scenarios

Scenario 1: Leader Election Storm from Disk Latency Spike

Trigger: The underlying storage on the leader node hits an IO latency spike. Maybe a noisy neighbor on shared storage, or EBS burst credits ran out. WAL fsync exceeds the Raft election timeout (default 1,000ms).

Impact: Followers stop receiving heartbeats and trigger an election. The new leader inherits the same workload. If the IO issue is systemic (all nodes on the same storage tier), the cluster enters a leader election loop. No writes get accepted, the kube-apiserver returns 5xx errors, and all Kubernetes control plane operations halt. Pods keep running, but new scheduling, scaling, and deployments stop.

Detection: Monitor etcd_disk_wal_fsync_duration_seconds (99th percentile > 10ms is a warning, > 100ms is critical). Alert on etcd_server_leader_changes_seen_total increasing more than once per hour. The apiserver metric etcd_request_duration_seconds will spike at the same time.

Recovery: Find the IO-bound node and either move it to faster storage or remove it from the cluster with etcdctl member remove. If the entire cluster is affected, reduce write volume (scale down non-critical controllers) and migrate to provisioned IOPS storage. Never restart all members at once.

Scenario 2: MVCC Revision Exhaustion and Database Size Breach

Trigger: Auto-compaction is disabled or misconfigured, and the cluster has been running for months. Every key mutation piles up historical revisions. The bbolt database grows past 8 GB, and defragmentation can't reclaim space because the live dataset itself is too large.

Impact: The etcd cluster enters a read-only alarm state (NOSPACE alarm). All writes return mvcc: database space exceeded. In Kubernetes terms, this means no new pods, no ConfigMap updates, no Secret rotations. The cluster is dead for writes.

Detection: Monitor etcd_mvcc_db_total_size_in_bytes and alert at 6 GB (75% of the 8 GB limit). Track etcd_debugging_mvcc_keys_total for unexpected key count growth. Dashboard the compaction backlog via etcd_debugging_mvcc_db_compaction_keys_total.

Recovery: Run manual compaction with etcdctl compaction <revision> to drop old revisions, then etcdctl defrag --cluster to reclaim physical disk space. Clear the alarm with etcdctl alarm disarm. Enable auto-compaction: --auto-compaction-mode=periodic --auto-compaction-retention=1h. If the live dataset exceeds 8 GB, the fix is to reduce key count or move large values to an external store. There is no other fix.

Scenario 3: Split-Brain During Network Partition

Trigger: A network partition isolates the leader from two followers. The two followers elect a new leader (forming a quorum of 2 out of 3). The old leader is now partitioned with no quorum.

Impact: Because etcd uses Raft, only the partition with quorum (the two followers plus the new leader) can accept writes. The old leader detects it has lost quorum within one election timeout and steps down, rejecting all client requests. Clients connected to the old leader get errors and must reconnect to the new leader. If clients lack retry logic or DNS updates are slow, a chunk of traffic will see errors for the duration of the partition plus the DNS/discovery TTL.

Detection: Monitor etcd_network_peer_round_trip_time_seconds for peer latency spikes. Alert on etcd_server_has_leader dropping to 0 on any member. The metric etcd_server_proposals_failed_total will increase on the partitioned leader.

Recovery: Raft guarantees no split-brain writes. Only the majority partition accepts mutations. Once the network heals, the partitioned node catches up via Raft log replay. Make sure clients use all member endpoints (not just the leader) and implement retry with backoff. For Kubernetes, the kube-apiserver's --etcd-servers flag should list all members, and its built-in retry handles transient failures.

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Criteria	etcd	ZooKeeper	Consul
Consensus	Raft	ZAB (Paxos-like)	Raft
Data model	Flat key-value with MVCC	Hierarchical znodes	KV + service catalog
Language	Go	Java	Go
Max recommended DB size	8 GB	~500 MB (per znode: 1 MB)	No hard limit (uses Raft log)
Watch mechanism	gRPC streaming, revision-based	Session-based, one-shot triggers	Blocking long-poll queries
Write throughput	~10,000-30,000 writes/sec	~10,000-15,000 writes/sec	~5,000-10,000 writes/sec
Kubernetes integration	Native (backing store)	None (deprecated for K8s)	Service mesh focus
Service discovery	Via watches on key prefixes	Via ephemeral znodes	Native with health checks
Operational overhead	Low (single Go binary)	High (JVM tuning, GC pauses)	Medium (agent-based)

Capacity Planning

Failure Scenarios

Scenario 1: Leader Election Storm from Disk Latency Spike

Scenario 2: MVCC Revision Exhaustion and Database Size Breach

Scenario 3: Split-Brain During Network Partition

Trigger: A network partition isolates the leader from two followers. The two followers elect a new leader (forming a quorum of 2 out of 3). The old leader is now partitioned with no quorum.

Use Cases

Architecture

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: Leader Election Storm from Disk Latency Spike

Scenario 2: MVCC Revision Exhaustion and Database Size Breach

Scenario 3: Split-Brain During Network Partition

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

etcd

Use Cases

Architecture

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: Leader Election Storm from Disk Latency Spike

Scenario 2: MVCC Revision Exhaustion and Database Size Breach

Scenario 3: Split-Brain During Network Partition

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies