etcd
The consensus store that Kubernetes literally cannot run without
Use Cases
Architecture
Why It Exists
Every distributed system eventually needs a single source of truth for coordination metadata. Which node is the leader? What services are running where? What configuration is active right now? Without a strongly consistent store backing these answers, the result is split-brain scenarios, data corruption, duplicate processing, and cascading failures.
etcd solves this with a linearizable key-value store built on Raft consensus. Every read returns the most recent committed write, even during partial network failures. That's the whole promise, and it delivers.
CoreOS built etcd because ZooKeeper was a bad fit for the Go-based cloud-native world. ZooKeeper's Java runtime, session-based ephemeral nodes, and hierarchical namespace all added complexity that didn't map well to what people actually needed. etcd shipped as a single Go binary with a flat keyspace and a gRPC API. Kubernetes picked it as its sole persistence layer, and that decision made etcd the most widely deployed consensus system in modern infrastructure.
How It Works Internally
Every write follows a strict path. The client sends a gRPC request to any cluster member, which forwards it to the current Raft leader. The leader appends the entry to its Write-Ahead Log (WAL), replicates it to a quorum of followers (majority of nodes), waits for acknowledgment, then applies it to the state machine. The state machine is a bbolt B+ tree database that stores all key-value pairs using Multi-Version Concurrency Control (MVCC). Each mutation creates a new revision number. Both current and historical values get indexed.
The MVCC model is the critical piece. etcd maintains a keyIndex in-memory B-tree that maps keys to their revision histories, plus a backend B+ tree (bbolt) that maps revisions to values. For a range query on key /registry/pods, it first checks the keyIndex for the latest revision, then fetches the value from bbolt. This dual-index design is what makes watches efficient: when a client subscribes to changes starting from revision N, etcd replays all mutations since that revision straight from the backend store.
The Watch API runs over gRPC server-side streaming. Internally, the watch hub splits watchers into two groups: synced watchers (caught up to current revision) and unsynced watchers (still replaying history). Synced watchers get events directly from the apply loop with near-zero overhead. This design lets a single etcd cluster handle tens of thousands of concurrent watches. Kubernetes depends on this heavily for its controller reconciliation loops.
Leases provide TTL semantics. Attach a lease (with a grant TTL) to keys, then periodically send KeepAlive RPCs. If the lease expires because of missed heartbeats, etcd automatically deletes all attached keys. This is the building block for service registration: a service registers itself with a 10-second lease, heartbeats every 3 seconds, and gets automatically deregistered on crash. Simple and reliable.
Production Architecture
Run 3 or 5 members across failure domains (availability zones or racks). The write path needs quorum agreement: a 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2. Going beyond 5 nodes is counterproductive. Raft replication cost grows linearly with cluster size, and a 7-node cluster doing ~10,000 writes/sec generates a lot of cross-AZ network traffic for marginal fault tolerance gains.
Storage must be local SSD. I cannot stress this enough. The Raft protocol requires durable WAL writes on every commit, and the 99th percentile fsync latency directly determines the write throughput ceiling. On AWS, use i3/i3en instances with NVMe or io2 EBS with provisioned IOPS (minimum 3,000 IOPS). Network-attached storage with variable latency (gp2/gp3) will cause leader election storms during IO spikes. I have seen this kill production clusters more than once.
For Kubernetes at scale, the standard pattern is a dedicated etcd cluster separate from worker nodes. The kube-apiserver connects over mTLS. At very large scale (5,000+ nodes), teams run separate etcd clusters for events vs. core state using the --etcd-servers-overrides flag, because event volume alone can overwhelm a single cluster.
For backups, use etcdctl snapshot save on a periodic cron (every 15-30 minutes) and upload to object storage. Snapshots are point-in-time consistent and include the full bbolt database. Restoration replaces the entire cluster state, so test the restore process before it is needed.
Decision Criteria
| Criteria | etcd | ZooKeeper | Consul |
|---|---|---|---|
| Consensus | Raft | ZAB (Paxos-like) | Raft |
| Data model | Flat key-value with MVCC | Hierarchical znodes | KV + service catalog |
| Language | Go | Java | Go |
| Max recommended DB size | 8 GB | ~500 MB (per znode: 1 MB) | No hard limit (uses Raft log) |
| Watch mechanism | gRPC streaming, revision-based | Session-based, one-shot triggers | Blocking long-poll queries |
| Write throughput | ~10,000-30,000 writes/sec | ~10,000-15,000 writes/sec | ~5,000-10,000 writes/sec |
| Kubernetes integration | Native (backing store) | None (deprecated for K8s) | Service mesh focus |
| Service discovery | Via watches on key prefixes | Via ephemeral znodes | Native with health checks |
| Operational overhead | Low (single Go binary) | High (JVM tuning, GC pauses) | Medium (agent-based) |
Capacity Planning
Cluster sizing: 3 nodes for most workloads. 5 nodes for production clusters that need higher fault tolerance. Never exceed 7. Each node needs 2-4 dedicated CPU cores, 8 GB RAM, and local SSD with consistent sub-10ms fsync latency.
Throughput: A well-tuned 3-node cluster on local SSDs handles ~10,000 linearizable writes/sec and ~30,000 serializable reads/sec. Each write generates WAL entries replicated to all followers, so network bandwidth is roughly write_rate * avg_value_size * (N-1) where N is cluster size.
Storage: Total DB size = num_keys * avg_value_size * avg_revisions_per_key. With auto-compaction retaining 1 hour of revisions, a cluster with 100,000 keys averaging 1 KB values and 10 revisions/hour uses about 1 GB. The hard recommendation is to stay under 8 GB. Exceeding this triggers expensive defragmentation cycles and risks OOM.
Watch load: A single etcd node can sustain ~10,000 concurrent watch streams. Kubernetes clusters with 2,000 nodes and aggressive controller reconciliation can generate 50,000+ watches. This is a common scaling bottleneck solved with watch coalescing or the apiserver watch cache.
Network: Raft heartbeats run every 100ms by default with a 1,000ms election timeout. Cross-AZ round-trip latency must stay under 50ms for stable leadership. Budget 10 Mbps baseline for a moderately active cluster, and scale with write volume.
Failure Scenarios
Scenario 1: Leader Election Storm from Disk Latency Spike
Trigger: The underlying storage on the leader node hits an IO latency spike. Maybe a noisy neighbor on shared storage, or EBS burst credits ran out. WAL fsync exceeds the Raft election timeout (default 1,000ms).
Impact: Followers stop receiving heartbeats and trigger an election. The new leader inherits the same workload. If the IO issue is systemic (all nodes on the same storage tier), the cluster enters a leader election loop. No writes get accepted, the kube-apiserver returns 5xx errors, and all Kubernetes control plane operations halt. Pods keep running, but new scheduling, scaling, and deployments stop.
Detection: Monitor etcd_disk_wal_fsync_duration_seconds (99th percentile > 10ms is a warning, > 100ms is critical). Alert on etcd_server_leader_changes_seen_total increasing more than once per hour. The apiserver metric etcd_request_duration_seconds will spike at the same time.
Recovery: Find the IO-bound node and either move it to faster storage or remove it from the cluster with etcdctl member remove. If the entire cluster is affected, reduce write volume (scale down non-critical controllers) and migrate to provisioned IOPS storage. Never restart all members at once.
Scenario 2: MVCC Revision Exhaustion and Database Size Breach
Trigger: Auto-compaction is disabled or misconfigured, and the cluster has been running for months. Every key mutation piles up historical revisions. The bbolt database grows past 8 GB, and defragmentation can't reclaim space because the live dataset itself is too large.
Impact: The etcd cluster enters a read-only alarm state (NOSPACE alarm). All writes return mvcc: database space exceeded. In Kubernetes terms, this means no new pods, no ConfigMap updates, no Secret rotations. The cluster is dead for writes.
Detection: Monitor etcd_mvcc_db_total_size_in_bytes and alert at 6 GB (75% of the 8 GB limit). Track etcd_debugging_mvcc_keys_total for unexpected key count growth. Dashboard the compaction backlog via etcd_debugging_mvcc_db_compaction_keys_total.
Recovery: Run manual compaction with etcdctl compaction <revision> to drop old revisions, then etcdctl defrag --cluster to reclaim physical disk space. Clear the alarm with etcdctl alarm disarm. Enable auto-compaction: --auto-compaction-mode=periodic --auto-compaction-retention=1h. If the live dataset exceeds 8 GB, the fix is to reduce key count or move large values to an external store. There is no other fix.
Scenario 3: Split-Brain During Network Partition
Trigger: A network partition isolates the leader from two followers. The two followers elect a new leader (forming a quorum of 2 out of 3). The old leader is now partitioned with no quorum.
Impact: Because etcd uses Raft, only the partition with quorum (the two followers plus the new leader) can accept writes. The old leader detects it has lost quorum within one election timeout and steps down, rejecting all client requests. Clients connected to the old leader get errors and must reconnect to the new leader. If clients lack retry logic or DNS updates are slow, a chunk of traffic will see errors for the duration of the partition plus the DNS/discovery TTL.
Detection: Monitor etcd_network_peer_round_trip_time_seconds for peer latency spikes. Alert on etcd_server_has_leader dropping to 0 on any member. The metric etcd_server_proposals_failed_total will increase on the partitioned leader.
Recovery: Raft guarantees no split-brain writes. Only the majority partition accepts mutations. Once the network heals, the partitioned node catches up via Raft log replay. Make sure clients use all member endpoints (not just the leader) and implement retry with backoff. For Kubernetes, the kube-apiserver's --etcd-servers flag should list all members, and its built-in retry handles transient failures.
Pros
- • Strong consistency via Raft consensus
- • Simple HTTP/gRPC API
- • Watch API for real-time change notifications
- • Lease-based TTL for ephemeral keys
- • Foundation of Kubernetes control plane
Cons
- • Not designed for large data volumes (recommended < 8GB)
- • Write latency depends on cluster size and network
- • All data must fit in memory
- • Limited query capabilities (prefix-based only)
- • Compaction needed to prevent unbounded growth
When to use
- • Kubernetes or cloud-native infrastructure
- • Need strongly consistent configuration store
- • Service discovery with health checking
- • Distributed coordination in Go-based systems
When NOT to use
- • General-purpose data storage
- • Large datasets (> 8 GB)
- • High write throughput requirements
- • Complex query patterns
Key Points
- •All writes go through the Raft leader. Reads can be linearizable (leader) or serializable (any node), trading consistency for latency.
- •bbolt's B+ tree stores MVCC revisions. Every key mutation creates a new revision, which is what makes transactional watches and time-travel reads possible.
- •Cluster size should be 3 or 5 nodes. Never even numbers, never more than 7, because Raft quorum write amplification becomes prohibitive.
- •The default 1.5MB value size limit and 8GB recommended DB size are hard operational constraints. Exceeding them degrades the entire cluster.
- •The Watch API uses gRPC server-side streaming with revision-based resumption, so it stays safe across network partitions.
- •etcd's MVCC model means deleted keys still eat space until compaction runs. Auto-compaction is non-negotiable for long-lived clusters.
Common Mistakes
- ✗Running etcd on network-attached storage instead of local SSDs. Raft consensus is extremely latency-sensitive to fsync.
- ✗Not setting --auto-compaction-retention, leading to unbounded MVCC revision growth and eventual DB size limit breach.
- ✗Storing large blobs (images, binaries) in etcd instead of keeping values under 1KB for optimal performance.
- ✗Using an even number of nodes (e.g., 4), which gives no additional fault tolerance over 3 but increases write latency.
- ✗Neglecting to monitor etcd disk WAL fsync duration. Values above 10ms signal impending cluster instability.