Raft Consensus Protocol
Architecture
Why Consensus Is Hard
Distributed consensus looks deceptively simple: get a group of machines to agree on a sequence of values. But the environment these machines operate in is adversarial in ways that single-machine programming never prepares you for.
A network partition splits your five-node cluster into a group of three and a group of two. Both groups might try to elect a leader. If both succeed, you have split brain, and the two leaders will accept conflicting writes that cannot be reconciled later. This is not a theoretical concern. It happens in production, particularly in cloud environments where an entire availability zone can become unreachable while every node inside it thinks it is fine.
Nodes crash and restart. When a node comes back, its state might be stale by seconds or hours. It might have missed committed entries. It might have uncommitted entries from a previous leader that was deposed. The protocol has to handle all of this without human intervention.
And the fundamental problem underneath everything: there is no reliable way for one node to distinguish between "that node crashed" and "the network between us is down." Fischer, Lynch, and Paterson proved in 1985 that deterministic consensus is impossible in an asynchronous system with even one crash failure. Every practical consensus protocol works around this impossibility result by introducing some form of timing assumption, usually through timeouts.
The Three Sub-Problems
Diego Ongaro designed Raft specifically to be understandable. The key insight was decomposition. Instead of presenting consensus as one monolithic problem (the Paxos approach), Raft breaks it into three pieces that you can reason about independently.
Leader election determines which node is in charge. Only the leader accepts client requests and replicates log entries. There is at most one leader per term.
Log replication is how the leader gets its log entries onto follower nodes. The leader sends AppendEntries RPCs, followers append the entries, and once a majority have acknowledged, the entry is committed.
Safety guarantees that once an entry is committed, no future leader can overwrite it. This is the hardest part to get right, and where Paxos papers tend to lose people.
Leader Election In Detail
Time in Raft is divided into terms, which are consecutively numbered. Each term begins with an election. If the election produces a leader, that leader serves for the rest of the term. If the election fails (split vote), the term ends with no leader and a new term starts immediately.
Every node starts as a follower. Followers listen for AppendEntries RPCs from the leader, which also serve as heartbeats. If a follower does not hear from a leader within its election timeout, it assumes the leader has failed.
The follower then transitions to candidate, increments its current term, votes for itself, and sends RequestVote RPCs to all other nodes. A node grants its vote if (a) it has not already voted in this term and (b) the candidate's log is at least as up-to-date as the voter's log. That second condition is critical and we will come back to it.
Three things can happen to a candidate. It receives votes from a majority of nodes and becomes the leader. It receives an AppendEntries RPC from another node claiming to be leader with a term at least as large, in which case it steps down to follower. Or the election times out with no winner, and a new term begins.
The randomized election timeout prevents elections from cycling indefinitely. Each node picks a random timeout between 150ms and 300ms. The node whose timeout fires first almost always wins the election before other nodes even start one. Split votes still happen occasionally, but consecutive split votes are extremely unlikely because each round picks new random timeouts.
In practice, leader elections take about one to two round trips. The cluster is unavailable for writes during the election, which typically lasts a few hundred milliseconds. This is why election timeout tuning matters: too short and network jitter causes unnecessary elections, too long and legitimate leader failures take too long to detect.
Log Replication
Once a leader is elected, it begins servicing client requests. Each client request becomes a new entry in the leader's log. The leader then sends AppendEntries RPCs to every follower in parallel, containing the new log entries.
The AppendEntries RPC includes more than just the new entries. It includes the index and term of the entry immediately preceding the new entries (prevLogIndex and prevLogTerm). The follower checks whether its log matches at that point. If it does, the follower appends the new entries. If it does not, the follower rejects the RPC, and the leader backs up one entry and retries.
This consistency check is what keeps logs aligned across the cluster. If a follower missed entries or has stale entries from a previous term, the leader will walk backward through the log until it finds the point of agreement, then overwrite everything after that point. The leader never modifies its own log. Followers always converge toward the leader's log.
Once the leader knows that a majority of followers have replicated an entry, it commits that entry by advancing its commit index. The leader includes the commit index in subsequent AppendEntries RPCs so followers know to apply committed entries to their state machines. This is where linearizability comes from: clients see committed entries in a consistent order because all nodes apply the same log in the same sequence.
The leader sends heartbeats (empty AppendEntries RPCs) to all followers at a regular interval, typically every 50 to 100 milliseconds. These heartbeats prevent followers from starting unnecessary elections and also carry updated commit indices.
The Election Restriction: Why Safety Works
Here is the subtle part. Say a leader commits entry X at log index 100, then crashes. A new leader is elected. Can the new leader overwrite index 100 with something different?
Raft says no, and the mechanism is elegantly simple. When a candidate sends a RequestVote, it includes the index and term of its last log entry. A voter rejects the vote if the candidate's log is less up-to-date. "Up-to-date" means: compare the term of the last entry first. If the terms differ, the higher term wins. If the terms are the same, the longer log wins.
Since entry X was committed, a majority of nodes have it. For any candidate to win an election, it needs votes from a majority. These two majorities must overlap in at least one node. That overlapping node has entry X and will refuse to vote for any candidate that does not also have it.
This means every future leader must have all previously committed entries. No special recovery protocol is needed. No two-phase coordination. The election itself filters out candidates that would lose committed data.
This property, combined with the fact that leaders never overwrite their own logs, gives you the key safety guarantee: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher terms.
Log Compaction and Snapshotting
Without compaction, the Raft log grows forever. Every write ever accepted by the cluster stays in the log. After months of operation, the log might be gigabytes or terabytes, and a new node joining the cluster would need to replay all of it.
Snapshotting is the standard solution. Each node independently takes a snapshot of its state machine at some committed log index, then discards all log entries up to that index. The snapshot includes the state machine data plus the index and term of the last included entry.
When a leader needs to bring a far-behind follower up to date, and the required log entries have been discarded, the leader sends its snapshot via an InstallSnapshot RPC. The follower replaces its state machine with the snapshot and starts receiving AppendEntries from that point forward.
Snapshot frequency is a tuning parameter. Frequent snapshots keep the log small and speed up new node joins, but the snapshot process itself consumes I/O and CPU. Most production systems snapshot when the log exceeds a configurable size threshold, like 100MB or 1GB.
Membership Changes
Changing the cluster membership (adding or removing nodes) is one of the trickier parts of Raft. The danger is that during a membership change, the old configuration and the new configuration might each independently elect a leader, causing split brain.
Raft's original paper proposed joint consensus: a transitional configuration where entries must be committed by majorities of both the old and new configurations simultaneously. This is correct but complex to implement.
The simpler alternative, which most production implementations use, is single-server membership changes. You add or remove one server at a time. With single-server changes, the old and new configurations always overlap in a majority, so two independent leaders cannot be elected. This is what etcd uses.
The practical implication: if you need to go from a three-node cluster to a five-node cluster, you add one node, wait for it to catch up and join the quorum, then add the second node. Never try to change multiple members simultaneously unless your implementation explicitly supports joint consensus.
Pre-Vote Extension
Consider a node that gets partitioned from the cluster. It stops receiving heartbeats, so its election timeout fires. It increments its term and sends RequestVote RPCs, but they never reach anyone. Its timeout fires again. It increments its term again. This continues until the partition heals.
When the partition heals, this node has a much higher term than the rest of the cluster. It sends a RequestVote with that high term, and every node that receives it steps down from whatever it was doing because they see a higher term. The current leader steps down. A new election happens. This is disruptive and unnecessary.
The Pre-Vote extension (described in Ongaro's thesis) fixes this. Before incrementing its term and starting a real election, a candidate first sends Pre-Vote RPCs asking "would you vote for me?" without incrementing the term. If a majority says yes, the candidate proceeds with the real election. If not, it stays as follower.
A partitioned node's Pre-Vote RPCs will fail because it cannot reach a majority. It never increments its term. When the partition heals, it rejoins quietly. etcd and CockroachDB both implement Pre-Vote.
Raft vs Multi-Paxos
Multi-Paxos predates Raft by about 25 years and is provably correct. So why did Raft take over?
Understandability. Paxos was described in terms of a "part-time parliament" metaphor that confused most readers. The formal proofs are correct but dense. Raft's paper was written with diagrams, with concrete RPC definitions, with pseudocode you could almost copy-paste into an implementation. In a user study at Stanford, students implementing Raft scored significantly higher on correctness than students implementing Paxos.
The single-leader design also simplifies reasoning. Multi-Paxos can operate with multiple proposers, which creates additional complexity around ordering. Raft says there is one leader, period. The leader serializes all operations. This is less flexible but much easier to build correctly.
In practice, the performance characteristics are similar. Both require a majority quorum for commits. Both have similar latency profiles. Raft's leader-based approach can be a bottleneck for write-heavy workloads, but the same is true for practical Multi-Paxos deployments that typically use a stable leader anyway.
The places where you still see Paxos variants are systems that predate Raft (like Chubby, which uses Multi-Paxos) and systems that need leaderless operation or flexible quorums (like some academic Paxos variants). For new systems, Raft has become the default choice.
Production Considerations
Read scalability. Raft commits require contacting a majority, which means the leader is a bottleneck for reads too if you want linearizable reads. Two common workarounds: read leases (the leader assumes its lease is valid for a short period and serves reads without contacting followers) and follower reads (direct reads to followers with a "read index" protocol that confirms the follower is sufficiently up-to-date). Both have subtleties around clock skew and stale reads.
Batching and pipelining. Production Raft implementations batch multiple client requests into a single AppendEntries RPC and pipeline multiple batches without waiting for acknowledgment. This dramatically improves throughput. etcd's Raft implementation achieves tens of thousands of writes per second with batching.
Disk performance. Raft requires that log entries be persisted to disk before a node can acknowledge them. If the leader and a majority of followers all wait for fsync on every entry, commit latency is bounded by disk speed. This is why Raft clusters on SSDs perform significantly better than on spinning disks. Some implementations use group commit (batching multiple fsyncs) to amortize the cost.
Witness replicas. In a five-node cluster, you need five machines each storing the full dataset. A witness is a node that participates in voting but does not store the full state machine, only enough log to participate in elections. CockroachDB supports witness replicas for reducing storage costs in multi-region deployments.
Key Points
- •Raft decomposes consensus into three relatively independent sub-problems: leader election, log replication, and safety. Each one is tractable on its own, which is the whole reason Raft succeeded where Paxos confused everyone
- •Every committed entry survives any future leader election because candidates must prove they have all committed entries before a majority will vote for them. This is the election restriction, and it is the single most important safety property in the protocol
- •Leader election uses randomized timeouts between 150ms and 300ms. The randomization prevents split votes from repeating indefinitely, and in practice elections settle in one or two rounds
- •etcd, CockroachDB, Consul, TiKV, and RabbitMQ quorum queues all run Raft in production. It is the dominant consensus algorithm for strongly consistent replicated state machines
Used By
Common Mistakes
- ✗Setting election timeout too low in cloud environments. Network jitter on shared infrastructure can easily exceed 150ms, causing unnecessary leader elections that stall writes for the duration of the election
- ✗Not implementing log compaction or snapshotting. Without it, the Raft log grows unbounded on disk and new followers joining the cluster must replay the entire history from the beginning
- ✗Ignoring learner (non-voting) members when scaling reads. Adding voting members increases the cost of every commit because the leader must wait for a majority. Learners receive the log but do not participate in elections or commit quorums
- ✗Assuming Raft handles Byzantine failures. It does not. Raft trusts that nodes follow the protocol honestly. A malicious node that sends fabricated log entries will corrupt the cluster. For Byzantine tolerance, you need BFT protocols like PBFT or HotStuff