Paxos Consensus
Architecture
The Consensus Problem
Imagine three servers that need to agree on a single value. Maybe it is the next entry in a replicated log, or who holds a distributed lock, or what the current cluster configuration is. Each server can propose a value, and they need to reach agreement even if some servers crash or messages are delayed.
This sounds straightforward until you think about failures. Server A proposes "x" and sends it to B and C. But A crashes after sending to B and before sending to C. Now B has seen "x" but C has not. Meanwhile, C proposes "y." How do you ensure everyone ends up with the same value without blocking forever?
Paxos solves this problem. It guarantees two things: (1) only a single value is chosen (safety), and (2) if a majority of servers are alive and can communicate, a value will eventually be chosen (liveness, modulo the FLP impossibility result).
Single-Decree Paxos: Agreeing on One Value
The protocol has three roles: proposers (who suggest values), acceptors (who vote on values), and learners (who learn the chosen value). In practice, every server plays all three roles.
Phase 1: Prepare and Promise
A proposer picks a unique ballot number n (think of it as a priority or version number, monotonically increasing and globally unique) and sends Prepare(n) to a majority of acceptors.
Each acceptor that receives Prepare(n) checks: is n higher than any ballot I have previously promised? If yes, the acceptor sends back Promise(n), pledging not to accept any proposal with a ballot number less than n. If the acceptor has previously accepted a value in some earlier ballot, it includes that value in the promise so the proposer knows about it.
If no, the acceptor ignores the prepare message (or sends a rejection).
Phase 2: Accept and Accepted
Once the proposer receives promises from a majority of acceptors, it moves to Phase 2. Here is the critical rule: if any of those promises included a previously accepted value, the proposer must use that value. It cannot substitute its own. This is what ensures that a value, once chosen, stays chosen. Only if no promises include a previously accepted value can the proposer use its own.
The proposer sends Accept(n, v) to a majority of acceptors, where v is the value (either its own or the one it was forced to adopt from the promises).
Each acceptor that receives Accept(n, v) checks: have I promised a ballot number higher than n since I sent my promise? If not, accept the value and store (n, v). If yes, reject.
When a majority of acceptors have accepted the same ballot (n, v), the value v is chosen. Learners can be notified by the acceptors or by the proposer.
Why This Works
The beauty of Paxos is in that Phase 2 constraint. If a value has already been accepted by a majority in some earlier ballot, any new proposer running Phase 1 will discover it (because any majority overlaps with the previous majority in at least one acceptor). The new proposer is then forced to re-propose that same value. A chosen value cannot be overwritten.
If no value has been chosen yet, the proposer is free to propose whatever it wants. If two proposers compete simultaneously, one of them will fail (its ballot gets superseded) and retry with a higher ballot. Progress is not guaranteed in the face of contention (this is the livelock problem), but safety always holds.
Multi-Paxos: Making It Practical
Single-decree Paxos agrees on one value. Building a replicated log (a sequence of values, like a sequence of database operations) means running Paxos once per log entry. That is a lot of Prepare/Promise round trips.
Multi-Paxos optimizes this by electing a stable leader. Once a proposer wins Phase 1, it can skip Phase 1 for subsequent log entries and go straight to Phase 2 as long as no other proposer challenges it. The leader sends Accept messages directly, and acceptors accept them immediately. This reduces consensus to a single round trip per log entry in the steady state.
If the leader fails, a new proposer runs Phase 1 with a higher ballot number, discovers any in-flight values from the old leader, and takes over. The leadership transition is just Paxos doing its thing with a new ballot.
This is where Paxos goes from elegant algorithm to production system, and where the real complexity lives. The Paxos paper says almost nothing about:
- Snapshotting: the log grows forever. You need to periodically snapshot the state machine and truncate old log entries.
- Membership changes: adding or removing servers from the Paxos group while it is running requires careful protocol extensions (joint consensus or similar).
- Read leases: serving reads from followers without going through consensus requires a lease mechanism to ensure the leader is still actually the leader.
- Client interaction: handling duplicate client requests, ensuring exactly-once semantics, routing clients to the leader.
This is the gap that Raft was designed to fill. Raft specifies all of these details explicitly. Paxos leaves them as exercises for the implementer.
Paxos vs. Raft
The question everyone asks: when should you use Paxos instead of Raft?
Raft was designed to be understandable. It decomposes consensus into leader election, log replication, and safety, with each piece specified in detail. Paxos derives from first principles and is more general but harder to implement correctly.
In practice, most new systems choose Raft because the specification is complete enough to build from. etcd, CockroachDB, TiKV, and Consul all use Raft. Systems that predate Raft (Chubby, Zookeeper, Spanner) use Paxos variants because Raft did not exist when they were built.
There are some genuine differences. Raft requires a leader for all operations, which simplifies reasoning but creates a bottleneck. Paxos can operate in a leaderless mode (though it is less efficient). Flexible Paxos allows using different quorum sizes for Phase 1 and Phase 2, which enables interesting trade-offs that Raft's fixed-quorum model does not support.
For new projects? Use Raft unless you have a specific reason not to. The reference implementations are better, the testing tools are more mature, and the specification leaves fewer gaps to fill.
Flexible Paxos: Rethinking Quorums
Classic Paxos requires a majority for both phases. With 5 servers, you need 3 for Phase 1 (Prepare) and 3 for Phase 2 (Accept). Howard, Malkhi, and Spiegelman showed in 2016 that this is more restrictive than necessary.
The actual requirement is that the Phase 1 quorum and the Phase 2 quorum overlap. With 5 servers, you could use a Phase 1 quorum of 4 and a Phase 2 quorum of 2. As long as 4 + 2 > 5 (which it does), the overlap guarantee holds.
Why does this matter? Phase 2 runs on every write. Phase 1 runs only during leader changes. By shrinking the Phase 2 quorum (the common case) at the cost of a larger Phase 1 quorum (the rare case), you can reduce write latency. In a geo-distributed cluster, this means writes can succeed with fewer cross-datacenter round trips.
The FLP Impossibility Result
Fischer, Lynch, and Paterson proved in 1985 that no deterministic protocol can guarantee consensus in an asynchronous system if even one process can crash. "Asynchronous" means there is no upper bound on message delay. A slow message and a lost message are indistinguishable.
This does not mean consensus is unsolvable. It means any solution must use some form of non-determinism (randomization or timeouts) to guarantee progress. Paxos uses timeouts for leader election. If the leader has not sent a heartbeat in T seconds, assume it is dead and start a new election. This can lead to unnecessary elections if the network is slow, but it never violates safety. The worst case is temporary unavailability, not incorrect results.
Understanding FLP matters because it sets expectations correctly. No consensus protocol can always make progress. If someone claims their protocol guarantees both safety and liveness in an asynchronous system with failures, they have either changed the model or made a mistake.
Production Lessons from Google
Google has been running Paxos in production since at least 2006 (Chubby). Their experience reports reveal patterns that the academic papers do not cover.
Chubby uses Multi-Paxos for a distributed lock service. The key lesson: Paxos-based systems need disk writes on the critical path (acceptors must persist their state before responding), and disk write latency dominates end-to-end latency. Chubby went through several iterations of optimizing disk access patterns.
Spanner uses Paxos groups for replication within a datacenter and TrueTime for cross-datacenter consistency. Each Paxos group runs Multi-Paxos independently. The interesting piece: Spanner layer 2PC on top of Paxos. Each participant in a distributed transaction is a Paxos group, so the 2PC coordinator is itself fault-tolerant.
Megastore attempted to use Paxos across wide-area networks for Google App Engine. The latency was painful (hundreds of milliseconds per write), and the system eventually evolved into Spanner. The lesson: Paxos works across datacenters, but the latency implications are severe, and you need to architect around them.
Implementing Paxos: Where People Get Stuck
The single-decree protocol is simple. The implementation traps are:
Ballot number uniqueness. Two proposers must never use the same ballot number. The common approach: each server has an ID, and ballot numbers are (round_number, server_id) pairs, compared lexicographically.
Persistence before responding. An acceptor must write its promise or acceptance to durable storage before sending the response. If it crashes and recovers, it must honor its last promise. Responding from memory and writing to disk asynchronously violates the protocol.
Garbage collection. Which log entries can you safely delete? Only entries that have been applied to the state machine on all servers. This requires a protocol for communicating "I have applied up to entry X" and is often the trickiest part of a production implementation.
Dueling proposers. Proposer A sends Prepare(5). Proposer B sends Prepare(6) before A's Accept arrives. A's Accept gets rejected. A retries with Prepare(7). B's Accept gets rejected. This ping-pong can continue indefinitely. The fix: exponential backoff with jitter on leadership attempts, or a simple lease mechanism where the current leader holds a time-based lease and others do not attempt to lead until the lease expires.
Key Points
- •Paxos is the original solution to distributed consensus. Leslie Lamport published it in 1998 (written in 1990), and every consensus algorithm since then, including Raft, Zab, and Viewstamped Replication, is either a variant of Paxos or heavily influenced by it
- •Single-decree Paxos agrees on exactly one value through two phases: Prepare/Promise (proposer claims a ballot number, acceptors promise not to accept older ballots) and Accept/Accepted (proposer sends the value, acceptors accept if they have not promised a higher ballot)
- •Multi-Paxos optimizes for the common case by electing a stable leader. Once a leader is established, it skips the Prepare phase entirely and runs only the Accept phase for each new value. This cuts consensus down to one round trip
- •The FLP impossibility result proves that no deterministic consensus protocol can guarantee progress in an asynchronous system with even one crash failure. Paxos handles this by using timeouts and leader election, which can livelock but never violate safety
- •Google Spanner, Google Chubby, Azure Storage, and Apache Zookeeper (via the ZAB protocol, a Paxos variant) all run Paxos-family protocols in production at massive scale
Used By
Common Mistakes
- ✗Conflating Paxos with leader election. Paxos does not require a leader. Single-decree Paxos works with multiple concurrent proposers. The leader optimization in Multi-Paxos is a performance improvement, not a correctness requirement
- ✗Assuming that understanding single-decree Paxos means you can implement Multi-Paxos. The gap between agreeing on one value and building a replicated log involves dozens of engineering decisions that the original paper does not cover: snapshotting, membership changes, read leases, client interaction
- ✗Ignoring the livelock scenario. Two proposers alternating Prepare messages with increasing ballot numbers can prevent either from completing the Accept phase. A randomized backoff or leader election breaks the livelock, but you have to implement it
- ✗Thinking that Paxos is too complex to be practical. The single-decree protocol is about 20 lines of pseudocode. What is genuinely complex is building a full replicated state machine on top of it, but that complexity exists regardless of which consensus protocol you choose