Two-Phase Commit (2PC)
Architecture
The Atomic Commit Problem
You are transferring $100 from account A (on server 1) to account B (on server 2). You debit A and credit B. Both operations must succeed, or neither should. If A is debited but B is not credited, $100 vanishes. If B is credited but A is not debited, $100 appears from nowhere.
On a single database, this is easy. Wrap it in a transaction, and the database guarantees atomicity. But when the data lives on different servers, different databases, or different shards, there is no single transaction manager. You need a protocol that coordinates the commit across multiple independent participants. That protocol is two-phase commit.
How 2PC Works
The protocol involves a coordinator (the node driving the transaction) and multiple participants (the nodes holding the data).
Phase 1: Prepare
The coordinator sends a PREPARE message to every participant. Each participant does everything needed to commit the transaction except actually committing: it acquires locks, validates constraints, writes a prepare record to its write-ahead log. If everything checks out, it responds YES. If anything goes wrong (constraint violation, disk full, timeout), it responds NO.
A YES vote is a serious promise. The participant is saying: "I can commit this transaction, I have all the locks and resources, and I will not unilaterally decide to abort. I will wait for your decision." This promise must survive crashes, which is why the prepare record goes to the WAL before responding.
A NO vote is unilateral. The participant has already decided to abort its local portion and does not need to hear back from the coordinator.
Phase 2: Commit or Abort
The coordinator collects all votes. If every participant voted YES, the coordinator writes a commit record to its own WAL and sends COMMIT to all participants. If any participant voted NO (or did not respond within the timeout), the coordinator writes an abort record and sends ABORT to all participants.
Each participant receives the decision, applies or rolls back, releases its locks, and acknowledges. Once the coordinator receives all acknowledgments, the transaction is fully complete.
The Decision Point
There is a critical moment in 2PC: the instant when the coordinator writes its commit decision to durable storage. Before that point, the transaction can still be aborted (if any participant has not voted yes or the coordinator decides to abort). After that point, the decision is final. Participants who have voted yes will eventually be told to commit.
This decision point is the source of both the power and the weakness of 2PC.
The Blocking Problem
Here is the scenario that makes 2PC controversial.
Participant A votes YES. Participant B votes YES. The coordinator writes "commit" to its log and then crashes before sending the commit message to anyone.
Now both participants are stuck. They voted YES and promised not to abort unilaterally. They are holding locks on the data. They cannot commit because they have not been told to. They cannot abort because they promised not to. They are blocked, waiting for a coordinator that might not come back for minutes, hours, or ever (until someone manually intervenes).
Meanwhile, other transactions that need those locked rows are also blocked. The blast radius can be enormous.
This is the fundamental problem with 2PC. In a system with unreliable nodes, any participant can get stuck in an uncertain state between voting YES and receiving the final decision. During that uncertainty window, it holds locks and blocks other work.
Dealing with Coordinator Failure
Several strategies address the blocking problem:
Timeout and abort. The simplest approach. If a participant votes YES and does not hear back within T seconds, abort. This violates the "promise to commit" guarantee but is sometimes acceptable in practice. The risk: the coordinator eventually recovers and sends COMMIT, but the participant has already aborted. Now you have an inconsistency. Recovery protocols exist to detect and repair this, but they add complexity.
Cooperative recovery. When the coordinator is down, participants ask each other: "Did you get a commit or abort message?" If any participant received COMMIT, everyone commits. If any participant received ABORT, or if any participant has not yet voted (meaning it voted NO or never got the PREPARE), everyone aborts. The only irrecoverable case is when all participants voted YES and none received the final decision.
Replicated coordinator. The approach used by Google Spanner and CockroachDB. The coordinator is not a single node but a Paxos or Raft group. If the coordinator leader fails, another node in the group takes over, reads the commit decision from the replicated log, and finishes the protocol. This eliminates the blocking problem at the cost of running a consensus protocol for the coordinator itself.
Three-Phase Commit (3PC): The Theoretical Fix
Skeen proposed 3PC in 1981 as a non-blocking alternative to 2PC. It adds a third phase: pre-commit.
Phase 1: Prepare (same as 2PC). Phase 2: Pre-commit. If all voted yes, the coordinator sends PRE-COMMIT. Participants acknowledge. Phase 3: Commit. The coordinator sends COMMIT.
The idea is that before the coordinator commits, everyone knows that everyone else voted yes (via the pre-commit acknowledgment). If the coordinator crashes, participants can coordinate among themselves because they all know the vote outcome.
In theory, 3PC is non-blocking. In practice, it has problems. It assumes a synchronous network with bounded message delays. In an asynchronous network (the real world), 3PC can actually violate safety: a network partition can cause one side to commit and the other to abort. This is why you rarely see 3PC in production systems. 2PC with a replicated coordinator is strictly better.
Presumed Abort Optimization
In vanilla 2PC, the coordinator must persist its decision (commit or abort) before sending it to participants. This means a disk write on the critical path for every transaction, even aborted ones.
The presumed-abort optimization says: do not bother logging abort decisions. If a participant asks the coordinator about a transaction and the coordinator has no record of it, the answer is "abort." This is safe because the absence of a commit record means the transaction was never committed.
For commit decisions, the coordinator still writes to its log. But for the common case of aborted transactions (which are frequent in busy systems due to lock conflicts), you save a disk write.
The presumed-commit optimization flips this: do not log commit decisions, and assume committed if no record exists. This is useful when most transactions commit. In practice, presumed-abort is more common because it has simpler recovery semantics.
2PC in the Wild
MySQL XA Transactions. MySQL supports the XA transaction standard, which is essentially 2PC. You can run XA PREPARE to enter the prepared state and XA COMMIT or XA ROLLBACK to complete. The practical problem: MySQL's XA implementation has historically been buggy. Prepared transactions can survive a server restart but sometimes get lost during crash recovery. This has improved in recent versions, but many teams avoid MySQL XA and use application-level compensation instead.
PostgreSQL PREPARE TRANSACTION. PostgreSQL's implementation is more robust. PREPARE TRANSACTION 'txid' persists the prepared state. COMMIT PREPARED 'txid' or ROLLBACK PREPARED 'txid' finishes it. Prepared transactions survive crashes and restarts. The catch: prepared transactions hold locks until resolved. If your application crashes without committing or rolling back prepared transactions, those locks persist until someone manually cleans them up. PostgreSQL intentionally makes PREPARE TRANSACTION visible in pg_prepared_xacts for exactly this reason.
Google Spanner. This is the gold standard for 2PC done right. Every participant in a Spanner transaction is a Paxos group, and the coordinator is also a Paxos group. When the coordinator leader crashes, another member takes over and completes the protocol. When a participant leader crashes, its Paxos group elects a new leader that has the prepare record in its replicated log. There is no window of uncertainty because every decision is replicated before it is acted on. The price: two consensus rounds per transaction (one for the prepare, one for the commit), which means higher latency.
CockroachDB follows a similar design. Distributed transactions use a parallel commit optimization that pipelines the prepare and commit phases: the coordinator writes the commit record to its own transaction record and lets participants commit as soon as they see the record, without waiting for the coordinator to explicitly send COMMIT. This cuts one round trip from the happy path.
Kafka Transactions. Kafka's exactly-once semantics use a variant of 2PC. The transaction coordinator is a special Kafka broker. Producers send messages to multiple partitions within a transaction. The coordinator decides commit or abort and writes a commit marker to each partition. Consumers that read "read committed" skip messages in uncommitted transactions. This is not classic 2PC (there is no prepare phase per partition), but the coordination pattern is the same.
When to Use 2PC (and When Not To)
Use 2PC when you need atomic commits across a small number of tightly coupled data stores. Cross-shard transactions within a single database system (Spanner, CockroachDB, Vitess) are the sweet spot. The participants trust each other, the coordinator can be replicated, and the latency budget accommodates two round trips.
Do not use 2PC across loosely coupled microservices. If your checkout flow needs to update inventory, charge payment, and send a confirmation email, wrapping all three in a 2PC transaction means that an email service outage blocks checkout. Instead, use sagas (a sequence of local transactions with compensating actions) or outbox patterns (write to a local outbox table and process asynchronously).
The rule of thumb: 2PC works within a trust boundary where all participants are managed by the same team and have similar availability targets. Across trust boundaries, use eventual consistency patterns instead. The world learned this lesson the hard way with CORBA and XA transactions in the early 2000s, and it remains true today.
Key Points
- •2PC solves the atomic commit problem: either all participants in a distributed transaction commit, or all abort. There is no state where participant A committed and participant B aborted. This is the guarantee that makes distributed transactions possible
- •Phase 1 (Prepare): the coordinator asks each participant 'can you commit?' Each participant acquires locks, writes a prepare record to its WAL, and votes yes or no. A yes vote is a binding promise to commit if asked. Phase 2 (Commit/Abort): the coordinator collects votes. If all voted yes, send commit. If any voted no, send abort
- •The blocking problem is the biggest weakness. If the coordinator crashes after sending prepare but before sending the commit decision, participants are stuck holding locks and cannot proceed. They voted yes and promised to commit if asked, but nobody is asking. 3PC was invented to address this, but it trades blocking for other problems
- •Google Spanner solves the blocking problem by making the coordinator itself a Paxos group. If the coordinator leader crashes, another member of the Paxos group takes over and drives the commit decision. This is 2PC-over-Paxos, and it is the standard approach for production distributed databases
- •MySQL XA transactions, PostgreSQL PREPARE TRANSACTION, CockroachDB, and Google Spanner all use 2PC or close variants for cross-shard or cross-database transactions
Used By
Common Mistakes
- ✗Not planning for coordinator failure. Running 2PC with a single coordinator is asking for trouble in production. Either make the coordinator replicated (Spanner approach) or implement a recovery protocol where participants can query each other to learn the outcome after a coordinator crash
- ✗Holding locks during the prepare phase without a timeout. A participant that voted yes must hold its locks until it hears commit or abort. If the coordinator is slow, those locks can block other transactions for a long time. Set a timeout and abort if no decision arrives
- ✗Using 2PC across services with different availability requirements. If your payment service is in 2PC with your notification service, a notification service outage blocks payments. Keep 2PC boundaries tight: within a database or between closely coupled shards, not across loosely coupled microservices
- ✗Confusing 2PC with consensus protocols like Paxos or Raft. 2PC decides whether to commit a specific transaction. Paxos decides on a value in a replicated log. They solve different problems and are often used together (Spanner uses 2PC across Paxos groups)