PBFT — Practical Byzantine Fault Tolerance
Architecture
The Byzantine Generals Problem
Crash fault tolerance handles the case where machines die. A dead machine does not send messages. It does not lie about its state. It just stops. Raft and Paxos work perfectly in this model.
But what if a machine does not just crash? What if it sends different values to different peers? What if it claims to have data it does not have, or corrupts messages intentionally? This is the Byzantine fault model, named after Lamport's 1982 thought experiment about Byzantine generals who might be traitors.
Byzantine faults sound theoretical, but they happen in real systems. Hardware errors can flip bits in memory, causing a node to send corrupted data it genuinely believes is correct. A compromised machine in a multi-tenant environment might try to manipulate consensus outcomes. In blockchain networks, validators have financial incentives to cheat. Any system where participants do not fully trust each other needs Byzantine fault tolerance.
How PBFT Works
Castro and Liskov published PBFT in 1999 as the first practical algorithm for Byzantine fault tolerance. Previous Byzantine consensus protocols existed but were too slow for real use. PBFT brought the message complexity down to O(n²), which made it viable for small clusters.
The system has n = 3f + 1 replicas, where f is the maximum number of Byzantine faults to tolerate. One replica is the primary (leader), and the rest are backups.
Pre-Prepare Phase
A client sends a request to the primary. The primary assigns a sequence number to the request and broadcasts a PRE-PREPARE message to all backups. This message says: "I, the primary, am proposing that request R gets sequence number S in view V."
A backup accepts the pre-prepare if: (1) it is in view V, (2) it has not accepted a different pre-prepare for the same view and sequence number, and (3) the sequence number is within the acceptable range (a "water mark" window).
Prepare Phase
Each backup that accepts the pre-prepare broadcasts a PREPARE message to all other replicas (including the primary). A replica collects PREPARE messages and waits until it has 2f matching prepares (plus the pre-prepare). Once it has 2f+1 total agreements (including its own), the request is "prepared."
Why 2f? Because up to f nodes might be Byzantine and could send fake prepares. You need 2f prepares from non-primary replicas to ensure that at least f+1 honest replicas agreed on the same ordering. Since there are 3f total non-primary replicas and at most f are Byzantine, 2f prepares guarantee a majority of honest replicas are aligned.
Commit Phase
Once a replica has collected enough prepares, it broadcasts a COMMIT message. When a replica collects 2f+1 commit messages (from different replicas), the request is committed. The replica executes the request and sends a REPLY to the client.
The client waits for f+1 matching replies from different replicas. Since at most f replicas are Byzantine, f+1 matching replies guarantee that at least one honest replica confirmed the result.
Why Three Phases?
Two phases are not enough because the prepare phase only ensures ordering agreement within the current view. If the primary is faulty and a view change happens after prepares but before commits, some replicas might have prepared different requests for the same sequence number. The commit phase ensures that enough honest replicas have committed to the ordering so that it survives view changes.
View Changes: Dealing with a Bad Leader
If the primary is Byzantine (sending inconsistent pre-prepares, staying silent, or sending garbage), the system must replace it. Replicas monitor the primary and trigger a view change if they suspect it is faulty.
A view change works like this:
- A backup that suspects the primary broadcasts a VIEW-CHANGE message containing its current state: what requests it has prepared and committed.
- The new primary (replica (v+1) mod n for view v+1) collects 2f VIEW-CHANGE messages.
- The new primary computes the correct state from the view change messages, ensuring no committed request is lost.
- It broadcasts a NEW-VIEW message with the merged state and resumes normal operation.
The view change protocol is the most complex part of PBFT. Getting it right means ensuring that any request committed in the old view remains committed in the new view, even if the old primary was Byzantine and tried to create conflicting states.
The Scalability Problem
Each prepare and commit message goes from every replica to every other replica. With n replicas, that is O(n²) messages per consensus round. For n=4, that is manageable (16 messages). For n=100, that is 10,000 messages per round. At high throughput (thousands of requests per second), the network becomes saturated.
This quadratic cost is why PBFT does not scale beyond a few dozen nodes in practice. Blockchain systems that use PBFT variants typically limit the validator set: Cosmos chains run with 100-175 validators, Hyperledger Fabric uses small orderer sets, and Tendermint recommends no more than a few hundred validators.
Several optimizations reduce the constant factor:
Batching. Instead of running PBFT for every individual request, batch hundreds or thousands of requests into a single consensus round. This amortizes the O(n²) message cost across many requests.
Speculative execution. Zyzzyva (2007) skips the commit phase when all replicas agree. If the client receives 3f+1 matching speculative replies, it accepts immediately. The commit phase only runs when there is disagreement. This saves one round trip in the happy case.
Threshold signatures. Instead of n replicas each sending messages to n-1 others, replicas contribute shares to a threshold signature. Once enough shares are collected, a single combined signature proves agreement. This reduces message complexity to O(n).
PBFT vs. Raft: When to Use Each
The decision is simple in most cases:
Use Raft when all participants are under your control (your own servers in your own datacenter). Crash faults are the main concern. Raft is simpler, faster (O(n) messages), and scales to larger clusters.
Use PBFT (or a variant) when participants do not fully trust each other. Blockchain validator networks, multi-organization consortiums, or systems where compromised nodes are a realistic threat.
The hybrid approach is common in practice. CockroachDB uses Raft internally (nodes trust each other) but adds checksums and authentication at the storage layer to catch hardware-level Byzantine faults like bit flips. This gives most of the safety of BFT without the performance cost.
Modern BFT Protocols
PBFT was the starting point, but research has produced better options:
HotStuff (2019, used by Facebook's Diem/Libra). Achieves O(n) message complexity per consensus round using a tree-based communication pattern and threshold signatures. The leader collects votes in a pipeline, producing a new committed block every two network round trips. This is the state of the art for BFT consensus.
Tendermint/CometBFT. A simplified BFT protocol used by Cosmos and many proof-of-stake chains. It adds explicit locking rules to prevent equivocation and uses gossip-based message propagation instead of all-to-all broadcast.
Istanbul BFT (IBFT). Used in Hyperledger Besu and Quorum (JP Morgan's Ethereum fork). A PBFT variant adapted for blockchain with simpler view change logic.
Production Considerations
Key management. Every PBFT replica needs a signing key, and messages are authenticated. Rotating keys, revoking compromised keys, and distributing keys securely adds operational complexity that crash-fault protocols do not have.
Checkpoint and garbage collection. PBFT replicas periodically create checkpoints (a stable snapshot of the state) and garbage-collect old messages. Without checkpoints, message logs grow unbounded. A checkpoint requires 2f+1 matching checkpoint messages, adding another protocol on top of the consensus protocol.
Client retries and exactly-once semantics. A client that does not receive f+1 matching replies must retry. But the request might have already been executed. PBFT handles this with client-stamped request IDs: replicas detect duplicate requests and return the cached reply without re-executing. This is straightforward but adds state management per client.
Performance numbers. Typical PBFT implementations achieve 1,000-10,000 transactions per second with 4-7 nodes, dropping to hundreds with 20+ nodes. For comparison, Raft easily handles 10,000-100,000 operations per second. The gap is the price of Byzantine tolerance.
Key Points
- •PBFT handles Byzantine faults: nodes that can lie, send conflicting messages, or behave arbitrarily. Raft and Paxos only handle crash faults where nodes either work correctly or stop completely. If you need consensus in an adversarial environment, you need something like PBFT
- •The protocol tolerates up to f Byzantine nodes out of 3f+1 total. With 4 nodes you tolerate 1 Byzantine failure. With 7 you tolerate 2. The 3f+1 bound is fundamental and cannot be improved for deterministic protocols
- •Three phases drive consensus: pre-prepare (leader proposes), prepare (replicas echo the proposal), commit (replicas confirm they saw enough prepares). A client waits for f+1 matching replies from different replicas before accepting the result
- •View changes handle a Byzantine leader. If the leader is faulty or slow, replicas trigger a view change that rotates leadership to the next replica. The protocol ensures no committed request is lost during the transition
- •The O(n²) message complexity is the main scalability bottleneck. Every replica sends messages to every other replica in the prepare and commit phases. Beyond about 20 nodes, the network overhead becomes prohibitive for high-throughput systems
Used By
Common Mistakes
- ✗Assuming PBFT scales like Raft. Raft has O(n) message complexity per consensus round. PBFT has O(n²). A 100-node PBFT cluster sends 10,000 messages per round. This is why most blockchain systems use PBFT only for small validator sets
- ✗Using PBFT when crash fault tolerance is sufficient. If your threat model only includes machines dying or losing network connectivity, Raft or Paxos is simpler, faster, and scales better. PBFT adds significant overhead that is only justified when nodes might actively misbehave
- ✗Forgetting that PBFT requires a known, fixed membership. Unlike permissionless blockchain consensus (Nakamoto), PBFT requires every node to know every other node upfront. Adding or removing nodes requires a reconfiguration protocol
- ✗Ignoring the authentication overhead. Every PBFT message must be authenticated (MAC or digital signature) to prevent forgery. In high-throughput systems, the CPU cost of signing and verifying thousands of messages per second is substantial