PBFT — Practical Byzantine Fault Tolerance — Distributed Systems | CrackingWalnuts

The Byzantine Generals Problem

Crash fault tolerance handles the case where machines die. A dead machine does not send messages. It does not lie about its state. It just stops. Raft and Paxos work perfectly in this model.

But what if a machine does not just crash? What if it sends different values to different peers? What if it claims to have data it does not have, or corrupts messages intentionally? This is the Byzantine fault model, named after Lamport's 1982 thought experiment about Byzantine generals who might be traitors.

Byzantine faults sound theoretical, but they happen in real systems. Hardware errors can flip bits in memory, causing a node to send corrupted data it genuinely believes is correct. A compromised machine in a multi-tenant environment might try to manipulate consensus outcomes. In blockchain networks, validators have financial incentives to cheat. Any system where participants do not fully trust each other needs Byzantine fault tolerance.

How PBFT Works

Castro and Liskov published PBFT in 1999 as the first practical algorithm for Byzantine fault tolerance. Previous Byzantine consensus protocols existed but were too slow for real use. PBFT brought the message complexity down to O(n²), which made it viable for small clusters.

The system has n = 3f + 1 replicas, where f is the maximum number of Byzantine faults to tolerate. One replica is the primary (leader), and the rest are backups.

Pre-Prepare Phase

A client sends a request to the primary. The primary assigns a sequence number to the request and broadcasts a PRE-PREPARE message to all backups. This message says: "I, the primary, am proposing that request R gets sequence number S in view V."

A backup accepts the pre-prepare if: (1) it is in view V, (2) it has not accepted a different pre-prepare for the same view and sequence number, and (3) the sequence number is within the acceptable range (a "water mark" window).

Prepare Phase

Each backup that accepts the pre-prepare broadcasts a PREPARE message to all other replicas (including the primary). A replica collects PREPARE messages and waits until it has 2f matching prepares (plus the pre-prepare). Once it has 2f+1 total agreements (including its own), the request is "prepared."

Why 2f? Because up to f nodes might be Byzantine and could send fake prepares. You need 2f prepares from non-primary replicas to ensure that at least f+1 honest replicas agreed on the same ordering. Since there are 3f total non-primary replicas and at most f are Byzantine, 2f prepares guarantee a majority of honest replicas are aligned.

Commit Phase

Once a replica has collected enough prepares, it broadcasts a COMMIT message. When a replica collects 2f+1 commit messages (from different replicas), the request is committed. The replica executes the request and sends a REPLY to the client.

The client waits for f+1 matching replies from different replicas. Since at most f replicas are Byzantine, f+1 matching replies guarantee that at least one honest replica confirmed the result.

Why Three Phases?

Two phases are not enough because the prepare phase only ensures ordering agreement within the current view. If the primary is faulty and a view change happens after prepares but before commits, some replicas might have prepared different requests for the same sequence number. The commit phase ensures that enough honest replicas have committed to the ordering so that it survives view changes.

View Changes: Dealing with a Bad Leader

If the primary is Byzantine (sending inconsistent pre-prepares, staying silent, or sending garbage), the system must replace it. Replicas monitor the primary and trigger a view change if they suspect it is faulty.

A view change works like this:

A backup that suspects the primary broadcasts a VIEW-CHANGE message containing its current state: what requests it has prepared and committed.
The new primary (replica (v+1) mod n for view v+1) collects 2f VIEW-CHANGE messages.
The new primary computes the correct state from the view change messages, ensuring no committed request is lost.
It broadcasts a NEW-VIEW message with the merged state and resumes normal operation.

The view change protocol is the most complex part of PBFT. Getting it right means ensuring that any request committed in the old view remains committed in the new view, even if the old primary was Byzantine and tried to create conflicting states.

The Scalability Problem

Each prepare and commit message goes from every replica to every other replica. With n replicas, that is O(n²) messages per consensus round. For n=4, that is manageable (16 messages). For n=100, that is 10,000 messages per round. At high throughput (thousands of requests per second), the network becomes saturated.

This quadratic cost is why PBFT does not scale beyond a few dozen nodes in practice. Blockchain systems that use PBFT variants typically limit the validator set: Cosmos chains run with 100-175 validators, Hyperledger Fabric uses small orderer sets, and Tendermint recommends no more than a few hundred validators.

Several optimizations reduce the constant factor:

Batching. Instead of running PBFT for every individual request, batch hundreds or thousands of requests into a single consensus round. This amortizes the O(n²) message cost across many requests.

Speculative execution. Zyzzyva (2007) skips the commit phase when all replicas agree. If the client receives 3f+1 matching speculative replies, it accepts immediately. The commit phase only runs when there is disagreement. This saves one round trip in the happy case.

Threshold signatures. Instead of n replicas each sending messages to n-1 others, replicas contribute shares to a threshold signature. Once enough shares are collected, a single combined signature proves agreement. This reduces message complexity to O(n).

PBFT vs. Raft: When to Use Each

The decision is simple in most cases:

Use Raft when all participants are under your control (your own servers in your own datacenter). Crash faults are the main concern. Raft is simpler, faster (O(n) messages), and scales to larger clusters.

Use PBFT (or a variant) when participants do not fully trust each other. Blockchain validator networks, multi-organization consortiums, or systems where compromised nodes are a realistic threat.

The hybrid approach is common in practice. CockroachDB uses Raft internally (nodes trust each other) but adds checksums and authentication at the storage layer to catch hardware-level Byzantine faults like bit flips. This gives most of the safety of BFT without the performance cost.

Modern BFT Protocols

PBFT was the starting point, but research has produced better options:

HotStuff (2019, used by Facebook's Diem/Libra). Achieves O(n) message complexity per consensus round using a tree-based communication pattern and threshold signatures. The leader collects votes in a pipeline, producing a new committed block every two network round trips. This is the state of the art for BFT consensus.

Tendermint/CometBFT. A simplified BFT protocol used by Cosmos and many proof-of-stake chains. It adds explicit locking rules to prevent equivocation and uses gossip-based message propagation instead of all-to-all broadcast.

Istanbul BFT (IBFT). Used in Hyperledger Besu and Quorum (JP Morgan's Ethereum fork). A PBFT variant adapted for blockchain with simpler view change logic.

Production Considerations

Key management. Every PBFT replica needs a signing key, and messages are authenticated. Rotating keys, revoking compromised keys, and distributing keys securely adds operational complexity that crash-fault protocols do not have.

Checkpoint and garbage collection. PBFT replicas periodically create checkpoints (a stable snapshot of the state) and garbage-collect old messages. Without checkpoints, message logs grow unbounded. A checkpoint requires 2f+1 matching checkpoint messages, adding another protocol on top of the consensus protocol.

Client retries and exactly-once semantics. A client that does not receive f+1 matching replies must retry. But the request might have already been executed. PBFT handles this with client-stamped request IDs: replicas detect duplicate requests and return the cached reply without re-executing. This is straightforward but adds state management per client.

Performance numbers. Typical PBFT implementations achieve 1,000-10,000 transactions per second with 4-7 nodes, dropping to hundreds with 20+ nodes. For comparison, Raft easily handles 10,000-100,000 operations per second. The gap is the price of Byzantine tolerance.

The Byzantine Generals Problem

Crash fault tolerance handles the case where machines die. A dead machine does not send messages. It does not lie about its state. It just stops. Raft and Paxos work perfectly in this model.

How PBFT Works

The system has n = 3f + 1 replicas, where f is the maximum number of Byzantine faults to tolerate. One replica is the primary (leader), and the rest are backups.

Pre-Prepare Phase

Prepare Phase

Commit Phase

The client waits for f+1 matching replies from different replicas. Since at most f replicas are Byzantine, f+1 matching replies guarantee that at least one honest replica confirmed the result.

Why Three Phases?

View Changes: Dealing with a Bad Leader

A view change works like this:

A backup that suspects the primary broadcasts a VIEW-CHANGE message containing its current state: what requests it has prepared and committed.
The new primary (replica (v+1) mod n for view v+1) collects 2f VIEW-CHANGE messages.
The new primary computes the correct state from the view change messages, ensuring no committed request is lost.
It broadcasts a NEW-VIEW message with the merged state and resumes normal operation.

The Scalability Problem

Several optimizations reduce the constant factor:

Batching. Instead of running PBFT for every individual request, batch hundreds or thousands of requests into a single consensus round. This amortizes the O(n²) message cost across many requests.

PBFT vs. Raft: When to Use Each

The decision is simple in most cases:

Use PBFT (or a variant) when participants do not fully trust each other. Blockchain validator networks, multi-organization consortiums, or systems where compromised nodes are a realistic threat.

Modern BFT Protocols

PBFT was the starting point, but research has produced better options:

Istanbul BFT (IBFT). Used in Hyperledger Besu and Quorum (JP Morgan's Ethereum fork). A PBFT variant adapted for blockchain with simpler view change logic.

PBFT — Practical Byzantine Fault Tolerance

Architecture

The Byzantine Generals Problem

How PBFT Works

Pre-Prepare Phase

Prepare Phase

Commit Phase

Why Three Phases?

View Changes: Dealing with a Bad Leader

The Scalability Problem

PBFT vs. Raft: When to Use Each

Modern BFT Protocols

Production Considerations

Key Points

Used By

Common Mistakes

Related

PBFT — Practical Byzantine Fault Tolerance

Architecture

The Byzantine Generals Problem

How PBFT Works

Pre-Prepare Phase

Prepare Phase

Commit Phase

Why Three Phases?

View Changes: Dealing with a Bad Leader

The Scalability Problem

PBFT vs. Raft: When to Use Each

Modern BFT Protocols

Production Considerations

Key Points

Used By

Common Mistakes

Related