Merkle Trees & Anti-Entropy Repair
Architecture
Two Copies of Your Data Just Diverged
Replication keeps your system alive when servers fail. But replicas drift apart. A write reaches replica A but the network blips and replica B misses it. A disk corruption flips a bit on replica C. An operator restores replica D from a backup that is six hours stale.
Now you need to bring them back into sync. The brute force approach: compare every record on replica A with every record on replica B. With a billion records, that means transferring a billion keys and values over the network, diffing them, and sending corrections. Even if only 100 records actually differ, you paid the cost of examining all one billion.
Merkle trees make this problem tractable.
The Structure
A Merkle tree is a binary tree built from the bottom up. Each leaf node is the hash of one data block (or one key-value pair, or one row). Each internal node is the hash of its two children's hashes concatenated together. The root hash is a single fingerprint that represents the entire dataset.
If two replicas have the same root hash, their datasets are identical. Full stop. One hash comparison, and you know whether repair is needed.
If the roots differ, compare the children. The root has two children: left and right. If the left children match, the divergence is somewhere in the right subtree. Recurse into the right subtree and repeat. At each level, you eliminate half the data from consideration.
With a tree of depth D over N data blocks, you have N leaves and D levels. Finding all divergent blocks requires at most O(D * K) hash comparisons where K is the number of divergent blocks. Since D = log2(N), that is O(K * log N) comparisons instead of O(N). For a billion records with 100 divergent keys, that is about 100 * 30 = 3000 hash comparisons instead of 1,000,000,000 full comparisons.
Anti-Entropy Repair in Cassandra
Cassandra is probably the most well-known production user of Merkle trees for repair. Here is how it works.
Each replica owns a range of the token ring (the keyspace is divided into ranges by consistent hashing). When you run nodetool repair, the coordinator asks each replica that owns a given token range to build a Merkle tree over its data in that range.
Building the tree means reading every row in the range, hashing it, and constructing the tree bottom-up. Each replica sends its tree to the coordinator. The coordinator compares trees pairwise, identifies the divergent ranges, and tells the replicas to stream just those ranges to each other.
The beauty: if a token range has 50 million rows but only 200 diverged, the repair transfers about 200 rows worth of data instead of 50 million. On a 10-node cluster with terabytes of data, this is the difference between a repair that takes minutes and one that takes days.
The ugly: building the tree requires a full scan of the data range. On a large dataset, that full scan itself is expensive. It reads from disk, consumes CPU for hashing, and competes with production traffic. Before Cassandra 4.0, every repair cycle rebuilt the tree from scratch.
Incremental Merkle Trees
Cassandra 4.0 introduced incremental repair, which changes the economics significantly. Instead of rebuilding the entire tree every time, the system tracks which data has been repaired and which has not. Only unrepaired data needs a new Merkle tree. After repair completes, that data is marked as repaired and excluded from future tree builds.
This means the first repair is still expensive (full scan), but subsequent repairs only process new writes since the last repair. If your cluster ingests 1% new data per day and repairs daily, each repair scans 1% of the data instead of 100%.
The implementation works by maintaining separate SSTables for repaired and unrepaired data. Compaction keeps them separated. The Merkle tree build only reads unrepaired SSTables.
Practical Tree Depth
A binary Merkle tree over 1 billion records has 30 levels. That means 2^30 leaves, each storing a hash. The tree itself, with all internal nodes, contains about 2 billion hashes. If each hash is 32 bytes (SHA-256), the tree is about 64GB. That is far too large to hold in memory or transfer between replicas.
Practical implementations use a bounded depth. Cassandra defaults to a depth of 15, which gives 2^15 = 32768 leaf buckets. Each bucket covers a range of tokens, and the bucket's hash covers all rows that fall into that range. This means the tree fits in about 1MB and transfers quickly between replicas.
The tradeoff: if a bucket contains 30,000 rows and one of them diverges, the tree identifies the bucket but not the specific row. The repair process then streams all 30,000 rows in that bucket to resolve the single divergent row. With deeper trees (more buckets), you get finer granularity but larger trees. The depth-15 default works for most clusters, but very large datasets with low divergence rates benefit from deeper trees.
Cassandra 4.0+ allows configuring the tree depth. Setting it too high wastes memory; too low wastes bandwidth during repair. A useful heuristic: each leaf bucket should cover roughly the amount of data you are comfortable streaming during repair. If streaming 10MB per bucket is acceptable and your total data per range is 100GB, you want about 10,000 buckets, which is a depth of about 14.
Merkle DAGs: Beyond Binary Trees
Git, IPFS, and several blockchain systems use a generalization called Merkle DAGs (directed acyclic graphs). Instead of a strict binary tree, any node can reference any number of children by their content hash.
In Git, a commit object contains the hash of a tree object. That tree object contains hashes of blob objects (file contents) and other tree objects (subdirectories). Every object is addressed by its content hash. If two commits share the same file, they reference the same blob hash without duplication.
This gives Git its efficiency: cloning a repository with 100,000 commits does not transfer 100,000 copies of files that did not change between commits. The deduplication is automatic because identical content produces identical hashes.
IPFS takes this further. Every block of data stored in IPFS is addressed by its content hash. A large file is split into chunks, each chunk is hashed, and a Merkle DAG links the chunks together. Fetching a file means walking the DAG from the root, and you can verify every chunk against its expected hash. Nobody can tamper with the data without changing the root hash.
The connection to anti-entropy repair is direct. If two IPFS nodes claim to have the same content, their root hashes must match. If the hashes differ, the DAG structure lets you pinpoint exactly which chunks diverge.
Patricia Merkle Tries in Ethereum
Ethereum uses a specialized variant called a Patricia Merkle Trie (or Modified Merkle Patricia Trie) for its state database. This combines the properties of a trie (prefix tree for efficient key lookups), a Merkle tree (every node is hashed, root hash represents the entire state), and path compression (Patricia optimization to skip common prefixes).
The result is a data structure where you can prove that a given account has a certain balance by providing a path from the state root to the leaf, along with all sibling hashes. The verifier checks that each hash along the path is correct and that the root matches the known state root. This proof is about 1-2KB regardless of how many accounts exist in the state.
These proofs are what make light clients possible. A light client does not store the full Ethereum state. It trusts the state root from a block header (which is secured by consensus) and then requests Merkle proofs from full nodes for any specific data it needs. The proof is verifiable without trusting the full node.
The Hash Function Matters
The choice of hash function affects both security and performance. For anti-entropy repair between trusted replicas in the same data center, you do not need a cryptographically secure hash. MurmurHash3 or xxHash runs 10-20x faster than SHA-256 and provides excellent collision resistance for non-adversarial use cases. Cassandra uses MurmurHash3 for its partitioner and repair trees.
For content-addressable storage where untrusted parties might supply data (IPFS, Git, blockchains), you need cryptographic hashes. SHA-256 is the standard. Git historically used SHA-1, and after the SHAttered collision attack in 2017, it has been migrating to SHA-256.
The performance difference is not trivial. Hashing a terabyte of data with SHA-256 takes roughly 10x longer than with xxHash. For a Cassandra repair that scans the entire dataset, that is the difference between a 20-minute tree build and a 3-minute one. Use the weakest hash that meets your security requirements.
Bandwidth Estimation
Here is a practical formula for estimating repair bandwidth savings.
Without Merkle trees: to find K divergent keys out of N total, you transfer N key hashes (to compare) plus K actual values (to repair). If each key hash is 32 bytes and N is 1 billion, that is 32GB of hashes alone.
With a Merkle tree of depth D: you transfer O(D * K) internal node hashes to locate the divergent subtrees, plus the K actual values. With D = 15 and K = 1000, that is about 15,000 hashes (480KB) plus 1000 values.
When K is small relative to N (which it usually is, since replicas mostly agree), the savings are orders of magnitude. But when K approaches N (replicas are completely divergent), the Merkle tree does not help much and you should just do a full sync.
A useful rule of thumb: Merkle tree repair is worth it when fewer than 10% of records have diverged. Beyond that, full streaming is faster because you skip the tree build and comparison overhead.
Combining with Other Techniques
Merkle trees are one piece of the anti-entropy puzzle. Most production systems layer multiple approaches:
Read repair catches divergence at read time. When a coordinator reads from multiple replicas and gets different values, it sends the latest value to the stale replicas. This is cheap and opportunistic but only fixes keys that are actively read.
Hinted handoff prevents divergence during transient failures. When a write cannot reach a replica, another node stores a "hint" and replays it when the replica recovers. This handles the common case of brief outages.
Full anti-entropy repair with Merkle trees catches everything else: data corruption, missed hints, long outages, operator errors. It is the backstop that guarantees eventual convergence.
Running all three together gives you fast repair of hot data (read repair), quick recovery from transient failures (hinted handoff), and comprehensive repair for everything else (Merkle tree repair). Skipping any one layer leaves gaps that grow over time.
Key Points
- •Two replicas can find their differences in O(log N) hash comparisons instead of comparing every record. Compare root hashes, recurse into the subtree that differs, and narrow down to exactly the divergent blocks
- •Cassandra builds Merkle trees during repair to identify divergent data ranges. Only the differing data transfers between replicas, saving enormous bandwidth compared to full sync
- •Merkle DAGs generalize the concept beyond binary trees. Git uses Merkle DAGs for commits, trees, and blobs. IPFS uses them for content-addressable storage where the hash IS the address
- •Repairing K divergent keys out of N total with a depth-D tree transfers O(K * D) hashes plus K values. Without the tree, you transfer N hashes. When K is much smaller than N, the savings are massive
- •Incremental Merkle trees (Cassandra 4.0+) avoid the cost of rebuilding the entire tree from scratch on every repair cycle
Used By
Common Mistakes
- ✗Rebuilding the entire Merkle tree on every sync cycle. Use incremental updates that only rehash the subtrees where data changed
- ✗Choosing tree depth too shallow. If each leaf covers thousands of keys, you detect that something diverged in a large range but still have to compare every key in that range
- ✗Ignoring the CPU cost of hashing large datasets. SHA-256 on terabytes of data is not cheap, and running repair during peak traffic can starve application workloads
- ✗Not scheduling repair during low-traffic periods. Anti-entropy repair is IO and CPU intensive. Running it at peak hours competes with production reads and writes