Distributed Systems Algorithms: HyperLogLog, Raft, CRDTs, Consistent Hashing & More

Bloom Filters & Probabilistic Membership — Probabilistic Data Structures

Difficulty: Advanced. Complexity: O(k) per lookup, O(m) space

Key Points for Bloom Filters & Probabilistic Membership

Answers 'is this element in the set?' with zero false negatives and a tunable false positive rate. If a Bloom filter says no, the element is definitely not there. If it says yes, the element is probably there
A bit array of m bits with k independent hash functions. Insertion hashes the element k times and sets those k bit positions to 1. Lookup hashes the element k times and checks whether all k positions are 1
False positive rate is approximately (1 - e^(-kn/m))^k where n is the number of inserted elements, m is the bit array size, and k is the number of hash functions. Tuning m and k for a target false positive rate is straightforward given expected n
Standard Bloom filters do not support deletion. Removing an element would require clearing bits that might be shared with other elements. Counting Bloom filters replace each bit with a counter to enable deletion at the cost of 3-4x more memory
Cassandra, Chrome Safe Browsing, RocksDB, PostgreSQL, and Bitcoin SPV nodes all use Bloom filters in production to avoid expensive lookups against data that is almost certainly not present

Production Systems Using Bloom Filters & Probabilistic Membership

Cassandra SSTable lookup
Chrome Safe Browsing
RocksDB block-based filter
PostgreSQL bloom index
Akamai cache lookup
Bitcoin SPV nodes

Common Mistakes with Bloom Filters & Probabilistic Membership

Not sizing the filter correctly for the expected number of elements. An undersized Bloom filter degrades gracefully in theory but catastrophically in practice: at 50% fill, false positive rates spike well above your design target
Assuming you can delete elements from a standard Bloom filter. Clearing a bit that belongs to multiple elements corrupts the filter silently. If you need deletion, use a Counting Bloom filter or a Cuckoo filter from the start
Using a Bloom filter where false positives affect correctness, not just performance. A Bloom filter as a negative cache is fine (false positive means an unnecessary lookup). A Bloom filter deciding whether to charge a customer is not fine
Ignoring hash function quality. Bloom filters need hash functions that produce independent, uniformly distributed outputs. Using a weak hash function creates correlated bit positions and inflates your actual false positive rate well beyond the theoretical bound

Related to Bloom Filters & Probabilistic Membership

HyperLogLog & Cardinality Estimation, Count-Min Sketch & Frequency Estimation

Consistent Hashing & Ring Algorithms — Data Distribution

Difficulty: Advanced. Complexity: O(log N) lookup with ring

Key Points for Consistent Hashing & Ring Algorithms

Adding or removing a server with modulo hashing (hash(key) % N) remaps nearly every key. With 100 servers, adding one server moves about 99% of your data. Consistent hashing remaps only 1/N of keys on average
Virtual nodes smooth out load distribution by giving each physical server 100-200 positions on the hash ring. Without them, servers can own wildly unequal segments due to hash clustering
Jump Hash from Google achieves perfect uniformity with zero memory overhead, but only supports append-only scaling. You can add bucket N+1 but you cannot remove bucket 5
Rendezvous hashing (highest random weight) needs no ring data structure at all. For each key, hash against every server and pick the highest. Clean and elegant, but O(N) per lookup limits it to smaller clusters
Maglev hashing builds a lookup table for O(1) queries after construction. Google uses it in their network load balancers where per-packet latency matters

Production Systems Using Consistent Hashing & Ring Algorithms

Cassandra (token ring with virtual nodes)
DynamoDB (consistent hashing with partition map)
Memcached clients (ketama algorithm)
Akamai CDN (original consistent hashing paper)
Nginx upstream consistent hashing
HAProxy consistent hashing

Common Mistakes with Consistent Hashing & Ring Algorithms

Using too few virtual nodes. With 10 virtual nodes per server, load imbalance can exceed 2x between the most-loaded and least-loaded servers. At 150+ virtual nodes, variance drops below 10%
Giving identical virtual node counts to servers with different capacities. A 64GB machine and a 16GB machine should not own the same fraction of the ring
Reaching for consistent hashing when the server set is small and stable. With 5 servers that never change, modulo hashing is faster, simpler, and perfectly uniform
Not monitoring actual key distribution after deployment. Hash functions can cluster on real-world key patterns even when they look uniform on random inputs

Related to Consistent Hashing & Ring Algorithms

Merkle Trees & Anti-Entropy Repair, Quorum Systems & NRW Notation

Count-Min Sketch & Frequency Estimation — Probabilistic Data Structures

Difficulty: Advanced. Complexity: O(d) per update, O(w*d) space

Key Points for Count-Min Sketch & Frequency Estimation

Estimates how often an element appears in a data stream using a fixed amount of memory regardless of stream length. The structure never needs to know the universe of possible elements in advance
Uses d independent hash functions mapping elements to d rows of w counters each. Insertion increments d counters, query returns the minimum of d counters. The minimum is the least-overcounted estimate
Overestimates frequency by at most epsilon * N with probability at least 1 - delta, where w = ceil(e/epsilon) and d = ceil(ln(1/delta)). There are no underestimates
Conservative update optimization only increments counters that equal the current minimum, significantly reducing overestimation for frequent items at the cost of losing the strict mergeability property
Used in production at Cisco, Apache Flink, Redis (RedisBloom Top-K), AT&T, and Twitter for network monitoring, streaming analytics, fraud detection, and trending topic detection

Production Systems Using Count-Min Sketch & Frequency Estimation

Cisco network monitoring
Apache Flink streaming
Redis Top-K (RedisBloom)
AT&T fraud detection
Google Sawzall
Twitter trending topics

Common Mistakes with Count-Min Sketch & Frequency Estimation

Using Count-Min Sketch for low-frequency items where the noise from hash collisions dominates the actual count. If an item appears 3 times and the overcounting noise is 50, your estimate is useless. CMS works best for heavy hitters
Not tuning width and depth for the actual workload. Default parameters from a library might target epsilon=0.01 and delta=0.01 (width 272, depth 5), but your stream might have 10 billion events, making the absolute error bound 100 million. Think in absolute terms, not just relative
Treating frequency estimates as exact for billing or compliance. CMS always overcounts. An estimate of 1,247 means the true count is at most 1,247 and at least something lower. If you are charging per event, that overcount becomes a financial dispute
Forgetting that conservative update breaks mergeability. If you need to merge sketches from different stream partitions, you must use standard update, not conservative update. Choose one or the other based on your architecture

Related to Count-Min Sketch & Frequency Estimation

HyperLogLog & Cardinality Estimation, Bloom Filters & Probabilistic Membership, Streaming Quantiles: T-Digest & DDSketch

CRDTs: Conflict-Free Replicated Data Types — Consensus & Coordination

Difficulty: Expert. Complexity: O(1) per op, varies by type

Key Points for CRDTs: Conflict-Free Replicated Data Types

CRDTs let multiple replicas accept writes concurrently without any coordination and merge them deterministically. The merge function must be commutative, associative, and idempotent, which guarantees convergence regardless of message ordering or duplication
State-based CRDTs ship the full state and merge with a join operation. Operation-based CRDTs ship individual operations that must be commutative. State-based is simpler to implement, op-based is more bandwidth-efficient at scale
The OR-Set (observed-remove set) is the most practically useful CRDT for application developers. It handles concurrent add and remove by tagging each add with a unique identifier, so concurrent add-wins semantics are unambiguous
Redis Enterprise, Riak, Automerge, and Figma all use CRDTs in production for active-active geo-replication or real-time collaborative editing

Production Systems Using CRDTs: Conflict-Free Replicated Data Types

Redis Enterprise CRDT (active-active geo-distribution)
Riak (OR-Set, counters)
Automerge (collaborative editing)
Figma (multiplayer design)
SoundCloud Roshi (LWW-Set)
Apple Notes (collaborative editing)

Common Mistakes with CRDTs: Conflict-Free Replicated Data Types

Choosing LWW (last-writer-wins) when concurrent updates carry semantic meaning. If two users update a shopping cart simultaneously, LWW silently drops one update. An OR-Set or counter would preserve both
Ignoring the metadata overhead of OR-Sets. Each element carries a set of unique tags (one per concurrent add), and removed elements leave tombstones. Without garbage collection, metadata grows without bound
Assuming CRDTs solve all conflict problems. CRDTs solve a specific mathematical class of conflicts where a commutative merge exists. Many business logic conflicts (like double-booking a hotel room) have no meaningful automatic resolution
Not implementing garbage collection for tombstones. Deleted elements in OR-Sets leave metadata behind. Over weeks or months of operation, tombstone accumulation can consume more memory than the live data

Related to CRDTs: Conflict-Free Replicated Data Types

Operational Transformation, Logical Clocks & Causality Tracking, Gossip Protocols & SWIM, Quorum Systems & NRW Notation

Distributed Snapshots & Chandy-Lamport — Consensus & Coordination

Difficulty: Advanced. Complexity: O(E) messages, O(N) state

Key Points for Distributed Snapshots & Chandy-Lamport

A consistent distributed snapshot captures the state of every node and every in-flight message without stopping the system. The snapshot represents a state the system could have been in, even if no single moment in real time matched it exactly
Chandy-Lamport (1985) uses marker messages as logical scissors that cut message flows into before-snapshot and after-snapshot. Any node can initiate. The algorithm requires only FIFO channels, no global clock
Apache Flink adapted Chandy-Lamport for stream processing using checkpoint barriers injected into data streams. Aligned barriers give exactly-once guarantees but can cause backpressure. Unaligned barriers (Flink 1.11+) reduce latency at the cost of larger checkpoint sizes
Incremental snapshots capture only the state delta since the last checkpoint. For Flink operators backed by RocksDB, this can shrink checkpoint size from gigabytes to megabytes, making frequent checkpoints practical
The consistent cut property guarantees that for every message received in the snapshot, the corresponding send is also captured. This is what makes the snapshot usable for recovery without duplicating or losing events

Production Systems Using Distributed Snapshots & Chandy-Lamport

Apache Flink checkpoints (barrier-based)
Spark Structured Streaming
Kafka Streams state store snapshots
CockroachDB MVCC snapshots
Debezium initial snapshot + CDC
Apache Samza checkpointing

Common Mistakes with Distributed Snapshots & Chandy-Lamport

Using aligned checkpoints on high-throughput Flink pipelines without benchmarking backpressure. Aligned barriers force operators to buffer data from fast channels while waiting for slow ones, which can cascade into severe backpressure spikes
Not enabling incremental checkpoints for large operator state. Full state snapshots of a 50GB RocksDB backend can take minutes and saturate network bandwidth to the checkpoint store
Setting the checkpoint interval too aggressively. Each checkpoint has coordination overhead (barrier propagation, state serialization, storage writes). An interval of 100ms on a pipeline doing 10ms checkpoints means 10% of capacity goes to checkpointing
Ignoring checkpoint storage costs. Retained checkpoints on S3 or HDFS accumulate fast, especially with incremental checkpoints that keep SST file references alive. Without a retention policy, storage bills grow quietly

Related to Distributed Snapshots & Chandy-Lamport

Write-Ahead Log, Raft Consensus Protocol, Logical Clocks & Causality Tracking

Erasure Coding & Reed-Solomon — Data Distribution

Difficulty: Advanced. Complexity: O(n·k) encode, O(k²) decode (Galois field ops)

Key Points for Erasure Coding & Reed-Solomon

Erasure coding splits data into k data shards and computes m parity shards such that any k of the k+m total shards can reconstruct the original data. This provides the same fault tolerance as (k+m)/k-way replication at a fraction of the storage cost
Reed-Solomon codes operate over Galois fields (finite field arithmetic) to generate parity shards. RS(6,3) stores 9 shards total and tolerates any 3 failures, using 1.5x storage instead of the 3x that triple replication would require for equivalent durability
The hidden cost is reconstruction. When a shard is lost, the system must read k surviving shards and perform matrix inversion to recover it. This amplifies read latency and network traffic during degraded operation, which is why hot data often still uses replication
Every major storage system uses erasure coding for cold and warm data: HDFS switched from 3x replication to RS(6,3) and cut storage costs nearly in half, Ceph supports erasure-coded pools, and S3 uses a proprietary scheme across availability zones

Production Systems Using Erasure Coding & Reed-Solomon

HDFS (RS(6,3) and RS(10,4) for cold data)
Ceph (erasure-coded pools with configurable k,m)
AWS S3 (proprietary erasure coding across AZs)
Azure Storage (Local Reconstruction Codes)
MinIO (Reed-Solomon per object)
Backblaze Vaults (RS(17,3) across 20 drives)
Google Colossus (successor to GFS, erasure coded)

Common Mistakes with Erasure Coding & Reed-Solomon

Using erasure coding for small objects. An RS(6,3) scheme turns a 1KB object into 9 separate shards of ~170 bytes each, and every read requires fetching 6 of them from 6 different nodes. The per-request overhead dominates. Most systems set a minimum object size (typically 1-4MB) below which they fall back to replication
Ignoring tail latency during degraded reads. Normal reads touch k nodes and take the slowest of those k responses. During reconstruction, you still read k nodes but the data also passes through a decode step. If even one of those k nodes is slow, the entire read stalls. Systems like Azure add extra read requests to hedge against stragglers
Placing all shards in the same failure domain. If your RS(6,3) shards all land on the same rack and that rack loses power, you lose all 9 shards at once. Erasure coding only works when shards are spread across independent failure domains: different racks, different power circuits, or different availability zones
Not accounting for reconstruction bandwidth during recovery. When a node dies holding thousands of erasure-coded shards, recovering each one requires reading k shards from k different nodes. At scale this can saturate network links and trigger cascading slowdowns. Smart systems throttle recovery bandwidth and prioritize shards that are closest to losing durability

Related to Erasure Coding & Reed-Solomon

Consistent Hashing & Ring Algorithms, Quorum Systems & NRW Notation, Failure Detection Algorithms

Failure Detection Algorithms — Failure Detection & Quorums

Difficulty: Advanced. Complexity: O(1) per check

Key Points for Failure Detection Algorithms

You cannot distinguish a crashed node from a slow one. Every failure detector is a tradeoff between detection speed and false positive rate. Faster detection means more healthy nodes incorrectly declared dead
The phi accrual failure detector outputs a continuous suspicion level instead of a binary alive/dead. It adapts to network conditions automatically by tracking heartbeat interval distribution. Cassandra uses it with phi threshold of 8
SWIM failure detection achieves O(N) total messages per round instead of O(N^2) all-to-all heartbeats. Each node probes one random peer, then uses indirect probing through k other nodes if the direct probe fails
Fixed-timeout heartbeat detectors break in cloud environments where latency varies. A 200ms timeout that works in us-east-1 causes constant false positives on cross-region links with 150ms baseline latency
JVM garbage collection pauses of 5-30 seconds will trigger any aggressive failure detector. Systems running on the JVM need failure detection timeouts that accommodate worst-case GC pauses

Production Systems Using Failure Detection Algorithms

Cassandra (phi accrual, phi threshold = 8)
Akka Cluster (phi accrual failure detector)
Consul (SWIM via HashiCorp memberlist)
Kubernetes (kubelet heartbeat with node lease)
MongoDB (heartbeat with election timeout)
Hazelcast (phi accrual)

Common Mistakes with Failure Detection Algorithms

Using fixed heartbeat timeouts in cloud environments where network latency varies significantly between zones and regions
Not accounting for GC pauses in JVM-based systems. A 10-second stop-the-world GC in Cassandra or Elasticsearch will look exactly like a dead node to the failure detector
Having every node heartbeat every other node. With N nodes sending heartbeats to N-1 others, you get O(N^2) network messages. This does not scale past a few dozen nodes
Setting the phi accrual threshold too low. phi=4 gives a false positive rate of about 1 in 55. At phi=8, it drops to about 1 in 3000. The difference between constant false alarms and rare ones

Related to Failure Detection Algorithms

Raft Consensus Protocol, Quorum Systems & NRW Notation

Gossip Protocols & SWIM — Consensus & Coordination

Difficulty: Advanced. Complexity: O(log N) convergence, O(N) state

Key Points for Gossip Protocols & SWIM

Information spreads through a gossip cluster the way a disease spreads through a population. Each node periodically picks a random peer and exchanges state. After O(log N) rounds, the entire cluster has the information with high probability
SWIM separates failure detection from dissemination. It detects failures through direct and indirect probing, then piggybacks membership updates onto probe messages instead of broadcasting them separately
Gossip provides eventual consistency, not strong consistency. Two nodes might temporarily disagree about cluster state. This is fine for membership and health but wrong for operations that need linearizability
Cassandra, Consul, Redis Cluster, and ScyllaDB all use gossip for cluster membership and state dissemination in production

Production Systems Using Gossip Protocols & SWIM

Cassandra cluster membership
HashiCorp Serf
HashiCorp Memberlist (used by Consul, Nomad)
Redis Cluster gossip bus
Riak cluster coordination
ScyllaDB

Common Mistakes with Gossip Protocols & SWIM

Setting gossip interval too aggressively in large clusters. With 1000 nodes gossiping every 200ms, each node handles 5 incoming gossip messages per second, which creates non-trivial CPU and network overhead
Not tuning suspicion timeouts for cloud environments where instances restart with the same IP address. A node that crashed and restarted quickly might still be marked as suspected, causing unnecessary failover
Forgetting that gossip provides eventual consistency, not strong consistency. Using gossip state for decisions that require agreement across the cluster (like leader election) will produce split-brain scenarios
Not limiting the size of piggybacked payloads. If you attach large metadata to gossip messages, the protocol degrades from lightweight chatter to bulk data transfer

Related to Gossip Protocols & SWIM

Raft Consensus Protocol, Failure Detection Algorithms, CRDTs: Conflict-Free Replicated Data Types

HyperLogLog & Cardinality Estimation — Probabilistic Data Structures

Difficulty: Advanced. Complexity: O(1) per add, O(m) space

Key Points for HyperLogLog & Cardinality Estimation

Estimates the count of distinct elements in a stream using roughly 12KB of memory regardless of cardinality. That is 12KB for 1 million uniques or 1 billion uniques
Standard error of 1.04/sqrt(m) where m is the number of registers. With 16384 registers (the Redis default), you get about 0.81% error
The core insight: the longest run of leading zeros in a hash tells you something about how many distinct values you have seen. More distinct values means longer runs become more likely
HyperLogLog structures are fully mergeable. You can compute cardinality of a union by merging HLL sketches from different servers without touching the original data
Redis PFCOUNT, ClickHouse uniqHLL12, BigQuery APPROX_COUNT_DISTINCT, and Presto approx_distinct all use HyperLogLog under the hood

Production Systems Using HyperLogLog & Cardinality Estimation

Redis PFADD/PFCOUNT
ClickHouse uniqHLL12
BigQuery APPROX_COUNT_DISTINCT
Presto/Trino approx_distinct
Flink DataStream approximate distinct
Druid HyperUnique aggregator
Elasticsearch cardinality aggregation

Common Mistakes with HyperLogLog & Cardinality Estimation

Using exact COUNT(DISTINCT) on datasets above 100M rows when approximate is acceptable. A query that takes 45 seconds with exact counting takes 200ms with HLL
Not understanding that HLL cannot tell you WHICH elements are in the set, only how many. If you need membership testing, you want a Bloom filter instead
Forgetting that the error is relative to the true cardinality. 0.81% error on 1 billion uniques means you could be off by 8 million. For dashboards that is fine. For billing, it is not
Using too few registers to save memory. Going from 16384 to 256 registers saves 12KB but increases error from 0.81% to 6.5%. The memory savings is almost never worth the accuracy loss

Related to HyperLogLog & Cardinality Estimation

Bloom Filters & Probabilistic Membership, Count-Min Sketch & Frequency Estimation, Streaming Quantiles: T-Digest & DDSketch

Logical Clocks & Causality Tracking — Clocks & Ordering

Difficulty: Advanced. Complexity: O(1) Lamport, O(N) vector

Key Points for Logical Clocks & Causality Tracking

Physical clocks cannot reliably order events across machines. NTP drift, leap seconds, and clock skew mean that wall-clock timestamps from two different servers are not comparable for ordering purposes
Lamport timestamps give you a total order that is consistent with causality in one direction: if A happened before B, then L(A) < L(B). But the converse is not true. Two events with different timestamps might be concurrent
Vector clocks can detect concurrency: if neither vector dominates the other element-wise, the events are concurrent. This is what you need for conflict detection in multi-master replication
DynamoDB uses version vectors, CockroachDB uses Hybrid Logical Clocks, and Google Spanner uses GPS and atomic clocks to bound uncertainty. Each makes a different trade-off between metadata size, accuracy, and hardware requirements

Production Systems Using Logical Clocks & Causality Tracking

DynamoDB (version vectors)
CockroachDB (Hybrid Logical Clocks)
Google Spanner (TrueTime)
Riak (dotted version vectors)
Cassandra (Lamport timestamps for conflict resolution)
CouchDB (revision trees)

Common Mistakes with Logical Clocks & Causality Tracking

Relying on System.currentTimeMillis() or time.Now() to order events across machines. Clock skew between machines is typically 1-10ms with NTP, but can spike to seconds during NTP corrections or leap second events
Using Lamport timestamps when you need to detect concurrency. Lamport timestamps cannot distinguish between 'A happened before B' and 'A and B are concurrent.' You need vector clocks or version vectors for that
Implementing full vector clocks at scale. With 1000 nodes, every message carries 1000 integers. Version vectors, which track versions per data item rather than per node, are almost always sufficient and much smaller
Ignoring clock skew when using LWW registers. If node A's clock is 5 seconds ahead of node B, node A's writes always win concurrent conflicts regardless of actual ordering

Related to Logical Clocks & Causality Tracking

CRDTs: Conflict-Free Replicated Data Types, Raft Consensus Protocol, Quorum Systems & NRW Notation

LSM Trees vs B-Trees — Storage Internals

Difficulty: Advanced. Complexity: LSM: O(1) write, B-Tree: O(log N) write

Key Points for LSM Trees vs B-Trees

B-Trees store data in fixed-size pages organized as a balanced tree. Reads follow root-to-leaf in O(log N) page reads. Writes update pages in place with a write-ahead log for crash safety. PostgreSQL, MySQL InnoDB, and SQLite all use B-Trees
LSM Trees buffer all writes in memory (memtable), flush to immutable sorted files (SSTables) when full, and merge SSTables in the background via compaction. RocksDB, LevelDB, Cassandra, and ScyllaDB all use LSM trees
Write amplification in LSM trees comes from compaction rewriting data multiple times across levels. Leveled compaction in RocksDB rewrites data about 10-30x. B-Trees rewrite entire pages for small updates plus WAL entries
Bloom filters on SSTables are what make LSM read performance tolerable. Without them, a point query might check SSTables across every level. With them, you skip levels that definitely do not contain the key
The choice between them almost always comes down to workload: write-heavy and append-heavy favors LSM, read-heavy with random updates favors B-Trees

Production Systems Using LSM Trees vs B-Trees

RocksDB (LSM, embedded in CockroachDB/TiDB/Flink)
PostgreSQL (B-Tree indexes)
MySQL InnoDB (B+Tree)
Cassandra (LSM with configurable compaction)
LevelDB (LSM)
ClickHouse (MergeTree, LSM variant)

Common Mistakes with LSM Trees vs B-Trees

Choosing an LSM-based engine for read-heavy OLTP with lots of random point queries. B-Trees will outperform because they do not need to check multiple levels
Using default leveled compaction for a write-heavy workload without benchmarking tiered compaction. Leveled has the worst write amplification, and tiered can improve write throughput 3-5x
Ignoring write amplification on SSDs. SSDs have a limited number of write cycles per cell. A 10x write amplification means your drive wears out 10x faster than the application-level write volume suggests
Not monitoring compaction lag. When incoming writes outpace compaction, unmerged SSTables pile up, reads degrade because they check more files, and space usage balloons

Related to LSM Trees vs B-Trees

Bloom Filters & Probabilistic Membership, Merkle Trees & Anti-Entropy Repair

Merkle Trees & Anti-Entropy Repair — Data Distribution

Difficulty: Advanced. Complexity: O(log N) verification

Key Points for Merkle Trees & Anti-Entropy Repair

Two replicas can find their differences in O(log N) hash comparisons instead of comparing every record. Compare root hashes, recurse into the subtree that differs, and narrow down to exactly the divergent blocks
Cassandra builds Merkle trees during repair to identify divergent data ranges. Only the differing data transfers between replicas, saving enormous bandwidth compared to full sync
Merkle DAGs generalize the concept beyond binary trees. Git uses Merkle DAGs for commits, trees, and blobs. IPFS uses them for content-addressable storage where the hash IS the address
Repairing K divergent keys out of N total with a depth-D tree transfers O(K * D) hashes plus K values. Without the tree, you transfer N hashes. When K is much smaller than N, the savings are massive
Incremental Merkle trees (Cassandra 4.0+) avoid the cost of rebuilding the entire tree from scratch on every repair cycle

Production Systems Using Merkle Trees & Anti-Entropy Repair

Cassandra nodetool repair
IPFS content addressing
Git (commits, trees, blobs)
ZFS data integrity (block checksumming)
Amazon DynamoDB anti-entropy
Ethereum state trie (Patricia Merkle Trie)

Common Mistakes with Merkle Trees & Anti-Entropy Repair

Rebuilding the entire Merkle tree on every sync cycle. Use incremental updates that only rehash the subtrees where data changed
Choosing tree depth too shallow. If each leaf covers thousands of keys, you detect that something diverged in a large range but still have to compare every key in that range
Ignoring the CPU cost of hashing large datasets. SHA-256 on terabytes of data is not cheap, and running repair during peak traffic can starve application workloads
Not scheduling repair during low-traffic periods. Anti-entropy repair is IO and CPU intensive. Running it at peak hours competes with production reads and writes

Related to Merkle Trees & Anti-Entropy Repair

Consistent Hashing & Ring Algorithms, Bloom Filters & Probabilistic Membership

Operational Transformation — Consensus & Coordination

Difficulty: Expert. Complexity: O(n²) worst-case transform, O(n) typical

Key Points for Operational Transformation

OT lets multiple users edit the same document simultaneously by transforming each operation against every concurrent operation before applying it. The transform adjusts character positions so that every client converges to the same final document, even when edits arrive out of order
OT requires a central server (or at least a total ordering mechanism) to sequence operations. The server is the single source of truth for operation order, which makes correctness tractable but means offline editing and peer-to-peer topologies are not naturally supported
The correctness of OT depends on transformation functions satisfying TP1 (transform-then-apply equals apply-then-transform for any pair of concurrent operations). TP2 extends this to three or more concurrent operations. Multiple published algorithms were later found to violate TP2
Google Docs has run OT in production since 2006, making it the most battle-tested approach for real-time collaborative text editing. The Jupiter protocol (client prediction + server canonicalization) handles latency by letting clients apply their own edits immediately and reconcile later

Production Systems Using Operational Transformation

Google Docs (Jupiter protocol, real-time collaborative editing)
Google Sheets (structured OT for cell operations)
Apache Wave (formerly Google Wave)
CKEditor 5 (rich text OT)
Firepad (Firebase-backed collaborative editor)
Codox (collaborative Word/Excel plugins)

Common Mistakes with Operational Transformation

Writing transform functions that handle insert-insert and insert-delete but forget delete-delete interactions at the same position. Edge cases around overlapping deletions are where most OT bugs hide
Trying to build OT without a central server. Pure peer-to-peer OT exists in theory, but the TP2 correctness requirement makes it extremely hard to get right. Most peer-to-peer implementations that claim OT are actually doing something closer to CRDTs
Assuming OT is only for plain text. Google Docs uses OT for rich text with formatting, and Google Sheets extends it to spreadsheet operations. But every new operation type multiplies the number of transform function pairs you need to write and prove correct
Not implementing operation compression for undo. A naive undo stack stores raw operations, but transformed operations change shape as they pass through the system. You need inverse transforms that account for the operation's history, not just the original

Related to Operational Transformation

CRDTs: Conflict-Free Replicated Data Types, Logical Clocks & Causality Tracking

Quorum Systems & NRW Notation — Failure Detection & Quorums

Difficulty: Advanced. Complexity: O(W+R) per operation

Key Points for Quorum Systems & NRW Notation

With N replicas, W (write quorum) and R (read quorum) define the consistency tradeoff. If W + R > N, every read sees at least one replica with the latest write because the write and read sets must overlap
N=3, W=2, R=2 is the classic setup. It tolerates 1 node failure for both reads and writes while guaranteeing strong consistency. Cassandra, Riak, and DynamoDB all support this configuration
Sloppy quorums sacrifice consistency for availability. When designated replicas are down, writes go to any available node. Hinted handoff eventually delivers the data to the correct replicas, but reads may be stale in the meantime
W + R > N guarantees you will see the latest write, but it does not guarantee linearizability. Concurrent reads during a write can still observe different values. Additional mechanisms like read repair or blocking reads are needed for true linearizability
Witness replicas participate in quorum votes without storing full data. They confirm they saw a write by storing just the key and version, useful for tie-breaking in even-numbered replica configurations

Production Systems Using Quorum Systems & NRW Notation

DynamoDB (sloppy quorum with hinted handoff)
Cassandra (configurable NRW per query)
Riak (NRW with read repair)
ZooKeeper (majority quorum for writes)
etcd (majority quorum via Raft)
CockroachDB (Raft-based quorum)

Common Mistakes with Quorum Systems & NRW Notation

Assuming W + R > N gives you linearizability. It only gives regular register semantics. Without additional mechanisms like Raft or blocking writes, concurrent reads can disagree
Using the same NRW configuration for every query type. Analytics reads that tolerate staleness should use R=1 for speed. Transaction reads that need consistency should use R=2. Cassandra supports per-query consistency levels for exactly this reason
Forgetting that sloppy quorums can return stale data even with W + R > N. The N nodes that participated in the write might include non-designated nodes that the read quorum never contacts
Not monitoring hinted handoff queue depth. If hints pile up faster than they drain, replicas are diverging and the consistency guarantees you think you have are eroding

Related to Quorum Systems & NRW Notation

Raft Consensus Protocol, Failure Detection Algorithms, Consistent Hashing & Ring Algorithms

Raft Consensus Protocol — Consensus & Coordination

Difficulty: Expert. Complexity: O(n) per commit, O(log n) disk

Key Points for Raft Consensus Protocol

Raft decomposes consensus into three relatively independent sub-problems: leader election, log replication, and safety. Each one is tractable on its own, which is the whole reason Raft succeeded where Paxos confused everyone
Every committed entry survives any future leader election because candidates must prove they have all committed entries before a majority will vote for them. This is the election restriction, and it is the single most important safety property in the protocol
Leader election uses randomized timeouts between 150ms and 300ms. The randomization prevents split votes from repeating indefinitely, and in practice elections settle in one or two rounds
etcd, CockroachDB, Consul, TiKV, and RabbitMQ quorum queues all run Raft in production. It is the dominant consensus algorithm for strongly consistent replicated state machines

Production Systems Using Raft Consensus Protocol

etcd (Kubernetes backing store)
CockroachDB
Consul
TiKV (TiDB storage)
HashiCorp Vault
RabbitMQ Quorum Queues

Common Mistakes with Raft Consensus Protocol

Setting election timeout too low in cloud environments. Network jitter on shared infrastructure can easily exceed 150ms, causing unnecessary leader elections that stall writes for the duration of the election
Not implementing log compaction or snapshotting. Without it, the Raft log grows unbounded on disk and new followers joining the cluster must replay the entire history from the beginning
Ignoring learner (non-voting) members when scaling reads. Adding voting members increases the cost of every commit because the leader must wait for a majority. Learners receive the log but do not participate in elections or commit quorums
Assuming Raft handles Byzantine failures. It does not. Raft trusts that nodes follow the protocol honestly. A malicious node that sends fabricated log entries will corrupt the cluster. For Byzantine tolerance, you need BFT protocols like PBFT or HotStuff

Related to Raft Consensus Protocol

Gossip Protocols & SWIM, Quorum Systems & NRW Notation, Logical Clocks & Causality Tracking

Streaming Quantiles: T-Digest & DDSketch — Probabilistic Data Structures

Difficulty: Expert. Complexity: O(log n) per add, O(compression) space

Key Points for Streaming Quantiles: T-Digest & DDSketch

You cannot average percentiles. The P99 of shard A and the P99 of shard B does not give you the global P99. This is the fundamental reason streaming quantile sketches exist
T-Digest uses weighted centroids that cluster more tightly at the tails (P1, P99) where accuracy matters most for latency monitoring. The compression parameter (typically 100-300) controls the accuracy-memory trade-off
DDSketch uses logarithmic bin boundaries to provide a relative error guarantee. If relative error is 1%, your P99 estimate is within 1% of the true value regardless of the underlying distribution shape
Both T-Digest and DDSketch support merging from multiple servers, which makes them suitable for computing global percentiles across a distributed fleet without centralizing raw data
Prometheus histogram_quantile, Datadog DDSketch, Elasticsearch percentile aggregation, and Envoy proxy latency histograms all rely on streaming quantile algorithms in production

Production Systems Using Streaming Quantiles: T-Digest & DDSketch

Prometheus histogram_quantile
Datadog Agent DDSketch
Elasticsearch percentile aggregation
Apache Spark approxQuantile
ClickHouse quantileTDigest
Envoy proxy latency histograms

Common Mistakes with Streaming Quantiles: T-Digest & DDSketch

Averaging percentiles across shards or time windows. This is mathematically wrong and produces values that can be arbitrarily far from the true global percentile. A system with 100 shards each reporting P99 = 50ms might actually have a global P99 of 500ms if the tail latency is concentrated on a few shards
Using fixed-width histogram bins for long-tail latency distributions. Bins of [0-10ms, 10-20ms, 20-30ms, ...] waste resolution on the low end where everything clusters and have no resolution at the tail where the interesting behavior lives. Exponential or logarithmic bins are almost always better
Setting T-Digest compression too low for tail accuracy. A compression parameter of 25 might seem fine when you look at P50, but the P99.9 estimate can be off by 30% or more. For SLO monitoring, use compression 200+ and verify tail accuracy against known distributions
Treating Prometheus default histogram buckets as precise quantile estimates. Prometheus histograms use linear interpolation within fixed buckets, and the default buckets (5ms, 10ms, 25ms, 50ms, ...) produce quantile estimates that can be wildly off if your actual latency distribution does not align with the bucket boundaries

Related to Streaming Quantiles: T-Digest & DDSketch