Bloom Filters & Probabilistic Membership — Probabilistic Data Structures
Difficulty: Advanced. Complexity: O(k) per lookup, O(m) space
Key Points for Bloom Filters & Probabilistic Membership
- Answers 'is this element in the set?' with zero false negatives and a tunable false positive rate. If a Bloom filter says no, the element is definitely not there. If it says yes, the element is probably there
- A bit array of m bits with k independent hash functions. Insertion hashes the element k times and sets those k bit positions to 1. Lookup hashes the element k times and checks whether all k positions are 1
- False positive rate is approximately (1 - e^(-kn/m))^k where n is the number of inserted elements, m is the bit array size, and k is the number of hash functions. Tuning m and k for a target false positive rate is straightforward given expected n
- Standard Bloom filters do not support deletion. Removing an element would require clearing bits that might be shared with other elements. Counting Bloom filters replace each bit with a counter to enable deletion at the cost of 3-4x more memory
- Cassandra, Chrome Safe Browsing, RocksDB, PostgreSQL, and Bitcoin SPV nodes all use Bloom filters in production to avoid expensive lookups against data that is almost certainly not present
Production Systems Using Bloom Filters & Probabilistic Membership
- Cassandra SSTable lookup
- Chrome Safe Browsing
- RocksDB block-based filter
- PostgreSQL bloom index
- Akamai cache lookup
- Bitcoin SPV nodes
Common Mistakes with Bloom Filters & Probabilistic Membership
- Not sizing the filter correctly for the expected number of elements. An undersized Bloom filter degrades gracefully in theory but catastrophically in practice: at 50% fill, false positive rates spike well above your design target
- Assuming you can delete elements from a standard Bloom filter. Clearing a bit that belongs to multiple elements corrupts the filter silently. If you need deletion, use a Counting Bloom filter or a Cuckoo filter from the start
- Using a Bloom filter where false positives affect correctness, not just performance. A Bloom filter as a negative cache is fine (false positive means an unnecessary lookup). A Bloom filter deciding whether to charge a customer is not fine
- Ignoring hash function quality. Bloom filters need hash functions that produce independent, uniformly distributed outputs. Using a weak hash function creates correlated bit positions and inflates your actual false positive rate well beyond the theoretical bound
Related to Bloom Filters & Probabilistic Membership
HyperLogLog & Cardinality Estimation, Count-Min Sketch & Frequency Estimation
BM25 — Best Matching 25 — Search & Retrieval
Difficulty: Advanced. Complexity: O(q × avgPostingLen)
Key Points for BM25 — Best Matching 25
- Ranks documents by combining term frequency saturation, inverse document frequency, and document length normalization. It fixes the biggest problem with raw TF-IDF: long documents no longer dominate results, and repeating a word 50 times does not give you 50x the score
- Two parameters control everything. k1 (typically 1.2-2.0) sets how quickly term frequency saturates. b (typically 0.75) controls how much document length penalizes the score. Setting b=0 turns off length normalization entirely
- Elasticsearch switched from TF-IDF to BM25 as the default scoring function in version 5.0 (Lucene 6.0). The reason was simple: BM25 consistently beat TF-IDF across every standard information retrieval benchmark they tested
- BM25 scores are decomposable per query term, which enables aggressive query-time optimizations like WAND and Block-Max WAND. These skip large chunks of posting lists that cannot possibly produce top-k results
- BM25 is a lexical matching algorithm. It cannot match 'automobile' to 'car' unless you add synonyms explicitly. Modern search systems combine BM25 with vector embeddings in a hybrid approach to get the best of both worlds
Production Systems Using BM25 — Best Matching 25
- Elasticsearch / OpenSearch
- Apache Lucene / Solr
- SQLite FTS5
- PostgreSQL tsvector ranking
- Typesense
- Tantivy (Rust search engine)
Common Mistakes with BM25 — Best Matching 25
- Running with default k1 and b without tuning for your actual data. Short product titles need very different parameters than long-form blog posts. Always run relevance evaluation on a representative query set before shipping
- Ignoring how tokenization and stemming affect BM25 scores. If your indexing analyzer stems 'running' to 'run' but your query analyzer does not, you get zero matches on a query that should return thousands of results
- Treating BM25 scores as absolute relevance measures. A score of 12.5 means nothing on its own. BM25 scores are only meaningful for ranking documents against each other within the same query
- Not accounting for multi-field scoring. A title match should almost always score higher than a body match, but BM25 does not know about fields by itself. Elasticsearch applies BM25 per field and combines them, and those field boost weights need careful thought
Related to BM25 — Best Matching 25
Inverted Index, HyperLogLog & Cardinality Estimation, Bloom Filters & Probabilistic Membership
Chord & Distributed Hash Tables — Data Distribution
Difficulty: Advanced. Complexity: O(log n) lookup with finger table
Key Points for Chord & Distributed Hash Tables
- A Distributed Hash Table (DHT) maps keys to nodes in a decentralized network with no central coordinator. Any node can look up any key by routing through O(log n) intermediate nodes, even in a network with millions of participants
- Chord places both keys and nodes on a circular ID space (ring) of size 2^m. Each key is stored on its successor node, the first node whose ID is equal to or follows the key's ID clockwise on the ring
- Finger tables make lookups fast. Each node maintains a table of O(log n) entries pointing to nodes at exponentially increasing distances around the ring. Instead of hopping one node at a time, lookups jump halfway to the target, then a quarter, then an eighth
- Kademlia (used in BitTorrent and IPFS) uses XOR distance instead of clockwise distance. This makes routing symmetric (if A is close to B, then B is close to A) and enables parallel lookups where a node queries multiple peers simultaneously
- Node joins and departures require updating finger tables and transferring keys. Chord's stabilization protocol runs periodically to fix inconsistencies, but there is a window where lookups can fail during churn. Production systems add replication on top to handle this
Production Systems Using Chord & Distributed Hash Tables
- BitTorrent (Mainline DHT / Kademlia)
- IPFS (Kademlia variant)
- Ethereum (Kademlia-based discovery)
- Coral CDN (DSHT)
- Amazon Dynamo (influenced by Chord)
- Apache Cassandra (ring concept)
Common Mistakes with Chord & Distributed Hash Tables
- Confusing DHTs with consistent hashing for server-side load balancing. Consistent hashing (like what Cassandra uses) assumes a central view of the membership. DHTs work when no single node knows the full membership. They solve different problems
- Assuming DHT lookups are fast in absolute terms. O(log n) hops across the internet means O(log n) network round trips, each adding 50-200ms of latency. A lookup in a 10,000-node DHT might take 500-2000ms. This is fine for file sharing, terrible for database queries
- Not planning for churn. In peer-to-peer networks, nodes join and leave constantly. Each departure can orphan keys if replication is insufficient. Production DHTs replicate keys across multiple successor nodes and run background repair processes
- Ignoring Sybil attacks. In an open DHT, an attacker can create thousands of nodes and position them strategically to intercept or censor specific keys. Defending against this requires proof-of-work, social trust graphs, or other identity mechanisms
Related to Chord & Distributed Hash Tables
Consistent Hashing & Ring Algorithms, Gossip Protocols & SWIM, Merkle Trees & Anti-Entropy Repair
Circuit Breakers & Bulkheads — Failure Detection & Quorums
Difficulty: Intermediate. Complexity: O(1) per request
Key Points for Circuit Breakers & Bulkheads
- A circuit breaker monitors calls to a downstream service and trips open when failures exceed a threshold. Once open, requests fail immediately without attempting the call. This prevents a failing service from dragging down its callers and cascading the failure across the entire system
- Three states: Closed (normal, requests flow through), Open (tripped, requests fail fast), Half-Open (testing, a limited number of requests probe the downstream to check if it has recovered). The transition from open to half-open happens after a configurable timeout
- Bulkheads isolate components so that a failure in one does not exhaust shared resources and take down others. Named after ship bulkheads that contain flooding to one compartment. In practice, this means separate thread pools, connection pools, or rate limits per downstream dependency
- Netflix Hystrix popularized both patterns but is now in maintenance mode. Modern alternatives include Resilience4j (Java), Polly (.NET), Envoy proxy (sidecar-based), and Istio (service mesh-level). The concepts are the same regardless of the implementation
- Circuit breakers and bulkheads are defense mechanisms, not solutions. They buy you time to recover. The actual fix is always upstream: retry with backoff, degrade gracefully, serve cached data, or alert an operator. A circuit breaker that trips and stays open forever is just a permanent outage with extra steps
Production Systems Using Circuit Breakers & Bulkheads
- Netflix (Hystrix, now Resilience4j)
- Envoy Proxy (outlier detection)
- Istio service mesh
- AWS App Mesh
- Spring Cloud Circuit Breaker
- Polly (.NET resilience library)
Common Mistakes with Circuit Breakers & Bulkheads
- Setting the failure threshold too low. If your circuit breaker trips after 3 failures in 60 seconds, a single burst of network timeouts will open it even though the downstream service is fundamentally healthy. Use a percentage-based threshold (e.g., 50% failure rate over the last 100 calls) rather than an absolute count
- Not distinguishing between failure types. A 500 error from the downstream usually means the service is struggling. A 400 error means your request was bad. Counting 400s toward the circuit breaker threshold will trip it when your own requests are malformed, not when the downstream is failing
- Forgetting the half-open state. A circuit breaker that goes from open to closed without testing first floods the recovering service with full traffic. The half-open state lets through a small number of test requests. Only if they succeed does the breaker close
- Using circuit breakers for services that must not be skipped. If your payment processor is down and you cannot complete the checkout, a circuit breaker that fails fast is the right behavior. But if you silently skip the payment step and ship the order for free, that is worse than being slow
Related to Circuit Breakers & Bulkheads
Failure Detection Algorithms, Gossip Protocols & SWIM, Quorum Systems & NRW Notation
Consistent Hashing & Ring Algorithms — Data Distribution
Difficulty: Advanced. Complexity: O(log N) lookup with ring
Key Points for Consistent Hashing & Ring Algorithms
- Adding or removing a server with modulo hashing (hash(key) % N) remaps nearly every key. With 100 servers, adding one server moves about 99% of your data. Consistent hashing remaps only 1/N of keys on average
- Virtual nodes smooth out load distribution by giving each physical server 100-200 positions on the hash ring. Without them, servers can own wildly unequal segments due to hash clustering
- Jump Hash from Google achieves perfect uniformity with zero memory overhead, but only supports append-only scaling. You can add bucket N+1 but you cannot remove bucket 5
- Rendezvous hashing (highest random weight) needs no ring data structure at all. For each key, hash against every server and pick the highest. Clean and elegant, but O(N) per lookup limits it to smaller clusters
- Maglev hashing builds a lookup table for O(1) queries after construction. Google uses it in their network load balancers where per-packet latency matters
Production Systems Using Consistent Hashing & Ring Algorithms
- Cassandra (token ring with virtual nodes)
- DynamoDB (consistent hashing with partition map)
- Memcached clients (ketama algorithm)
- Akamai CDN (original consistent hashing paper)
- Nginx upstream consistent hashing
- HAProxy consistent hashing
Common Mistakes with Consistent Hashing & Ring Algorithms
- Using too few virtual nodes. With 10 virtual nodes per server, load imbalance can exceed 2x between the most-loaded and least-loaded servers. At 150+ virtual nodes, variance drops below 10%
- Giving identical virtual node counts to servers with different capacities. A 64GB machine and a 16GB machine should not own the same fraction of the ring
- Reaching for consistent hashing when the server set is small and stable. With 5 servers that never change, modulo hashing is faster, simpler, and perfectly uniform
- Not monitoring actual key distribution after deployment. Hash functions can cluster on real-world key patterns even when they look uniform on random inputs
Related to Consistent Hashing & Ring Algorithms
Merkle Trees & Anti-Entropy Repair, Quorum Systems & NRW Notation
Contraction Hierarchies — Geospatial Algorithms
Difficulty: Advanced. Complexity: Preprocessing: hours (500M nodes). Query: < 1ms (~500 nodes explored)
Key Points for Contraction Hierarchies
- CH works by contracting nodes in order of increasing importance. Contracting a node means removing it from the graph and adding shortcut edges between its neighbors if the shortest path between those neighbors went through the removed node. The 'witness path' check determines whether a shortcut is needed: if an alternative path exists that is equally short or shorter, no shortcut is added.
- After preprocessing, every shortest path in the graph can be decomposed into an upward segment (source to a high-importance node) followed by a downward segment (high-importance node to destination). This is the hierarchical property. A query runs bidirectional search: forward from source going only upward, backward from destination going only upward. They meet at the highest-importance node on the path.
- The query explores approximately 500 nodes total, regardless of whether the route is 5 km or 5,000 km. This is because both searches climb the hierarchy quickly: residential streets to arterials to highways to motorways. The hierarchy converges at a small number of high-importance nodes (major highway interchanges) that form the backbone of long-distance routing.
- Customizable Contraction Hierarchies (CCH) solve the dynamic weight problem. CCH separates the hierarchy structure (node ordering, which shortcuts exist) from the edge weights. The structure is preprocessed once based on graph topology. When traffic changes edge weights, only the shortcut weights need recalculation. A bottom-up propagation updates all affected shortcuts in ~2 seconds for the full graph, or ~100ms for localized changes. This is how Google Maps reflects live traffic without re-running 4-6 hour preprocessing.
- Node importance ordering is the most critical preprocessing decision. The standard heuristic considers: edge difference (how many shortcuts would be added vs edges removed), contracted neighbors (avoid creating shortcut chains), and original edge count. Google's variant also incorporates traffic volume: high-traffic corridors get higher importance, which improves query performance for popular routes.
- The preprocessed graph is deployed as a memory-mapped binary file. Each routing server loads the file via mmap. The OS handles paging: frequently queried regions (urban areas) stay in RAM, rarely queried regions (rural) stay on disk until accessed. No deserialization step on server startup. A new graph version is deployed by writing a new file to S3 and swapping the mmap pointer, achieving zero-downtime deployment.
Production Systems Using Contraction Hierarchies
- Google Maps
- Apple Maps
- OSRM
- Valhalla (Mapbox)
- Uber (ETA computation)
Common Mistakes with Contraction Hierarchies
- Running Dijkstra or A* on the full graph at scale. On a 500M-node road network, Dijkstra explores ~2 million nodes per query (~2 seconds). At 58,000 queries/sec, that is 116 billion node visits per second. CH reduces this to ~500 nodes per query, making the same workload feasible on 6-18 servers.
- Treating the CH graph as immutable between weekly rebuilds. Traffic changes edge weights constantly. Without CCH weight updates, the routing engine uses stale weights and produces suboptimal routes. The fix: run CCH weight propagation every 30 seconds from live traffic data in Redis.
- Underestimating shortcut memory. Contracting a 500M-node graph can produce 1.5-2B shortcut edges on top of 1.2B original edges. Each shortcut stores a target node (8 bytes), weight (4 bytes), and middle node for unpacking (8 bytes). That is 30-40 GB of shortcuts alone. Regional sharding (5-12 GB per shard) keeps per-server memory manageable.
- Reprocessing the full graph for every change. A single new road does not require 4-6 hours of CH preprocessing. Minor topology changes can be handled by temporarily marking edges as blocked or open via CCH weight overrides (set weight to infinity for closed roads, normal weight for opened roads). Reserve full reprocessing for when cumulative topology changes exceed a threshold.
- Ignoring cross-region routing. When the graph is sharded by geographic region, routes that cross shard boundaries need a stitching mechanism. The transit node overlay pattern precomputes shortest-path distances between border nodes across regions. A cross-region route becomes 3 CH queries (origin shard + overlay + destination shard), adding ~3-5ms total.
Related to Contraction Hierarchies
Google S2 Geometry, Consistent Hashing
Count-Min Sketch & Frequency Estimation — Probabilistic Data Structures
Difficulty: Advanced. Complexity: O(d) per update, O(w*d) space
Key Points for Count-Min Sketch & Frequency Estimation
- Estimates how often an element appears in a data stream using a fixed amount of memory regardless of stream length. The structure never needs to know the universe of possible elements in advance
- Uses d independent hash functions mapping elements to d rows of w counters each. Insertion increments d counters, query returns the minimum of d counters. The minimum is the least-overcounted estimate
- Overestimates frequency by at most epsilon * N with probability at least 1 - delta, where w = ceil(e/epsilon) and d = ceil(ln(1/delta)). There are no underestimates
- Conservative update optimization only increments counters that equal the current minimum, significantly reducing overestimation for frequent items at the cost of losing the strict mergeability property
- Used in production at Cisco, Apache Flink, Redis (RedisBloom Top-K), AT&T, and Twitter for network monitoring, streaming analytics, fraud detection, and trending topic detection
Production Systems Using Count-Min Sketch & Frequency Estimation
- Cisco network monitoring
- Apache Flink streaming
- Redis Top-K (RedisBloom)
- AT&T fraud detection
- Google Sawzall
- Twitter trending topics
Common Mistakes with Count-Min Sketch & Frequency Estimation
- Using Count-Min Sketch for low-frequency items where the noise from hash collisions dominates the actual count. If an item appears 3 times and the overcounting noise is 50, your estimate is useless. CMS works best for heavy hitters
- Not tuning width and depth for the actual workload. Default parameters from a library might target epsilon=0.01 and delta=0.01 (width 272, depth 5), but your stream might have 10 billion events, making the absolute error bound 100 million. Think in absolute terms, not just relative
- Treating frequency estimates as exact for billing or compliance. CMS always overcounts. An estimate of 1,247 means the true count is at most 1,247 and at least something lower. If you are charging per event, that overcount becomes a financial dispute
- Forgetting that conservative update breaks mergeability. If you need to merge sketches from different stream partitions, you must use standard update, not conservative update. Choose one or the other based on your architecture
Related to Count-Min Sketch & Frequency Estimation
HyperLogLog & Cardinality Estimation, Bloom Filters & Probabilistic Membership, Streaming Quantiles: T-Digest & DDSketch
CRDTs: Conflict-Free Replicated Data Types — Consensus & Coordination
Difficulty: Expert. Complexity: O(1) per op, varies by type
Key Points for CRDTs: Conflict-Free Replicated Data Types
- CRDTs let multiple replicas accept writes concurrently without any coordination and merge them deterministically. The merge function must be commutative, associative, and idempotent, which guarantees convergence regardless of message ordering or duplication
- State-based CRDTs ship the full state and merge with a join operation. Operation-based CRDTs ship individual operations that must be commutative. State-based is simpler to implement, op-based is more bandwidth-efficient at scale
- The OR-Set (observed-remove set) is the most practically useful CRDT for application developers. It handles concurrent add and remove by tagging each add with a unique identifier, so concurrent add-wins semantics are unambiguous
- Redis Enterprise, Riak, Automerge, and Figma all use CRDTs in production for active-active geo-replication or real-time collaborative editing
Production Systems Using CRDTs: Conflict-Free Replicated Data Types
- Redis Enterprise CRDT (active-active geo-distribution)
- Riak (OR-Set, counters)
- Automerge (collaborative editing)
- Figma (multiplayer design)
- SoundCloud Roshi (LWW-Set)
- Apple Notes (collaborative editing)
Common Mistakes with CRDTs: Conflict-Free Replicated Data Types
- Choosing LWW (last-writer-wins) when concurrent updates carry semantic meaning. If two users update a shopping cart simultaneously, LWW silently drops one update. An OR-Set or counter would preserve both
- Ignoring the metadata overhead of OR-Sets. Each element carries a set of unique tags (one per concurrent add), and removed elements leave tombstones. Without garbage collection, metadata grows without bound
- Assuming CRDTs solve all conflict problems. CRDTs solve a specific mathematical class of conflicts where a commutative merge exists. Many business logic conflicts (like double-booking a hotel room) have no meaningful automatic resolution
- Not implementing garbage collection for tombstones. Deleted elements in OR-Sets leave metadata behind. Over weeks or months of operation, tombstone accumulation can consume more memory than the live data
Related to CRDTs: Conflict-Free Replicated Data Types
Operational Transformation, Logical Clocks & Causality Tracking, Gossip Protocols & SWIM, Quorum Systems & NRW Notation
CRUSH: Topology-Aware Data Placement — Data Distribution
Difficulty: Advanced. Complexity: O(log N) per placement, no central lookup
Key Points for CRUSH: Topology-Aware Data Placement
- CRUSH replaces the question 'where IS my data?' with 'where SHOULD my data be?' Any node can compute the answer locally using the same algorithm and cluster map, with no central placement database and no network call
- The algorithm walks a hierarchy (region → AZ → rack → node) selecting one node at each level using a pseudo-random function seeded by the placement group ID. This makes it deterministic: same PG ID + same cluster map = same node selection every time
- Failure domain awareness is built in. CRUSH rules enforce constraints like 'no two shards on the same rack' or 'max 5 shards per AZ'. This is what prevents a single rack power failure from losing too many shards of the same object
- When a node is added or removed, CRUSH recomputes placement and only the placement groups that were on the changed node need to move. With 100K PGs and 20K nodes, adding one node moves roughly 5 PGs worth of data, not a full reshuffle
- Weights control how much data each node gets. A node with 36TB drives gets 3x the data of a node with 12TB drives. This handles heterogeneous hardware without manual balancing
- CRUSH is not a storage system, not a replication protocol, not a metadata store. It is purely a placement function: input = PG ID + cluster topology, output = list of N nodes
Production Systems Using CRUSH: Topology-Aware Data Placement
- Ceph (RADOS, the original CRUSH implementation)
- Object storage systems with placement groups
- Distributed filesystems that need rack-aware placement
- Any system that needs topology-aware shard distribution without a central lookup
Common Mistakes with CRUSH: Topology-Aware Data Placement
- Confusing CRUSH with consistent hashing. Consistent hashing maps keys to positions on a ring. CRUSH walks a topology tree with failure domain awareness. Consistent hashing has no concept of racks or AZs
- Not defining CRUSH rules for your failure domains. Default rules spread data across hosts but not across racks. If two replicas land on the same rack and the rack loses power, both copies are gone
- Ignoring weight rebalancing when adding nodes with different disk sizes. A new node with 20TB drives gets the same default weight as an old node with 8TB drives, leading to uneven utilization
- Making the topology tree too deep. Every level in the hierarchy adds a random selection step. Region → AZ → rack → node is fine. Adding sub-rack and sub-node levels increases computation and makes debugging placement harder
Related to CRUSH: Topology-Aware Data Placement
Consistent Hashing & Ring Algorithms, Erasure Coding & Reed-Solomon, Quorum Systems & NRW Notation
Cuckoo Hashing & Cuckoo Filters — Probabilistic Data Structures
Difficulty: Advanced. Complexity: O(1) lookup, O(1) amortized insert
Key Points for Cuckoo Hashing & Cuckoo Filters
- Cuckoo filters store fingerprints of elements in a cuckoo hash table, enabling both membership queries and deletion. This is the key advantage over Bloom filters, which cannot delete elements without switching to a counting variant that uses 3-4x more memory
- Each element maps to two candidate buckets via two hash functions. Lookup checks both buckets for a matching fingerprint. If found in either, the element is probably present. If not found in either, it is definitely not present
- At false positive rates below about 3%, cuckoo filters use less space per element than Bloom filters. At a 1% FPR, a cuckoo filter uses roughly 12 bits per element versus a Bloom filter's 10 bits, but the gap narrows and reverses at lower FPRs
- The maximum load factor is about 95% with 4-way buckets. Beyond that, insertions start failing because displacement chains cannot find empty slots. Bloom filters degrade gracefully when overfull; cuckoo filters hit a wall
- DPDK uses cuckoo hashing for high-performance packet lookup tables. RocksDB explored cuckoo table format as an alternative to block-based tables for in-memory workloads with pure point lookups
Production Systems Using Cuckoo Hashing & Cuckoo Filters
- DPDK packet classification
- RocksDB cuckoo table format
- Intel storage libraries
- Network flow tracking systems
- Edge caching systems
- DNS filtering appliances
Common Mistakes with Cuckoo Hashing & Cuckoo Filters
- Not planning for the capacity limit. Once a cuckoo filter is about 95% full, insertions fail. You cannot just keep adding elements like you can with a Bloom filter (where the false positive rate rises but inserts never fail). Size your filter for the maximum expected element count, not the initial count
- Using cuckoo filters for append-only workloads where deletion is not needed. If you never delete, Bloom filters are simpler, well-understood, and have extensive production track records. The whole point of cuckoo filters is deletion support
- Storing too-short fingerprints. Shorter fingerprints save space but increase false positive rates. With f-bit fingerprints and b entries per bucket, the FPR is approximately 1/(2^f * b). Use at least 8-bit fingerprints for reasonable FPRs
- Inserting the same element multiple times. Each insertion stores a new fingerprint. If you insert element X three times, it occupies three slots. The third insertion might even evict unrelated elements. Always check for membership before inserting if duplicates are possible
Related to Cuckoo Hashing & Cuckoo Filters
Bloom Filters & Probabilistic Membership, Count-Min Sketch & Frequency Estimation, HyperLogLog & Cardinality Estimation
Distributed Snapshots & Chandy-Lamport — Consensus & Coordination
Difficulty: Advanced. Complexity: O(E) messages, O(N) state
Key Points for Distributed Snapshots & Chandy-Lamport
- A consistent distributed snapshot captures the state of every node and every in-flight message without stopping the system. The snapshot represents a state the system could have been in, even if no single moment in real time matched it exactly
- Chandy-Lamport (1985) uses marker messages as logical scissors that cut message flows into before-snapshot and after-snapshot. Any node can initiate. The algorithm requires only FIFO channels, no global clock
- Apache Flink adapted Chandy-Lamport for stream processing using checkpoint barriers injected into data streams. Aligned barriers give exactly-once guarantees but can cause backpressure. Unaligned barriers (Flink 1.11+) reduce latency at the cost of larger checkpoint sizes
- Incremental snapshots capture only the state delta since the last checkpoint. For Flink operators backed by RocksDB, this can shrink checkpoint size from gigabytes to megabytes, making frequent checkpoints practical
- The consistent cut property guarantees that for every message received in the snapshot, the corresponding send is also captured. This is what makes the snapshot usable for recovery without duplicating or losing events
Production Systems Using Distributed Snapshots & Chandy-Lamport
- Apache Flink checkpoints (barrier-based)
- Spark Structured Streaming
- Kafka Streams state store snapshots
- CockroachDB MVCC snapshots
- Debezium initial snapshot + CDC
- Apache Samza checkpointing
Common Mistakes with Distributed Snapshots & Chandy-Lamport
- Using aligned checkpoints on high-throughput Flink pipelines without benchmarking backpressure. Aligned barriers force operators to buffer data from fast channels while waiting for slow ones, which can cascade into severe backpressure spikes
- Not enabling incremental checkpoints for large operator state. Full state snapshots of a 50GB RocksDB backend can take minutes and saturate network bandwidth to the checkpoint store
- Setting the checkpoint interval too aggressively. Each checkpoint has coordination overhead (barrier propagation, state serialization, storage writes). An interval of 100ms on a pipeline doing 10ms checkpoints means 10% of capacity goes to checkpointing
- Ignoring checkpoint storage costs. Retained checkpoints on S3 or HDFS accumulate fast, especially with incremental checkpoints that keep SST file references alive. Without a retention policy, storage bills grow quietly
Related to Distributed Snapshots & Chandy-Lamport
Write-Ahead Log, Raft Consensus Protocol, Logical Clocks & Causality Tracking
Erasure Coding & Reed-Solomon — Data Distribution
Difficulty: Advanced. Complexity: O(n·k) encode, O(k²) decode (Galois field ops)
Key Points for Erasure Coding & Reed-Solomon
- Erasure coding splits data into k data shards and computes m parity shards such that any k of the k+m total shards can reconstruct the original data. This provides the same fault tolerance as (k+m)/k-way replication at a fraction of the storage cost
- Reed-Solomon codes operate over Galois fields (finite field arithmetic) to generate parity shards. RS(6,3) stores 9 shards total and tolerates any 3 failures, using 1.5x storage instead of the 3x that triple replication would require for equivalent durability
- The hidden cost is reconstruction. When a shard is lost, the system must read k surviving shards and perform matrix inversion to recover it. This amplifies read latency and network traffic during degraded operation, which is why hot data often still uses replication
- Every major storage system uses erasure coding for cold and warm data: HDFS switched from 3x replication to RS(6,3) and cut storage costs nearly in half, Ceph supports erasure-coded pools, and S3 uses a proprietary scheme across availability zones
Production Systems Using Erasure Coding & Reed-Solomon
- HDFS (RS(6,3) and RS(10,4) for cold data)
- Ceph (erasure-coded pools with configurable k,m)
- AWS S3 (proprietary erasure coding across AZs)
- Azure Storage (Local Reconstruction Codes)
- MinIO (Reed-Solomon per object)
- Backblaze Vaults (RS(17,3) across 20 drives)
- Google Colossus (successor to GFS, erasure coded)
Common Mistakes with Erasure Coding & Reed-Solomon
- Using erasure coding for small objects. An RS(6,3) scheme turns a 1KB object into 9 separate shards of ~170 bytes each, and every read requires fetching 6 of them from 6 different nodes. The per-request overhead dominates. Most systems set a minimum object size (typically 1-4MB) below which they fall back to replication
- Ignoring tail latency during degraded reads. Normal reads touch k nodes and take the slowest of those k responses. During reconstruction, you still read k nodes but the data also passes through a decode step. If even one of those k nodes is slow, the entire read stalls. Systems like Azure add extra read requests to hedge against stragglers
- Placing all shards in the same failure domain. If your RS(6,3) shards all land on the same rack and that rack loses power, you lose all 9 shards at once. Erasure coding only works when shards are spread across independent failure domains: different racks, different power circuits, or different availability zones
- Not accounting for reconstruction bandwidth during recovery. When a node dies holding thousands of erasure-coded shards, recovering each one requires reading k shards from k different nodes. At scale this can saturate network links and trigger cascading slowdowns. Smart systems throttle recovery bandwidth and prioritize shards that are closest to losing durability
Related to Erasure Coding & Reed-Solomon
Consistent Hashing & Ring Algorithms, Quorum Systems & NRW Notation, Failure Detection Algorithms
Event Sourcing — Data Patterns
Difficulty: Advanced. Complexity: O(1) append, O(n) rebuild from log
Key Points for Event Sourcing
- Store every state change as an immutable event in an append-only log. Current state is derived by replaying events, not stored directly. You get a complete, auditable history of how the system reached any given state
- Snapshots bound rebuild time. Without them, reconstructing state means replaying every event since the beginning. Periodic snapshots let you start from a recent checkpoint and replay only the events after it. The tradeoff: snapshot storage cost vs rebuild speed
- CQRS (Command Query Responsibility Segregation) falls out of this design almost automatically. Writes append events to the log. Reads query pre-built projections optimized for specific access patterns. Separating the two lets each side scale independently
- Event replay is the killer debugging tool. When something goes wrong, replay the event log up to the point of failure and inspect the exact state. No guessing, no log correlation, no 'works on my machine.' The events ARE the truth
- Schema evolution is the hardest operational problem. Once events are persisted, their schema is frozen. Adding fields is easy (default values). Removing or renaming fields requires versioned event handlers that can process both old and new formats indefinitely
Production Systems Using Event Sourcing
- Kafka (commit log as distributed event store)
- EventStoreDB (purpose-built event sourcing database)
- Axon Framework (Java CQRS + event sourcing)
- Yjs collaborative editors (operation log + document snapshots)
- Banking and financial systems (immutable transaction ledgers)
- Git (commit history as event log, working tree as projection)
- Redux (action log with reducer-based state derivation)
Common Mistakes with Event Sourcing
- Not implementing snapshots from the start. A collaborative document with 50,000 edits takes seconds to rebuild from raw events. Without snapshots, every document open becomes a full replay. Add snapshot logic early, even if the initial event count is small
- Using event sourcing for everything. CRUD-heavy domains with simple read patterns gain nothing from event sourcing and pay the full complexity cost. Reserve it for domains where audit history, temporal queries, or replay capability are actual requirements
- Storing derived data in events instead of raw facts. Events should capture what happened (UserClickedButton, DocumentEdited), not what the system decided to do about it. Derived state belongs in projections, not in the event log
- Ignoring event log growth. An active collaborative document generates thousands of events per hour. Without compaction, archival, or retention policies, the event log grows unboundedly. PostgreSQL performance degrades as tables exceed hundreds of millions of rows without partitioning
Related to Event Sourcing
Write-Ahead Log, CRDTs, Saga Pattern
Failure Detection Algorithms — Failure Detection & Quorums
Difficulty: Advanced. Complexity: O(1) per check
Key Points for Failure Detection Algorithms
- You cannot distinguish a crashed node from a slow one. Every failure detector is a tradeoff between detection speed and false positive rate. Faster detection means more healthy nodes incorrectly declared dead
- The phi accrual failure detector outputs a continuous suspicion level instead of a binary alive/dead. It adapts to network conditions automatically by tracking heartbeat interval distribution. Cassandra uses it with phi threshold of 8
- SWIM failure detection achieves O(N) total messages per round instead of O(N^2) all-to-all heartbeats. Each node probes one random peer, then uses indirect probing through k other nodes if the direct probe fails
- Fixed-timeout heartbeat detectors break in cloud environments where latency varies. A 200ms timeout that works in us-east-1 causes constant false positives on cross-region links with 150ms baseline latency
- JVM garbage collection pauses of 5-30 seconds will trigger any aggressive failure detector. Systems running on the JVM need failure detection timeouts that accommodate worst-case GC pauses
Production Systems Using Failure Detection Algorithms
- Cassandra (phi accrual, phi threshold = 8)
- Akka Cluster (phi accrual failure detector)
- Consul (SWIM via HashiCorp memberlist)
- Kubernetes (kubelet heartbeat with node lease)
- MongoDB (heartbeat with election timeout)
- Hazelcast (phi accrual)
Common Mistakes with Failure Detection Algorithms
- Using fixed heartbeat timeouts in cloud environments where network latency varies significantly between zones and regions
- Not accounting for GC pauses in JVM-based systems. A 10-second stop-the-world GC in Cassandra or Elasticsearch will look exactly like a dead node to the failure detector
- Having every node heartbeat every other node. With N nodes sending heartbeats to N-1 others, you get O(N^2) network messages. This does not scale past a few dozen nodes
- Setting the phi accrual threshold too low. phi=4 gives a false positive rate of about 1 in 55. At phi=8, it drops to about 1 in 3000. The difference between constant false alarms and rare ones
Related to Failure Detection Algorithms
Raft Consensus Protocol, Quorum Systems & NRW Notation
Gorilla Compression (Delta-of-Delta + XOR) — Storage Internals
Difficulty: Intermediate. Complexity: O(1) per sample encode/decode, ~1.37 bytes/sample
Key Points for Gorilla Compression (Delta-of-Delta + XOR)
- Gorilla exploits two properties unique to time-series data: timestamps arrive at regular intervals (delta-of-delta is usually zero = 1 bit), and consecutive float values are often identical or change slowly (XOR produces mostly zeros = 1 bit). Generic compressors like LZ4 or ZSTD cannot see this structure
- Delta-of-delta encoding stores each timestamp as the difference between consecutive deltas. With a regular 15-second scrape interval, every delta is 15s, every delta-of-delta is 0, and every timestamp after the second costs exactly 1 bit
- XOR encoding compresses float64 values by XOR-ing consecutive samples. According to Facebook's Gorilla paper, 51% of all production metric values XOR to exactly zero with their predecessor (1 bit). 30% compress to ~26.6 bits. 19% need ~36.9 bits. Average: ~9.5 bits per value
- The combined result is ~11 bits per sample (1.37 bytes) compared to 128 bits uncompressed (16 bytes). That is a 12x compression ratio. At 500M samples/sec, this is the difference between 54 TB/day and 4.5 TB/day of storage
- Every modern time-series database uses Gorilla or a variant: Prometheus TSDB, VictoriaMetrics, M3DB (Uber), InfluxDB IOx, and TimescaleDB. The algorithm is the foundation of practical TSDB storage
Production Systems Using Gorilla Compression (Delta-of-Delta + XOR)
- Facebook Gorilla (original in-memory TSDB, 2015)
- Prometheus TSDB (2-hour block chunks)
- VictoriaMetrics (vmselect/vmstorage hot tier)
- M3DB / M3TSZ (Uber, Gorilla variant)
- InfluxDB IOx (Apache Arrow + Gorilla for floats)
- TimescaleDB (gorilla compression option)
- QuestDB (column-level Gorilla for doubles)
Common Mistakes with Gorilla Compression (Delta-of-Delta + XOR)
- Assuming Gorilla compression works equally well on irregular data. Delta-of-delta encoding relies on consistent intervals. If timestamps arrive at random intervals, delta-of-deltas are rarely zero and timestamps cost 9-36 bits each instead of 1 bit
- Confusing the theoretical 1.37 bytes/sample with production reality. Chunk headers, label storage overhead, index structures, and compression boundary effects push the effective rate to 1.5-2 bytes per sample in real systems
- Applying Gorilla to non-numeric time-series data. The algorithm compresses float64 timestamp-value pairs only. Labels, string fields, and metadata require separate compression (inverted index for labels, dictionary encoding for strings)
- Using excessively short chunk durations. Each chunk has fixed overhead (header, block boundary markers). Very short chunks (under 30 seconds of data) waste a higher percentage of space on overhead, reducing the effective compression ratio
- Not layering a general-purpose compressor on top for cold storage. Gorilla is optimized for fast encode/decode in the hot path. For cold data read infrequently, applying ZSTD on top of Gorilla-compressed blocks can squeeze out another 2-3x
Related to Gorilla Compression (Delta-of-Delta + XOR)
Distributed Snapshots & Chandy-Lamport, Write-Ahead Log, LSM Tree, Bloom Filters
Gossip Protocols & SWIM — Consensus & Coordination
Difficulty: Advanced. Complexity: O(log N) convergence, O(N) state
Key Points for Gossip Protocols & SWIM
- Information spreads through a gossip cluster the way a disease spreads through a population. Each node periodically picks a random peer and exchanges state. After O(log N) rounds, the entire cluster has the information with high probability
- SWIM separates failure detection from dissemination. It detects failures through direct and indirect probing, then piggybacks membership updates onto probe messages instead of broadcasting them separately
- Gossip provides eventual consistency, not strong consistency. Two nodes might temporarily disagree about cluster state. This is fine for membership and health but wrong for operations that need linearizability
- Cassandra, Consul, Redis Cluster, and ScyllaDB all use gossip for cluster membership and state dissemination in production
Production Systems Using Gossip Protocols & SWIM
- Cassandra cluster membership
- HashiCorp Serf
- HashiCorp Memberlist (used by Consul, Nomad)
- Redis Cluster gossip bus
- Riak cluster coordination
- ScyllaDB
Common Mistakes with Gossip Protocols & SWIM
- Setting gossip interval too aggressively in large clusters. With 1000 nodes gossiping every 200ms, each node handles 5 incoming gossip messages per second, which creates non-trivial CPU and network overhead
- Not tuning suspicion timeouts for cloud environments where instances restart with the same IP address. A node that crashed and restarted quickly might still be marked as suspected, causing unnecessary failover
- Forgetting that gossip provides eventual consistency, not strong consistency. Using gossip state for decisions that require agreement across the cluster (like leader election) will produce split-brain scenarios
- Not limiting the size of piggybacked payloads. If you attach large metadata to gossip messages, the protocol degrades from lightweight chatter to bulk data transfer
Related to Gossip Protocols & SWIM
Raft Consensus Protocol, Failure Detection Algorithms, CRDTs: Conflict-Free Replicated Data Types
Hidden Markov Map Matching — Geospatial Algorithms
Difficulty: Advanced. Complexity: O(nm) per trajectory. n = GPS points, m = candidate segments per point
Key Points for Hidden Markov Map Matching
- Map matching converts raw GPS coordinates (lat, lng) into road segment IDs. GPS says 40.7512, -73.9876. The road graph has segment_42 and segment_43. These are different systems. Map matching bridges them. Every downstream operation (traffic aggregation, routing, ETA) depends on this conversion being correct.
- The Hidden Markov Model treats each GPS point as an observation and each candidate road segment as a hidden state. The emission probability measures how close the GPS point is to the segment (Gaussian, closer = higher). The transition probability measures how realistic the movement is between two segments (routing distance vs GPS distance, similar = higher).
- The Viterbi algorithm finds the most likely sequence of road segments given the full trajectory. It uses dynamic programming: for each GPS point, compute the best probability of reaching each candidate segment considering all previous points. Then backtrack to recover the path. This is what makes HMM better than simple nearest-road snapping.
- Simple nearest-road matching fails at overpasses, parallel streets, and intersections. A GPS point 5m from a highway overpass and 8m from the road below would snap to the highway. But if the vehicle was on the surface street for the last 10 points, HMM correctly keeps it on the surface street because the transition probability from surface→highway is low.
- At 50M GPS points per second, HMM at 0.5ms per point would need 25,000 CPU cores. The optimization: use simple geometric snapping (nearest road within 10m) for unambiguous straight segments. Reserve HMM for complex areas (overpasses, parallel roads, dense intersections). This reduces HMM calls by ~80%, bringing compute down to ~5,000 cores.
- In production, map matching runs inside Apache Flink on GPS streams partitioned by S2 cell. All GPS points from the same geographic area land on the same partition, so the map matcher has the full local road network in memory. Output is (segment_id, speed, timestamp) fed to downstream traffic aggregation.
Production Systems Using Hidden Markov Map Matching
- Google Maps
- Apple Maps
- Uber
- Lyft
- OSRM
- Valhalla
Common Mistakes with Hidden Markov Map Matching
- Using nearest-road snapping instead of HMM. Nearest-road works 90% of the time but fails catastrophically at overpasses, parallel roads, and complex intersections. These are exactly the high-traffic areas where accuracy matters most. A vehicle incorrectly matched to a highway overpass instead of the congested road below produces wrong traffic data for both roads.
- Not handling GPS dropouts in tunnels and parking garages. When GPS signal is lost, the vehicle continues moving but produces no points. Naive implementations leave a gap. The fix: dead reckoning from last known position + speed + heading until GPS resumes, then re-anchor with HMM on the first reliable point.
- Setting the GPS accuracy sigma too tight. If sigma is 3m but actual GPS accuracy is 8m in urban canyons, the emission probabilities become too peaked and the model rejects correct matches. Use the device-reported accuracy or a conservative default (5-10m).
- Running HMM on every single GPS point at scale. At 50M points/sec, this is 25,000 CPU cores. The geometric fast path (snap to nearest for unambiguous segments, HMM only for complex areas) reduces this by 80%. Without this optimization, the map matching pipeline becomes the most expensive component in the entire traffic system.
- Ignoring one-way streets and turn restrictions in transition probabilities. If the routing distance from segment A to segment B via legal turns is 500m, but the GPS distance is only 50m, the transition probability should be very low. Without turn restrictions, the model may match vehicles to segments they could not legally reach.
Related to Hidden Markov Map Matching
Contraction Hierarchies, Google S2 Geometry
Hybrid Logical Clocks (HLC) — Clocks & Ordering
Difficulty: Advanced. Complexity: O(1) per event, constant-size timestamp
Key Points for Hybrid Logical Clocks (HLC)
- HLC combines a physical timestamp with a logical counter to get the best of both worlds. Events get timestamps that are close to real wall-clock time (useful for humans and TTLs) while still maintaining the causal ordering guarantees of logical clocks
- The physical component tracks the maximum physical time seen across all messages. The logical component breaks ties when multiple events have the same physical time. Together, they form a timestamp that is always within clock skew of the actual physical time
- CockroachDB uses HLC timestamps for MVCC versioning and serializable isolation. Every transaction gets an HLC timestamp, and CockroachDB uses the bounded clock skew to define uncertainty intervals for reads
- Unlike vector clocks, HLC timestamps have a fixed size (one physical counter + one logical counter) regardless of the number of nodes. This makes them practical for systems with thousands of nodes where vector clock metadata would be prohibitively large
- The key assumption is bounded clock skew. NTP typically keeps clocks within a few milliseconds of each other. HLC uses this bound to reason about whether two events at nearby timestamps might be causally related or are definitely concurrent
Production Systems Using Hybrid Logical Clocks (HLC)
- CockroachDB (MVCC timestamps)
- YugabyteDB (safe timestamps)
- MongoDB (WiredTiger timestamps)
- TiDB (TSO with HLC concepts)
- Couchbase (hybrid clocks)
Common Mistakes with Hybrid Logical Clocks (HLC)
- Treating HLC timestamps as exact wall-clock times. The physical component is always >= real time but can be ahead of it (because it tracks the maximum physical time seen in messages). Using HLC timestamps for user-facing time display works most of the time but can show slightly future times
- Ignoring the clock skew bound. HLC correctness depends on NTP keeping clocks reasonably synchronized. If NTP fails or clock skew exceeds the configured bound, the uncertainty intervals in systems like CockroachDB grow, increasing transaction retry rates
- Assuming HLC gives you a total order like Google TrueTime. HLC tells you 'A happened before B' or 'A and B might be concurrent.' TrueTime tells you 'A definitely happened before B' with confidence intervals. TrueTime provides a total order; HLC provides a partial order with physical time hints
- Not persisting the HLC state across restarts. If a node restarts with its HLC reset to current physical time, it might issue timestamps lower than timestamps it issued before the restart. This can break causal ordering. Persist the logical counter and last known physical time
Related to Hybrid Logical Clocks (HLC)
Logical Clocks & Causality Tracking, Vector Clocks & Version Vectors, Two-Phase Commit (2PC)
HyperLogLog & Cardinality Estimation — Probabilistic Data Structures
Difficulty: Advanced. Complexity: O(1) per add, O(m) space
Key Points for HyperLogLog & Cardinality Estimation
- Estimates the count of distinct elements in a stream using roughly 12KB of memory regardless of cardinality. That is 12KB for 1 million uniques or 1 billion uniques
- Standard error of 1.04/sqrt(m) where m is the number of registers. With 16384 registers (the Redis default), you get about 0.81% error
- The core insight: the longest run of leading zeros in a hash tells you something about how many distinct values you have seen. More distinct values means longer runs become more likely
- HyperLogLog structures are fully mergeable. You can compute cardinality of a union by merging HLL sketches from different servers without touching the original data
- Redis PFCOUNT, ClickHouse uniqHLL12, BigQuery APPROX_COUNT_DISTINCT, and Presto approx_distinct all use HyperLogLog under the hood
Production Systems Using HyperLogLog & Cardinality Estimation
- Redis PFADD/PFCOUNT
- ClickHouse uniqHLL12
- BigQuery APPROX_COUNT_DISTINCT
- Presto/Trino approx_distinct
- Flink DataStream approximate distinct
- Druid HyperUnique aggregator
- Elasticsearch cardinality aggregation
Common Mistakes with HyperLogLog & Cardinality Estimation
- Using exact COUNT(DISTINCT) on datasets above 100M rows when approximate is acceptable. A query that takes 45 seconds with exact counting takes 200ms with HLL
- Not understanding that HLL cannot tell you WHICH elements are in the set, only how many. If you need membership testing, you want a Bloom filter instead
- Forgetting that the error is relative to the true cardinality. 0.81% error on 1 billion uniques means you could be off by 8 million. For dashboards that is fine. For billing, it is not
- Using too few registers to save memory. Going from 16384 to 256 registers saves 12KB but increases error from 0.81% to 6.5%. The memory savings is almost never worth the accuracy loss
Related to HyperLogLog & Cardinality Estimation
Bloom Filters & Probabilistic Membership, Count-Min Sketch & Frequency Estimation, Streaming Quantiles: T-Digest & DDSketch
Inverted Index — Search & Retrieval
Difficulty: Advanced. Complexity: O(1) term lookup, O(n) posting traversal
Key Points for Inverted Index
- An inverted index flips the natural document-to-words relationship. Instead of asking 'what words are in this document?' you can ask 'which documents contain this word?' and get an answer in microseconds, even across billions of documents
- Posting lists are the core data structure: sorted arrays of document IDs paired with term frequencies, positions, and offsets. Compression techniques like PForDelta and Variable Byte encoding shrink these lists to 1-2 bits per posting
- Query execution boils down to set operations on posting lists. AND queries intersect sorted lists using skip pointers. OR queries merge them. The sorted order makes these operations efficient even when individual lists contain millions of entries
- Every Elasticsearch shard is a Lucene index, and every Lucene index is a collection of immutable segments. Each segment has its own inverted index. Search fans out to all segments, each one queries its local index, and results merge back
- Building the index requires choosing an analyzer chain: tokenizer, lowercasing, stemming, stop word removal. These choices are permanent. You cannot retroactively change how documents were tokenized without a full reindex
Production Systems Using Inverted Index
- Elasticsearch / Lucene segments
- Apache Solr
- PostgreSQL GIN index
- SQLite FTS5
- Bleve (Go search library)
- Tantivy (Rust search engine)
Common Mistakes with Inverted Index
- Changing the analyzer after indexing data and expecting old documents to match new queries. The analyzer must be identical at index time and query time, or you get silent mismatches where documents exist but queries cannot find them
- Indexing everything as a single giant field. Separate title, body, tags, and metadata into distinct fields so you can boost and query them independently. A match in the title is worth more than a match buried in paragraph 47
- Ignoring stop words in phrase queries. If you remove stop words during indexing, the phrase 'to be or not to be' becomes an empty query. Either keep stop words for fields that need phrase matching, or use a shingle-based approach
- Not planning for updates. Inverted indexes are append-optimized. Updating a document means marking the old version as deleted and inserting a new one. High-update workloads cause segment bloat and need aggressive merge policies
Related to Inverted Index
BM25 — Best Matching 25, LSM Trees vs B-Trees, Bloom Filters & Probabilistic Membership
Logical Clocks & Causality Tracking — Clocks & Ordering
Difficulty: Advanced. Complexity: O(1) Lamport, O(N) vector
Key Points for Logical Clocks & Causality Tracking
- Physical clocks cannot reliably order events across machines. NTP drift, leap seconds, and clock skew mean that wall-clock timestamps from two different servers are not comparable for ordering purposes
- Lamport timestamps give you a total order that is consistent with causality in one direction: if A happened before B, then L(A) < L(B). But the converse is not true. Two events with different timestamps might be concurrent
- Vector clocks can detect concurrency: if neither vector dominates the other element-wise, the events are concurrent. This is what you need for conflict detection in multi-master replication
- DynamoDB uses version vectors, CockroachDB uses Hybrid Logical Clocks, and Google Spanner uses GPS and atomic clocks to bound uncertainty. Each makes a different trade-off between metadata size, accuracy, and hardware requirements
Production Systems Using Logical Clocks & Causality Tracking
- DynamoDB (version vectors)
- CockroachDB (Hybrid Logical Clocks)
- Google Spanner (TrueTime)
- Riak (dotted version vectors)
- Cassandra (Lamport timestamps for conflict resolution)
- CouchDB (revision trees)
Common Mistakes with Logical Clocks & Causality Tracking
- Relying on System.currentTimeMillis() or time.Now() to order events across machines. Clock skew between machines is typically 1-10ms with NTP, but can spike to seconds during NTP corrections or leap second events
- Using Lamport timestamps when you need to detect concurrency. Lamport timestamps cannot distinguish between 'A happened before B' and 'A and B are concurrent.' You need vector clocks or version vectors for that
- Implementing full vector clocks at scale. With 1000 nodes, every message carries 1000 integers. Version vectors, which track versions per data item rather than per node, are almost always sufficient and much smaller
- Ignoring clock skew when using LWW registers. If node A's clock is 5 seconds ahead of node B, node A's writes always win concurrent conflicts regardless of actual ordering
Related to Logical Clocks & Causality Tracking
CRDTs: Conflict-Free Replicated Data Types, Raft Consensus Protocol, Quorum Systems & NRW Notation
LSM Trees vs B-Trees — Storage Internals
Difficulty: Advanced. Complexity: LSM: O(1) write, B-Tree: O(log N) write
Key Points for LSM Trees vs B-Trees
- B-Trees store data in fixed-size pages organized as a balanced tree. Reads follow root-to-leaf in O(log N) page reads. Writes update pages in place with a write-ahead log for crash safety. PostgreSQL, MySQL InnoDB, and SQLite all use B-Trees
- LSM Trees buffer all writes in memory (memtable), flush to immutable sorted files (SSTables) when full, and merge SSTables in the background via compaction. RocksDB, LevelDB, Cassandra, and ScyllaDB all use LSM trees
- Write amplification in LSM trees comes from compaction rewriting data multiple times across levels. Leveled compaction in RocksDB rewrites data about 10-30x. B-Trees rewrite entire pages for small updates plus WAL entries
- Bloom filters on SSTables are what make LSM read performance tolerable. Without them, a point query might check SSTables across every level. With them, you skip levels that definitely do not contain the key
- The choice between them almost always comes down to workload: write-heavy and append-heavy favors LSM, read-heavy with random updates favors B-Trees
Production Systems Using LSM Trees vs B-Trees
- RocksDB (LSM, embedded in CockroachDB/TiDB/Flink)
- PostgreSQL (B-Tree indexes)
- MySQL InnoDB (B+Tree)
- Cassandra (LSM with configurable compaction)
- LevelDB (LSM)
- ClickHouse (MergeTree, LSM variant)
Common Mistakes with LSM Trees vs B-Trees
- Choosing an LSM-based engine for read-heavy OLTP with lots of random point queries. B-Trees will outperform because they do not need to check multiple levels
- Using default leveled compaction for a write-heavy workload without benchmarking tiered compaction. Leveled has the worst write amplification, and tiered can improve write throughput 3-5x
- Ignoring write amplification on SSDs. SSDs have a limited number of write cycles per cell. A 10x write amplification means your drive wears out 10x faster than the application-level write volume suggests
- Not monitoring compaction lag. When incoming writes outpace compaction, unmerged SSTables pile up, reads degrade because they check more files, and space usage balloons
Related to LSM Trees vs B-Trees
Bloom Filters & Probabilistic Membership, Merkle Trees & Anti-Entropy Repair
Merkle Trees & Anti-Entropy Repair — Data Distribution
Difficulty: Advanced. Complexity: O(log N) verification
Key Points for Merkle Trees & Anti-Entropy Repair
- Two replicas can find their differences in O(log N) hash comparisons instead of comparing every record. Compare root hashes, recurse into the subtree that differs, and narrow down to exactly the divergent blocks
- Cassandra builds Merkle trees during repair to identify divergent data ranges. Only the differing data transfers between replicas, saving enormous bandwidth compared to full sync
- Merkle DAGs generalize the concept beyond binary trees. Git uses Merkle DAGs for commits, trees, and blobs. IPFS uses them for content-addressable storage where the hash IS the address
- Repairing K divergent keys out of N total with a depth-D tree transfers O(K * D) hashes plus K values. Without the tree, you transfer N hashes. When K is much smaller than N, the savings are massive
- Incremental Merkle trees (Cassandra 4.0+) avoid the cost of rebuilding the entire tree from scratch on every repair cycle
Production Systems Using Merkle Trees & Anti-Entropy Repair
- Cassandra nodetool repair
- IPFS content addressing
- Git (commits, trees, blobs)
- ZFS data integrity (block checksumming)
- Amazon DynamoDB anti-entropy
- Ethereum state trie (Patricia Merkle Trie)
Common Mistakes with Merkle Trees & Anti-Entropy Repair
- Rebuilding the entire Merkle tree on every sync cycle. Use incremental updates that only rehash the subtrees where data changed
- Choosing tree depth too shallow. If each leaf covers thousands of keys, you detect that something diverged in a large range but still have to compare every key in that range
- Ignoring the CPU cost of hashing large datasets. SHA-256 on terabytes of data is not cheap, and running repair during peak traffic can starve application workloads
- Not scheduling repair during low-traffic periods. Anti-entropy repair is IO and CPU intensive. Running it at peak hours competes with production reads and writes
Related to Merkle Trees & Anti-Entropy Repair
Consistent Hashing & Ring Algorithms, Bloom Filters & Probabilistic Membership
MinHash & Locality-Sensitive Hashing — Probabilistic Data Structures
Difficulty: Advanced. Complexity: O(k) per signature, O(n*b) for LSH indexing
Key Points for MinHash & Locality-Sensitive Hashing
- MinHash estimates the Jaccard similarity between two sets without comparing them element by element. Two documents with 80% word overlap will have matching MinHash values about 80% of the time. This turns a potentially expensive set intersection into a fast signature comparison
- An LSH (Locality-Sensitive Hashing) index groups similar items into the same hash bucket with high probability. Unlike regular hash functions that spread similar inputs apart, LSH deliberately creates collisions for similar items. This enables sub-linear similarity search
- The banding technique controls the trade-off between false positives and false negatives. Split the signature into b bands of r rows each. Two items are candidates if they match in at least one band. More bands with fewer rows catches more true pairs but also more false ones
- SimHash (Charikar, 2002) is the cosine similarity counterpart to MinHash. It produces a fixed-size binary fingerprint where the Hamming distance between fingerprints approximates the cosine distance between the original vectors. Google used SimHash for web page deduplication at scale
- The combination of MinHash signatures and LSH indexing reduces near-duplicate detection from O(n²) pairwise comparisons to roughly O(n), making it practical on datasets with billions of items
Production Systems Using MinHash & Locality-Sensitive Hashing
- Google web dedup (SimHash)
- Uber geospatial matching
- Spotify playlist similarity
- Twitter timeline dedup
- Apache Spark MLlib
- Pinecone (vector DB, LSH index)
Common Mistakes with MinHash & Locality-Sensitive Hashing
- Using too few hash functions in the MinHash signature. The estimation error of Jaccard similarity is proportional to 1/sqrt(k) where k is the number of hash functions. With 100 hash functions, your error is about 10%. With 400, it is about 5%. Skimping on k for speed gives unreliable results
- Applying MinHash to weighted or ordered data without adaptation. Standard MinHash works on unweighted sets (is this element present or not). For weighted sets (TF-IDF vectors), you need weighted MinHash. For ordered sequences (sentences), you need shingling first
- Setting LSH bands and rows without understanding the S-curve. The probability of two items becoming candidates is 1-(1-s^r)^b where s is the true similarity. Plotting this curve for your chosen b and r shows you exactly where the threshold falls and how steep it is
- Running all-pairs similarity on a large dataset without LSH. Pairwise comparison of 10 million items is 50 trillion comparisons. No amount of hardware makes this feasible. LSH reduces it to checking only items that fall in the same bucket, which is typically a few orders of magnitude fewer pairs
Related to MinHash & Locality-Sensitive Hashing
Bloom Filters & Probabilistic Membership, HyperLogLog & Cardinality Estimation, BM25 — Best Matching 25
Operational Transformation — Consensus & Coordination
Difficulty: Expert. Complexity: O(n²) worst-case transform, O(n) typical
Key Points for Operational Transformation
- OT lets multiple users edit the same document simultaneously by transforming each operation against every concurrent operation before applying it. The transform adjusts character positions so that every client converges to the same final document, even when edits arrive out of order
- OT requires a central server (or at least a total ordering mechanism) to sequence operations. The server is the single source of truth for operation order, which makes correctness tractable but means offline editing and peer-to-peer topologies are not naturally supported
- The correctness of OT depends on transformation functions satisfying TP1 (transform-then-apply equals apply-then-transform for any pair of concurrent operations). TP2 extends this to three or more concurrent operations. Multiple published algorithms were later found to violate TP2
- Google Docs has run OT in production since 2006, making it the most battle-tested approach for real-time collaborative text editing. The Jupiter protocol (client prediction + server canonicalization) handles latency by letting clients apply their own edits immediately and reconcile later
Production Systems Using Operational Transformation
- Google Docs (Jupiter protocol, real-time collaborative editing)
- Google Sheets (structured OT for cell operations)
- Apache Wave (formerly Google Wave)
- CKEditor 5 (rich text OT)
- Firepad (Firebase-backed collaborative editor)
- Codox (collaborative Word/Excel plugins)
Common Mistakes with Operational Transformation
- Writing transform functions that handle insert-insert and insert-delete but forget delete-delete interactions at the same position. Edge cases around overlapping deletions are where most OT bugs hide
- Trying to build OT without a central server. Pure peer-to-peer OT exists in theory, but the TP2 correctness requirement makes it extremely hard to get right. Most peer-to-peer implementations that claim OT are actually doing something closer to CRDTs
- Assuming OT is only for plain text. Google Docs uses OT for rich text with formatting, and Google Sheets extends it to spreadsheet operations. But every new operation type multiplies the number of transform function pairs you need to write and prove correct
- Not implementing operation compression for undo. A naive undo stack stores raw operations, but transformed operations change shape as they pass through the system. You need inverse transforms that account for the operation's history, not just the original
Related to Operational Transformation
CRDTs: Conflict-Free Replicated Data Types, Logical Clocks & Causality Tracking
Paxos Consensus — Consensus & Coordination
Difficulty: Expert. Complexity: O(n) messages per round
Key Points for Paxos Consensus
- Paxos is the original solution to distributed consensus. Leslie Lamport published it in 1998 (written in 1990), and every consensus algorithm since then, including Raft, Zab, and Viewstamped Replication, is either a variant of Paxos or heavily influenced by it
- Single-decree Paxos agrees on exactly one value through two phases: Prepare/Promise (proposer claims a ballot number, acceptors promise not to accept older ballots) and Accept/Accepted (proposer sends the value, acceptors accept if they have not promised a higher ballot)
- Multi-Paxos optimizes for the common case by electing a stable leader. Once a leader is established, it skips the Prepare phase entirely and runs only the Accept phase for each new value. This cuts consensus down to one round trip
- The FLP impossibility result proves that no deterministic consensus protocol can guarantee progress in an asynchronous system with even one crash failure. Paxos handles this by using timeouts and leader election, which can livelock but never violate safety
- Google Spanner, Google Chubby, Azure Storage, and Apache Zookeeper (via the ZAB protocol, a Paxos variant) all run Paxos-family protocols in production at massive scale
Production Systems Using Paxos Consensus
- Google Spanner (Multi-Paxos)
- Google Chubby lock service
- Azure Storage replication
- Apache Zookeeper (ZAB variant)
- WANdisco active-active replication
- Google Megastore
Common Mistakes with Paxos Consensus
- Conflating Paxos with leader election. Paxos does not require a leader. Single-decree Paxos works with multiple concurrent proposers. The leader optimization in Multi-Paxos is a performance improvement, not a correctness requirement
- Assuming that understanding single-decree Paxos means you can implement Multi-Paxos. The gap between agreeing on one value and building a replicated log involves dozens of engineering decisions that the original paper does not cover: snapshotting, membership changes, read leases, client interaction
- Ignoring the livelock scenario. Two proposers alternating Prepare messages with increasing ballot numbers can prevent either from completing the Accept phase. A randomized backoff or leader election breaks the livelock, but you have to implement it
- Thinking that Paxos is too complex to be practical. The single-decree protocol is about 20 lines of pseudocode. What is genuinely complex is building a full replicated state machine on top of it, but that complexity exists regardless of which consensus protocol you choose
Related to Paxos Consensus
Raft Consensus Protocol, Quorum Systems & NRW Notation, Two-Phase Commit (2PC)
PBFT — Practical Byzantine Fault Tolerance — Consensus & Coordination
Difficulty: Expert. Complexity: O(n²) messages per round
Key Points for PBFT — Practical Byzantine Fault Tolerance
- PBFT handles Byzantine faults: nodes that can lie, send conflicting messages, or behave arbitrarily. Raft and Paxos only handle crash faults where nodes either work correctly or stop completely. If you need consensus in an adversarial environment, you need something like PBFT
- The protocol tolerates up to f Byzantine nodes out of 3f+1 total. With 4 nodes you tolerate 1 Byzantine failure. With 7 you tolerate 2. The 3f+1 bound is fundamental and cannot be improved for deterministic protocols
- Three phases drive consensus: pre-prepare (leader proposes), prepare (replicas echo the proposal), commit (replicas confirm they saw enough prepares). A client waits for f+1 matching replies from different replicas before accepting the result
- View changes handle a Byzantine leader. If the leader is faulty or slow, replicas trigger a view change that rotates leadership to the next replica. The protocol ensures no committed request is lost during the transition
- The O(n²) message complexity is the main scalability bottleneck. Every replica sends messages to every other replica in the prepare and commit phases. Beyond about 20 nodes, the network overhead becomes prohibitive for high-throughput systems
Production Systems Using PBFT — Practical Byzantine Fault Tolerance
- Hyperledger Fabric (PBFT variant)
- Tendermint / CometBFT
- Zilliqa (pBFT + PoW hybrid)
- IBM blockchain platform
- dYdX (CometBFT-based)
- Cosmos SDK chains
Common Mistakes with PBFT — Practical Byzantine Fault Tolerance
- Assuming PBFT scales like Raft. Raft has O(n) message complexity per consensus round. PBFT has O(n²). A 100-node PBFT cluster sends 10,000 messages per round. This is why most blockchain systems use PBFT only for small validator sets
- Using PBFT when crash fault tolerance is sufficient. If your threat model only includes machines dying or losing network connectivity, Raft or Paxos is simpler, faster, and scales better. PBFT adds significant overhead that is only justified when nodes might actively misbehave
- Forgetting that PBFT requires a known, fixed membership. Unlike permissionless blockchain consensus (Nakamoto), PBFT requires every node to know every other node upfront. Adding or removing nodes requires a reconfiguration protocol
- Ignoring the authentication overhead. Every PBFT message must be authenticated (MAC or digital signature) to prevent forgery. In high-throughput systems, the CPU cost of signing and verifying thousands of messages per second is substantial
Related to PBFT — Practical Byzantine Fault Tolerance
Raft Consensus Protocol, Paxos Consensus, Quorum Systems & NRW Notation
Quorum Systems & NRW Notation — Failure Detection & Quorums
Difficulty: Advanced. Complexity: O(W+R) per operation
Key Points for Quorum Systems & NRW Notation
- With N replicas, W (write quorum) and R (read quorum) define the consistency tradeoff. If W + R > N, every read sees at least one replica with the latest write because the write and read sets must overlap
- N=3, W=2, R=2 is the classic setup. It tolerates 1 node failure for both reads and writes while guaranteeing strong consistency. Cassandra, Riak, and DynamoDB all support this configuration
- Sloppy quorums sacrifice consistency for availability. When designated replicas are down, writes go to any available node. Hinted handoff eventually delivers the data to the correct replicas, but reads may be stale in the meantime
- W + R > N guarantees you will see the latest write, but it does not guarantee linearizability. Concurrent reads during a write can still observe different values. Additional mechanisms like read repair or blocking reads are needed for true linearizability
- Witness replicas participate in quorum votes without storing full data. They confirm they saw a write by storing just the key and version, useful for tie-breaking in even-numbered replica configurations
Production Systems Using Quorum Systems & NRW Notation
- DynamoDB (sloppy quorum with hinted handoff)
- Cassandra (configurable NRW per query)
- Riak (NRW with read repair)
- ZooKeeper (majority quorum for writes)
- etcd (majority quorum via Raft)
- CockroachDB (Raft-based quorum)
Common Mistakes with Quorum Systems & NRW Notation
- Assuming W + R > N gives you linearizability. It only gives regular register semantics. Without additional mechanisms like Raft or blocking writes, concurrent reads can disagree
- Using the same NRW configuration for every query type. Analytics reads that tolerate staleness should use R=1 for speed. Transaction reads that need consistency should use R=2. Cassandra supports per-query consistency levels for exactly this reason
- Forgetting that sloppy quorums can return stale data even with W + R > N. The N nodes that participated in the write might include non-designated nodes that the read quorum never contacts
- Not monitoring hinted handoff queue depth. If hints pile up faster than they drain, replicas are diverging and the consistency guarantees you think you have are eroding
Related to Quorum Systems & NRW Notation
Raft Consensus Protocol, Failure Detection Algorithms, Consistent Hashing & Ring Algorithms
Raft Consensus Protocol — Consensus & Coordination
Difficulty: Expert. Complexity: O(n) per commit, O(log n) disk
Key Points for Raft Consensus Protocol
- Raft decomposes consensus into three relatively independent sub-problems: leader election, log replication, and safety. Each one is tractable on its own, which is the whole reason Raft succeeded where Paxos confused everyone
- Every committed entry survives any future leader election because candidates must prove they have all committed entries before a majority will vote for them. This is the election restriction, and it is the single most important safety property in the protocol
- Leader election uses randomized timeouts between 150ms and 300ms. The randomization prevents split votes from repeating indefinitely, and in practice elections settle in one or two rounds
- etcd, CockroachDB, Consul, TiKV, and RabbitMQ quorum queues all run Raft in production. It is the dominant consensus algorithm for strongly consistent replicated state machines
Production Systems Using Raft Consensus Protocol
- etcd (Kubernetes backing store)
- CockroachDB
- Consul
- TiKV (TiDB storage)
- HashiCorp Vault
- RabbitMQ Quorum Queues
Common Mistakes with Raft Consensus Protocol
- Setting election timeout too low in cloud environments. Network jitter on shared infrastructure can easily exceed 150ms, causing unnecessary leader elections that stall writes for the duration of the election
- Not implementing log compaction or snapshotting. Without it, the Raft log grows unbounded on disk and new followers joining the cluster must replay the entire history from the beginning
- Ignoring learner (non-voting) members when scaling reads. Adding voting members increases the cost of every commit because the leader must wait for a majority. Learners receive the log but do not participate in elections or commit quorums
- Assuming Raft handles Byzantine failures. It does not. Raft trusts that nodes follow the protocol honestly. A malicious node that sends fabricated log entries will corrupt the cluster. For Byzantine tolerance, you need BFT protocols like PBFT or HotStuff
Related to Raft Consensus Protocol
Gossip Protocols & SWIM, Quorum Systems & NRW Notation, Logical Clocks & Causality Tracking
Saga Pattern — Consensus & Coordination
Difficulty: Advanced. Complexity: O(n) steps, n = transaction participants
Key Points for Saga Pattern
- A saga is a sequence of local transactions where each step has a compensating action. If step 3 fails, steps 2 and 1 are rolled back by running their compensating actions in reverse order. This gives you eventual consistency across services without locking resources
- Two execution styles: choreography (each service listens for events and decides what to do next) and orchestration (a central coordinator tells each service what to do). Choreography is simpler for 2-3 services. Orchestration scales better and is easier to debug
- Sagas replace 2PC in microservice architectures because holding distributed locks across services is impractical. A payment service and a shipping service operated by different teams with different SLAs should not be coupled by a blocking transaction protocol
- Compensating actions are not the same as rollbacks. A database rollback undoes uncommitted changes. A compensating action is a new forward operation that semantically reverses a committed change. Refunding a charge is a compensating action for processing a payment
- Temporal (formerly Uber Cadence), AWS Step Functions, and Axon Framework are the most common saga orchestration platforms. They handle retries, timeouts, state persistence, and failure tracking so you do not have to build all of that from scratch
Production Systems Using Saga Pattern
- Uber (Cadence / Temporal)
- AWS Step Functions
- Axon Framework
- Netflix Conductor
- MassTransit (.NET)
- Eventuate Tram (Java)
Common Mistakes with Saga Pattern
- Assuming every operation has a clean compensating action. Some operations cannot be undone. You cannot un-send an email. You cannot un-ship a package. For these, use techniques like delayed execution (do not send until the saga completes) or accept that some manual cleanup will be needed
- Not making saga steps idempotent. If step 3 fails halfway and gets retried, it might run twice. If it charges a customer without checking whether the charge already happened, you double-charge them. Every step and every compensating action must be safe to retry
- Using choreography with more than 4-5 services. Without a central coordinator, the flow is spread across event handlers in multiple services. Debugging why a saga got stuck means tracing events across service logs. Orchestration keeps the full flow visible in one place
- Ignoring the 'in-flight' state. Between steps, the system is in a partially completed state. A user checking their order might see inventory reserved but no payment processed. Design your UIs and APIs to handle intermediate states gracefully
Related to Saga Pattern
Two-Phase Commit (2PC), Raft Consensus Protocol, Circuit Breakers & Bulkheads
Shard Rebalancing & Virtual Buckets — Data Distribution
Difficulty: Advanced. Complexity: O(data moved / bandwidth)
Key Points for Shard Rebalancing & Virtual Buckets
- Shard rebalancing moves data between nodes when you add capacity, remove nodes, or when load becomes uneven. The core challenge is doing this without downtime and without killing the performance of the live system that is still serving reads and writes
- Range-based sharding splits the key space into contiguous ranges (e.g., keys A-M on shard 1, N-Z on shard 2). Easy to reason about and efficient for range queries, but prone to hotspots if certain key ranges get more traffic
- Hash-based sharding hashes keys and assigns the hash to a shard. This gives even distribution but makes range queries impossible because adjacent keys end up on different shards. CockroachDB and DynamoDB use hash-based approaches
- Virtual buckets (or virtual shards) decouple the logical partition from the physical node. Create 1000 virtual buckets, assign them to 10 physical nodes. Adding an 11th node means reassigning ~90 buckets, not rehashing every key
- Online rebalancing while serving traffic requires throttling data movement to avoid saturating network and disk I/O. Every production system has a 'rebalancing rate limit' knob, and getting it right is the difference between a smooth expansion and a cascading outage
Production Systems Using Shard Rebalancing & Virtual Buckets
- Vitess (MySQL sharding)
- CockroachDB (range splits/merges)
- Apache Kafka (partition reassignment)
- DynamoDB (partition splitting)
- MongoDB (chunk migration)
- Citus (PostgreSQL sharding)
Common Mistakes with Shard Rebalancing & Virtual Buckets
- Rebalancing without rate limiting. Moving terabytes of data at full network speed starves the live workload of bandwidth and I/O. Cassandra, CockroachDB, and Kafka all have rebalancing throttle settings that default to conservative values for good reason
- Splitting shards by key count instead of data size. A shard with 1 million tiny keys and a shard with 1 million large blobs have the same key count but wildly different storage and I/O requirements. Split on bytes, not rows
- Not monitoring shard balance after the initial setup. Data distribution shifts as usage patterns change. The shard that was perfectly sized six months ago might now hold 3x the data of its neighbors. Set up alerts on shard size deviation
- Doing a full rebalance when a partial one would suffice. If one shard is hot, split that shard. Do not reorganize the entire cluster. Minimizing data movement reduces risk and shortens the rebalancing window
Related to Shard Rebalancing & Virtual Buckets
Consistent Hashing & Ring Algorithms, Chord & Distributed Hash Tables, Quorum Systems & NRW Notation
Skip Lists — Storage Internals
Difficulty: Advanced. Complexity: O(log n) search/insert/delete
Key Points for Skip Lists
- A skip list is a layered linked list where higher layers skip over multiple elements, giving you O(log n) search, insert, and delete on a data structure that is conceptually much simpler than a balanced binary tree
- Level assignment is randomized: each new node gets level 1, then flips a coin repeatedly to decide if it should also appear at level 2, level 3, and so on. This probabilistic approach gives expected O(log n) performance without any rebalancing logic
- Redis chose skip lists over balanced trees for sorted sets (ZSET) because skip lists are simpler to implement, naturally support range queries, and are easier to reason about for concurrent access. Antirez has said as much in the Redis documentation
- LevelDB and RocksDB use skip lists for their in-memory memtable because skip lists support concurrent reads without locking and concurrent inserts with minimal synchronization. When the memtable fills up, it flushes to an SSTable on disk
- The space overhead is modest. With p=0.25 (each node has a 25% chance of being promoted to the next level), the expected number of pointers per node is 1.33. With p=0.5, it is 2. In practice, the pointer overhead is small compared to the keys and values stored
Production Systems Using Skip Lists
- Redis sorted sets (ZSET)
- LevelDB / RocksDB memtable
- MemSQL (SingleStore) indexes
- Apache HBase MemStore
- Java ConcurrentSkipListMap
- Lucene (older versions, doc values)
Common Mistakes with Skip Lists
- Using p=0.5 (50% promotion probability) without thinking about it. p=0.5 gives the best search performance but uses more memory (2 pointers per node on average). p=0.25 uses less memory (1.33 pointers) with only slightly slower search. Redis uses p=0.25
- Setting maxLevel too low. The maximum level should be log(1/p) of the expected number of elements. With 1 million elements and p=0.25, you need at least 10 levels. Too few levels and the top layers do not skip enough, degrading to linear search
- Forgetting that skip lists are probabilistic. A single skip list can get unlucky and have O(n) performance if the random level assignments are adversarial. In practice with a reasonable random number generator, this essentially never happens, but it is good to know the worst case
- Implementing concurrent skip lists naively. Lock-free skip list insertion requires careful ordering of pointer updates and memory barriers. The Herlihy and Shavit algorithm is the standard reference, but getting it right is harder than it looks
Related to Skip Lists
LSM Trees vs B-Trees, Write-Ahead Log
SSTable & Compaction Strategies — Storage Internals
Difficulty: Advanced. Complexity: Write amp: O(L) leveled, O(T) size-tiered
Key Points for SSTable & Compaction Strategies
- An SSTable (Sorted String Table) is an immutable, sorted file of key-value pairs with an index block for fast lookups. Immutability is the key design decision: writes never modify existing files, they create new ones. This makes writes fast and crash recovery simple
- Compaction merges multiple SSTables into fewer, larger ones, discarding deleted keys (tombstones) and keeping only the latest version of each key. Without compaction, reads would need to check every SSTable, and disk space would grow without bound
- Size-Tiered Compaction (STCS) groups SSTables by size and merges them when enough similar-sized files accumulate. Write-optimized but can temporarily use 2x disk space during compaction and creates wide variation in SSTable sizes
- Leveled Compaction (LCS) organizes SSTables into levels with exponentially increasing size limits. Each level is sorted with no key overlap (except L0). Read-optimized because any key exists in at most one SSTable per level, but write amplification is higher
- Choosing the wrong compaction strategy for your workload is one of the most common performance mistakes in Cassandra and RocksDB. Write-heavy workloads want STCS. Read-heavy workloads with updates want LCS. Time-series data often wants FIFO or a time-window strategy
Production Systems Using SSTable & Compaction Strategies
- Apache Cassandra
- RocksDB / LevelDB
- ScyllaDB
- Apache HBase
- CockroachDB (via RocksDB/Pebble)
- InfluxDB (TSI/TSM)
Common Mistakes with SSTable & Compaction Strategies
- Running leveled compaction on a write-heavy workload without enough disk I/O headroom. LCS rewrites data multiple times as it flows through levels (10-30x write amplification). On spinning disks or underpowered SSDs, compaction falls behind and the system chokes
- Ignoring tombstone accumulation. Deleted keys leave tombstones that persist until compaction removes them. If compaction does not reach old SSTables, tombstones pile up and reads slow down because the system has to process deletes it cannot discard yet
- Not reserving enough disk space for compaction. STCS needs up to 50% free space during a major compaction because it writes new SSTables before deleting old ones. Running at 80% disk utilization with STCS is a ticking time bomb
- Using FIFO compaction for data you query by key. FIFO drops entire SSTables when they exceed a TTL or size limit. It does not merge or sort. If you need to look up individual keys in FIFO-compacted data, reads are slow because there is no compaction to consolidate keys
Related to SSTable & Compaction Strategies
LSM Trees vs B-Trees, Write-Ahead Log, Skip Lists, Bloom Filters & Probabilistic Membership
Streaming Quantiles: T-Digest & DDSketch — Probabilistic Data Structures
Difficulty: Expert. Complexity: O(log n) per add, O(compression) space
Key Points for Streaming Quantiles: T-Digest & DDSketch
- You cannot average percentiles. The P99 of shard A and the P99 of shard B does not give you the global P99. This is the fundamental reason streaming quantile sketches exist
- T-Digest uses weighted centroids that cluster more tightly at the tails (P1, P99) where accuracy matters most for latency monitoring. The compression parameter (typically 100-300) controls the accuracy-memory trade-off
- DDSketch uses logarithmic bin boundaries to provide a relative error guarantee. If relative error is 1%, your P99 estimate is within 1% of the true value regardless of the underlying distribution shape
- Both T-Digest and DDSketch support merging from multiple servers, which makes them suitable for computing global percentiles across a distributed fleet without centralizing raw data
- Prometheus histogram_quantile, Datadog DDSketch, Elasticsearch percentile aggregation, and Envoy proxy latency histograms all rely on streaming quantile algorithms in production
Production Systems Using Streaming Quantiles: T-Digest & DDSketch
- Prometheus histogram_quantile
- Datadog Agent DDSketch
- Elasticsearch percentile aggregation
- Apache Spark approxQuantile
- ClickHouse quantileTDigest
- Envoy proxy latency histograms
Common Mistakes with Streaming Quantiles: T-Digest & DDSketch
- Averaging percentiles across shards or time windows. This is mathematically wrong and produces values that can be arbitrarily far from the true global percentile. A system with 100 shards each reporting P99 = 50ms might actually have a global P99 of 500ms if the tail latency is concentrated on a few shards
- Using fixed-width histogram bins for long-tail latency distributions. Bins of [0-10ms, 10-20ms, 20-30ms, ...] waste resolution on the low end where everything clusters and have no resolution at the tail where the interesting behavior lives. Exponential or logarithmic bins are almost always better
- Setting T-Digest compression too low for tail accuracy. A compression parameter of 25 might seem fine when you look at P50, but the P99.9 estimate can be off by 30% or more. For SLO monitoring, use compression 200+ and verify tail accuracy against known distributions
- Treating Prometheus default histogram buckets as precise quantile estimates. Prometheus histograms use linear interpolation within fixed buckets, and the default buckets (5ms, 10ms, 25ms, 50ms, ...) produce quantile estimates that can be wildly off if your actual latency distribution does not align with the bucket boundaries
Related to Streaming Quantiles: T-Digest & DDSketch
HyperLogLog & Cardinality Estimation, Count-Min Sketch & Frequency Estimation
Two-Phase Commit (2PC) — Consensus & Coordination
Difficulty: Advanced. Complexity: O(n) messages, 2 round trips
Key Points for Two-Phase Commit (2PC)
- 2PC solves the atomic commit problem: either all participants in a distributed transaction commit, or all abort. There is no state where participant A committed and participant B aborted. This is the guarantee that makes distributed transactions possible
- Phase 1 (Prepare): the coordinator asks each participant 'can you commit?' Each participant acquires locks, writes a prepare record to its WAL, and votes yes or no. A yes vote is a binding promise to commit if asked. Phase 2 (Commit/Abort): the coordinator collects votes. If all voted yes, send commit. If any voted no, send abort
- The blocking problem is the biggest weakness. If the coordinator crashes after sending prepare but before sending the commit decision, participants are stuck holding locks and cannot proceed. They voted yes and promised to commit if asked, but nobody is asking. 3PC was invented to address this, but it trades blocking for other problems
- Google Spanner solves the blocking problem by making the coordinator itself a Paxos group. If the coordinator leader crashes, another member of the Paxos group takes over and drives the commit decision. This is 2PC-over-Paxos, and it is the standard approach for production distributed databases
- MySQL XA transactions, PostgreSQL PREPARE TRANSACTION, CockroachDB, and Google Spanner all use 2PC or close variants for cross-shard or cross-database transactions
Production Systems Using Two-Phase Commit (2PC)
- MySQL XA transactions
- PostgreSQL PREPARE TRANSACTION
- Google Spanner (2PC over Paxos)
- CockroachDB distributed transactions
- Microsoft DTC
- Kafka transactions (variant)
Common Mistakes with Two-Phase Commit (2PC)
- Not planning for coordinator failure. Running 2PC with a single coordinator is asking for trouble in production. Either make the coordinator replicated (Spanner approach) or implement a recovery protocol where participants can query each other to learn the outcome after a coordinator crash
- Holding locks during the prepare phase without a timeout. A participant that voted yes must hold its locks until it hears commit or abort. If the coordinator is slow, those locks can block other transactions for a long time. Set a timeout and abort if no decision arrives
- Using 2PC across services with different availability requirements. If your payment service is in 2PC with your notification service, a notification service outage blocks payments. Keep 2PC boundaries tight: within a database or between closely coupled shards, not across loosely coupled microservices
- Confusing 2PC with consensus protocols like Paxos or Raft. 2PC decides whether to commit a specific transaction. Paxos decides on a value in a replicated log. They solve different problems and are often used together (Spanner uses 2PC across Paxos groups)
Related to Two-Phase Commit (2PC)
Paxos Consensus, Raft Consensus Protocol, Write-Ahead Log
Vector Clocks & Version Vectors — Clocks & Ordering
Difficulty: Expert. Complexity: O(n) per comparison, n = nodes
Key Points for Vector Clocks & Version Vectors
- Lamport clocks tell you 'event A happened before event B' but cannot tell you whether two events are truly concurrent. Vector clocks can. If neither vector dominates the other (one is not component-wise less than or equal), the events are concurrent, and you know there might be a conflict
- Each node maintains a vector of counters, one per node in the system. When a node does local work, it increments its own counter. When it sends a message, it attaches the full vector. When it receives a message, it takes the component-wise maximum of its vector and the incoming vector, then increments its own counter
- DynamoDB, Riak, and Voldemort all used vector clocks (or close variants) for conflict detection in their eventually consistent replication models. When two replicas accept concurrent writes to the same key, the vector clocks reveal the conflict so the application can resolve it
- The size problem is real: the vector grows linearly with the number of nodes that have ever touched the data. In a system with thousands of nodes, each key carrying a vector of thousands of counters is expensive. Dotted version vectors and interval tree clocks were invented to address this
- Version vectors are technically different from vector clocks, though the terms are often used interchangeably. Vector clocks track causal ordering of events. Version vectors track causal ordering of object versions. The mechanics are nearly identical, but the semantics differ in subtle ways that matter when designing garbage collection and pruning
Production Systems Using Vector Clocks & Version Vectors
- Amazon DynamoDB (version vectors)
- Riak (dotted version vectors)
- Voldemort (LinkedIn)
- CouchDB conflict detection
- Cassandra (lightweight variant)
Common Mistakes with Vector Clocks & Version Vectors
- Confusing happens-before with wall-clock before. Vector clocks track causality, not real time. Event A can happen-before event B even if A occurred later in wall-clock time, as long as there is a causal chain from A to B
- Growing the vector indefinitely without pruning. In a system where nodes join and leave frequently, the vector accumulates entries for nodes that no longer exist. You need a pruning strategy (remove entries below a threshold, use node IDs that can be reused) or the metadata bloats over time
- Using vector clocks for ordering when you actually need a total order. Vector clocks give a partial order. Many events will be concurrent (incomparable). If your application needs a single definitive ordering of all events, you need something like Lamport timestamps plus a tiebreaker, or a centralized sequencer
- Not handling the conflict resolution at the application level. Vector clocks detect conflicts; they do not resolve them. If two concurrent writes happen, someone has to decide which one wins: last-writer-wins, merge, ask the user, or some domain-specific logic. Punting this decision to 'the system will handle it' leads to data loss
Related to Vector Clocks & Version Vectors
Logical Clocks & Causality Tracking, CRDTs & Conflict-Free Replicated Data Types, Quorum Systems & NRW Notation
Write-Ahead Log — Storage Internals
Difficulty: Advanced. Complexity: O(1) append, sequential I/O
Key Points for Write-Ahead Log
- Every durable database writes intentions to a sequential log before touching actual data files. If the process crashes mid-write, the log contains everything needed to either finish the operation or roll it back cleanly
- Sequential I/O is the reason WAL works so well. SSDs can sustain 2-4 GB/s sequentially while random 4KB writes drop to 100-400 MB/s. On spinning disks, the gap is even more dramatic
- Group commit batches multiple transactions into a single fsync call, turning the per-write overhead of durable logging into amortized cost. PostgreSQL does this automatically. Throughput improvements of 10x or more are common
- The WAL doubles as a replication transport. PostgreSQL streams WAL segments to replicas. Raft uses the log as its core abstraction. Kafka is, at its heart, a distributed WAL with consumer offsets
- Checkpointing bounds recovery time by periodically flushing dirty pages to disk and advancing the WAL replay start position. Without checkpointing, crash recovery would replay every write since the database was created
Production Systems Using Write-Ahead Log
- PostgreSQL WAL (pg_wal)
- MySQL InnoDB redo log
- RocksDB WAL
- Kafka commit log
- etcd Raft WAL
- SQLite WAL mode
- CockroachDB Pebble WAL
Common Mistakes with Write-Ahead Log
- Disabling fsync for benchmarks and forgetting to re-enable it in production. This is shockingly common and leads to silent data corruption on crash. Always verify fsync settings before any production deployment
- Not monitoring WAL disk usage. When the WAL disk fills up, most databases halt entirely rather than risk corruption. PostgreSQL will refuse new writes. Set alerts at 70% disk usage
- Ignoring WAL fsync latency during performance tuning. The WAL fsync is often the bottleneck for write-heavy workloads, not CPU or memory. Faster storage under the WAL directory can transform throughput
- Not configuring WAL archiving for point-in-time recovery. Without archived WAL segments, you can only restore to the last full backup, potentially losing hours of transactions
Related to Write-Ahead Log
LSM Trees vs B-Trees, Raft Consensus Protocol
ZSTD Compression (Zstandard) — Storage Internals
Difficulty: Intermediate. Complexity: O(n) compression, O(n) decompression, tunable ratio via levels 1-22
Key Points for ZSTD Compression (Zstandard)
- ZSTD combines LZ77-style dictionary matching (find repeated byte sequences, replace with back-references) with two entropy coders: Finite State Entropy (FSE) for match lengths and offsets, and Huffman coding for literal bytes. This three-stage pipeline achieves gzip-level compression ratios at LZ4-level speeds
- Compression levels 1-22 let you trade CPU time for compression ratio on the same data. Level 1 compresses at ~500 MB/s with ~2.9x ratio. Level 3 (default) hits ~400 MB/s at ~3.3x. Level 9 drops to ~80 MB/s but reaches ~3.8x. Decompression speed stays at ~1500 MB/s no matter which level was used to compress. That asymmetry matters for write-once-read-many workloads like cold storage, where you pay the compression cost once but decompress on every read
- Dictionary mode is ZSTD's secret weapon for small data. Standard compression needs enough bytes to learn patterns within each input. For small messages (1-10 KB, like Kafka records or log lines), there is not enough data to find patterns. Dictionary mode pre-trains a 100 KB dictionary on representative samples, then every message compresses against that shared context. Compression ratio improves 3-5x over standard mode for small inputs
- ZSTD is a general-purpose byte-level compressor. It sees no structure in the input, just bytes. Gorilla, by contrast, exploits time-series-specific patterns like regular timestamp intervals and slowly-changing float values. So Gorilla gets ~12x on sequential metric samples, while ZSTD gets ~3x on the same float64 arrays. Different tools for different tiers: Gorilla for hot-tier sequential samples, ZSTD for cold-tier aggregated data and Parquet blocks
- ZSTD replaced gzip as the default compressor in most modern data infrastructure: Apache Kafka (producer compression), Apache Parquet (column block compression), RocksDB (SST file compression), ClickHouse (column compression), Linux kernel (initramfs, btrfs). It compresses as well as gzip but decompresses 5x faster, and dictionary mode handles small messages that gzip struggles with. That is a hard combination to beat
Production Systems Using ZSTD Compression (Zstandard)
- Facebook (creator, 2016, by Yann Collet)
- Apache Parquet (default column block compression)
- Apache Kafka (producer-side message compression)
- RocksDB (SST file compression, used by Flink state backend)
- ClickHouse (column compression codec)
- Linux kernel (btrfs filesystem, initramfs)
- VictoriaMetrics (cold-tier S3 Parquet export)
Common Mistakes with ZSTD Compression (Zstandard)
- Using high compression levels (15-22) in latency-sensitive paths. Level 19 compresses at ~5 MB/s, which is 100x slower than level 1. High levels only make sense for offline batch jobs like cold-tier compaction or archival. For real-time ingestion and Kafka messages, stick to levels 1-3
- Not using dictionary mode for small messages. Compressing 2 KB Kafka records with standard ZSTD gives ~1.5x ratio (barely worth the CPU). Training a dictionary on 1000 sample records and compressing with that dictionary gives ~4x. The dictionary is 100 KB of shared context that makes small-message compression viable
- Comparing ZSTD ratio to Gorilla ratio as if they do the same thing. Gorilla achieves 12x on time-series data because it exploits domain-specific structure (delta-of-delta timestamps, XOR-encoded values). ZSTD achieves ~3x on the same data because it only sees bytes. They solve different problems: Gorilla for hot-tier raw samples, ZSTD for cold-tier aggregated doubles in Parquet
- Ignoring decompression speed when choosing a compressor. ZSTD decompresses at ~1500 MB/s regardless of the compression level. gzip tops out at ~300 MB/s. On read-heavy workloads like cold-tier metric queries (write once, read many times), decompression speed matters more than compression speed, and that 5x gap is why ZSTD replaced gzip in Parquet and RocksDB
- Setting the same compression level across all data tiers. Hot-path Kafka messages need level 1 (fast, ~2.9x). Warm-tier Parquet blocks benefit from level 3-6 (~3.3-3.6x). Cold-tier archival can use level 9+ (~3.8x+). Each tier has a different latency budget, so each should use a different level
Related to ZSTD Compression (Zstandard)
Gorilla Compression (Delta-of-Delta + XOR), LSM Tree, Write-Ahead Log, Bloom Filters