Quorum Systems & NRW Notation
Architecture
The Fundamental Question
You replicated your data to three servers. A write comes in. Do you wait for all three to acknowledge? Just one? Two? The answer determines whether your system is consistent, available, fast, or some combination.
NRW notation gives you a language for this tradeoff. N is the total number of replicas. W is the number of replicas that must acknowledge a write before it is considered successful. R is the number of replicas you read from and compare to return the latest value.
These three numbers encode your system's personality.
The Overlap Property
The central insight: if W + R > N, then the set of nodes you wrote to and the set of nodes you read from must share at least one node. That shared node has the latest value. Your read will always find it.
With N=3, W=2, R=2: you write to 2 of 3 replicas. You read from 2 of 3 replicas. No matter which 2 you wrote to and which 2 you read from, at least one node is in both sets. The pigeonhole principle at work. You have 3 slots, you filled 2 with writes and need 2 for reads. There is overlap.
The read process picks the response with the highest version number (or timestamp, depending on the conflict resolution strategy). Since at least one response comes from a node that received the latest write, the latest value is always among the responses.
Common Configurations and What They Mean
N=3, W=2, R=2. The workhorse configuration. Strong consistency. Tolerates 1 node failure for both reads and writes (you can still reach W=2 of the remaining 2 nodes for writes if... wait, no. With 1 node down, you have 2 remaining. W=2 means both must acknowledge. So writes succeed only if neither of the 2 remaining nodes is the failed one that you need. Actually, with N=3 and 1 node down, W=2 still works because you need 2 acks out of the 2 remaining healthy nodes and the coordinator sends to all 3, getting 2 responses from the healthy ones.) This is Cassandra's QUORUM consistency level.
N=3, W=3, R=1. Strong consistency with fast reads. Every read touches just one replica and gets the latest value because every replica got every write. The catch: writes fail if any single replica is unavailable. One node down and your write path is dead. This works for read-dominated workloads where you can tolerate brief write outages.
N=3, W=1, R=3. Strong consistency with fast writes. Writes succeed as long as one replica is up (practically guaranteed). Reads must hit all three replicas to ensure they find the one that got the write. If any replica is down, reads fail. The mirror image of W=3/R=1.
N=3, W=1, R=1. Eventual consistency. Fast reads, fast writes, but no overlap guarantee. A read might hit the one replica that has not received the latest write yet. This is Cassandra's ONE consistency level. Fine for analytics, recommendations, and anything where reading slightly stale data is acceptable.
N=5, W=3, R=3. Strong consistency with better fault tolerance. Tolerates 2 node failures for both reads and writes. Used in systems that need higher availability and can afford the extra replication cost. ZooKeeper and etcd use 5-node (or 7-node) clusters with majority quorums for this reason.
Why W + R > N Does Not Give You Linearizability
This trips up a lot of experienced engineers. Quorum intersection guarantees that you see the latest write, but it does not guarantee that all concurrent readers agree on what they see at the same moment.
Scenario: Client A writes value V2 with W=2. Replicas 1 and 2 acknowledge. Replica 3 has not received V2 yet. Client B reads with R=2, hitting replicas 1 and 3. It sees V2 from replica 1 and V1 from replica 3, returns V2 (latest). Meanwhile, Client C reads with R=2, hitting replicas 2 and 3. It sees V2 from replica 2 and V1 from replica 3, returns V2.
That worked out. But what if the timing is slightly different? Client A's write reaches replica 1 but has not reached replica 2 yet. Client B reads from replicas 2 and 3, both of which still have V1. Client B returns V1. Client C reads from replicas 1 and 3, gets V2 from replica 1, returns V2. Now Client B and Client C read at approximately the same time and got different results. Client B saw V1, Client C saw V2. This is not linearizable.
For true linearizability, you need either a consensus protocol (Raft, Paxos) that orders operations globally, or additional mechanisms on top of quorums. CockroachDB and Spanner layer Raft on top of their replication. Cassandra with QUORUM gives you "regular register" semantics, which is weaker than linearizability but stronger than eventual consistency.
The ABD algorithm (Attiya, Bar-Noy, Dolev) achieves linearizability with quorums by adding a second round-trip: after reading, the client writes the latest value back to a quorum before returning. This ensures that any subsequent reader sees at least that value. The cost is double the latency for reads.
Sloppy Quorums: When Availability Trumps Consistency
Strict quorums require specific designated replicas to participate. With N=3 and replicas A, B, C, a write needs W=2 of {A, B, C}. If B and C are down, the write fails even though there are dozens of other healthy nodes in the cluster.
Sloppy quorums relax this constraint. If the designated replicas are unavailable, write to any W nodes from a larger pool. The data lands somewhere and will be transferred to the correct replicas later.
DynamoDB pioneered this approach. In their system, the preference list for a key includes more than N nodes (typically the ring neighbors beyond the designated replicas). If the primary replicas are unreachable, writes flow to the next nodes in the preference list. These nodes store the data with a "hint" indicating which replica should actually own it.
The availability benefit is significant. Your write path stays up even during replica failures, network partitions, and maintenance windows. The system continues accepting writes as long as any W nodes in the extended preference list are reachable.
The consistency cost: reads might not find the latest write. If Client A writes to nodes D and E (sloppy, because A, B, C were down), and Client B reads from nodes A and B (which are back up but do not have the write yet), Client B gets stale data. Even though W=2, R=2, and W+R=4 > N=3, the write and read quorums did not overlap because the write went to non-designated nodes.
This is the critical subtlety of sloppy quorums. The W+R>N guarantee holds only when W and R are drawn from the same N designated replicas.
Hinted Handoff: The Glue for Sloppy Quorums
When a node stores data on behalf of an unavailable replica, it creates a "hint": a record that says "this data belongs to replica B, deliver it when B comes back." The node periodically checks whether the intended recipient is available. When it is, the hint is replayed: the data is sent to the correct replica and the hint is deleted.
This mechanism is elegant in theory but has operational gotchas.
Hint accumulation. If replica B is down for an extended period, hints pile up on the nodes that accepted sloppy writes. If many keys hash to B's range and the system has high write throughput, the hint backlog can consume significant disk space. Cassandra limits the hint window (default: 3 hours) and drops hints for nodes that have been down longer.
Hint replay storms. When replica B comes back online, all the accumulated hints replay simultaneously. This can overwhelm B with a burst of writes, especially if B was down for hours during a peak traffic period. Throttling hint replay is essential.
Hints are not sufficient for full repair. Hinted handoff only covers writes that happened while the replica was down. It does not catch pre-existing divergence from other causes (bit rot, operator error, bugs). Full anti-entropy repair with Merkle trees is still needed as a backstop.
Read Repair: Opportunistic Healing
When a read quorum returns conflicting values (different replicas have different versions), the coordinator knows which replicas are stale. It can immediately send the latest value to those replicas. This is read repair: every read is also a potential write to fix inconsistencies.
Read repair is cheap because you are already reading the data and comparing responses. The repair write is a side effect of the read. It catches a surprising amount of divergence in practice, because popular keys get read frequently and therefore repaired quickly. Rarely-read keys accumulate staleness until a background repair process catches them.
Cassandra supports two modes: synchronous read repair (the read blocks until the repair completes, slower but guaranteed to fix the inconsistency) and asynchronous read repair (the read returns immediately and the repair happens in the background, faster but the next read might still see stale data on the same replica).
The downside: read repair only fixes keys that are read. A key that is written once and never read again will not be repaired. This is why read repair complements, but does not replace, full anti-entropy repair.
Flexible Quorums: Beyond W + R > N
Heidi Howard's 2016 paper on flexible quorums challenges the assumption that quorums must satisfy W + R > N. The actual requirement is weaker: every write quorum must intersect with every read quorum. The W + R > N formula is a sufficient condition for this, but not a necessary one.
For example, with 5 replicas labeled A through E, you could define:
- Write quorum: any set containing A (the "leader")
- Read quorum: any set containing A
As long as A is available, writes and reads intersect trivially (both include A). This gives you W=1 and R=1 with strong consistency, but only as long as A is up. If A fails, you fall back to a more traditional quorum until A recovers.
This sounds academic, but it captures what many real systems do. Raft requires the leader to be in every write quorum (the leader proposes writes and replicates them). Reads from the leader are linearizable with R=1 because the leader is always in the write quorum.
The flexible quorums framework also enables geographic quorums. In a system replicated across 3 data centers, you might require writes to include at least one replica from 2 different data centers (ensuring durability across DCs) while reads only need to succeed in the local DC (fast reads). The intersecting property can be maintained by designing the quorum sets carefully.
Witness Replicas
A witness replica is a lightweight participant in quorum voting. It stores just enough state to participate in consensus (key, version, maybe a small metadata record) without storing the actual data value. The data value is stored on full replicas.
Why would you want this? Consider a 2-node setup. With N=2, strong consistency requires W=2 and R=2 (since W+R must exceed 2). But W=2 means a single node failure blocks all writes. Adding a third full replica doubles your storage cost.
A witness replica gives you N=3 for quorum purposes without the storage cost of a third full copy. The witness confirms "version 7 of key X exists" without storing the actual value of key X. Writes succeed with 2 of 3 (two full replicas, or one full replica plus the witness). Reads succeed with 2 of 3, and since at least one response is from a full replica, the data is available.
CockroachDB uses this pattern. Their non-voting replicas (similar to witnesses) participate in Raft for commit quorum but do not serve reads. This allows placing witnesses in a third availability zone for durability without paying the full replication cost.
Tuning NRW for Real Workloads
The biggest practical advantage of NRW notation is per-query tuning. A single database can serve different queries with different consistency requirements.
User-facing reads that must be consistent (account balance, order status): use QUORUM (R=2 for N=3). The extra latency of waiting for 2 responses is a few milliseconds. Worth it.
Analytics aggregations that can tolerate stale data (daily active users, trending products): use ONE (R=1). Faster, lower load on the cluster, and slightly stale numbers are fine for dashboards.
Critical writes (financial transactions, inventory decrements): use QUORUM or ALL (W=2 or W=3). Ensure the write is durable on multiple replicas before acknowledging.
Best-effort writes (telemetry, click tracking): use ONE (W=1). If the write is lost due to a replica failure before replication, losing one click event is not worth the latency of a quorum write.
Cassandra makes this particularly easy with per-query consistency levels. The same table, same replication factor, but each CQL statement specifies its own consistency level. A read at CONSISTENCY ONE and a read at CONSISTENCY QUORUM hit the same data with different guarantees and latency characteristics.
The Monitoring Story
Quorum systems look great on a whiteboard but need operational monitoring to stay healthy.
Hinted handoff queue depth. Rising hint queues mean replicas are unavailable and sloppy quorums are accumulating debt. If hints are not draining, replicas are diverging.
Read repair rate. High read repair rates indicate that replicas are frequently inconsistent. This might mean anti-entropy repair is not running often enough, or hinted handoff is failing, or write paths are flaky.
Write latency at quorum. If W=2 and one replica is slow (GC pauses, disk issues), every quorum write waits for that slow replica. Monitor per-replica latency to find the bottleneck. Some systems (speculative execution in Cassandra) send an extra write to a third replica if the second is slow, racing the responses.
Consistency level distribution. Track what fraction of queries use QUORUM vs ONE. If an application that should be using QUORUM accidentally defaults to ONE, you have a consistency bug that will not show up in functional tests, only in stale reads under failure.
Key Points
- •With N replicas, W (write quorum) and R (read quorum) define the consistency tradeoff. If W + R > N, every read sees at least one replica with the latest write because the write and read sets must overlap
- •N=3, W=2, R=2 is the classic setup. It tolerates 1 node failure for both reads and writes while guaranteeing strong consistency. Cassandra, Riak, and DynamoDB all support this configuration
- •Sloppy quorums sacrifice consistency for availability. When designated replicas are down, writes go to any available node. Hinted handoff eventually delivers the data to the correct replicas, but reads may be stale in the meantime
- •W + R > N guarantees you will see the latest write, but it does not guarantee linearizability. Concurrent reads during a write can still observe different values. Additional mechanisms like read repair or blocking reads are needed for true linearizability
- •Witness replicas participate in quorum votes without storing full data. They confirm they saw a write by storing just the key and version, useful for tie-breaking in even-numbered replica configurations
Used By
Common Mistakes
- ✗Assuming W + R > N gives you linearizability. It only gives regular register semantics. Without additional mechanisms like Raft or blocking writes, concurrent reads can disagree
- ✗Using the same NRW configuration for every query type. Analytics reads that tolerate staleness should use R=1 for speed. Transaction reads that need consistency should use R=2. Cassandra supports per-query consistency levels for exactly this reason
- ✗Forgetting that sloppy quorums can return stale data even with W + R > N. The N nodes that participated in the write might include non-designated nodes that the read quorum never contacts
- ✗Not monitoring hinted handoff queue depth. If hints pile up faster than they drain, replicas are diverging and the consistency guarantees you think you have are eroding