Erasure Coding & Reed-Solomon
Architecture
Why Not Just Copy Everything Three Times
The simplest way to protect data in a distributed system is replication. Store three copies on three different nodes. If one node dies, you still have two copies. This is what HDFS did for its first decade, what Kafka does for partition replicas, and what most databases do for high availability.
The problem is math. If you are storing 100 petabytes of data, triple replication means you are actually storing 300 petabytes. At roughly $20 per terabyte per month for cloud storage, that is $6 million per month instead of $2 million. The extra $4 million buys you exactly one thing: the ability to survive two simultaneous node failures per data block.
Erasure coding gives you the same (or better) fault tolerance at a fraction of the storage cost. HDFS switched from 3x replication to Reed-Solomon (6,3) for cold data and cut storage overhead from 200% to 50%. Same durability, half the disks.
The trade-off is not free. Erasure coding adds CPU cost for encoding and decoding, increases read latency (you touch more nodes per read), and makes recovery more expensive when nodes fail. But for data that is written once and read occasionally, the storage savings are too large to ignore.
How It Works
Take a block of data and split it into k equal-sized pieces called data shards. Then compute m additional pieces called parity shards using a mathematical function over the data shards. You now have k+m total shards. The key property: any k of those k+m shards are sufficient to reconstruct the original data.
With RS(6,3), k=6 and m=3. You split the data into 6 shards and compute 3 parity shards. Store all 9 on different nodes. Any 3 nodes can die simultaneously and you can still reconstruct every byte of the original data from the surviving 6 shards.
The storage overhead is (k+m)/k = 9/6 = 1.5x. Compare that to 3x replication, which also tolerates losing... well, 2 copies out of 3, so only 2 failures per block. Erasure coding at RS(6,3) tolerates 3 failures at 1.5x storage. Replication needs 4 copies (4x storage) to tolerate 3 failures. You are getting strictly better fault tolerance at lower cost.
The numbers get more dramatic at higher ratios. RS(10,4) tolerates 4 failures at 1.4x storage. RS(14,4) tolerates 4 failures at 1.29x storage. As k grows, the overhead approaches 1x, though you are spreading data across more nodes which introduces other problems.
Reed-Solomon: The Dominant Algorithm
Most production erasure coding systems use Reed-Solomon codes, which operate over Galois fields (finite fields with exactly p^n elements, typically GF(2^8) with 256 elements).
You do not need to understand the full algebra to use it. The mental model is this: arrange the k data shards as a vector, multiply by a carefully chosen (k+m) x k matrix, and you get k+m output shards. The first k outputs are identical to the data shards (systematic encoding). The last m are the parity shards.
The matrix is designed so that any k rows form an invertible submatrix. This is what guarantees that any k of k+m shards can reconstruct the data. You grab the k surviving shards, find the corresponding k rows of the encoding matrix, invert that submatrix, and multiply. Out comes the original data.
Vandermonde and Cauchy matrices are the most common choices because they have the "any k rows are invertible" property by construction. You do not need to check it. It falls out of the math.
The operations (addition, multiplication) happen in the Galois field, not over regular integers. GF(2^8) addition is XOR. Multiplication uses lookup tables (log and antilog tables with 256 entries). Modern implementations use SIMD instructions to process multiple Galois field operations in parallel, and the best libraries (Intel ISA-L, Klauspost's Go implementation) achieve multi-gigabyte-per-second throughput on commodity hardware.
A simpler alternative to Reed-Solomon is XOR-based coding. RAID-5 uses a single XOR parity (m=1, tolerates one failure). RAID-6 adds a second parity using Galois field multiplication (m=2). These are faster to compute but limited to m=1 or m=2 parity shards. Reed-Solomon handles arbitrary m, which is why it dominates in distributed systems where you want to tolerate 3 or more failures.
The Reconstruction Tax
Here is the part that product teams do not think about until it bites them.
A normal read in a replicated system touches one node. Pick the closest replica, read the data, done. A normal read in an RS(6,3) system touches 6 nodes (you need all k data shards to assemble the block). That is 6 network round trips instead of 1, and your read latency is the maximum of those 6 responses.
During normal operation, this is manageable. All 6 shards are healthy, the reads are parallel, and the slowest of 6 might add a few milliseconds. But when a node is down (degraded mode), you need to read any 6 of the remaining 8 shards, decode through Galois field matrix inversion, and reconstruct the missing shard. This is slower, uses more CPU, and if the degraded state lasts hours or days, every read to any block on the failed node pays this tax.
Azure Storage addresses this with Local Reconstruction Codes (LRC). Instead of a single RS code over all shards, LRC creates local parity groups. If one shard in a local group fails, you can reconstruct from the local group (reading 2-3 shards) instead of the global group (reading k shards). This reduces reconstruction I/O at the cost of slightly higher storage overhead. The paper reports 2x reduction in repair traffic.
Another common optimization is hedged reads. If one of the k data shards is slow, fire off an extra read to a parity shard. Use whichever responds first. This helps tail latency at the cost of higher average bandwidth usage.
When Replication Still Wins
Erasure coding is not always the right choice. There are clear situations where replication is better.
Hot data with latency requirements. If you are serving user-facing requests from this data, the 6-node read pattern adds tail latency that replication avoids. Database primary storage, caches, and real-time serving layers almost always use replication.
Small objects. Splitting a 4KB metadata record into 6 shards of 700 bytes each, storing them on 9 nodes, and reassembling on read is absurd. The overhead per object (9 I/O operations, 9 metadata entries) dwarfs the storage savings. Most systems define a crossover size: objects smaller than 1-4MB use replication, larger objects use erasure coding.
Frequently updated data. Every update to an erasure-coded object requires re-encoding the affected data shard and recomputing all m parity shards. For append-heavy workloads (like log systems), this is expensive. Write-optimized systems like Kafka stick with replication for this reason.
Metadata. Filesystem metadata (which block lives where, which nodes are alive) needs to be read constantly, updated frequently, and accessed with minimal latency. Every distributed storage system replicates its metadata layer even when the data layer is erasure-coded.
The general pattern: replicate the hot path, erasure-code the cold path. HDFS uses 3x replication for data being actively written and computed on, then converts to RS(6,3) once the data cools off. Ceph lets you configure replication pools for hot data and erasure-coded pools for archives, all within the same cluster.
Placement Across Failure Domains
Erasure coding only provides durability if the shards are spread across independent failure domains. Put all 9 shards of an RS(6,3) block on the same rack, and a rack power failure wipes out everything.
The minimum requirement: spread k+m shards across k+m failure domains. For RS(6,3), that means 9 nodes on at least 9 different failure domains. In a cloud environment, failure domains map to availability zones. In a datacenter, they map to racks or power circuits.
This creates a constraint on cluster topology. You need at least k+m nodes in different failure domains to store a single erasure-coded block. A 3-rack cluster cannot safely run RS(6,3) because you cannot spread 9 shards across 3 racks and survive an entire rack failure.
Backblaze runs RS(17,3) across 20 drives in a Vault, with drives spread across 20 different storage pods. Each pod is a self-contained unit with its own power and network. This gives them 3-drive fault tolerance per block at 1.18x storage overhead. The economics are aggressive: at exabyte scale, even a few percentage points of storage overhead translate to millions of dollars in hardware.
HDFS spreads shards across racks using rack-aware placement. When computing the placement for an RS(6,3) block, it ensures no two shards share a rack. This means a single rack failure takes out at most one shard, well within the 3-shard tolerance.
Real Systems and What They Chose
HDFS introduced erasure coding in version 3.0 (2017). The default policy is RS(6,3) for cold data, with RS(10,4) as an option for higher efficiency. Data starts as 3x replicated during the write pipeline and gets background-converted to erasure coding by a policy engine. Facebook reported 20% reduction in total storage cluster capacity after migrating to erasure coding.
Ceph supports erasure-coded pools alongside replicated pools. You choose per pool. The Ceph OSD daemon handles encoding and decoding transparently. The jerasure and ISA-L plugins provide the math. A common pattern is a fast replicated pool for RBD (block devices) and an erasure-coded pool for RGW (object storage).
AWS S3 uses a proprietary erasure coding scheme that distributes data across multiple availability zones within a region. AWS has not published the exact parameters, but the 99.999999999% (11 nines) durability guarantee implies a scheme with significant redundancy. Objects are chunked and encoded before distribution.
Azure Storage uses Local Reconstruction Codes (LRC), a Microsoft Research innovation that adds local parity groups on top of Reed-Solomon. The 2012 paper describes a (12,2,2) LRC code: 12 data fragments, 2 global parity, 2 local parity. Normal single-fragment failures repair from a local group of 6, avoiding the need to contact all 12 data nodes.
MinIO applies Reed-Solomon per object, with configurable k and m. The default is RS(8,4) for a 12-drive setup (1.5x overhead, 4-drive fault tolerance). Encoding happens inline during the write path. MinIO uses Klauspost's Go Reed-Solomon library, which achieves high throughput through SSSE3 and AVX2 SIMD optimizations.
The direction is clear. Replication is not going away for hot data and metadata, but erasure coding has become the default for everything else. At petabyte scale, the storage savings are simply too large to leave on the table.
Key Points
- •Erasure coding splits data into k data shards and computes m parity shards such that any k of the k+m total shards can reconstruct the original data. This provides the same fault tolerance as (k+m)/k-way replication at a fraction of the storage cost
- •Reed-Solomon codes operate over Galois fields (finite field arithmetic) to generate parity shards. RS(6,3) stores 9 shards total and tolerates any 3 failures, using 1.5x storage instead of the 3x that triple replication would require for equivalent durability
- •The hidden cost is reconstruction. When a shard is lost, the system must read k surviving shards and perform matrix inversion to recover it. This amplifies read latency and network traffic during degraded operation, which is why hot data often still uses replication
- •Every major storage system uses erasure coding for cold and warm data: HDFS switched from 3x replication to RS(6,3) and cut storage costs nearly in half, Ceph supports erasure-coded pools, and S3 uses a proprietary scheme across availability zones
Used By
Common Mistakes
- ✗Using erasure coding for small objects. An RS(6,3) scheme turns a 1KB object into 9 separate shards of ~170 bytes each, and every read requires fetching 6 of them from 6 different nodes. The per-request overhead dominates. Most systems set a minimum object size (typically 1-4MB) below which they fall back to replication
- ✗Ignoring tail latency during degraded reads. Normal reads touch k nodes and take the slowest of those k responses. During reconstruction, you still read k nodes but the data also passes through a decode step. If even one of those k nodes is slow, the entire read stalls. Systems like Azure add extra read requests to hedge against stragglers
- ✗Placing all shards in the same failure domain. If your RS(6,3) shards all land on the same rack and that rack loses power, you lose all 9 shards at once. Erasure coding only works when shards are spread across independent failure domains: different racks, different power circuits, or different availability zones
- ✗Not accounting for reconstruction bandwidth during recovery. When a node dies holding thousands of erasure-coded shards, recovering each one requires reading k shards from k different nodes. At scale this can saturate network links and trigger cascading slowdowns. Smart systems throttle recovery bandwidth and prioritize shards that are closest to losing durability