System Design: Object Storage (Erasure Coding, Flat Namespace, and Exabyte Scale)
Goal: Build a distributed object storage platform storing 100+ trillion objects across exabytes of raw data. Support PutObject, GetObject, DeleteObject, ListObjectsV2, multipart uploads up to 5TB, versioning, storage class transitions, and presigned URLs. Target 11 nines of durability (99.999999999%) using Reed-Solomon erasure coding (10+4) combined with fast repair, multi-AZ isolation, and failure independence. Strong read-after-write consistency. Flat namespace. CDN-accelerated reads. Clear control plane / data plane separation.
Mental model (read this first):
- Metadata Service: object key -> placement group (PG) ID + version + ETag. Knows what exists, not where it lives.
- CRUSH: PG ID -> 14 storage nodes. Computed locally by any component using the cluster map. No central lookup.
- Storage Nodes: store erasure-coded shards in extent files on disk. An object is split into shards (10 data + 4 parity). Each shard is appended into a large extent file (a 256MB container holding many shards sequentially).
- Cluster Map: shared topology tree (AZ -> rack -> node -> disk with weights). Updated by control plane, cached everywhere. What makes CRUSH deterministic.
- Immutability: PUT creates new shards. Old shards untouched until GC. No locking, no conflicts.
System flow in 5 steps:
- PUT arrives -> metadata assigns a PG ID
- CRUSH computes which 14 nodes hold this PG's shards
- Object is erasure-coded into 14 shards
- Shards are appended into extent files on those storage nodes
- Metadata commits via Raft -> object is visible
TL;DR: Reed-Solomon (10+4) splits each object into 10 data + 4 parity shards placed across 14 storage nodes in different failure domains via CRUSH. This targets 11 nines of durability at 1.4x storage overhead, saving ~$32M/month/EB vs triple replication. The control plane manages ~136 PB of metadata split into 1.36M Raft partitions for strong consistency. The data plane is 19,500 storage nodes (14 EB physical storage). CRUSH deterministically maps each object to its 14 nodes using cluster topology. Small objects pack into 256MB extent files. Objects are immutable. Flat namespace with sorted LSM-tree keys gives near-constant-time lookups and efficient prefix scans. For CDN-served private content, CDN-signed URLs with Origin Access Control are the correct pattern. For file-level edits, a manifest service (application-layer) tracks chunk-to-file mappings on top of the immutable blob store.
1. The Core Problems
Problem: Durability at 11 nines. A typical data center runs around 100K drives. At a 2% annual failure rate, roughly 5.5 drives fail daily. Failures are constant, not rare. Storage costs about $0.02/GB/month (~$20M per EB). Triple replication stores 3 EB per EB of logical data, costing ~$60M/month. Reed-Solomon (10+4) stores 1.4 EB, costing ~$28M/month. That saves ~$32M/month and tolerates up to 4 shard failures instead of 2.
Problem: Flat namespace at trillion scale. No directories. The key "photos/2024/vacation/beach.jpg" is opaque. LIST with prefix=photos/2024/ and delimiter=/ simulates directories via efficient range scans over sorted, partitioned keys. This avoids metadata hotspots and distributed locking entirely.
Problem: Strong consistency. PUT then GET must always return the latest version. This requires Raft consensus at every metadata write.
Problem: Partial updates on immutable blobs. There is no PATCH, no append. Editing 10MB in a 1GB file means re-uploading the entire object. The workaround is a manifest-based chunking layer built on top of the storage system. This is an application concern, not a storage concern.
Scale: 100T+ objects, exabytes of data, 350K req/sec peak, 5TB max object size.
2. Requirements
Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-1 | PutObject: upload objects up to 5TB | P0 |
| FR-2 | GetObject: retrieve objects with byte-range reads | P0 |
| FR-3 | DeleteObject: soft delete (versioning) or hard delete (version-id) | P0 |
| FR-4 | ListObjectsV2: prefix + delimiter + pagination over trillions of objects | P0 |
| FR-5 | Multipart upload: create, upload-part (up to 10,000 parts), complete, abort | P0 |
| FR-6 | Presigned URLs: time-limited signed URLs for direct client-to-storage transfers | P0 |
| FR-7 | Bucket versioning: every PUT creates a new version, recoverable deletes | P1 |
| FR-8 | Storage class transitions: Standard, IA, Glacier Instant/Flexible, Deep Archive, Intelligent-Tiering | P1 |
| FR-9 | Lifecycle policies: automatic transition and expiration | P1 |
| FR-10 | CopyObject: server-side copy without re-upload | P1 |
| FR-11 | Object tagging for lifecycle rules and access policies | P1 |
| FR-12 | Bucket policies and IAM: resource-level and user-level access control | P0 |
| FR-13 | Server-side encryption: SSE-Default, SSE-KMS, SSE-C | P0 |
| FR-14 | Event notifications on object create/delete/transition | P1 |
| FR-15 | Cross-region replication: event-driven async replication with version ordering | P2 |
| FR-16 | CDN integration: edge caching via CDN-signed URLs, WAF/DDoS protection | P1 |
| FR-17 | Automatic rebalancing and repair on node addition/removal/failure | P0 |
Non-Functional Requirements
| Property | Target |
|---|---|
| Durability | 99.999999999% (11 nines) |
| Availability | 99.99% (52.6 min downtime/year) |
| PUT latency (p99, <1MB) | < 200ms |
| GET latency (p99, first byte, <1MB) | < 100ms |
| LIST latency (p99, 1000 results) | < 500ms |
| Total objects | 100T+ |
| Total logical storage | Exabytes |
| Maximum object size | 5TB |
| Peak request rate | 350K/sec |
| Storage overhead | < 1.5x (erasure coding) |
| Consistency | Strong read-after-write |
| CDN cache hit ratio | >= 60% for read-heavy workloads |
3. Design Principles
Objects are immutable. There is no PATCH endpoint. No append operation. No in-place modification. PUT always creates a new version with entirely new shards. Old shards sit untouched until GC reclaims them (or kept forever if versioning is enabled). This removes the need for distributed locking, makes caching straightforward (ETags never go stale), and eliminates conflict resolution from replication. If two clients upload the same key simultaneously, they each create independent shards on independent nodes. Raft orders the metadata commits to decide which version is "latest", but the data writes never conflict.
Separate control plane from data plane. Metadata (Raft consensus) and data (quorum writes, erasure coding) scale independently. Mixing them creates a bottleneck that caps both.
No central bottleneck. The Data Router is a stateless fleet. Placement uses CRUSH (deterministic, no central authority). Metadata is partitioned across 1.36M Raft groups.
Push data to the edge. CDN caches content via CDN-signed URLs and Origin Access Control. Storage presigned URLs transfer bytes directly to storage nodes for uploads. API servers handle auth and metadata, not terabits of data.
Failure is routine. At 5.5 drive failures per day, failure is not an exception. Write with quorum (11/14, so up to 3 slow nodes don't block), read with speculative parity fetches, repair in the background. One slow node never blocks a user request.
Flat namespace is an advantage. No directory tree means no distributed lock coordination for renames. Sorted keys in LSM-trees give near-constant-time point lookups (O(log N) technically, but with bloom filters and block caches it behaves like O(1) in practice) and efficient range scans.
4. Erasure Coding
Problem: Storing 3 copies of every byte wastes money and tolerates fewer failures than the math allows. At exabyte scale, the difference between 3x and 1.4x overhead is ~$32M/month.
Simple example: A 10MB photo is split into 10 slices of 1MB each (data shards). Four additional 1MB slices are computed from the originals using Reed-Solomon math (parity shards). The 14 shards are stored on 14 different machines. To reconstruct the photo, any 10 of the 14 shards suffice. Four machines can fail simultaneously without data loss.
Mental model: A book is photocopied 3 times (3x replication) — lose 2 copies and the book is gone. Alternatively, the book is split into 10 chapters, and 4 "cheat sheets" are created that can regenerate any missing chapters from the remaining ones. The cheat-sheet approach stores less total paper and survives more losses.
| 3x Replication | 10+4 Erasure Coding | |
|---|---|---|
| Physical storage (1 EB logical) | 3 EB | 1.4 EB |
| Monthly cost ($0.02/GB) | $60M | $28M |
| Fault tolerance | 2 failures | 4 failures |
Encoding is pipelined. The Data Router processes 1MB stripes as they arrive over the network, encoding and shipping shards in parallel with receiving the next stripe. It never buffers the full object before starting. ISA-L does ~10 GB/sec encoding per core with AVX-512 acceleration. In most deployments, EC CPU cost is small relative to disk and network I/O.
The trade-off: write latency increases (parity computation before writing) and reconstruction is more expensive (read 10 shards and decode). But the cost savings at scale are overwhelming.
5. Object Placement: Placement Groups and CRUSH
Problem: Managing placement per object does not scale at 100T objects. Adding a single node with naive consistent hashing reshuffles data across the entire cluster. The system needs a way to reason about groups of objects, not individual objects, and to limit rebalancing scope when topology changes.
Simple example: Instead of deciding where each of 100 trillion objects lives individually, objects are grouped into 100,000 buckets (placement groups). Each PG maps to a fixed set of 14 storage nodes. When a node is added, only a small subset of PGs are reassigned — controlled, bulk data movement instead of a full reshuffle.
Mental model: A post office sorts mail not by individual address but by ZIP code. Each ZIP code maps to a carrier. Adding a new carrier means reassigning a few ZIP codes, not re-sorting every letter.
pg_id = hash(bucket_id + "/" + object_key) % 100,000
Each PG maps to 14 storage nodes via CRUSH (Controlled Replication Under Scalable Hashing). CRUSH is a deterministic, topology-aware placement algorithm that selects nodes based on cluster layout (racks, AZs) and capacity weights, ensuring shards distribute across failure domains.
The placement map is computed locally by any component — no central lookup. It changes only when cluster topology changes (node join, leave, or failure), minimizing coordination overhead.
Source of truth: The PG ID stored in metadata is the authoritative reference. CRUSH + cluster map derive node locations from that PG ID. The shard_map field in metadata is a cached optimization for faster reads and may be stale or recomputed.
6. Metadata Architecture
Problem: 100 trillion objects need an index that supports sub-10ms point lookups, efficient prefix range scans, versioned entries, and strong consistency. A single database node cannot hold this. Hash-partitioned stores (like Cassandra or DynamoDB) cannot do cross-partition range scans efficiently — scanning a prefix across thousands of partitions requires a full table scan.
Simple example: Consider listing all objects under photos/2024/. With range-partitioned keys in sorted order, this is a forward scan starting at photos/2024/ and stopping at photos/2024/\xff. With hash-partitioned keys, the prefix is scattered across thousands of partitions and every partition must be scanned.
Mental model: A library organized by call number (range-partitioned) lets a librarian walk to the "P" section and scan forward. A library organized by random shelf assignment (hash-partitioned) requires checking every shelf.
Each object needs roughly 500-1KB of metadata (key ~200B, version ID 16B, size/etag/content-type ~80B, ACL/timestamps ~60B, user metadata ~100B, shard map ~225B), averaging ~700 bytes. At 100T objects: ~70 PB raw metadata. With LSM-tree overhead (bloom filters, block indexes, write-ahead logs): ~136 PB.
Partitioning: ~1.36M Raft partitions at 100GB each (small enough for fast Raft snapshots, large enough to keep partition count manageable). 3 replicas per partition. Spread across ~13,600 metadata nodes, each hosting ~100 partitions.
Key format: /{bucket_id}/{object_key}\0{version_id}. Prefixing with bucket_id puts all objects in the same bucket in a contiguous key range, so LIST with a prefix scans forward from that point. The \0 null byte separates the object key from the version. Version IDs use bit-flipped timestamps (XOR each byte with 0xFF) so that newer versions produce smaller byte values and sort first in a forward scan. GET without a version_id reads the first entry it finds — always the newest.
What metadata stores vs what CRUSH computes: Metadata stores the PG ID for each object, not node locations. Node placement is derived at read/write time by running CRUSH against the cluster map.
Consistency model: All metadata writes go through the Raft leader. Reads are served from the leader (strict consistency) or follower replicas with lease-based staleness bounds (lower latency). Leader reads cost ~2-5ms, but PUT-then-GET always returns the latest version.
LSM-tree internals: Each metadata partition uses a log-structured merge tree. Writes go to an in-memory memtable backed by a write-ahead log (WAL). When the memtable fills, it flushes to an immutable sorted file (SSTable) on disk. Background compaction merges SSTables to bound read amplification. Bloom filters attached to each SSTable let point lookups skip irrelevant files with high probability. Block indexes within each SSTable enable binary search to the exact key. The result: writes are always sequential (fast), reads check the bloom filter first (skip most files), then binary-search into the correct SSTable.
| Metadata Store Option | Consistency | Range Scans |
|---|---|---|
| FoundationDB | Strong (Paxos) | Yes |
| TiKV | Strong (Raft) | Yes |
| CockroachDB KV | Strong (Raft) | Yes |
7. Storage Classes
| Class | Latency | $/TB/mo | Retrieval $/GB | Use Case |
|---|---|---|---|---|
| Standard | ms | $23.00 | $0.00 | Active data |
| Infrequent Access | ms | $12.50 | $0.01 | Backups |
| Glacier Instant | ms | $4.00 | $0.03 | Archives needing instant access |
| Glacier Flexible | min-hrs | $3.60 | $0.01-0.03 | Compliance archives |
| Deep Archive | 12-48h | $0.99 | $0.02 | Legal/regulatory (tape-backed) |
| Intelligent-Tiering | ms | auto | $0.00 | Unknown access patterns |
Standard to IA is a billing change (data stays put). Glacier uses denser HDD arrays with higher EC ratios like 16+4 (1.25x overhead, since access is rare). Deep Archive uses tape-backed systems where retrieval takes 12-48 hours because physical media must be loaded by robotic arms.
8. CDN and Edge Layer
At 250K GETs/sec x 100KB = 25 GB/sec egress. CDN offloads 60-80%, reducing origin load to ~7.5 GB/sec.
Objects are immutable, so CDN caching is safe by design. But storage presigned URLs are NOT CDN-friendly: each URL has a unique signature, so CDN sees every request as a different cache key (always a miss). For private content through CDN, CDN-signed URLs with Origin Access Control are the correct pattern.
WAF at CDN edge handles per-IP/bucket/account rate limiting, geo-restrictions, and enumeration attack detection.
9. Control Plane vs Data Plane
| Control Plane | Data Plane | |
|---|---|---|
| Does | Decisions: placement, versioning, access | Bytes: encode, transfer, store |
| Components | Metadata Service, Raft, CRUSH, IAM, Lifecycle/GC | Data Router Fleet (stateless), Storage Nodes |
| Consistency | Strong (Raft) | Quorum (11/14 shards) |
The Data Router is a logical role, not necessarily a separate fleet. It handles erasure coding and shard distribution. In the API path, it runs as a separate stateless fleet (~1,500 instances). In the presigned path, the storage node takes on this role directly. The role is the same regardless of where the code runs: receive bytes, encode, distribute shards, coordinate the metadata commit.
10. Architecture Overview
Three parallel paths exist. Downloads go through CDN. Uploads go through the API path or presigned URLs. All three paths converge at the same storage layer with identical placement, erasure coding, and durability guarantees. Only the access pattern differs.
CDN Download Path
For reads with CDN-signed URLs. CDN talks directly to the storage origin. No API Router, no Metadata Service, no Data Router in this path.
Step by step: The client never talks to the storage system directly — it talks to the CDN. The backend generates a CDN-signed URL (using a CDN key pair, separate from storage credentials). The CDN edge validates the signature, checks expiry, and enforces access policies. On cache hit (60-80% of reads), the response is immediate. On cache miss, CDN fetches from the storage origin using its OAC credential. The storage origin calls the metadata service internally (PG ID, shard locations), fetches 10 data shards in parallel, reconstructs via Reed-Solomon if any shards are missing, and returns the assembled bytes. CDN caches and serves.
API Upload Path
Full controlled path. Every component involved: auth, metadata, encoding, storage, consensus.
Step by step: The client sends PUT /bucket/key with HMAC-SHA256 signature. Request passes through WAF and L7 load balancer. API Router validates the HMAC signature, calls IAM to check PutObject permission. Metadata Service hashes bucket+key to a PG ID, returns 14 storage node addresses via CRUSH. Data Router processes data in 1MB stripes: split into 10 data shards, compute 4 parity shards via Reed-Solomon, pipeline encoding with receiving. All 14 shards are sent to storage nodes in parallel. Each node appends the shard to an extent file, computes CRC32C, fsyncs, and ACKs. Once quorum succeeds (11/14), metadata commits via Raft (2/3 replicas confirm). Only after both shard quorum AND Raft commit succeed does the client get 200 OK.
If shards write successfully but the Raft commit fails (e.g., leader election), those shards become orphans. GC cleans them up within 24 hours. The client gets an error and retries.
Presigned Upload Path
Direct path for large uploads. Zero bytes through the API fleet.
Step by step: The client first obtains a presigned URL from the API server (normal authenticated request). The API server signs the URL with the storage system's secret key, embedding bucket, key, operation, expiry, and conditions. The client uploads directly to the storage node. The storage node validates the presigned URL by reconstructing the "string to sign" from the request, looking up the secret key (cached locally), and comparing HMAC-SHA256 signatures. No database lookup, no session state. The storage node then runs CRUSH lookup, Reed-Solomon encoding, distributes shards (1 local + 13 to peers), waits for quorum, and commits metadata via Raft.
The storage node temporarily takes on the Data Router's role. Once the upload completes, it returns to being a regular storage node.
Background Systems
Always running, not tied to any request path:
- Repair / Rebalance: Detects node/drive failures via heartbeat timeout (~30 seconds), reads surviving shards, reconstructs missing shards using Reed-Solomon, writes to healthy nodes. Also handles rebalancing when nodes are added or removed.
- Garbage Collector: Cleans up shards no longer referenced by any live object (overwrites, deletes, aborted multipart, failed writes). 24-hour grace period before physical deletion.
- Scrubber: Reads every shard on every drive periodically (full scan every 2 weeks), computes CRC32C, compares to stored checksum. Detects silent bit rot before a client reads corrupted data.
- Lifecycle Manager: Evaluates lifecycle rules (e.g., "move to Glacier after 90 days"). Triggers storage class transitions by re-encoding shards with the target EC scheme and moving to the appropriate tier.
Component Reference
| Component | Responsibility |
|---|---|
| CDN Edge | Caches objects for reads. Only involved in downloads. Uses CDN-signed URLs (not storage presigned URLs). |
| WAF / DDoS | Filters malicious requests, rate limiting. In front of the API path. |
| API Router Fleet | Stateless servers: validate HMAC signatures, route to internal services. Horizontally scalable, no state. |
| IAM Engine | Evaluates IAM policies + bucket policies. Returns allow or deny. |
| Metadata Service | Maps object key -> PG ID + version + ETag + shard locations. Does NOT decide which nodes hold shards (that is CRUSH). |
| Raft Groups (1.36M) | Consensus protocol ensuring metadata writes are durable and consistent. 3 replicas per partition, writes confirmed after 2/3 agree. |
| CRUSH Placement | Deterministic algorithm: given a PG ID, returns 14 storage nodes across failure domains. Runs locally, no central server. |
| Data Router Fleet | Stateless servers: receive object bytes, RS encode 10+4 shards, distribute to correct storage nodes. Only in the API path. |
| Storage Nodes | Machines with 36 HDDs each. Store shards in extent files, verify checksums on read, report health via heartbeats. In the presigned path, temporarily act as Data Router. |
Path Comparison
| Step | API Upload | Presigned Upload | CDN Download |
|---|---|---|---|
| Entry | WAF -> LB -> API Router | Storage Node directly | CDN Edge |
| Auth | API Router + IAM | Storage Node validates signature locally | CDN validates CDN-signed URL at edge |
| Placement | Data Router calls CRUSH | Storage Node calls CRUSH | Handled internally by storage origin |
| EC encode | Data Router | Storage Node (acts as Data Router) | N/A (read path) |
| Shard write | Data Router -> 14 nodes (quorum 11/14) | Storage Node -> 13 peers + 1 local (quorum 11/14) | N/A |
| Metadata | Shard quorum -> Raft commit -> ACK | Shard quorum -> Raft commit -> 200 OK | Storage origin handles lookup internally |
Object Lifecycle
11. Scale Estimation
Request Throughput
Daily requests: 10B -> ~115K/sec average, ~350K/sec peak (3x)
GET 70%: 250K/sec PUT 15%: 50K/sec LIST 10%: 35K/sec DELETE 5%: 15K/sec
Storage
100T objects x 100KB avg = 10 EB logical
10+4 EC (1.4x) = 14 EB physical
19,500 nodes x 36 drives x 20TB = 14 EB
700,000 total drives
Metadata
~700 bytes/object x 100T = ~70 PB, with LSM overhead: ~140 PB
140 PB / 100GB per partition = ~1.4M partitions
13,600 metadata nodes (100 partitions each)
Bandwidth
PUT ingress: 50K/sec x 100KB = 5 GB/sec
GET egress: 250K/sec x 100KB = 25 GB/sec (CDN offloads 70% -> 7.5 GB/sec origin)
Parity write overhead: 5 GB/sec x (4/10) = 2 GB/sec extra
Router Nodes
API Routers: 350K peak req/sec / ~175 req/sec per node = ~2,000 nodes
Data Routers: 50K PUTs/sec / ~33 PUTs/sec per instance = ~1,500 nodes
Total router nodes: ~3,500
Summary
| Resource | Count |
|---|---|
| Storage nodes | ~19,500 |
| Total drives | ~700,000 |
| Physical storage | 14 EB |
| Metadata nodes | ~13,600 |
| Raft partitions | ~1,360,000 |
| Placement groups | ~100,000 |
| API/Data Router nodes | ~3,500 |
| CDN PoPs | 200+ |
| Peak req/sec | ~350,000 |
Cost breakdown:
Capex:
Storage nodes: 19,500 x $15K = $292M
Metadata nodes: 13,600 x $8K = $109M
Router nodes: 3,500 x $5K = $18M
Networking: $50M
Total capex: $469M
Opex (annual):
Power + cooling: 36,600 nodes x 500W x $0.10/kWh x 8,760h = $160M
Operations (staff, facilities, bandwidth): $100M
Total opex: $260M/year
Amortized over 4 years: ($469M / 4) + $260M = $377M/year
Cost per GB: $377M / (10 EB x 10^9 GB/EB) = $0.003/GB/month
Internal cost: ~$3/TB/month
Retail ~$23/TB/month covers margin, multi-region, and ops overhead. At scale, cross-AZ network transfer often costs more than storage hardware itself.
12. Data Model
Bucket Metadata (PostgreSQL)
CREATE TABLE buckets (
account_id BIGINT NOT NULL,
bucket_name TEXT NOT NULL,
bucket_id UUID DEFAULT gen_random_uuid(),
region TEXT NOT NULL,
versioning TEXT DEFAULT 'disabled',
acl JSONB DEFAULT '{}',
policy JSONB,
lifecycle_rules JSONB DEFAULT '[]',
encryption JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (account_id, bucket_name)
);
CREATE UNIQUE INDEX idx_bucket_name ON buckets (bucket_name);Object Metadata (Distributed KV, protobuf)
Key: /{bucket_id}/{object_key}\0{version_id}. Version IDs use bit-flipped timestamps (XOR each byte with 0xFF) so newer versions sort first in a forward scan. GET without a version_id reads the first entry — always the newest.
message ObjectMetadata {
string object_key = 1;
bytes version_id = 2;
uint64 size = 3;
string etag = 4;
string content_type = 5;
string storage_class = 6;
string encryption_key_id = 7;
map<string, string> user_metadata = 8;
uint32 placement_group = 9;
repeated ShardLocation shard_map = 10; // cached; recomputable from PG + CRUSH
bool is_delete_marker = 11;
int64 created_at = 12;
}
message ShardLocation {
uint32 shard_index = 1; // 0-13
uint64 node_id = 2;
uint64 extent_id = 3;
uint64 offset = 4;
uint32 length = 5;
}Placement Group Map (etcd)
message PlacementGroup {
uint32 pg_id = 1;
repeated uint64 node_ids = 2; // 14 entries
uint64 epoch = 3;
string state = 4; // ACTIVE, REBALANCING, DEGRADED
}Extent Index (local per node)
Each node holds roughly 72 billion shard entries (100T objects x 14 shards / 19,500 nodes). At 32 bytes per entry, a full in-memory index would need ~2.3TB — far exceeding RAM. The index is tiered: a bloom filter in memory (a few GB, eliminates most irrelevant extents), a persistent on-disk sorted index per sealed extent file, and an LRU cache in RAM for hot entries. Most reads hit the bloom filter, skip irrelevant extents, then do one disk seek into the right extent. The top 1-5% of hot entries live in RAM cache, covering the majority of read traffic.
Multipart Upload State
Key: /mpu/{bucket_id}/{object_key}/{upload_id}
message MultipartUpload {
string upload_id = 1;
map<uint32, PartInfo> parts = 2;
int64 created_at = 3;
int64 expires_at = 4; // auto-abort deadline
}
message PartInfo {
uint64 size = 1;
string etag = 2;
uint32 placement_group = 3;
repeated ShardLocation shards = 4;
}Version Listing Index
Secondary index: /versions/{bucket_id}/{object_key}\0{reverse_version_id} pointing to primary metadata. Uses the same bit-flip trick to make newest versions sort first.
Lifecycle State
Key: /lifecycle/{bucket_id}/{rule_id}/{object_key} -> {current_class, target_class, transition_at, state}
13. API Design
PutObject
PUT /{bucket}/{key} HTTP/1.1
Host: storage.us-east-1.example.com
Content-Length: 10485760
Content-Type: image/jpeg
x-storage-class: STANDARD
x-storage-server-side-encryption: kms
x-storage-server-side-encryption-kms-key-id: kms:us-east-1:123456:key/abcd-1234
x-storage-meta-original-filename: beach-photo.jpg
Authorization: HMAC-SHA256 Credential=AKID/20260310/us-east-1/storage/request,
SignedHeaders=content-length;content-type;host;x-storage-date,
Signature=abcdef1234567890...
[10 MB of object bytes]HTTP/1.1 200 OK
ETag: "d41d8cd98f00b204e9800998ecf8427e"
x-storage-version-id: 3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUoGetObject
GET /{bucket}/{key} HTTP/1.1
Range: bytes=0-1048575
If-None-Match: "d41d8cd98f00b204e9800998ecf8427e"Returns 206 Partial Content with range, or 304 Not Modified if ETag matches.
DeleteObject
DELETE /{bucket}/{key}?versionId=3sL4kqtJlcpXr... HTTP/1.1Without versionId on a versioned bucket: creates a delete marker (data untouched). With versionId: permanently deletes that version.
ListObjectsV2
GET /{bucket}?list-type=2&prefix=photos/2024/&delimiter=/&max-keys=1000 HTTP/1.1<ListBucketResult>
<Prefix>photos/2024/</Prefix>
<CommonPrefixes><Prefix>photos/2024/january/</Prefix></CommonPrefixes>
<CommonPrefixes><Prefix>photos/2024/vacation/</Prefix></CommonPrefixes>
<Contents>
<Key>photos/2024/index.html</Key>
<ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
<Size>2048</Size>
</Contents>
<IsTruncated>true</IsTruncated>
<NextContinuationToken>def456</NextContinuationToken>
</ListBucketResult>Multipart Upload
POST /{bucket}/{key}?uploads HTTP/1.1 -> UploadId=abc123
PUT /{bucket}/{key}?partNumber=1&uploadId=abc123 HTTP/1.1 -> ETag for part 1
PUT /{bucket}/{key}?partNumber=2&uploadId=abc123 HTTP/1.1 -> ETag for part 2
POST /{bucket}/{key}?uploadId=abc123 HTTP/1.1 -> Complete (send part ETags)
DELETE /{bucket}/{key}?uploadId=abc123 HTTP/1.1 -> Abort (if needed)Parts upload in parallel, out of order, from different machines. Each part is independently erasure-coded.
Presigned URLs
https://storage.../my-bucket/photo.jpg
?X-Storage-Algorithm=HMAC-SHA256
&X-Storage-Credential=AKID%2F20260310%2Fus-east-1%2Fstorage%2Frequest
&X-Storage-Expires=3600
&X-Storage-Signature=fe5f80f77d5fa3beca038a248ff027...
Self-contained: signature embeds bucket, key, operation, expiry. Conditions: IP range, content-type, content-length-range. These go direct to storage nodes, not through CDN.
14. End-to-End: From Upload to Recovery
Before examining individual flows, this walkthrough shows the full lifecycle using a simplified 2+1 scheme (2 data shards, 1 parity). The real system uses 10+4, but the mechanics are identical.
Setup: File = [A B C D] (4 blocks). Stripe size = 2 blocks (k=2). One parity shard per stripe (m=1).
Step 1: Split into stripes and encode.
| Stripe | Data Shards | Parity Shard | Stored |
|---|---|---|---|
| Stripe 1 | D0 = A, D1 = B | P = A XOR B | [A, B, A XOR B] |
| Stripe 2 | D0 = C, D1 = D | P = C XOR D | [C, D, C XOR D] |
Step 2: Distribute shards across nodes.
| Shard | Stripe 1 | Stripe 2 |
|---|---|---|
| Node 1 | A | C |
| Node 2 | B | D |
| Node 3 | A XOR B (parity) | C XOR D (parity) |
Every shard on a different node, different rack. Metadata records the mapping.
Step 3: Node 2 crashes. Heartbeat detects failure within 30 seconds. Shards B and D are now missing.
Step 4: Read during failure. The client gets the complete file. No error, just ~50ms extra latency from reconstruction.
Step 5: Background repair. Repair worker reconstructs B, writes it to Node 7, updates metadata. Full redundancy restored.
Mapping to production: Replace 2+1 with 10+4. Stripes are 1MB. Parity uses GF(2^8) matrix math instead of simple XOR. The system tolerates 4 missing shards instead of 1.
15. The PUT Path
- API Router validates HMAC, calls IAM. (~2ms)
- Metadata Service returns PG + shard-to-node mapping. (~5ms)
- Data Router reads 1MB stripes, RS encodes 10 data + 4 parity shards.
- 14 parallel shard writes to storage nodes (append to extent file, CRC32C, fsync). (~10-20ms)
- Quorum 11/14 means up to 3 slow nodes don't block. Missing shards repaired async.
- Raft commit of object metadata. (~5-10ms)
- Return 200 with ETag and version-id.
Total for 10MB PUT: ~38ms + network transfer. Data is not durable until both shard quorum write and Raft metadata commit succeed.
16. The GET Path
Three ways to read an object. Each skips different layers.
GET via CDN (most common for high-traffic reads): CDN edge validates CDN-signed URL, checks cache. Hit (60-80%): serve immediately. Miss: fetch from storage origin via OAC.
GET via API (normal authenticated read): API Router validates auth, calls IAM. Metadata lookup from Raft leader returns PG ID, ETag, size, cached shard locations (~5ms). Data Router fetches 10 data shards in parallel. Happy path (>99%): all 10 respond, stream to client (~15-20ms). Degraded path: 1-4 shards missing, request parity shards, RS reconstruct (+50-200ms), queue background repair.
GET via presigned URL (direct, no CDN): Storage node validates signature, reads local shards, fetches remaining from peers, reconstructs if needed, streams back.
Speculative read: If 9 of 10 shards respond quickly and 1 lags, fire off a request for a parity shard without waiting. Whichever arrives first wins. This keeps p99 tight even when individual nodes are slow.
17. Presigned URLs: Storage vs CDN
Problem: Two completely separate signing systems exist. Mixing them up is one of the most common integration mistakes.
Simple example: A storage presigned URL sends the request directly to a storage node — CDN is bypassed entirely. A CDN-signed URL sends the request to CDN edge, which validates and caches. Using a storage presigned URL through CDN produces 100% cache misses because every URL has a unique signature (different cache key).
Mental model: Storage presigned URLs are direct-dial phone numbers — they reach the storage node. CDN-signed URLs are switchboard numbers — the CDN operator routes the call, and can serve from memory if asked the same question recently.
Storage presigned PUT flow:
CDN-signed URLs (the correct CDN pattern): The storage bucket is private. CDN is the only entity allowed to fetch from storage via OAC. The backend generates a CDN-signed URL using a CDN key pair. CDN verifies signature and expiry at the edge. On cache miss, CDN fetches from storage using its OAC credential.
| Pattern | URL goes to | Who validates | Caching | Best for |
|---|---|---|---|---|
| Storage presigned URL | Storage directly | Storage node | None | Uploads, temp downloads |
| CDN-signed URL + OAC | CDN edge | CDN edge | Full CDN caching | Private content, high read traffic |
| Public bucket + CDN | CDN edge | No auth needed | Full CDN caching | Static assets, public content |
18. Multipart Upload
Problem: A 5TB object cannot be uploaded in a single request. Network failures mid-upload would require starting over. Parts need to upload in parallel from different machines.
Simple example: A 1GB video is split into 64 parts of 16MB each. Each part uploads independently and is erasure-coded on arrival. If the upload fails at part 40, ListParts reveals parts 1-39 and 41-64 are already durable. Only part 40 needs re-upload.
Mental model: Each part is a self-contained mini-upload with full durability. Completion is a metadata operation that stitches the parts together into one visible object. No data is copied during completion — existing part shard locations are linked into the final object record.
Each part is independently erasure-coded with its own shard map. Parts upload in parallel, out of order, from different machines. The upload_id ties them together.
What happens during CompleteMultipartUpload:
- Validate: Check UploadId exists, all referenced parts present, ETags match, ascending order.
- Assemble: Link existing parts into a single logical object by updating metadata pointers. Primarily a metadata operation.
- Compute final ETag:
MD5(concat(MD5(part1) + MD5(part2) + ...)) + "-" + part_count. - Atomic commit: Object becomes visible in one metadata write. Before: object does not exist. After: fully readable.
- Cleanup: Temporary multipart state removed. Part data stays (now referenced by final object).
For stronger integrity, explicit SHA-256 or CRC32 checksums can be provided at completion time. Incomplete uploads older than 7 days are automatically cleaned up.
| Object Size | Part Size | Parts | Parallelism |
|---|---|---|---|
| 100MB | 10MB | 10 | 5-10 |
| 1GB | 16MB | ~63 | 10 |
| 10GB | 64MB | ~157 | 10-20 |
| 1TB | 512MB | ~2,000 | 50-100 |
| 5TB | 512MB | ~10,000 | 100 |
19. Small Object Packing: Extent Files
Problem: Most objects are small (median ~10KB). Storing each as a separate file hits three walls simultaneously: inode exhaustion (ext4 has tens to hundreds of millions of inodes per filesystem), IOPS waste (a 7200 RPM HDD does ~150 random I/O ops/sec because each read requires a physical disk seek of ~7ms), and filesystem fragmentation from billions of tiny files.
Simple example: A 1KB metadata object stored as its own file costs one inode and one disk seek. Packed into a 256MB extent file alongside thousands of other objects, it costs 32 bytes of index memory and one sequential read.
Mental model: Instead of giving every letter its own filing cabinet drawer, stuff thousands of letters into one large envelope and keep an address book that says "letter X is at position 4,096 in envelope 7."
+------------------------------------------------------------------+
| Extent File (256 MB) |
| Header | ObjA shard|CRC|pg| ObjB shard|CRC|pg| ... | free space |
+------------------------------------------------------------------+
Write: Append shard to open extent, update in-memory index, fsync. Seal at 256MB.
Read: O(1) in-memory hash lookup -> seek to offset -> read bytes -> verify CRC32C.
| Metric | One file per object | Extent packing |
|---|---|---|
| Write IOPS (1KB, HDD) | 150/sec (seek) | 5,000+/sec (sequential) |
| Max objects per 20TB | ~4B (inode limit) | Unlimited (memory-bound) |
| Metadata per object | ~256 bytes (inode) | 32 bytes (in-memory) |
Compaction: When >30% of an extent's shards are dead, GC rewrites live shards into a new extent. Large objects (>4MB): stored in dedicated files, not packed.
Small object durability strategy: For objects under ~256KB, EC overhead is disproportionate. A 1KB object produces 14 shards of ~100 bytes each, and every read fetches 10 of them from 10 different nodes. Most production systems handle this with a size-based policy:
| Object size | Strategy | Rationale |
|---|---|---|
| < 256KB | 3x replication | EC overhead too high. Replication is simpler, reads hit one node. |
| 256KB - 4MB | EC or replication (configurable) | Transition zone. EC starts to pay off. |
| > 4MB | Erasure coding (10+4) | EC savings dominate. Stored in dedicated files. |
20. Prefix Listing at Scale
Keys are sorted in the LSM-tree. prefix=photos/2024/ triggers a range scan from "photos/2024/" to "photos/2024/\xff". For each key: strip prefix, find delimiter -> CommonPrefix (simulates subdirectory) or Contents (simulates file).
Scaling to billions of matches: The metadata service scans all relevant partitions in parallel. Results come back sorted per partition. A k-way merge combines these sorted streams into one globally sorted result. Once 1000 results are collected (max-keys default), scanning stops and a continuation token marks where to resume.
Applications must design key structure with LIST in mind. {date}/{category}/{uuid} is efficient. {uuid} with no prefix structure means scanning everything.
21. Garbage Collection
Problem: Overwrites, deletes, aborted multipart uploads, and failed writes all leave orphan shards. Without cleanup, leaked storage accumulates to petabytes within months.
Simple example: An object is overwritten. The old shards still sit on disk. Nothing references them. GC finds them 24 hours later and reclaims the space.
Mental model: GC is the janitorial crew. It never enters during business hours (never in the write path). It cleans up after hours (async background), with a waiting period (24-hour grace) so nothing gets thrown away that someone might still be reading.
Dead shards sit for a 24-hour grace period before physical deletion. This protects against in-flight reads, clock skew, and race conditions. GC gets 20% of cluster bandwidth (~1.4 GB/sec, ~86 TB/day).
When an extent passes 30% dead ratio, GC compacts it by rewriting live shards into a fresh extent.
Orphan detection: Once a week, the metadata service exports every referenced shard. Each storage node compares that list against its local index. Unmatched entries are orphan candidates, quarantined for 7 days, then deleted.
22. Rebalancing and Repair
Problem: When a node or drive fails, some placement groups lose a shard. The system must reconstruct missing shards before a second failure hits the same PG and threatens durability.
Simple example: Node X fails (36 drives). The repair worker reads 10 surviving shards per affected PG, reconstructs missing shards via Reed-Solomon, writes them to healthy nodes, and updates the PG map.
Mental model: A library book is damaged. The librarian can reconstruct it from the "cheat sheets" (parity) and the remaining chapters (surviving data shards). The sooner the reconstruction finishes, the lower the chance of losing another chapter.
Triggers: Node added (new capacity), node removed (decommission), disk failure, capacity imbalance.
Repair flow: Heartbeat detects failure (30s) -> PGs prioritized by severity (10/14 shards = critical, 11/14 = high, 13/14 = low) -> read 10 surviving shards -> RS reconstruct -> write to new node -> update PG map.
Rate-limited: 20% of cluster I/O. Backpressure: pause if client p99 exceeds threshold.
Repair time for one node failure: ~72 PGs of data to reconstruct, roughly ~1 PB total. At 1.4 GB/sec with 10 parallel workers: ~20 hours under ideal conditions. Real-world repair takes 2-3x longer due to throttling, network contention, and concurrent repairs. Keeping repair fast is critical because the longer the window stays open, the higher the chance of a second failure hitting the same PG.
Node decommission: Mark DRAINING -> stop writes -> copy all shards to replacements -> verify -> remove from CRUSH.
23. Versioning
Versioning enabled: every PUT creates a new version. DELETE without version_id creates a delete marker (data untouched, recoverable). DELETE with version_id permanently removes that version.
MFA Delete: Requires MFA to permanently delete versions or change versioning state. Protects against compromised credentials.
Storage impact: 1MB file updated 1,000x/day = 1GB/day versioned data. Lifecycle rules are essential: "expire non-current versions after 30 days."
24. Hot Objects and Caching
Anything accessed more than 100 times per minute gets promoted to a local SSD cache on the Data Router (LRU, 10GB per node). For viral content hitting 10K+ req/sec on a single object, the system copies it to multiple Data Router caches and uses request coalescing at the CDN edge. A thousand simultaneous cache misses become one origin fetch instead of a thousand.
25. Storage Layer Details
Each node: 36 HDDs (20TB), 2 NVMe SSDs (extent index + WAL), 256GB RAM, 2x25Gbps NIC.
CRUSH failure domain hierarchy: Region -> AZ (3) -> Rack (~540) -> Node (~36/rack) -> Drive (36/node).
For a 10+4 PG: 5+5+4 shard distribution across 3 AZs, each shard on a different rack. Max 5 per AZ. Losing the AZ with 4 leaves exactly 10 (sufficient). For AZ-loss tolerance on critical data, 10+6 (max 5/AZ, losing any AZ leaves >= 11 >= 10).
26. User Experience Scenarios
5MB photo upload: Auth 2ms + PG lookup 5ms + encode <1ms + 14 shard writes 20ms + Raft 10ms = ~40ms + transfer. Photo erasure-coded across 3 AZs with 11 nines.
4GB video multipart: 800 parts x 5MB, 10 parallel. ~5.5 minutes on 100 Mbps. WiFi drops at part 400 -> reconnect -> ListParts -> resume from 400. No wasted work.
Data lake byte-range read: Bytes 100MB-110MB of a 50GB Parquet file -> Data Router computes target stripe -> fetches only relevant shards. ~20ms first byte. Never downloads the full 50GB.
Accidental delete recovery: Versioning enabled -> DELETE only creates marker -> list versions -> GET with version_id or delete the marker. Data never lost.
27. The Partial Update Layer
Object storage is immutable by design. There is no PATCH, no partial write, no append. Changing 1 byte in a 5GB file means re-uploading the entire 5GB. This is deliberate: immutability keeps the storage engine simple.
The manifest service described here is application code built on top of the immutable blob store. The storage system just stores blobs.
Problem: File sync tools, collaborative editors, and backup systems need to update portions of files without re-uploading everything. How is file-level editing built on top of an immutable store?
Simple example: A 100MB document is split into 12 chunks of ~8MB each. Each chunk is stored as a regular object. A manifest record says "file doc-123, version 5 = [chunk_a, chunk_b, ..., chunk_l] in this order." Editing chunk 3 means uploading one new 8MB object and updating the manifest to reference it. Total upload: 8MB instead of 100MB.
Mental model: The storage system is a warehouse of numbered boxes. The manifest is a recipe card that says "to assemble this file, combine boxes 7, 12, 3, 45, 22 in that order." Changing one ingredient means putting a new box in the warehouse and updating the recipe card. Other boxes are untouched.
Manifest Data Model
Key: /{tenant_id}/{file_id}\0{version} (newest first via bit-flipped timestamps)
message FileManifest {
string file_id = 1;
string tenant_id = 2;
uint64 version = 3;
repeated ChunkRef chunks = 4;
uint64 total_size = 5;
string content_hash = 6;
int64 created_at = 7;
uint64 expected_version = 8; // optimistic locking
}
message ChunkRef {
uint32 index = 1;
string object_key = 2; // key in chunks bucket
uint64 offset = 3;
uint64 size = 4;
string etag = 5;
}Manifest API
| Endpoint | Purpose |
|---|---|
POST /files/{id}/upload-urls | Get presigned URLs for new chunks |
POST /files/{id}/versions | Commit new manifest (with If-Match: version_N for optimistic locking) |
GET /files/{id} | Get latest manifest + chunk presigned GET URLs |
GET /files/{id}/versions | List all versions |
DELETE /files/{id}/versions/{v} | Delete specific version |
Atomic Commit and Consistency
- Upload new/changed chunks via presigned URLs.
- Create new manifest referencing old chunks (unchanged) + new chunks.
- Atomic pointer switch: manifest service commits new version in a single Raft write.
No partial state is visible. Before commit: readers see version N. After: version N+1. If commit fails, uploaded chunks are orphans cleaned by GC. Reads always go through the Raft leader for the latest manifest.
Concurrency, Failure, Cleanup
Optimistic locking: Read version N -> edit -> submit with expected_version=N. If another client committed N+1, reject with 409 Conflict -> re-read, re-apply, retry. No distributed locks.
Read path: Fetch manifest -> get ordered chunk list with presigned GET URLs -> fetch chunks in parallel (10-20 concurrent) -> reassemble. For range reads, the manifest service computes which chunks cover the byte range and returns only those URLs.
Chunk upload failure: retry (idempotent). Manifest commit failure: retry (chunks already durable). Upload dies halfway: file unchanged because manifest unchanged. Check which chunks exist with HEAD, upload missing ones, commit.
GC: Background scanner finds chunks not referenced by any live manifest. 24-hour grace period, then delete. Aborted uploads cleaned after 7 days.
| Chunk Size | Best For |
|---|---|
| 4MB | Document editing, configs (fine-grained updates) |
| 8MB | General-purpose (recommended default) |
| 16MB | Video, media (fewer fetches, less metadata) |
28. What Breaks First at Scale
- Metadata hot spots. Popular bucket overwhelms a single Raft partition. Auto-split at 200GB or 10K ops/sec.
- Reconstruction storms. Rack failure (20 nodes) triggers massive repair I/O. Rate-limited repair + priority queues prevent cascading degradation.
- GC backlog. 5K dead shard sets/sec from overwrites. Without 20% I/O budget for GC, leaked storage hits petabytes in months.
- Small object IOPS wall. Median 10KB object = 150 reads/sec per HDD (seek-bound). Extent packing is mandatory.
- Presigned URL abuse. Leaked URL with 7-day expiry is an open door. Short expiry + IP restrictions + bucket deny policies.
29. Bottlenecks and Mitigations
| Bottleneck | Mitigation |
|---|---|
| Metadata partition hot spots | Auto-split at 200GB or 10K ops/sec. Add hash prefix for sequential keys. |
| EC reconstruction latency (+50-200ms) | Proactive repair, speculative parity reads, SSD cache |
| Small object IOPS (150/sec/HDD) | Extent packing, SSD read cache, batch reads |
| LIST on enormous prefixes | Parallel merge, secondary prefix index, max-keys 1000 |
| LSM compaction write amplification | Rate-limited leveled compaction, NVMe for metadata |
| PG rebalancing I/O spike | Rate-limited migration (2 PGs/node/hour), off-peak scheduling |
| GC vs client I/O contention | 20% I/O budget for GC, lower priority class |
| Cross-AZ EC bandwidth | Place 7/10 data shards in local AZ |
| Multipart orphan storage leak | Auto-abort after 7 days |
| 136 PB metadata footprint | Bloom filters, NVMe mandatory, prefix-partitioned scans |
| Presigned URL abuse | Short expiry (1h default), IP restrictions, audit logging |
| Raft round-trip overhead | Pipelined Raft, read leases |
| CDN cache stampede | Request coalescing, stale-while-revalidate |
| Manifest hot files | Short TTL cache, read replicas, per-file rate limit |
30. Failure Scenarios
| Scenario | Shards Remaining | Impact | Recovery | Data Loss |
|---|---|---|---|---|
| Single drive failure | 13/14 | Zero user impact | Repair in hours | None |
| Node failure (36 drives) | 13/14 per PG (~72 PGs) | Degraded reads +50ms | Background repair, rate-limited | None |
| Full AZ outage | 9-10/14 | PGs with <=9 shards temporarily unreadable | Wait for AZ recovery or rebuild | None if max 4/AZ enforced |
| Raft leader failure | N/A | Partition writes stall ~5-10ms | Raft election (1-2 round trips) | None |
| Metadata corruption | N/A | Corrupted replica removed | New replica from snapshot + log | None (2 healthy replicas) |
| Silent bit rot | 13/14 | Detected by CRC32C on read or scrubber | Reconstruct from 10 good shards | None if <5 corrupted |
| AZ network partition | N/A | Minority-side leaders stall | Leaders step down; majority continues | None (Raft guarantee) |
| Multipart failure | N/A | Incomplete upload | ListParts -> resume. Auto-abort after 7 days. | None |
| EC engine bug | 10/14 data OK | Bad parity reconstruction | Detect via ETag mismatch. Re-encode from 10 data shards. | None if data shards intact |
| GC premature deletion | 13/14 | Durability reduced | 24h grace + weekly reconciliation + soft delete | None (EC covers it) |
| Partition split during writes | N/A | ~100-200ms write stall | Automatic; API Router updates partition map | None |
| Reconstruction storm | 13/14 per PG | Cascading I/O degradation | Rate-limited repair, priority queue, backpressure | None if >=10 shards survive |
| CDN origin failure | N/A | Cached: served. Uncached: 503. | DNS health checks reroute to healthy region | None (availability only) |
| Metadata service fully down | N/A | All reads/writes fail. Data intact on disk. | Raft election or cluster recovery from snapshots | None (availability, not durability) |
When Data Is Actually Lost
Only when 5+ of 14 shards for the same object are destroyed before repair completes.
Probability: With 2% annual drive failure rate and 6-hour repair time:
- P(1 drive fails in 6h) ~ 1.4 x 10^-5
- P(5 of 14 fail before repair) ~ 10^-21 per object per cycle
- Across 100T objects annually: ~10^-4 objects lost/year = 1 object per 10,000 years
That is 11 nines.
| Risk Factor | Mitigation |
|---|---|
| Correlated failures (bad drive batch) | Diversify vendors, spread same-batch across racks |
| Slow repair (overloaded cluster) | Prioritize repair I/O, pre-provision capacity |
| Silent corruption | Scrubber every 2 weeks, CRC on every read |
| Firmware bug | Verify shards on read-back before ACK |
| Natural disaster | Cross-region replication |
31. Deployment and Operations
Multi-Region Deployment
| Region | API Routers | Data Routers | Metadata | Storage | Total |
|---|---|---|---|---|---|
| us-east-1 | 800 | 600 | 5,500 | 8,000 | 14,900 |
| us-west-2 | 600 | 450 | 4,000 | 6,000 | 11,050 |
| eu-west-1 | 600 | 450 | 4,100 | 5,500 | 10,650 |
Technology Stack
| Layer | Technology |
|---|---|
| CDN / Edge | CloudFront / Fastly |
| API Routers | Go, stateless, Kubernetes |
| Data Routers | Rust, ISA-L, Kubernetes |
| Metadata | Go, Raft + LSM-tree |
| Storage Nodes | Rust, direct I/O, extents |
| Manifest Service | Go, PostgreSQL / Raft KV |
| Coordination | etcd (3-node/region) |
| Observability | Prometheus + Grafana + Jaeger |
| Events | Kafka |
Canary Deployment
5% rollout, 30-minute soak. Auto-rollback if: PUT success <99.9%, GET p99 +20%, reconstruction rate 2x, Raft commit +50%, CDN hit ratio -10%.
Storage node upgrades: rack-by-rack rolling (drain -> upgrade -> smoke test -> re-enable). 19,500 nodes across ~540 racks at 20 min/rack = ~7.5 days full cluster.
Observability
Key SLIs:
# Latency
put_latency_ms{quantile="0.99"}
get_latency_ms{quantile="0.99"}
get_first_byte_latency_ms{quantile="0.99"}
list_latency_ms{quantile="0.99"}
# CDN
cdn_cache_hit_ratio{region}
cdn_origin_fetch_latency_ms{quantile="0.99"}
# Erasure coding
ec_reconstruct_count_total
ec_reconstruct_latency_ms{quantile="0.99"}
# Health
shard_repair_backlog
pgs_degraded_total
gc_backlog_bytes
gc_bytes_reclaimed_per_hour
raft_commit_latency_ms{quantile="0.99"}
raft_leader_elections_total
drive_utilization_percent{node_id}
cross_az_bandwidth_bytes_per_sec{src_az, dst_az}Critical Alerts:
| Alert | Condition | Severity |
|---|---|---|
| PUT success rate drop | <99.9% for 5 min | P1 |
| Reconstruction rate spike | >5% of GETs | P2 |
| Raft election storm | >10/min for a partition | P1 |
| Drive failure spike | >20/day (vs baseline 5.5) | P1 |
| GC backlog growth | Increasing for >24h | P2 |
| PG degradation | >50 degraded PGs | P1 |
| Raft replication lag | >5 seconds | P1 |
| CDN hit ratio drop | <50% for 15 min | P2 |
Distributed tracing: Every request gets a trace ID at the API Router. Follows through CDN -> API Router -> Metadata -> Data Router -> Storage Nodes -> Raft. Sampling: 0.1% normal, 100% for errors or p99 breaches.
Example degraded GET trace:
TraceID: 7a8b9c0d... Duration: 127ms
cdn_edge.check: 3ms (MISS)
api_router.authenticate: 2ms
metadata_service.lookup: 6ms (partition 87432)
data_router.fetch_shards: 112ms
9/10 shards in 15ms, shard 1 TIMEOUT at 50ms
Parity shard 10: 12ms, RS reconstruct: 0.3ms
Security
IAM and Policy Engine: Every request evaluates (principal, action, resource). IAM policies (user-level) -> bucket policies (resource-level) -> ACLs (legacy). Explicit DENY wins, then ALLOW, default DENY. In-memory policy cache on each API Router, refreshed every 5 minutes.
Encryption: Three-level key hierarchy: Master Key (HSM, annual rotation) -> Bucket Key (cached 24h, reduces KMS calls from 50K/sec to 1/bucket/day) -> Object Data Key (unique per object, AES-256-GCM). SSE-Default (service-managed), SSE-KMS (customer-managed, audited, revocable), SSE-C (customer-provided per request, not stored).
Transport: TLS 1.3 client-to-edge, mTLS service-to-service. HMAC-SHA256 auth with 15-minute replay window. CRC32C per shard, ETag (MD5) end-to-end.
Access controls: Block Public Access ON by default. Object Lock (WORM): Governance mode (override with IAM) and Compliance mode (no deletion until retention expires). MFA Delete for permanent deletion.
Audit logging: Every API call logged with principal, action, resource, source IP, timestamp, request-id, response status. Immutable, retained for compliance, queryable for forensic analysis.
Network segmentation: Storage nodes on isolated network (no internet). API Routers in DMZ. Each component has minimum required permissions.
32. Beyond This Design: Real-World Evolution
This design is the right starting point for an interview and for initial production at scale. Real systems evolve further as constraints tighten.
Metadata becomes the dominant bottleneck. At 100T+ objects, the metadata layer consumes more engineering effort than the storage layer. The 136 PB metadata footprint requires aggressive tiering: hot metadata on NVMe, warm on SSD, cold on HDD with prefetch. LSM-tree compaction at this scale is a continuous background cost — write amplification of 10-30x means every logical write triggers 10-30 physical writes during compaction. Rate-limited leveled compaction and dedicated NVMe are not optional.
Joins disappear entirely. The data model contains no joins. Every lookup is a single key access or a range scan. The shard_map is denormalized into object metadata specifically to avoid any cross-index lookup during the read path. Production systems take this further: pre-computed aggregations for billing, pre-built indexes for lifecycle scanning, materialized views for common LIST patterns.
Hot partition splitting becomes automatic. A single viral bucket can overwhelm its metadata partition. Production systems detect hot partitions by monitoring ops/sec per partition and automatically split them, redistributing load. The split must be transparent to clients — no downtime, no visible key range changes. This requires the partition map to be versioned and propagated to all API Routers within seconds.
Background work dominates I/O. In a steady-state cluster, client traffic may use only 30-40% of disk I/O. The rest is consumed by compaction (LSM-tree maintenance), scrubbing (integrity verification), GC (dead shard reclamation), repair (shard reconstruction), and rebalancing (capacity redistribution). Scheduling and prioritizing these background tasks is a major operational concern. Starvation of any one (especially repair) directly threatens durability.
Cross-AZ network cost exceeds storage cost. At exabyte scale, cross-AZ data transfer is often the largest single line item. Placing the majority of data shards (7 of 10 data shards) in the local AZ reduces cross-AZ traffic by 70%. Parity shards, which are written once and rarely read, can tolerate being remote. This placement bias must be balanced against failure domain requirements.
Small object handling diverges from large object handling. The size-based split (replication for small, EC for large) introduces two distinct code paths with different performance characteristics, failure modes, and operational procedures. Some systems unify this by always using replication below a threshold and transparently converting to EC during background compaction when objects are sealed into extents.
Cluster maps must propagate fast and consistently. CRUSH depends on every component having the same view of the cluster topology. A stale cluster map causes writes to land on wrong nodes and reads to miss. Production systems version the cluster map and include the version in every RPC. If a recipient has an older version, it fetches the update before processing. This propagation must complete in seconds, not minutes.
Repair speed is the primary durability lever. The probability of data loss scales exponentially with repair time. Halving repair time from 40 hours to 20 hours reduces the probability of a second failure hitting the same PG by orders of magnitude. Investment in faster repair (more parallel workers, dedicated repair bandwidth, prioritized I/O scheduling) yields a higher durability return than adding parity shards.
33. Explore the Technologies
Core Technologies
| Technology | Role | Learn More |
|---|---|---|
| FoundationDB | Metadata store: ordered KV, strong consistency | FoundationDB |
| RocksDB | LSM-tree engine for metadata nodes | RocksDB |
| TiKV | Alternative metadata store: ordered KV, Raft-based | TiKV |
| CockroachDB | Alternative metadata store: distributed SQL, Raft | CockroachDB |
| gRPC | Internal RPC with protobuf | gRPC |
| etcd | Placement coordination, cluster membership | etcd |
| Prometheus + Grafana | Metrics and dashboards | Prometheus |
| Kafka | Event notifications | Kafka |
Distributed Systems Concepts
| Concept | Role in This Design | Learn More |
|---|---|---|
| Raft Consensus | 1.36M Raft groups for strongly consistent metadata | Raft Consensus |
| LSM Trees vs B-Trees | LSM-tree metadata storage: fast writes, bloom-filter reads | LSM Trees vs B-Trees |
| Bloom Filters | Extent index: skip irrelevant extents during shard lookup | Bloom Filters |
| Write-Ahead Log | LSM-tree WAL for crash recovery on metadata nodes | Write-Ahead Log |
| Consistent Hashing | Foundation for CRUSH placement algorithm | Consistent Hashing |
| SSTable Compaction | Background LSM compaction to bound read amplification | SSTable Compaction |
| Merkle Trees | Integrity verification for shard data | Merkle Trees |
Infrastructure Patterns
| Pattern | Relevance to This Design | Learn More |
|---|---|---|
| Object Storage | The system being designed | Object Storage |
| Replication & Consistency | Raft for metadata, EC for data | Replication |
| Database Sharding | 1.36M metadata partitions | Sharding |
| Hot/Warm/Cold Tiering | Storage class transitions | Tiering |
| CDN & Edge | CDN-signed URLs, OAC, WAF | CDN |
| Distributed Tracing | End-to-end request tracing | Tracing |
| Erasure Coding | Reed-Solomon math and shard recovery | Erasure Coding |
| CRUSH Placement | Topology-aware shard placement (PG -> nodes) | CRUSH |
Further Reading
- Reed-Solomon Error Correction (Plank, 2013) -- Definitive RS tutorial with worked examples
- CRUSH: Controlled, Scalable, Decentralized Placement (Weil et al., SC 2006) -- The topology-aware placement algorithm
- Finding a Needle in Haystack (OSDI 2010) -- Small object packing at scale
- ISA-L: Intel Storage Acceleration Library -- Hardware-accelerated erasure coding with SIMD
- Windows Azure Storage (Calder et al., SOSP 2011) -- Blob storage architecture with erasure coding
- Dynamo: Highly Available Key-Value Store (DeCandia et al., SOSP 2007) -- Consistent hashing and quorum concepts
- The Google File System (Ghemawat et al., SOSP 2003) -- Chunk-based distributed storage
Reed-Solomon (10+4) targets 11 nines at 1.4x overhead. CDN serves reads via CDN-signed URLs and OAC, not storage presigned URLs. The control plane (1.36M Raft groups) runs separately from the data plane (stateless Data Router fleet, 19,500 storage nodes). Immutability removes locking, simplifies caching, and eliminates replication conflicts. Storage presigned URLs go direct to storage for uploads. File-level edits live in an application-layer manifest service on top. The metadata layer is the hardest part of the design, and the part that matters most.