System Design: Object Storage (Erasure Coding, Flat Namespace, and Exabyte Scale)
Goal: Build a distributed object storage platform storing 100+ trillion objects across exabytes of raw data. Support PutObject, GetObject, DeleteObject, ListObjectsV2, multipart uploads up to 5TB, versioning, storage class transitions, and presigned URLs. Target 11 nines of durability (99.999999999%) using Reed-Solomon erasure coding (10+4) combined with fast repair, multi-AZ isolation, and failure independence. Strong read-after-write consistency. Flat namespace. CDN-accelerated reads. Clear control plane / data plane separation.
Important: Object storage is immutable by design. There is no PATCH, no partial write, no append. PUT always creates a new version. This removes the need for distributed locking, conflict resolution, and crash recovery from partial writes. Applications that need file-level edits build a manifest-based chunking layer on top (Section 12).
Reading guide:
- Sections 1-4: Problem, requirements, design principles
- Section 5: Technology selection: erasure coding, placement, metadata, CDN, control/data plane
- Section 6: Architecture diagram with numbered request flows
- Section 7: Back-of-envelope math
- Sections 8-9: Data model and API design
- Section 10: Deep dives: end-to-end walkthrough, PUT/GET, presigned URLs, multipart, extent files, GC, rebalancing
- Section 11: What breaks first at scale
- Section 12: Partial update layer (application-level, not built into storage)
- Sections 13-17: Bottlenecks, failures, deployment, observability, security
Mental model (read this first):
- Metadata Service: object key → placement group (PG) ID + version + ETag. Knows what exists, not where it lives.
- CRUSH: PG ID → 14 storage nodes. Computed locally by any component using the cluster map. No central lookup.
- Storage Nodes: store erasure-coded shards in extent files on disk. An object is split into shards (EC concept: 10 data + 4 parity). Each shard is then appended into a large extent file (storage concept: a 256MB container holding many shards sequentially). Extents are just files on disk. Shards are just slices of objects.
- Cluster Map: shared topology tree (AZ → rack → node → disk with weights). Updated by control plane, cached everywhere. What makes CRUSH deterministic.
- Immutability: PUT creates new shards. Old shards untouched until GC. No locking, no conflicts.
Source of truth for placement: The PG ID stored in metadata is the authoritative reference. CRUSH + cluster map derive node locations from that PG ID. The
shard_mapfield in metadata is a cached optimization for faster reads and may be stale or recomputed.System flow in 5 steps:
- PUT arrives → metadata assigns a PG ID
- CRUSH computes which 14 nodes hold this PG's shards
- Object is erasure-coded into 14 shards
- Shards are appended into extent files on those storage nodes
- Metadata commits via Raft → object is visible
TL;DR: Reed-Solomon (10+4) splits each object into 10 data + 4 parity shards, so each object is typically placed across 14 storage nodes in different failure domains (racks, AZs) via CRUSH policies. Combined with fast repair and multi-AZ isolation, this targets 11 nines of durability at 1.4x storage overhead, saving $32M/month/EB vs triple replication. The architecture separates control plane from data plane. The control plane manages roughly 136 PB of metadata (500-1KB per object depending on features, averaging ~700 bytes x 100T objects), split into 1.36M Raft partitions at 100GB each for strong consistency. The data plane is 19,500 storage nodes (14 EB physical storage at 720TB per node, 36 drives each). CRUSH deterministically maps each object to its 14 nodes using the cluster topology. No per-object lookup needed, but CRUSH depends on a globally consistent cluster map that's distributed and cached by all components. Small objects pack into 256MB extent files. Objects are immutable. Flat namespace with sorted LSM-tree keys gives near-constant-time lookups and efficient prefix scans. For CDN-served private content, use CDN-signed URLs with Origin Access Control (not storage presigned URLs). For file-level edits, build a manifest service (application-layer) on top that tracks chunk-to-file mappings and atomically swaps versions.
1. Problem Statement
Problem 1: Durability at 11 nines. A typical data center runs around 100K drives. At a 2% annual failure rate, roughly 5.5 drives fail daily. Failures are constant, not rare. Storage costs about $0.02/GB/month (around $20M per EB). 3x replication stores 3 EB per EB of logical data, costing around $60M/month. Reed-Solomon (10+4) stores 1.4 EB, costing around $28M/month. That saves roughly $32M/month and tolerates up to 4 shard failures instead of 2.
Problem 2: Flat namespace at trillion scale. No directories (avoids metadata hotspots and distributed locking). The key "photos/2024/vacation/beach.jpg" is opaque. LIST with prefix=photos/2024/ and delimiter=/ simulates directories via efficient range scans over sorted, partitioned keys.
Problem 3: Strong consistency. PUT then GET always returns the latest version. Requires Raft consensus at every metadata write.
Problem 4: Partial updates on immutable blobs. There's no PATCH, no append. Editing 10MB in a 1GB file means re-uploading the whole thing. The workaround is a manifest-based chunking layer on top (Section 12). This is an application concern, not a storage concern.
Scale: 100T+ objects, exabytes of data, 350K req/sec peak, 5TB max object size.
Common anti-patterns at scale:
-
Blind 3x replication. Costs 3x storage (around $60M/month per EB) with weaker fault tolerance. Use erasure coding (RS 10+4) for better cost and durability.
-
Single metadata node. Cannot index or serve 100T objects. Use partitioned, distributed metadata with consensus (Raft).
-
One file per object on a filesystem. Filesystems like ext4 allocate a fixed number of inodes (small metadata records that track each file). In practice, a filesystem has tens to hundreds of millions of inodes depending on how it's formatted. At large scale, inodes run out before disk space does. On top of that, very large directories degrade performance due to metadata overhead and lookup costs. The fix: pack many small objects into large append-only segment files (called extents, typically 256MB each). A separate index (persisted on disk and cached in memory) maps each object to its offset and size within these segments. Result: fast sequential writes and efficient reads via a single lookup plus a direct seek. See Section 10.6 for the full extent file design.
-
Routing all data through API servers. The API fleet becomes a bandwidth bottleneck. Use presigned URLs for direct client-to-storage transfers.
-
Eventual consistency via polling. Leads to stale reads and unpredictable behavior. Use strong consistency (Raft) for metadata.
-
Naive hashing without placement groups. Adding a single node reshuffles data across the entire cluster. Use placement groups to limit rebalancing scope.
-
No garbage collection. Leaked and orphaned data accumulates to petabytes over months. Run background GC for cleanup.
-
Synchronous writes to all AZs. High latency, poor availability under partial failure. Write to a quorum, repair asynchronously.
-
Treating all objects the same. Inefficient for both ends of the size spectrum. Small objects need packing into extents. Large objects need multipart upload.
-
Ignoring storage tiering. Cold data sitting on SSD wastes money. Use tiered storage (hot/warm/cold) with lifecycle policies.
-
Ignoring CDN and edge caching. All reads hitting origin means high bandwidth cost. CDN offloads ~60-80% of read traffic.
The metadata layer is the hardest part of this design. 100T objects with strong consistency across 1.36M Raft partitions, supporting both point lookups and range scans. Get it right and the rest of the system is manageable. Get it wrong and no amount of storage-layer engineering compensates.
2. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-1 | PutObject: upload objects up to 5TB | P0 |
| FR-2 | GetObject: retrieve objects with byte-range reads | P0 |
| FR-3 | DeleteObject: soft delete (versioning) or hard delete (version-id) | P0 |
| FR-4 | ListObjectsV2: prefix + delimiter + pagination over trillions of objects | P0 |
| FR-5 | Multipart upload: create, upload-part (up to 10,000 parts), complete, abort | P0 |
| FR-6 | Presigned URLs: time-limited signed URLs for direct client-to-storage transfers | P0 |
| FR-7 | Bucket versioning: every PUT creates a new version, recoverable deletes | P1 |
| FR-8 | Storage class transitions: Standard, IA, Glacier Instant/Flexible, Deep Archive, Intelligent-Tiering | P1 |
| FR-9 | Lifecycle policies: automatic transition and expiration | P1 |
| FR-10 | CopyObject: server-side copy without re-upload | P1 |
| FR-11 | Object tagging for lifecycle rules and access policies | P1 |
| FR-12 | Bucket policies and IAM: resource-level and user-level access control | P0 |
| FR-13 | Server-side encryption: SSE-Default, SSE-KMS, SSE-C | P0 |
| FR-14 | Event notifications on object create/delete/transition | P1 |
| FR-15 | Cross-region replication: event-driven async replication with version-based ordering, eventual consistency to target region, automatic retry on failure | P2 |
| FR-16 | CDN integration: edge caching via CDN-signed URLs, WAF/DDoS protection | P1 |
| FR-17 | Automatic rebalancing and repair on node addition/removal/failure | P0 |
3. Non-Functional Requirements
| Property | Target |
|---|---|
| Durability | 99.999999999% (11 nines) |
| Availability | 99.99% (52.6 min downtime/year) |
| PUT latency (p99, <1MB) | < 200ms |
| GET latency (p99, first byte, <1MB) | < 100ms |
| LIST latency (p99, 1000 results) | < 500ms |
| Total objects | 100T+ |
| Total logical storage | Exabytes |
| Maximum object size | 5TB |
| Peak request rate | 350K/sec |
| Storage overhead | < 1.5x (erasure coding) |
| Consistency | Strong read-after-write |
| CDN cache hit ratio | >= 60% for read-heavy workloads |
4. Design Principles
1. Objects are immutable. PUT never modifies in place. This removes the need for distributed locking, makes caching straightforward (ETags never go stale), and eliminates conflict resolution from replication. If two clients upload the same key simultaneously, they each create independent shards on independent nodes. There's nothing shared to lock. Raft orders the metadata commits to decide which version is "latest", but the data writes never conflict.
2. Separate control plane from data plane. Metadata (Raft consensus) and data (quorum writes, EC) need to scale independently. Mixing them creates a bottleneck that caps both.
3. No central bottleneck. Data Router is a stateless fleet. Placement uses CRUSH (deterministic, no central authority). Metadata is partitioned across 1.36M Raft groups.
4. Push data to the edge. CDN caches content via CDN-signed URLs and Origin Access Control. Storage presigned URLs transfer bytes directly to/from storage nodes for uploads. API servers handle auth and metadata, not terabits of data.
5. Failure is routine. At 5.5 drive failures per day, failure is not an exception. Write with quorum (11/14, so up to 3 slow nodes don't block), read with speculative parity fetches (request a backup shard before the primary times out, use whichever arrives first), repair in the background. One slow node should never block a user request.
6. Flat namespace is an advantage. No directory tree = no distributed lock coordination for renames. Sorted keys in LSM-trees give near-constant-time point lookups (O(log N) technically, but with bloom filters and block caches it behaves like O(1) in practice) and efficient range scans.
5. High-Level Approach & Technology Selection
5.1 Object Storage is Not a Filesystem
No directory tree, no inodes, no POSIX semantics, no rename(), no partial writes, no file locking. An object is a key, a value, and some headers. That simplicity is deliberate. Every constraint removed from the API is a constraint removed from the storage engine.
5.2 Immutability: The Core Design Decision
PUT creates a new version with entirely new shards. The old shards sit there until GC cleans them up. Because nothing is ever modified in place, there are no partial shard updates, no distributed locks, no torn writes. ETags stay valid forever. Replication is just copying bytes with no conflict resolution.
All metadata writes go through the Raft leader. Metadata reads are typically served from the leader or follower replicas with consistency guarantees (leader reads for strict consistency, follower reads with lease-based staleness bounds for lower latency). Leader reads cost ~2-5ms, but PUT then GET just works. No more workarounds for eventual consistency.
What "immutable" means practically: There is no PATCH endpoint. No append operation. No in-place modification. Uploading report.pdf and then uploading a new version creates entirely new shards on potentially different nodes. The old version's shards are untouched until GC reclaims them (or kept forever if versioning is enabled). This is a deliberate trade-off: simpler storage internals at the cost of re-uploading entire objects for any change. Section 12 covers how to build partial updates on top of this.
5.3 Erasure Coding vs Replication
3x Replication: 10MB object → 30MB stored (3.0x), survives 2 failures. 10+4 Reed-Solomon: 10MB → 10 data shards (1MB each) + 4 parity shards = 14MB stored (1.4x), survives ANY 4 failures.
| 3x Replication | 10+4 Erasure Coding | |
|---|---|---|
| Physical storage (1 EB logical) | 3 EB | 1.4 EB |
| Monthly cost ($0.02/GB) | $60M | $28M |
| Fault tolerance | 2 failures | 4 failures |
Erasure coding wins on both cost and durability. The trade-off? Write latency goes up (you compute parity before writing) and reconstruction is more expensive (read 10 shards and do the math). But ISA-L does ~10 GB/sec encoding per core. In most deployments, the compute overhead is small compared to disk and network I/O.
Encoding is also pipelined. The Data Router processes 1MB stripes as they arrive over the network, encoding and shipping shards in parallel with receiving the next stripe. It never buffers the full object before starting. In most practical deployments, EC CPU cost is small relative to disk and network I/O. The encoding CPU typically shows up as a rounding error in flame graphs compared to fsync latency and cross-AZ network time.
For the math behind Reed-Solomon (generator matrices, Galois field arithmetic, reconstruction), see the Erasure Coding deep dive. Section 10.1 shows how it works with a concrete end-to-end example.
5.4 Object Placement: Placement Groups + CRUSH
pg_id = hash(bucket_id + "/" + object_key) % 100,000
Each object maps to a placement group (PG), and each PG maps to a fixed set of storage nodes via CRUSH (Controlled Replication Under Scalable Hashing). CRUSH is a deterministic, topology-aware placement algorithm that selects nodes based on cluster layout (racks, AZs) and capacity weights, ensuring shards are distributed across failure domains.
object → PG → CRUSH → nodes
Example:
pg_id = 42
CRUSH(pg_id) → [Node A, Node F, Node K, ...]
These nodes store all 14 shards for objects in that PG.
Each PG maps to 14 nodes, meaning all shards of objects in that PG are consistently placed on the same node set. This avoids per-object placement decisions and keeps data layout predictable.
The placement map is computed locally (no central lookup) and cached by all components. It changes only when cluster topology changes (node join, leave, or failure), minimizing coordination overhead.
Why PGs over direct consistent hashing? Managing placement per object does not scale at 100T objects. By grouping objects into ~100K PGs (~100x nodes), the system reasons about groups instead of individual objects. When a node is added or removed, only a small subset of PGs are reassigned, triggering controlled, bulk data movement between specific nodes rather than a full reshuffle. This makes rebalancing predictable, efficient, and operationally manageable.
5.5 Metadata Architecture
Each object needs roughly 500-1KB of metadata depending on features enabled (versioning, ACLs, custom metadata). A typical breakdown: key (~200B), version ID (16B), size/etag/content-type (~80B), ACL/timestamps (~60B), user metadata (~100B), and a shard map with 14 entries (~225B), totaling around 700 bytes on average. Multiply by 100T objects and you get roughly 70 PB of raw metadata. With LSM-tree overhead (bloom filters, block indexes, write-ahead logs), that doubles to ~136 PB.
How to partition it: ~1.36M Raft partitions at 100GB each (small enough for fast Raft snapshots, large enough to keep the total partition count manageable). 3 replicas per partition (standard Raft quorum, where any 2 of 3 form a majority for consensus). Spread across ~13,600 metadata nodes, each hosting ~100 partitions.
Key format: /{bucket_id}/{object_key}\0{version_id}. Prefixing with bucket_id puts all objects in the same bucket in a contiguous key range, so a LIST with a prefix just scans forward from that point. The \0 null byte cleanly separates the object key from the version. The version_id uses bit-flipped timestamps (XOR each byte with 0xFF) so that newer versions produce smaller byte values and sort first in a forward scan. This means GET without a version_id just reads the first entry it finds.
What metadata stores vs what CRUSH computes: Metadata stores the PG ID for each object, not node locations. Node placement is derived at read/write time by running CRUSH against the cluster map. The shard_map field in the protobuf schema (Section 8) is an optimization: cached shard locations for faster reads. It can always be recomputed from PG ID + CRUSH + cluster map.
Requirements: <10ms point lookup, efficient prefix range scan, versioned entries, strong consistency via Raft per partition.
| Metadata Store Option | Consistency | Range Scans | Battle-tested |
|---|---|---|---|
| FoundationDB | Strong (Paxos) | Yes | Apple (iCloud) |
| TiKV | Strong (Raft) | Yes | PingCAP |
| CockroachDB KV | Strong (Raft) | Yes | Cockroach Labs |
5.6 Storage Classes
| Class | Latency | $/TB/mo | Retrieval $/GB | Use Case |
|---|---|---|---|---|
| Standard | ms | $23.00 | $0.00 | Active data |
| Infrequent Access | ms | $12.50 | $0.01 | Backups |
| Glacier Instant | ms | $4.00 | $0.03 | Archives needing instant access |
| Glacier Flexible | min-hrs | $3.60 | $0.01-0.03 | Compliance archives |
| Deep Archive | 12-48h | $0.99 | $0.02 | Legal/regulatory (tape robots) |
| Intelligent-Tiering | ms | auto | $0.00 | Unknown access patterns |
Standard to IA is a billing change (data stays put). Glacier uses denser HDD arrays with higher EC ratios like 16+4 (1.25x overhead instead of 1.4x, since access is rare). Deep Archive uses extremely dense, low-cost storage, often tape-backed systems where retrieval takes 12-48 hours because physical media must be loaded by robotic arms.
5.7 CDN and Edge Layer
At 250K GETs/sec x 100KB = 25 GB/sec egress. CDN offloads 60-80%, reducing origin to ~7.5 GB/sec.
Objects are immutable, so CDN caching is safe by design. But storage presigned URLs are NOT CDN-friendly: each URL has a unique signature, so CDN sees every request as a different cache key (always a miss). For private content through CDN, use CDN-signed URLs with Origin Access Control. Full details in the presigned URL deep dive (Section 10.4).
WAF at CDN edge: per-IP/bucket/account rate limiting, geo-restrictions, enumeration attack detection.
5.8 Control Plane vs Data Plane
| Control Plane | Data Plane | |
|---|---|---|
| Does | Decisions: placement, versioning, access | Bytes: encode, transfer, store |
| Components | Metadata Service, Raft, CRUSH, IAM, Lifecycle/GC | Data Router Fleet (stateless), Storage Nodes |
| Consistency | Strong (Raft) | Quorum (11/14 shards) |
Data Router is a logical role, not necessarily a separate fleet. It handles erasure coding and shard distribution. In the API path, it runs as a separate stateless fleet (1,500 instances, ~33 PUTs/sec each). In the presigned path, the storage node takes on this role directly. Some deployments embed it in storage nodes entirely and skip the separate fleet. The role is the same regardless of where the code runs: receive bytes, encode, distribute shards, coordinate the metadata commit.
5.9 Technology Selection
| Component | Technology | Why |
|---|---|---|
| CDN / Edge | CloudFront / Fastly | CDN-signed URLs, OAC, 200+ PoPs, terabit DDoS |
| Metadata store | Custom LSM-tree + Raft | Ordered KV, range scans, strong consistency |
| Object data | Raw block, extent-based | No filesystem overhead, direct I/O |
| Erasure coding | Reed-Solomon, ISA-L | AVX-512 accelerated, ~10 GB/sec/core |
| Placement | CRUSH-derived | Deterministic, no central authority |
| Internal RPC | gRPC + protobuf | Low latency, bidirectional streaming |
| IAM / Policy | Custom evaluator | Sub-ms evaluation, in-memory cache |
| KMS | HSM-backed | Three-level key hierarchy, bucket key caching |
| Monitoring | Prometheus + Grafana + Jaeger | Standard stack |
| Events | Kafka | Durable event stream |
5.10 Build vs Use
Before building from scratch, it's worth evaluating existing distributed object storage systems. They already implement erasure coding, placement, metadata scaling, and the S3 API. The question is where they hit their limits.
MinIO is lightweight, S3-compatible, and designed for private cloud. Single Go binary, Reed-Solomon per object, distributes data across server pools. Operationally simple. Architected for petabyte scale, not exabyte. At very large scale, the lack of CRUSH-style topology-aware placement becomes a limiting factor.
Ceph (RADOS Gateway) is battle-tested at massive scale. CERN runs it for physics data at exabyte scale. DigitalOcean built their object storage on it. Ceph uses CRUSH for deterministic placement, supports multiple erasure coding schemes. The downside is operational complexity. Ceph clusters need dedicated ops teams who understand PG states, recovery priorities, and OSD tuning.
SeaweedFS is optimized for simplicity and small file performance. Central master for metadata and volume management, Reed-Solomon for data. Fast to set up, good for smaller deployments. The central master becomes a bottleneck at large scale.
| Property | MinIO | Ceph (RADOS Gateway) | SeaweedFS |
|---|---|---|---|
| API compatibility | Near-complete S3 | Near-complete S3 | Partial S3 |
| Erasure coding | RS per object | Multiple schemes (RS, LRC) | RS per volume |
| Placement | Server pools, distributed | CRUSH (topology-aware) | Central master assigns volumes |
| Operational complexity | Low | High | Low |
| Battle-tested scale | PB range | EB range (CERN, DigitalOcean) | TB to low PB range |
| Your scale | Recommendation |
|---|---|
| < 100TB | MinIO. Simple, fast to deploy, good S3 compatibility. |
| 100TB - 10PB | MinIO or Ceph. MinIO for simplicity, Ceph for fine-grained placement control. |
| 10PB - 1EB | Ceph with a dedicated ops team, or custom with sufficient engineering depth. |
| > 1EB | Custom. At this scale, controlling every trade-off matters, and the team size (50-100 engineers) justifies the investment. |
6. High-Level Architecture
Three parallel paths exist. Downloads go through CDN. Uploads go through the API path or presigned URLs. All three paths converge at the same storage layer with identical placement, erasure coding, and durability guarantees. Only the access pattern differs. The storage is the same.
6.1 CDN Download Path
Read-heavy, cached, no API involvement. CDN handles auth and caching at the edge.
For reads with CDN-signed URLs. CDN talks directly to storage origin. No API Router, no Metadata Service, no Data Router in this path.
How Path 1 works step by step:
The client never talks to the storage system directly. It talks to the CDN.
- Your backend generates a CDN-signed URL (using a CDN key pair, completely separate from storage credentials) and gives it to the client.
- Client sends the request to the CDN edge (e.g.,
https://d1234.cloudfront.net/photos/beach.jpg?Signature=...&Expires=...). - CDN edge validates the signature using the public key, checks expiry, and enforces access policies (IP restrictions, geo-blocks, header requirements). If invalid or expired: 403, no origin hit.
- Cache hit: CDN already has this object cached from a previous request. Returns it immediately. This is the fast path and handles 60-80% of reads.
- Cache miss: CDN doesn't have it cached. CDN fetches from the storage origin using its Origin Access Control (OAC) credential. The storage bucket is private and blocks all direct access. Only the CDN's service identity is allowed to read from it.
- The storage origin has an internal gateway that calls the metadata service for the object lookup (PG ID, shard locations), then calls the data layer to fetch the 10 data shards in parallel from the appropriate storage nodes (determined by CRUSH). If any shards are missing (degraded read), it fetches parity shards and reconstructs via Reed-Solomon. The assembled object bytes are returned to CDN.
- CDN receives the bytes, caches them (based on Cache-Control headers), and serves to the client.
Notice what's NOT in this path: no API Router, no IAM Engine, no Data Router. The CDN handles auth (its own signature). The storage origin handles internal routing. The CDN sees a single HTTP endpoint and gets back a blob.
6.2 API Upload Path
Full controlled path. Every component involved: auth, metadata, encoding, storage, consensus.
How Path 2 works step by step:
This is the controlled path where every component is involved.
- Client sends
PUT /bucket/keywith the object bytes and an HMAC-SHA256 signature in the Authorization header. - Request passes through WAF (malicious request filtering, rate limiting) and the L7 load balancer (distributes across API Router instances).
- API Router validates the HMAC signature (recomputes it from the request and the client's secret key, compares). Then calls the IAM Engine to check if this user has PutObject permission on this bucket. If either fails: 403.
- API Router asks the Metadata Service for the placement group: "where should this object's shards go?" Metadata Service hashes the bucket+key to a PG ID, looks up the CRUSH map, and returns the list of 14 storage nodes.
- API Router streams the object bytes to a Data Router instance (stateless, any instance works).
- Data Router processes the data in 1MB stripes. For each stripe: split into 10 data shards, compute 4 parity shards using Reed-Solomon encoding. Pipelining means it doesn't wait for the full object before starting.
- Data Router sends all 14 shards to the designated storage nodes in parallel. Each storage node appends the shard to an extent file, computes CRC32C checksum, fsyncs to disk, and ACKs.
- Once a quorum of shard writes succeed (at least 11 of 14), the Data Router tells the Metadata Service to commit. The Metadata Service writes the object metadata (key, ETag, shard locations) through Raft consensus. 2 of 3 Raft replicas must confirm.
- Only after both the shard quorum AND the Raft commit succeed does the client get a 200 OK with the ETag and version-id. If shards wrote successfully but the Raft commit fails (e.g., leader election), those shards become orphans. GC cleans them up. Client gets an error and retries.
CDN is not involved in uploads. It would add latency and caching is useless for writes.
6.3 Presigned Upload Path
Direct path for large uploads. Storage node handles everything. Zero bytes through API fleet.
Bypasses CDN, WAF, LB, API Router, and Data Router. The storage node does all the work:
How Path 3 works step by step:
This is the fast path for large uploads. A 5TB video upload sends zero bytes through the API Router.
- Client first asks the API server (via normal authenticated request) to generate a presigned URL. The API server signs the URL with the storage system's secret key, embedding the bucket, key, operation (PUT), expiry, and any conditions (content-type, max size, IP restriction). Returns the URL to the client.
- Client uploads the object bytes directly to the storage node endpoint in the presigned URL. No CDN, no WAF, no LB, no API Router in the data path.
- The storage node receives the bytes and validates the presigned URL: reconstructs the "string to sign" from the request (method, path, query params, headers, expiry), looks up the secret key for that credential (cached locally on every storage node), computes its own HMAC-SHA256, and compares. If it matches and the expiry hasn't passed, the request is authentic. No database lookup needed, no session state.
- Storage node looks up the CRUSH placement map (cached locally) to determine which 14 nodes should hold this object's shards.
- Storage node runs Reed-Solomon encoding on the incoming bytes (same 1MB stripe pipelining as the Data Router would do). Produces 10 data + 4 parity shards.
- Storage node writes 1 shard locally and sends the other 13 to peer storage nodes in parallel. Each peer appends to extent file, CRC32C, fsync, ACK.
- Once quorum ACKs arrive (at least 11 of 14), the storage node sends the metadata to the Metadata Service for Raft commit. Same Raft consensus as the API path: 2 of 3 replicas confirm.
- Only after both shard quorum and Raft commit succeed does the client get 200 OK.
The storage node temporarily takes on the Data Router's role: signature validation, CRUSH lookup, EC encoding, shard distribution, and Raft commit coordination. Once the upload is done, it goes back to being a regular storage node. This dual role is what makes presigned uploads possible without a separate fleet of proxy servers.
6.4 Background Systems
Always running, not tied to any request path:
These run continuously regardless of client traffic:
- Repair / Rebalance: When a storage node or drive fails, the repair worker detects it via heartbeat timeout (~30 seconds), reads surviving shards for affected placement groups, reconstructs missing shards using Reed-Solomon math, and writes them to healthy nodes. Also handles rebalancing when nodes are added or removed.
- Garbage Collector: Cleans up shards that are no longer referenced by any live object (from overwrites, deletes, aborted multipart uploads, or failed writes where shards landed but metadata commit failed). 24-hour grace period before physical deletion.
- Scrubber: Reads every shard on every drive periodically (full scan every 2 weeks), computes CRC32C, and compares to the stored checksum. Detects silent bit rot before a client reads the corrupted data.
- Lifecycle Manager: Evaluates lifecycle rules against objects (e.g., "move to Glacier after 90 days"). Triggers storage class transitions by re-encoding shards with the target EC scheme and moving them to the appropriate storage tier.
6.5 Component Reference
| Component | What it does |
|---|---|
| CDN Edge | Caches objects for reads. Only involved in downloads, never uploads. Uses CDN-signed URLs (not storage presigned URLs). See Section 10.4. |
| WAF / DDoS | Filters malicious requests, rate limiting. Sits in front of the API path. |
| API Router Fleet | Stateless servers that receive client requests, validate HMAC signatures, and route to internal services. The front door. Horizontally scalable, no state. |
| IAM Engine | Checks if this user is allowed to do this operation on this bucket/object. Evaluates IAM policies + bucket policies. Returns allow or deny. |
| Metadata Service | Stores the index of all objects: maps object key → placement group (PG) ID + version + ETag + shard locations. Does NOT decide which nodes hold the shards (that's CRUSH). Metadata answers "does this object exist and what's its PG?" CRUSH answers "which 14 nodes does this PG map to?" |
| Raft Groups (1.36M) | Raft is a consensus protocol that ensures metadata writes are durable and consistent. The metadata is split into 1.36 million partitions, each running its own Raft group (3 replicas). A metadata write is confirmed after 2 of 3 replicas agree, which guarantees PUT-then-GET consistency. |
| CRUSH Placement | Deterministic algorithm that answers: "given this object, which 14 storage nodes hold its shards?" Uses cluster topology (racks, AZs) to spread shards across failure domains. Runs locally on any node without a central server. |
| Data Router Fleet | Stateless servers that handle erasure coding. Receive object bytes, split into 10 data + 4 parity shards (Reed-Solomon), send each shard to the correct storage node (determined by CRUSH). Only used in the API path. |
| Storage Nodes | The actual machines with disks. 36 HDDs each. Store shards in extent files, verify checksums on read, report health via heartbeats. In the presigned path, temporarily take on the Data Router's job (EC encode + distribute shards). |
Storage Node: Dual Role. In the API path, a storage node is simple: it receives a pre-encoded shard from the Data Router, appends it to an extent file, fsyncs, and ACKs. It doesn't know about erasure coding. In the presigned path, the storage node that receives the upload temporarily acts as the Data Router too: validates signature, looks up CRUSH, runs EC encoding, distributes shards to peers, and commits metadata. Same work gets done, just by a different component.
All paths converge at the storage layer with the same EC + durability guarantees. Whether an object arrives via the API path, a presigned URL, or is read through CDN, it ends up as the same 14 erasure-coded shards on the same storage nodes with the same Raft-committed metadata. The path is different. The durability is identical.
6.6 Path Comparison
| Step | API Upload Path | Presigned Upload Path | CDN Download Path |
|---|---|---|---|
| Entry point | WAF → LB → API Router | Storage Node directly | CDN Edge |
| Auth | API Router + IAM Engine | Storage Node validates signature locally | CDN validates CDN-signed URL at edge |
| Placement | Data Router calls CRUSH | Storage Node calls CRUSH | Handled internally by storage origin |
| EC encode | Data Router | Storage Node (acts as Data Router) | N/A (read path) |
| Shard write | Data Router → 14 Storage Nodes (quorum 11/14) | Storage Node → 13 peers + 1 local (quorum 11/14) | N/A |
| Metadata | Shard quorum ACK → Raft commit → then final ACK to client | Shard quorum ACK → Raft commit → then 200 OK | Storage origin handles lookup internally |
| What's skipped | Nothing | CDN, WAF, LB, API Router, Data Router | API Router, IAM, Data Router (CDN talks to storage origin directly via OAC) |
Request flows (all operations):
| Operation | Path |
|---|---|
| PUT (API) | Client → WAF → LB → API Router → IAM → Metadata (PG lookup) → Data Router (EC encode) → Storage Nodes (14 shards, quorum 11/14) → Raft commit → ACK |
| PUT (Presigned) | Client → Storage Node directly (validate sig → CRUSH → EC encode → distribute 14 shards, quorum 11/14 → Raft commit → 200 OK) |
| GET (CDN) | Client → CDN Edge (validate CDN-signed URL, cache hit: serve / miss: fetch from storage origin via OAC → cache + serve). No API Router or Data Router in this path. |
| GET (Presigned) | Client → Storage Node directly (validate sig → read shards → stream). No CDN. See Section 10.4 for CDN alternatives. |
| DELETE | Client → API Router → Metadata (write delete marker via Raft) |
| LIST | Client → API Router → Metadata (parallel range scan across partitions, k-way merge, early termination at max-keys) |
6.7 Object Data Lifecycle
Every object follows this lifecycle from creation to eventual cleanup:
7. Back-of-Envelope Estimation
7.1 Request Throughput
Daily requests: 10B → ~115K/sec average, ~350K/sec peak (3x)
GET 70%: 250K/sec PUT 15%: 50K/sec LIST 10%: 35K/sec DELETE 5%: 15K/sec
7.2 Storage
100T objects × 100KB avg = 10 EB logical (average varies widely: 1KB for configs, 100KB typical, 1MB+ for media)
10+4 EC (1.4x) = 14 EB physical
19,500 nodes × 36 drives × 20TB = 14 EB
700,000 total drives
7.3 Metadata
~700 bytes/object × 100T = ~70 PB, with LSM overhead (bloom filters, indexes, WAL): ~140 PB
140 PB / 100GB per partition = ~1.4M partitions
13,600 metadata nodes (100 partitions each)
7.4 Bandwidth
PUT ingress: 50K/sec × 100KB = 5 GB/sec
GET egress: 250K/sec × 100KB = 25 GB/sec (CDN offloads 70% → 7.5 GB/sec origin)
Parity write overhead: 5 GB/sec × (4 parity shards / 10 data shards) = 2 GB/sec extra
7.5 Router Nodes
API Routers: 350K peak req/sec / ~175 req/sec per node = ~2,000 nodes
Data Routers: 50K PUTs/sec / ~33 PUTs/sec per instance = ~1,500 nodes
Total router nodes: ~3,500
7.6 Summary
| Resource | Count |
|---|---|
| Storage nodes | ~19,500 |
| Total drives | ~700,000 |
| Physical storage | 14 EB |
| Metadata nodes | ~13,600 |
| Raft partitions | ~1,360,000 |
| Placement groups | ~100,000 |
| API/Data Router nodes | ~3,500 |
| CDN PoPs | 200+ |
| Peak req/sec | ~350,000 |
Cost breakdown:
Capex:
Storage nodes: 19,500 × $15K/node = $292M
Metadata nodes: 13,600 × $8K/node = $109M
Router nodes: 3,500 × $5K/node = $18M
Networking (switches, cross-connects) = $50M
Total capex: $469M
Opex (annual):
Power + cooling: 36,600 nodes × 500W × $0.10/kWh × 8,760h = $160M
Operations (staff, facilities, bandwidth): $100M
Total opex: $260M/year
Amortized over 4 years: ($469M / 4) + $260M = $377M/year
Cost per GB: $377M / (10 EB × 10^9 GB/EB) = $0.003/GB/month
Cost per TB: about $3/TB/month internal
Retail $23/TB/month covers margin, multi-region, and ops overhead. At scale, cross-AZ network transfer often costs more than the storage hardware itself. Placing the majority of data shards in the local AZ is one of the biggest cost levers available.
8. Data Model
8.1 Bucket Metadata (PostgreSQL)
CREATE TABLE buckets (
account_id BIGINT NOT NULL,
bucket_name TEXT NOT NULL,
bucket_id UUID DEFAULT gen_random_uuid(),
region TEXT NOT NULL,
versioning TEXT DEFAULT 'disabled',
acl JSONB DEFAULT '{}',
policy JSONB,
lifecycle_rules JSONB DEFAULT '[]',
encryption JSONB,
created_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (account_id, bucket_name)
);
CREATE UNIQUE INDEX idx_bucket_name ON buckets (bucket_name);8.2 Object Metadata (Distributed KV, protobuf)
Key: /{bucket_id}/{object_key}\0{version_id}. Version IDs are timestamps. To make the newest sort first in a forward-only scan, each byte is XORed with 0xFF. This flips the bits so a larger timestamp (newer) becomes a smaller byte value (sorts earlier). Simple trick that avoids needing reverse iteration in the storage engine.
message ObjectMetadata {
string object_key = 1;
bytes version_id = 2;
uint64 size = 3;
string etag = 4;
string content_type = 5;
string storage_class = 6;
string encryption_key_id = 7;
map<string, string> user_metadata = 8;
uint32 placement_group = 9;
repeated ShardLocation shard_map = 10; // cached for fast reads; recomputable from PG + CRUSH
bool is_delete_marker = 11;
int64 created_at = 12;
}
message ShardLocation {
uint32 shard_index = 1; // 0-13
uint64 node_id = 2;
uint64 extent_id = 3;
uint64 offset = 4;
uint32 length = 5;
}8.3 Placement Group Map (etcd)
message PlacementGroup {
uint32 pg_id = 1;
repeated uint64 node_ids = 2; // 14 entries
uint64 epoch = 3;
string state = 4; // ACTIVE, REBALANCING, DEGRADED
}8.4 Extent Index (local per node)
Each node holds roughly 72 billion shard entries (100T objects × 14 shards / 19,500 nodes). At 32 bytes per entry, a full in-memory index would need ~2.3TB, which doesn't fit in RAM. In practice, the index is tiered: a bloom filter in memory (a few GB, tells you "this extent definitely doesn't have your shard" for most lookups), a persistent on-disk sorted index per extent file (written when the extent is sealed), and an LRU cache in RAM for hot entries. Most reads hit the bloom filter first, skip irrelevant extents, then do one disk seek into the right extent. In practice, the top 1-5% of hot entries live in the RAM cache, covering the majority of read traffic. Cold reads incur one additional disk seek to the on-disk index before reaching the data.
8.5 Multipart Upload State
Key: /mpu/{bucket_id}/{object_key}/{upload_id}
message MultipartUpload {
string upload_id = 1;
map<uint32, PartInfo> parts = 2;
int64 created_at = 3;
int64 expires_at = 4; // auto-abort deadline
}
message PartInfo {
uint64 size = 1;
string etag = 2;
uint32 placement_group = 3;
repeated ShardLocation shards = 4;
}8.6 Version Listing Index
Secondary index: /versions/{bucket_id}/{object_key}\0{reverse_version_id} pointing to primary metadata. Uses the same bit-flip trick from Section 8.2 to make newest versions sort first in a forward scan.
8.7 Lifecycle State
Key: /lifecycle/{bucket_id}/{rule_id}/{object_key} → {current_class, target_class, transition_at, state}
9. API Design
9.1 PutObject
PUT /{bucket}/{key} HTTP/1.1
Host: storage.us-east-1.example.com
Content-Length: 10485760
Content-Type: image/jpeg
x-storage-class: STANDARD
x-storage-server-side-encryption: kms
x-storage-server-side-encryption-kms-key-id: kms:us-east-1:123456:key/abcd-1234
x-storage-meta-original-filename: beach-photo.jpg
Authorization: HMAC-SHA256 Credential=AKID/20260310/us-east-1/storage/request,
SignedHeaders=content-length;content-type;host;x-storage-date,
Signature=abcdef1234567890...
[10 MB of object bytes]HTTP/1.1 200 OK
ETag: "d41d8cd98f00b204e9800998ecf8427e"
x-storage-version-id: 3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo9.2 GetObject
GET /{bucket}/{key} HTTP/1.1
Range: bytes=0-1048575
If-None-Match: "d41d8cd98f00b204e9800998ecf8427e"Returns 206 Partial Content with range, or 304 Not Modified if ETag matches.
9.3 DeleteObject
DELETE /{bucket}/{key}?versionId=3sL4kqtJlcpXr... HTTP/1.1Without versionId on a versioned bucket: creates a delete marker (data untouched). With versionId: permanently deletes that version.
9.4 ListObjectsV2
GET /{bucket}?list-type=2&prefix=photos/2024/&delimiter=/&max-keys=1000 HTTP/1.1<ListBucketResult>
<Prefix>photos/2024/</Prefix>
<CommonPrefixes><Prefix>photos/2024/january/</Prefix></CommonPrefixes>
<CommonPrefixes><Prefix>photos/2024/vacation/</Prefix></CommonPrefixes>
<Contents>
<Key>photos/2024/index.html</Key>
<ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
<Size>2048</Size>
</Contents>
<IsTruncated>true</IsTruncated>
<NextContinuationToken>def456</NextContinuationToken>
</ListBucketResult>9.5 Multipart Upload
POST /{bucket}/{key}?uploads HTTP/1.1 → UploadId=abc123
PUT /{bucket}/{key}?partNumber=1&uploadId=abc123 HTTP/1.1 → ETag for part 1
PUT /{bucket}/{key}?partNumber=2&uploadId=abc123 HTTP/1.1 → ETag for part 2
POST /{bucket}/{key}?uploadId=abc123 HTTP/1.1 → Complete (send part ETags)
DELETE /{bucket}/{key}?uploadId=abc123 HTTP/1.1 → Abort (if needed)Parts upload in parallel, out of order, from different machines. Each part independently erasure-coded.
9.6 Presigned URLs
https://storage.../my-bucket/photo.jpg
?X-Storage-Algorithm=HMAC-SHA256
&X-Storage-Credential=AKID%2F20260310%2Fus-east-1%2Fstorage%2Frequest
&X-Storage-Expires=3600
&X-Storage-Signature=fe5f80f77d5fa3beca038a248ff027...
Self-contained: signature embeds bucket, key, operation, expiry. Conditions: IP range, content-type, content-length-range. These go direct to storage nodes, not through CDN. See Section 10.4 for the full presigned URL and CDN deep dive.
10. Deep Dives
10.1 End-to-End: From Upload to Recovery
Before diving into individual flows, here's the full lifecycle of a file using a simplified 2+1 scheme (2 data shards, 1 parity). The real system uses 10+4, but the mechanics are identical.
Setup: File = [A B C D] (4 blocks). Stripe size = 2 blocks (k=2). One parity shard per stripe (m=1).
Step 1: Split into stripes and encode.
| Stripe | Data Shards | Parity Shard | Stored |
|---|---|---|---|
| Stripe 1 | D0 = A, D1 = B | P = A XOR B | [A, B, A⊕B] |
| Stripe 2 | D0 = C, D1 = D | P = C XOR D | [C, D, C⊕D] |
Each stripe is encoded independently. Stripes don't know about each other.
Step 2: Distribute shards across nodes.
| Shard | Stripe 1 | Stripe 2 |
|---|---|---|
| Node 1 | A | C |
| Node 2 | B | D |
| Node 3 | A⊕B (parity) | C⊕D (parity) |
Every shard lands on a different node, different rack. Metadata service records the mapping.
Step 3: Node 2 crashes.
Heartbeat monitor detects the failure within 30 seconds. Shard B (Stripe 1) and shard D (Stripe 2) are now missing.
Step 4: Read during failure. Client gets [A B C D]. No error, just ~50ms extra latency from reconstruction.
Step 5: Background repair. Repair worker reconstructs B, writes it to Node 7, updates metadata. Full redundancy restored.
Mapping to the production system: Replace 2+1 with the real EC scheme. Stripes are 1MB. Parity uses GF(2^8) matrix math instead of simple XOR. The system tolerates 4 missing shards instead of 1. Everything else works the same.
10.2 The PUT Path
- API Router validates HMAC, calls IAM. (~2ms)
- Metadata Service returns PG + shard-to-node mapping. (~5ms)
- Data Router reads 1MB stripes, RS encodes 10 data + 4 parity shards.
- 14 parallel shard writes to storage nodes (append to extent file, CRC32C, fsync). (~10-20ms)
- Quorum 11/14 means up to 3 slow nodes don't block. Missing shards repaired async.
- Raft commit of object metadata. (~5-10ms)
- Return 200 with ETag and version-id.
Total for 10MB PUT: ~38ms + network transfer. Data isn't durable until both the shard quorum write and the Raft metadata commit succeed. If shards land on disk but the Raft commit fails (say, a leader election during commit), those shards become orphans. GC cleans them up within 24 hours. The client gets an error and retries.
10.3 The GET Path
There are three ways to read an object. Each skips different layers:
GET via CDN (most common for high-traffic reads):
- Client sends CDN-signed URL to CDN edge.
- CDN validates signature, checks cache. Hit (60-80%): serve immediately.
- Miss: CDN fetches from storage origin via OAC. Storage handles metadata lookup + shard fetch internally.
GET via API (normal authenticated read):
- Client sends GET with HMAC signature to API Router.
- API Router validates auth, calls IAM.
- Metadata lookup from Raft leader: returns PG ID, ETag, size, and cached shard locations. (~5ms) (Alternatively, the reader can compute shard locations from PG + CRUSH.)
- Data Router fetches 10 data shards in parallel from storage nodes.
- Happy path (>99%): All 10 respond. Stream to client. ~15-20ms server-side.
- Degraded path: 1-4 shards missing → request parity shards → RS reconstruct. +50-200ms. Queue background repair.
GET via storage presigned URL (direct, no CDN):
- Client sends request directly to storage node with presigned URL.
- Storage node validates signature and expiry locally.
- Storage node reads local shards, fetches remaining from peers, reconstructs if needed, streams back.
- No CDN, no API Router, no Data Router in this path.
Speculative read: If 9 of 10 respond quickly and 1 is lagging, fire off a request for a parity shard without waiting. Whichever arrives first wins. This keeps p99 tight even when individual nodes are having a bad day.
10.4 Presigned URLs: Storage vs CDN
There are two completely separate signing systems. Mixing them up is one of the most common mistakes.
Storage presigned URLs go direct to the storage node. CDN is not involved.
Storage presigned PUT flow:
- Client asks the API server for a signed URL (authenticated with normal credentials).
- API server generates the URL: HMAC-SHA256 signature over bucket, key, operation, expiry, conditions.
- Client uploads directly to the storage node using the signed URL.
- Storage node validates: reconstructs the "string to sign" from the request (method, path, query params, headers, expiry), looks up the secret key for that credential (cached locally), computes its own HMAC-SHA256, compares. Match = authentic. Past expiry = 403. No database lookup, no session state.
- Storage node runs EC, distributes shards, commits metadata via Raft.
Storage presigned GET: Client hits the storage endpoint directly. Node validates signature, streams the object. No CDN, no caching benefit. Each URL is unique (different signature), so CDN would just be a pass-through.
CDN-signed URLs (the correct CDN pattern):
Storage presigned URLs and CDN-signed URLs are two different systems. For private content served through CDN:
- Storage bucket is private. Users cannot reach it directly.
- CDN is the only entity allowed to fetch from storage, via Origin Access Control (OAC).
- Your backend generates a CDN-signed URL using a CDN key pair (completely separate from storage credentials). URL looks like
https://d1234.cloudfront.net/video.mp4?Signature=...&Expires=...&Key-Pair-Id=... - User requests the CDN-signed URL. CDN verifies the signature and expiry at the edge using the public key. If invalid or expired: 403 immediately, no origin hit.
- If valid: serve from CDN cache (fast). On cache miss: CDN fetches from storage using its OAC credential, caches the response, returns to user.
Wrong: https://bucket.s3.amazonaws.com/file?X-Amz-Signature=... (storage presigned URL, bypasses CDN)
Right: https://d1234.cloudfront.net/file?Signature=... (CDN-signed URL, goes through CDN)
| Pattern | URL goes to | Who validates | Caching | Best for |
|---|---|---|---|---|
| Storage presigned URL | Storage directly | Storage node | None | Uploads, temp downloads |
| CDN-signed URL + OAC | CDN edge | CDN edge | Full CDN caching | Private content, high read traffic |
| Public bucket + CDN | CDN edge | No auth needed | Full CDN caching | Static assets, public content |
| Use case | Best choice |
|---|---|
| Single private file, low traffic | Storage presigned URL |
| Single private file, high traffic | CDN-signed URL |
| Many files (video streaming) | CDN-signed cookies |
| Public assets | CDN + public bucket |
10.5 Multipart Upload
Each part is independently erasure-coded with its own shard map. Parts upload in parallel, out of order, from different machines. The upload_id ties them together.
Each part is already durable on upload. When you upload Part 3, the storage system erasure-codes it, distributes shards, and stores it with full durability guarantees immediately. Parts don't sit around as raw unprotected bytes. They're already encoded and replicated. The system just hasn't stitched them into a visible object yet.
Parts before completion are stored as temporary internal objects. Not visible via normal GET. Don't show up in bucket listing. Only accessible through the UploadId. Think of them as "durable but invisible."
Resume: If upload fails at part 7 of 20, call ListParts (using the UploadId) to see which parts are stored. Parts 1-6 and 8-20 are durable. Re-upload only part 7. Parts are idempotent (same content = same ETag).
What happens during CompleteMultipartUpload:
- Validate: Check that the UploadId exists and all referenced parts are present. Verify ETags match. Parts must be in ascending order.
- Assemble: No full data rewrite is required. The system links existing parts into a single logical object by updating metadata pointers to reference the already-stored parts' shard locations. This is primarily a metadata operation. (Some systems may later compact or re-layout parts for read performance, but this is a background optimization, not part of the completion step.)
- Compute final ETag:
MD5(concat(MD5(part1) + MD5(part2) + ...)) + "-" + part_count. Example:"5d41402abc4b2a76b9719d911017c592-4"means 4 parts. Can't be compared to a single-PUT MD5 of the same content. - Atomic commit: Object becomes visible in one metadata write. Before: object does not exist. After: fully readable.
- Cleanup: Temporary multipart state removed. Part data stays (now referenced by final object metadata).
For stronger integrity, the system also supports explicit SHA-256 or CRC32 checksums at completion time.
Auto-abort: Incomplete uploads older than 7 days are automatically cleaned up.
| Object Size | Part Size | Parts | Parallelism |
|---|---|---|---|
| 100MB | 10MB | 10 | 5-10 |
| 1GB | 16MB | ~63 | 10 |
| 10GB | 64MB | ~157 | 10-20 |
| 1TB | 512MB | ~2,000 | 50-100 |
| 5TB | 512MB | ~10,000 | 100 |
10.6 Small Object Packing: Extent Files
Most objects are small. The median is closer to 10KB. If you store each one as a separate file, you hit three walls at once: inode exhaustion (ext4 allocates a fixed number of inodes, typically tens to hundreds of millions per filesystem), IOPS waste (a 7200 RPM HDD does about 150 random I/O operations per second because each read requires a physical disk seek of ~7ms, so reading a 1KB file costs the same IOPS as reading a 1MB file), and filesystem fragmentation from billions of tiny files.
Solution: Pack thousands of objects into 256MB extent files. Facebook published this approach as Haystack (OSDI 2010) to solve the exact same small-file problem for photo storage. Append shard to open extent, update in-memory index, fsync, seal at 256MB.
+------------------------------------------------------------------+
| Extent File (256 MB) |
| Header | ObjA shard|CRC|pg| ObjB shard|CRC|pg| ... | free space |
+------------------------------------------------------------------+
Read: O(1) in-memory hash lookup → seek to offset → read bytes → verify CRC32C.
| Metric | One file per object | Extent packing |
|---|---|---|
| Write IOPS (1KB, HDD) | 150/sec (seek) | 5,000+/sec (sequential) |
| Max objects per 20TB | ~4B (inode limit) | Unlimited (memory-bound) |
| Metadata per object | ~256 bytes (inode) | 32 bytes (in-memory) |
Compaction: When >30% of an extent's shards are dead, GC rewrites live shards to a new extent. Large objects (>4MB): stored in dedicated files, not packed.
Small object durability strategy (important design decision): For objects under 1MB, EC overhead is disproportionate. A 1KB object produces 14 shards of ~100 bytes each, and every read requires fetching 10 of them from 10 different nodes. The per-request overhead dwarfs the data. Most production systems handle this with a size-based split:
| Object size | Strategy | Why |
|---|---|---|
| < 256KB | 3x replication | EC overhead too high relative to data. Replication is simpler and reads hit one node instead of 10. |
| 256KB - 4MB | EC or replication (configurable) | Transition zone. EC starts to pay off. |
| > 4MB | Erasure coding (10+4) | EC savings dominate. Stored in dedicated files, not packed into extents. |
This policy is typically configurable per bucket or system-wide and applied at write time. The thresholds above are guidelines, not hard rules.
This is a big design choice that affects both cost and read latency for the majority of objects (most are small).
10.7 Prefix Listing at Scale
Keys are sorted in the LSM-tree. prefix=photos/2024/ → range scan from "photos/2024/" to "photos/2024/\xff". For each key: strip prefix, find delimiter → CommonPrefix (simulates subdirectory) or Contents (simulates file).
Scaling to 50B matches: The metadata service scans all relevant partitions in parallel. Results come back sorted per partition. A k-way merge combines these sorted streams into one globally sorted result (like merging k sorted lists at once). Once 1000 results are collected (the max-keys default), scanning stops and a continuation token marks where to resume on the next request.
Applications must design key structure with LIST in mind. {date}/{category}/{uuid} is good. {uuid} with no prefix structure means scanning everything.
10.8 Storage Class Transitions
Per-partition lifecycle scanner runs hourly. Matching objects: read shards → re-encode with target EC (e.g., 16+4 for Glacier) → write to target storage → update metadata → GC old shards. Rate-limited: 1,000 transitions/min/partition.
10.9 Garbage Collection
Sources of garbage: overwrites (old shards), deletes (all shards), aborted multipart (orphan parts), failed writes (orphan shards).
GC is always async. You never delete anything in the write path. Dead shards sit for a 24-hour grace period before physical deletion. This protects against in-flight reads, clock skew, and race conditions. When an extent passes 30% dead ratio, GC compacts it by rewriting the live shards into a fresh extent.
GC gets 20% of cluster bandwidth (~1.4 GB/sec, ~86 TB/day). Keep an eye on the backlog trend. If it's growing, either bump the budget or figure out why the deletion rate spiked.
Orphan detection: Once a week, the metadata service exports every referenced shard. Each storage node compares that list against its local index. Anything unmatched is an orphan candidate, quarantined for 7 days, then deleted.
10.10 Rebalancing and Repair
Triggers: Node added (new capacity), node removed (decommission), disk failure, capacity imbalance.
Repair flow: Heartbeat detects failure (30s) → PGs prioritized by severity (10/14 shards = critical, 11/14 = high, 13/14 = low) → read 10 surviving shards → RS reconstruct → write to new node → update PG map.
Rate-limited: 20% of cluster I/O. Backpressure: pause if client p99 exceeds threshold.
Repair time for one node failure: ~72 PGs of data to reconstruct, roughly ~1 PB total. At 1.4 GB/sec with 10 parallel workers: ~20 hours under ideal conditions. Real-world repair takes longer due to throttling (to protect client latency), network contention, and concurrent repairs. Budget 2-3x the ideal estimate. Keeping repair fast is critical for durability because the longer the window stays open, the higher the chance of a second failure hitting the same PG.
Node decommission: Mark DRAINING → stop writes → copy all shards to replacements → verify → remove from CRUSH.
10.11 Versioning
Versioning enabled: every PUT creates a new version. DELETE without version_id creates a delete marker (data untouched, recoverable). DELETE with version_id permanently removes that version.
MFA Delete: Requires MFA to permanently delete versions or change versioning state. Protects against compromised credentials.
Storage impact: 1MB file updated 1,000×/day = 1GB/day versioned data. Lifecycle rules essential: "expire non-current versions after 30 days."
10.12 Hot Objects and Caching
Anything accessed more than 100 times per minute gets promoted to a local SSD cache on the Data Router (LRU, 10GB per node). For truly viral content hitting 10K+ req/sec on a single object, the system copies it to multiple Data Router caches and uses request coalescing at the CDN edge. That way a thousand simultaneous cache misses turn into one origin fetch instead of a thousand.
10.13 Storage Layer Realism
Each node: 36 HDDs (20TB), 2 NVMe SSDs (extent index + WAL), 256GB RAM, 2×25Gbps NIC.
CRUSH failure domain hierarchy: Region → AZ (3) → Rack (~540) → Node (~36/rack) → Drive (36/node).
For a 10+4 PG: 5+5+4 shard distribution across 3 AZs, each shard on a different rack. Max 5 per AZ. Losing the AZ with 4 leaves exactly 10 (sufficient). For AZ-loss tolerance on critical data, use 10+6 (max 5/AZ, losing any AZ leaves ≥11 ≥10).
10.14 User Experience
5MB photo upload: Auth 2ms + PG lookup 5ms + encode <1ms + 14 shard writes 20ms + Raft 10ms = ~40ms + transfer. Photo erasure-coded across 3 AZs with 11 nines.
4GB video multipart: 800 parts × 5MB, 10 parallel. ~5.5 minutes on 100 Mbps. WiFi drops at part 400 → reconnect → ListParts → resume from 400. No wasted work.
Data lake byte-range read: Bytes 100MB-110MB of a 50GB Parquet file → Data Router computes target stripe → fetches only relevant shards. ~20ms first byte. Never downloads full 50GB.
Accidental delete recovery: Versioning enabled → DELETE only creates marker → list versions → GET with version_id or delete the marker. Data never lost.
11. What Breaks First at Scale
- Metadata hot spots. Popular bucket overwhelms a single Raft partition. Auto-split at 200GB or 10K ops/sec.
- Reconstruction storms. Rack failure (20 nodes) triggers massive repair I/O. Rate-limited repair + priority queues prevent cascading degradation.
- GC backlog. 5K dead shard sets/sec from overwrites. Without 20% I/O budget for GC, leaked storage hits petabytes in months.
- Small object IOPS wall. Median 10KB object = 150 reads/sec per HDD (seek-bound). Extent packing mandatory.
- Presigned URL abuse. Leaked URL with 7-day expiry = open door. Short expiry + IP restrictions + bucket deny policies.
12. Partial Update Layer (Application-Level)
This is NOT built into object storage. S3 has no concept of chunks, manifests, or partial updates. Changing 1 byte in a 5GB file means re-uploading the entire 5GB. This is by design: immutability keeps the storage engine simple. The manifest service described here is application code, not part of S3. S3 just stores blobs.
Dropbox, Google Drive, and backup tools like Restic all solve this the same way: they build a chunking and manifest layer on top of the blob store.
Here's how it works concretely. Say you have a 100MB document split into 12 chunks of ~8MB each:
- Each chunk is stored as a regular S3 object in a dedicated bucket (e.g.,
my-app-chunks). Chunk keys look likechunks/a1b2c3d4,chunks/e5f6g7h8, etc. S3 doesn't know these are chunks. They're just objects. - The manifest is a separate record (stored in your database or a metadata service) that says: "file doc-123, version 5 = [chunks/a1b2c3d4, chunks/e5f6g7h8, ..., chunks/x9y0z1w2] in this order."
- To edit the document (say chunk 3 changed), you upload one new 8MB chunk to S3 (
chunks/NEW_p4q5r6), then update the manifest to point to the new chunk while keeping the other 11 chunk references unchanged. Total upload: 8MB instead of 100MB. - To read the document, fetch the manifest, download all 12 chunks from S3 in parallel, concatenate them in order.
S3 stores the bytes. The manifest service tracks which bytes belong to which file version.
12.1 Data Model
Key: /{tenant_id}/{file_id}\0{version} (newest first via same XOR trick as Section 8.2)
message FileManifest {
string file_id = 1;
string tenant_id = 2;
uint64 version = 3;
repeated ChunkRef chunks = 4;
uint64 total_size = 5;
string content_hash = 6;
int64 created_at = 7;
uint64 expected_version = 8; // optimistic locking
}
message ChunkRef {
uint32 index = 1;
string object_key = 2; // key in chunks bucket
uint64 offset = 3;
uint64 size = 4;
string etag = 5;
}12.2 API
| Endpoint | Purpose |
|---|---|
POST /files/{id}/upload-urls | Get presigned URLs for new chunks |
POST /files/{id}/versions | Commit new manifest (with If-Match: version_N for optimistic locking) |
GET /files/{id} | Get latest manifest + chunk presigned GET URLs |
GET /files/{id}/versions | List all versions |
DELETE /files/{id}/versions/{v} | Delete specific version |
12.3 Atomic Commit and Consistency
- Upload new/changed chunks via presigned URLs.
- Create new manifest referencing old chunks (unchanged) + new chunks.
- Atomic pointer switch: manifest service commits new version in a single Raft write.
No partial state visible. Before commit: readers see version N. After: version N+1. If commit fails, uploaded chunks are orphans cleaned by GC.
Reads always go through the Raft leader for the latest manifest. Updating a file and reading it back immediately returns the new version. No surprises.
12.4 Concurrency, Failure, Cleanup
Optimistic locking: Read version N → edit → submit with expected_version=N. If another client committed N+1, reject with 409 Conflict → re-read, re-apply, retry. No distributed locks needed.
Read path: Fetch manifest → get ordered chunk list with presigned GET URLs → fetch chunks in parallel (10-20 concurrent) → reassemble. For range reads, the manifest service computes which chunks cover the byte range and returns only those URLs.
Chunk upload fails? Retry. It's idempotent. Manifest commit fails? Retry. No data lost since the chunks are already durable. Upload dies halfway through? The file hasn't changed because the manifest hasn't been updated. Check which chunks already exist with HEAD, upload the missing ones, commit.
GC: Background scanner finds chunks not referenced by any live manifest. 24-hour grace period, then delete. Aborted uploads cleaned after 7 days.
Security: Users interact with file_id, never raw chunk keys. Chunks bucket denies all access except from manifest service IAM role.
12.5 End-to-End Flow
Read: GET file → manifest returns chunk URLs → parallel fetch (CDN-cached for unchanged chunks) → reassemble.
Delete: Remove manifest pointer → GC identifies unreferenced chunks → delete after grace period.
| Chunk Size | Best For |
|---|---|
| 4MB | Document editing, configs (fine-grained updates) |
| 8MB | General-purpose (recommended default) |
| 16MB | Video, media (fewer fetches, less metadata) |
13. Identify Bottlenecks
| # | Bottleneck | Mitigation |
|---|---|---|
| 1 | Metadata partition hot spots | Auto-split at 200GB or 10K ops/sec. Adaptive partitioning monitors ops/sec per partition. For sequential key patterns, add a hash prefix to distribute writes (e.g., hash(key)[0:2]/key) |
| 2 | EC reconstruction latency (+50-200ms) | Proactive repair, speculative parity reads, SSD cache |
| 3 | Small object IOPS (150/sec/HDD) | Extent packing, SSD read cache, batch reads |
| 4 | LIST on enormous prefixes | Parallel merge, secondary prefix index, max-keys 1000 |
| 5 | LSM compaction write amplification | Rate-limited leveled compaction, NVMe for metadata |
| 6 | PG rebalancing I/O spike | Rate-limited migration (2 PGs/node/hour), off-peak scheduling |
| 7 | GC vs client I/O contention | 20% I/O budget for GC, lower priority class |
| 8 | Cross-AZ EC bandwidth | Place 7/10 data shards in local AZ |
| 9 | Multipart orphan storage leak | Auto-abort after 7 days |
| 10 | 136 PB metadata footprint | Bloom filters, NVMe mandatory, prefix-partitioned scans |
| 11 | Presigned URL abuse | Short expiry (1h default), IP restrictions, audit logging |
| 12 | Raft round-trip overhead | Pipelined Raft, read leases |
| 13 | CDN cache stampede | Request coalescing, stale-while-revalidate |
| 14 | Manifest hot files | Short TTL cache, read replicas, per-file rate limit |
14. Failure Scenarios
| Scenario | Shards Remaining | Impact | Recovery | Data Loss |
|---|---|---|---|---|
| Single drive failure | 13/14 | Zero user impact | Repair in hours: read 10 → RS reconstruct → write to new drive | None |
| Node failure (36 drives) | 13/14 per PG (~72 PGs) | Degraded reads +50ms | Background repair, rate-limited | None |
| Full AZ outage | 9-10/14 | PGs with ≤9 shards temporarily unreadable | Wait for AZ recovery or rebuild. CRUSH must enforce max 4/AZ. | None if max 4/AZ enforced |
| Raft leader failure | N/A | Partition writes stall ~5-10ms | Raft election (pre-vote: 1-2 round trips) | None |
| Metadata corruption | N/A | Corrupted replica removed | New replica catches up from snapshot + Raft log | None (2 healthy replicas) |
| Silent bit rot | 13/14 (corrupted ≈ missing) | Detected by CRC32C on read or scrubber | Reconstruct from 10 good shards | None if <5 corrupted |
| AZ network partition | N/A | Minority-side leaders stall | Leaders step down; majority-side continues | None (Raft guarantee) |
| Multipart failure mid-stream | N/A | Incomplete upload | ListParts → resume from failed part. Auto-abort after 7 days. | None |
| EC engine bug (bad parity) | 10/14 data OK | Reconstruction from parity produces wrong data | Detect via ETag mismatch. Try other shard combinations. Re-encode parity from 10 data shards. | None if data shards intact |
| GC premature deletion | 13/14 | Durability reduced | 24h grace period + weekly reconciliation + soft delete prevents most cases. Background repair restores. | None (EC covers it) |
| Partition split during writes | N/A | ~100-200ms write stall | Automatic; API Router updates partition map | None |
| Reconstruction storm (rack failure, 20 nodes) | 13/14 per PG | Cascading I/O degradation | Rate-limited repair, priority queue, lower ionice, backpressure on repair if p99 spikes | None if ≥10 shards survive |
| CDN origin failure | N/A | Cached: served. Uncached: 503. | DNS health checks reroute to healthy region | None (availability only) |
| Manifest service failure | N/A | Partial update layer down; core storage works | Raft election or manual replica restart | None |
| Rebalancing during failure | 12-13/14 | Dual disruption; repair takes longer | Pause migrations, prioritize repair, resume after stable | None under normal conditions |
| Metadata service fully down | N/A (data intact on disk) | All reads and writes fail. Data exists on storage nodes but nobody can find it (no index). CDN serves cached objects until cache expires. | Raft election restores within seconds for single partition. Full metadata outage (all partitions) requires cluster-level recovery from Raft snapshots. Data on storage nodes is never lost. | None (availability, not durability) |
When Is Data Actually Lost?
Only when 5+ of 14 shards for the same object are destroyed before repair completes.
Probability: With 2% annual drive failure rate and 6-hour repair time:
- P(1 drive fails in 6h) ≈ 1.4 × 10⁻⁵
- P(5 of 14 fail before repair) ≈ 10⁻²¹ per object per cycle
- Across 100T objects annually: ~10⁻⁴ objects lost/year = 1 object per 10,000 years
That is 11 nines.
| Risk Factor | Mitigation |
|---|---|
| Correlated failures (bad drive batch) | Diversify vendors, spread same-batch across racks |
| Slow repair (overloaded cluster) | Prioritize repair I/O, pre-provision capacity |
| Silent corruption | Scrubber every 2 weeks, CRC on every read |
| Firmware bug | Verify shards on read-back before ACK |
| Natural disaster | Cross-region replication |
15. Deployment Strategy
15.1 Multi-Region Deployment
| Region | API Routers | Data Routers | Metadata | Storage | Total |
|---|---|---|---|---|---|
| us-east-1 | 800 | 600 | 5,500 | 8,000 | 14,900 |
| us-west-2 | 600 | 450 | 4,000 | 6,000 | 11,050 |
| eu-west-1 | 600 | 450 | 4,100 | 5,500 | 10,650 |
15.2 Technology Stack
| Layer | Technology |
|---|---|
| CDN / Edge | CloudFront / Fastly |
| API Routers | Go, stateless, K8s |
| Data Routers | Rust, ISA-L, K8s |
| Metadata | Go, Raft + LSM-tree |
| Storage Nodes | Rust, direct I/O, extents |
| Manifest Service | Go, PostgreSQL / Raft KV |
| Coordination | etcd (3-node/region) |
| Observability | Prometheus + Grafana + Jaeger |
| Events | Kafka |
15.3 Canary Deployment
5% rollout, 30-minute soak. Auto-rollback if: PUT success <99.9%, GET p99 +20%, reconstruction rate 2x, Raft commit +50%, CDN hit ratio -10%.
Storage node upgrades: Rack-by-rack rolling (drain → upgrade → smoke test → re-enable). 19,500 nodes across ~540 racks at 20 min/rack = ~7.5 days full cluster.
16. Observability
16.1 Key SLIs
# Latency
put_latency_ms{quantile="0.99"}
get_latency_ms{quantile="0.99"}
get_first_byte_latency_ms{quantile="0.99"}
list_latency_ms{quantile="0.99"}
# CDN
cdn_cache_hit_ratio{region}
cdn_origin_fetch_latency_ms{quantile="0.99"}
# Erasure coding
ec_reconstruct_count_total
ec_reconstruct_latency_ms{quantile="0.99"}
# Health
shard_repair_backlog
pgs_degraded_total
gc_backlog_bytes
gc_bytes_reclaimed_per_hour
raft_commit_latency_ms{quantile="0.99"}
raft_leader_elections_total
drive_utilization_percent{node_id}
cross_az_bandwidth_bytes_per_sec{src_az, dst_az}
16.2 Critical Alerts
| Alert | Condition | Severity |
|---|---|---|
| PUT success rate drop | <99.9% for 5 min | P1 |
| Reconstruction rate spike | >5% of GETs | P2 |
| Raft election storm | >10/min for a partition | P1 |
| Drive failure spike | >20/day (vs baseline 5.5) | P1 |
| GC backlog growth | Increasing for >24h | P2 |
| PG degradation | >50 degraded PGs | P1 |
| Raft replication lag | >5 seconds | P1 |
| CDN hit ratio drop | <50% for 15 min | P2 |
16.3 Distributed Tracing
Every request gets a trace ID at the API Router. Follows through: CDN → API Router → Metadata → Data Router → Storage Nodes → Raft. Sampling: 0.1% normal, 100% for errors or p99 breaches. Jaeger with 7-day retention.
Example degraded GET trace:
TraceID: 7a8b9c0d... Duration: 127ms
cdn_edge.check: 3ms (MISS)
api_router.authenticate: 2ms
metadata_service.lookup: 6ms (partition 87432)
data_router.fetch_shards: 112ms
9/10 shards in 15ms, shard 1 TIMEOUT at 50ms
Parity shard 10: 12ms, RS reconstruct: 0.3ms
Dashboards: executive (rate, errors, latency, cost), API layer, metadata/Raft, data/EC, storage/drives, network, CDN.
17. Security
17.1 IAM and Policy Engine
Every request: API Router → IAM engine with (principal, action, resource). Evaluates IAM policies (user-level) → bucket policies (resource-level) → ACLs (legacy). Explicit DENY wins, then ALLOW, default DENY. In-memory policy cache on each API Router, refreshed every 5 minutes.
17.2 Encryption
Three-level key hierarchy: Master Key (HSM, annual rotation) → Bucket Key (cached 24h, reduces KMS calls from 50K/sec to 1/bucket/day) → Object Data Key (unique per object, AES-256-GCM).
- SSE-Default: Service-managed keys, transparent.
- SSE-KMS: Customer-managed, every use audited, revocable.
- SSE-C: Customer-provided per request, not stored.
17.3 Transport and Access Control
Auth: HMAC-SHA256 with 15-minute replay window. Transport: TLS 1.3 client-to-edge, mTLS service-to-service.
Data integrity: CRC32C per shard, ETag (MD5) end-to-end, Content-MD5 header optional.
Block Public Access: ON by default. Object Lock (WORM): Governance mode (override with IAM) and Compliance mode (no one deletes until retention expires). MFA Delete: Requires MFA for permanent deletion.
Audit logging: Every API call logged with: principal, action, resource, source IP, timestamp, request-id, response status. Logs are immutable, retained for compliance (configurable, default 90 days), and queryable for forensic analysis and regulatory audits.
Network segmentation: Storage nodes on isolated network (no internet). API Routers in DMZ. Each component has minimum required permissions.
Explore the Technologies
Core Technologies
| Technology | Role | Learn More |
|---|---|---|
| FoundationDB | Metadata store: ordered KV, strong consistency | FoundationDB |
| RocksDB | LSM-tree engine for metadata nodes | RocksDB |
| gRPC | Internal RPC with protobuf | gRPC |
| etcd | Placement coordination, cluster membership | etcd |
| Prometheus + Grafana | Metrics and dashboards | Prometheus |
| Kafka | Event notifications | Kafka |
| MinIO | S3-compatible private cloud object storage | MinIO |
| Ceph | Exabyte-scale storage with CRUSH placement | Ceph |
| SeaweedFS | Simple small-file optimized storage | SeaweedFS |
Infrastructure Patterns
| Pattern | Role | Learn More |
|---|---|---|
| Object Storage | The system being designed | Object Storage |
| Replication & Consistency | Raft for metadata, EC for data | Replication |
| Database Sharding | 1.36M metadata partitions | Sharding |
| Hot/Warm/Cold Tiering | Storage class transitions | Tiering |
| CDN & Edge | CDN-signed URLs, OAC, WAF | CDN |
| Distributed Tracing | End-to-end request tracing | Tracing |
| Erasure Coding | Reed-Solomon math and shard recovery | Erasure Coding |
| CRUSH Placement | Topology-aware shard placement (PG → nodes) | CRUSH |
Further Reading
- Reed-Solomon Error Correction (Plank, 2013) — definitive RS tutorial with worked examples
- CRUSH: Controlled, Scalable, Decentralized Placement (Weil et al., SC 2006) — the Ceph placement algorithm
- Finding a Needle in Haystack (Facebook, OSDI 2010) — small object packing at scale
- ISA-L: Intel Storage Acceleration Library — hardware-accelerated erasure coding with SIMD
- Windows Azure Storage (Calder et al., SOSP 2011) — blob storage architecture with erasure coding
- Dynamo: Amazon's Highly Available Key-Value Store (DeCandia et al., SOSP 2007) — consistent hashing and quorum concepts
- The Google File System (Ghemawat et al., SOSP 2003) — chunk-based distributed storage
Reed-Solomon (10+4) targets 11 nines at 1.4x overhead. CDN serves reads via CDN-signed URLs and OAC, not storage presigned URLs. Control plane (1.36M Raft groups) runs separately from the data plane (stateless Data Router fleet, 19,500 storage nodes). Immutability removes locking, simplifies caching, and eliminates replication conflicts. Storage presigned URLs go direct to storage for uploads. File-level edits live in an application-layer manifest service on top. The metadata layer is the hardest part of the design, and the part that matters most.