CrackingWalnuts

System DesignMarch 31, 2026· 65 min read

System Design: Object Storage (Erasure Coding, Flat Namespace, and Exabyte Scale)

Goal: Build a distributed object storage platform storing 100+ trillion objects across exabytes of raw data. Support PutObject, GetObject, DeleteObject, ListObjectsV2, multipart uploads up to 5TB, versioning, storage class transitions, and presigned URLs. Target 11 nines of durability (99.999999999%) using Reed-Solomon erasure coding (10+4) combined with fast repair, multi-AZ isolation, and failure independence. Strong read-after-write consistency. Flat namespace. CDN-accelerated reads. Clear control plane / data plane separation.

Important: Object storage is immutable by design. There is no PATCH, no partial write, no append. PUT always creates a new version. This removes the need for distributed locking, conflict resolution, and crash recovery from partial writes. Applications that need file-level edits build a manifest-based chunking layer on top (Section 12).

Reading guide:

Sections 1-4: Problem, requirements, design principles

Section 5: Technology selection: erasure coding, placement, metadata, CDN, control/data plane

Section 6: Architecture diagram with numbered request flows

Section 7: Back-of-envelope math

Sections 8-9: Data model and API design

Section 10: Deep dives: end-to-end walkthrough, PUT/GET, presigned URLs, multipart, extent files, GC, rebalancing

Section 11: What breaks first at scale

Section 12: Partial update layer (application-level, not built into storage)

Sections 13-17: Bottlenecks, failures, deployment, observability, security

Mental model (read this first):

Metadata Service: object key → placement group (PG) ID + version + ETag. Knows what exists, not where it lives.

CRUSH: PG ID → 14 storage nodes. Computed locally by any component using the cluster map. No central lookup.

Storage Nodes: store erasure-coded shards in extent files on disk. An object is split into shards (EC concept: 10 data + 4 parity). Each shard is then appended into a large extent file (storage concept: a 256MB container holding many shards sequentially). Extents are just files on disk. Shards are just slices of objects.

Cluster Map: shared topology tree (AZ → rack → node → disk with weights). Updated by control plane, cached everywhere. What makes CRUSH deterministic.

Immutability: PUT creates new shards. Old shards untouched until GC. No locking, no conflicts.

Source of truth for placement: The PG ID stored in metadata is the authoritative reference. CRUSH + cluster map derive node locations from that PG ID. The shard_map field in metadata is a cached optimization for faster reads and may be stale or recomputed.

System flow in 5 steps:

PUT arrives → metadata assigns a PG ID

CRUSH computes which 14 nodes hold this PG's shards

Object is erasure-coded into 14 shards

Shards are appended into extent files on those storage nodes

Metadata commits via Raft → object is visible

TL;DR: Reed-Solomon (10+4) splits each object into 10 data + 4 parity shards, so each object is typically placed across 14 storage nodes in different failure domains (racks, AZs) via CRUSH policies. Combined with fast repair and multi-AZ isolation, this targets 11 nines of durability at 1.4x storage overhead, saving $32M/month/EB vs triple replication. The architecture separates control plane from data plane. The control plane manages roughly 136 PB of metadata (500-1KB per object depending on features, averaging ~700 bytes x 100T objects), split into 1.36M Raft partitions at 100GB each for strong consistency. The data plane is 19,500 storage nodes (14 EB physical storage at 720TB per node, 36 drives each). CRUSH deterministically maps each object to its 14 nodes using the cluster topology. No per-object lookup needed, but CRUSH depends on a globally consistent cluster map that's distributed and cached by all components. Small objects pack into 256MB extent files. Objects are immutable. Flat namespace with sorted LSM-tree keys gives near-constant-time lookups and efficient prefix scans. For CDN-served private content, use CDN-signed URLs with Origin Access Control (not storage presigned URLs). For file-level edits, build a manifest service (application-layer) on top that tracks chunk-to-file mappings and atomically swaps versions.

1. Problem Statement

Problem 1: Durability at 11 nines. A typical data center runs around 100K drives. At a 2% annual failure rate, roughly 5.5 drives fail daily. Failures are constant, not rare. Storage costs about $0.02/GB/month (around $20M per EB). 3x replication stores 3 EB per EB of logical data, costing around $60M/month. Reed-Solomon (10+4) stores 1.4 EB, costing around $28M/month. That saves roughly $32M/month and tolerates up to 4 shard failures instead of 2.

Problem 2: Flat namespace at trillion scale. No directories (avoids metadata hotspots and distributed locking). The key "photos/2024/vacation/beach.jpg" is opaque. LIST with prefix=photos/2024/ and delimiter=/ simulates directories via efficient range scans over sorted, partitioned keys.

Problem 3: Strong consistency. PUT then GET always returns the latest version. Requires Raft consensus at every metadata write.

Problem 4: Partial updates on immutable blobs. There's no PATCH, no append. Editing 10MB in a 1GB file means re-uploading the whole thing. The workaround is a manifest-based chunking layer on top (Section 12). This is an application concern, not a storage concern.

Scale: 100T+ objects, exabytes of data, 350K req/sec peak, 5TB max object size.

Common anti-patterns at scale:

Blind 3x replication. Costs 3x storage (around $60M/month per EB) with weaker fault tolerance. Use erasure coding (RS 10+4) for better cost and durability.
Single metadata node. Cannot index or serve 100T objects. Use partitioned, distributed metadata with consensus (Raft).
One file per object on a filesystem. Filesystems like ext4 allocate a fixed number of inodes (small metadata records that track each file). In practice, a filesystem has tens to hundreds of millions of inodes depending on how it's formatted. At large scale, inodes run out before disk space does. On top of that, very large directories degrade performance due to metadata overhead and lookup costs. The fix: pack many small objects into large append-only segment files (called extents, typically 256MB each). A separate index (persisted on disk and cached in memory) maps each object to its offset and size within these segments. Result: fast sequential writes and efficient reads via a single lookup plus a direct seek. See Section 10.6 for the full extent file design.
Routing all data through API servers. The API fleet becomes a bandwidth bottleneck. Use presigned URLs for direct client-to-storage transfers.
Eventual consistency via polling. Leads to stale reads and unpredictable behavior. Use strong consistency (Raft) for metadata.
Naive hashing without placement groups. Adding a single node reshuffles data across the entire cluster. Use placement groups to limit rebalancing scope.
No garbage collection. Leaked and orphaned data accumulates to petabytes over months. Run background GC for cleanup.
Synchronous writes to all AZs. High latency, poor availability under partial failure. Write to a quorum, repair asynchronously.
Treating all objects the same. Inefficient for both ends of the size spectrum. Small objects need packing into extents. Large objects need multipart upload.
Ignoring storage tiering. Cold data sitting on SSD wastes money. Use tiered storage (hot/warm/cold) with lifecycle policies.
Ignoring CDN and edge caching. All reads hitting origin means high bandwidth cost. CDN offloads ~60-80% of read traffic.

The metadata layer is the hardest part of this design. 100T objects with strong consistency across 1.36M Raft partitions, supporting both point lookups and range scans. Get it right and the rest of the system is manageable. Get it wrong and no amount of storage-layer engineering compensates.

2. Functional Requirements

ID	Requirement	Priority
FR-1	PutObject: upload objects up to 5TB	P0
FR-2	GetObject: retrieve objects with byte-range reads	P0
FR-3	DeleteObject: soft delete (versioning) or hard delete (version-id)	P0
FR-4	ListObjectsV2: prefix + delimiter + pagination over trillions of objects	P0
FR-5	Multipart upload: create, upload-part (up to 10,000 parts), complete, abort	P0
FR-6	Presigned URLs: time-limited signed URLs for direct client-to-storage transfers	P0
FR-7	Bucket versioning: every PUT creates a new version, recoverable deletes	P1
FR-8	Storage class transitions: Standard, IA, Glacier Instant/Flexible, Deep Archive, Intelligent-Tiering	P1
FR-9	Lifecycle policies: automatic transition and expiration	P1
FR-10	CopyObject: server-side copy without re-upload	P1
FR-11	Object tagging for lifecycle rules and access policies	P1
FR-12	Bucket policies and IAM: resource-level and user-level access control	P0
FR-13	Server-side encryption: SSE-Default, SSE-KMS, SSE-C	P0
FR-14	Event notifications on object create/delete/transition	P1
FR-15	Cross-region replication: event-driven async replication with version-based ordering, eventual consistency to target region, automatic retry on failure	P2
FR-16	CDN integration: edge caching via CDN-signed URLs, WAF/DDoS protection	P1
FR-17	Automatic rebalancing and repair on node addition/removal/failure	P0

3. Non-Functional Requirements

Property	Target
Durability	99.999999999% (11 nines)
Availability	99.99% (52.6 min downtime/year)
PUT latency (p99, <1MB)	< 200ms
GET latency (p99, first byte, <1MB)	< 100ms
LIST latency (p99, 1000 results)	< 500ms
Total objects	100T+
Total logical storage	Exabytes
Maximum object size	5TB
Peak request rate	350K/sec
Storage overhead	< 1.5x (erasure coding)
Consistency	Strong read-after-write
CDN cache hit ratio	>= 60% for read-heavy workloads

4. Design Principles

1. Objects are immutable. PUT never modifies in place. This removes the need for distributed locking, makes caching straightforward (ETags never go stale), and eliminates conflict resolution from replication. If two clients upload the same key simultaneously, they each create independent shards on independent nodes. There's nothing shared to lock. Raft orders the metadata commits to decide which version is "latest", but the data writes never conflict.

2. Separate control plane from data plane. Metadata (Raft consensus) and data (quorum writes, EC) need to scale independently. Mixing them creates a bottleneck that caps both.

3. No central bottleneck. Data Router is a stateless fleet. Placement uses CRUSH (deterministic, no central authority). Metadata is partitioned across 1.36M Raft groups.

4. Push data to the edge. CDN caches content via CDN-signed URLs and Origin Access Control. Storage presigned URLs transfer bytes directly to/from storage nodes for uploads. API servers handle auth and metadata, not terabits of data.

5. Failure is routine. At 5.5 drive failures per day, failure is not an exception. Write with quorum (11/14, so up to 3 slow nodes don't block), read with speculative parity fetches (request a backup shard before the primary times out, use whichever arrives first), repair in the background. One slow node should never block a user request.

6. Flat namespace is an advantage. No directory tree = no distributed lock coordination for renames. Sorted keys in LSM-trees give near-constant-time point lookups (O(log N) technically, but with bloom filters and block caches it behaves like O(1) in practice) and efficient range scans.

5. High-Level Approach & Technology Selection

5.1 Object Storage is Not a Filesystem

No directory tree, no inodes, no POSIX semantics, no rename(), no partial writes, no file locking. An object is a key, a value, and some headers. That simplicity is deliberate. Every constraint removed from the API is a constraint removed from the storage engine.

5.2 Immutability: The Core Design Decision

PUT creates a new version with entirely new shards. The old shards sit there until GC cleans them up. Because nothing is ever modified in place, there are no partial shard updates, no distributed locks, no torn writes. ETags stay valid forever. Replication is just copying bytes with no conflict resolution.

All metadata writes go through the Raft leader. Metadata reads are typically served from the leader or follower replicas with consistency guarantees (leader reads for strict consistency, follower reads with lease-based staleness bounds for lower latency). Leader reads cost ~2-5ms, but PUT then GET just works. No more workarounds for eventual consistency.

What "immutable" means practically: There is no PATCH endpoint. No append operation. No in-place modification. Uploading report.pdf and then uploading a new version creates entirely new shards on potentially different nodes. The old version's shards are untouched until GC reclaims them (or kept forever if versioning is enabled). This is a deliberate trade-off: simpler storage internals at the cost of re-uploading entire objects for any change. Section 12 covers how to build partial updates on top of this.

5.3 Erasure Coding vs Replication

3x Replication: 10MB object → 30MB stored (3.0x), survives 2 failures. 10+4 Reed-Solomon: 10MB → 10 data shards (1MB each) + 4 parity shards = 14MB stored (1.4x), survives ANY 4 failures.

	3x Replication	10+4 Erasure Coding
Physical storage (1 EB logical)	3 EB	1.4 EB
Monthly cost ($0.02/GB)	$60M	$28M
Fault tolerance	2 failures	4 failures

Erasure coding wins on both cost and durability. The trade-off? Write latency goes up (you compute parity before writing) and reconstruction is more expensive (read 10 shards and do the math). But ISA-L does ~10 GB/sec encoding per core. In most deployments, the compute overhead is small compared to disk and network I/O.

Encoding is also pipelined. The Data Router processes 1MB stripes as they arrive over the network, encoding and shipping shards in parallel with receiving the next stripe. It never buffers the full object before starting. In most practical deployments, EC CPU cost is small relative to disk and network I/O. The encoding CPU typically shows up as a rounding error in flame graphs compared to fsync latency and cross-AZ network time.

For the math behind Reed-Solomon (generator matrices, Galois field arithmetic, reconstruction), see the Erasure Coding deep dive. Section 10.1 shows how it works with a concrete end-to-end example.

5.4 Object Placement: Placement Groups + CRUSH

pg_id = hash(bucket_id + "/" + object_key) % 100,000

Each object maps to a placement group (PG), and each PG maps to a fixed set of storage nodes via CRUSH (Controlled Replication Under Scalable Hashing). CRUSH is a deterministic, topology-aware placement algorithm that selects nodes based on cluster layout (racks, AZs) and capacity weights, ensuring shards are distributed across failure domains.

object → PG → CRUSH → nodes

Example:
pg_id = 42
CRUSH(pg_id) → [Node A, Node F, Node K, ...]
These nodes store all 14 shards for objects in that PG.

Each PG maps to 14 nodes, meaning all shards of objects in that PG are consistently placed on the same node set. This avoids per-object placement decisions and keeps data layout predictable.

The placement map is computed locally (no central lookup) and cached by all components. It changes only when cluster topology changes (node join, leave, or failure), minimizing coordination overhead.

Why PGs over direct consistent hashing? Managing placement per object does not scale at 100T objects. By grouping objects into ~100K PGs (~100x nodes), the system reasons about groups instead of individual objects. When a node is added or removed, only a small subset of PGs are reassigned, triggering controlled, bulk data movement between specific nodes rather than a full reshuffle. This makes rebalancing predictable, efficient, and operationally manageable.

5.5 Metadata Architecture

Each object needs roughly 500-1KB of metadata depending on features enabled (versioning, ACLs, custom metadata). A typical breakdown: key (~200B), version ID (16B), size/etag/content-type (~80B), ACL/timestamps (~60B), user metadata (~100B), and a shard map with 14 entries (~225B), totaling around 700 bytes on average. Multiply by 100T objects and you get roughly 70 PB of raw metadata. With LSM-tree overhead (bloom filters, block indexes, write-ahead logs), that doubles to ~136 PB.

How to partition it: ~1.36M Raft partitions at 100GB each (small enough for fast Raft snapshots, large enough to keep the total partition count manageable). 3 replicas per partition (standard Raft quorum, where any 2 of 3 form a majority for consensus). Spread across ~13,600 metadata nodes, each hosting ~100 partitions.

Key format: /{bucket_id}/{object_key}\0{version_id}. Prefixing with bucket_id puts all objects in the same bucket in a contiguous key range, so a LIST with a prefix just scans forward from that point. The \0 null byte cleanly separates the object key from the version. The version_id uses bit-flipped timestamps (XOR each byte with 0xFF) so that newer versions produce smaller byte values and sort first in a forward scan. This means GET without a version_id just reads the first entry it finds.

What metadata stores vs what CRUSH computes: Metadata stores the PG ID for each object, not node locations. Node placement is derived at read/write time by running CRUSH against the cluster map. The shard_map field in the protobuf schema (Section 8) is an optimization: cached shard locations for faster reads. It can always be recomputed from PG ID + CRUSH + cluster map.

Requirements: <10ms point lookup, efficient prefix range scan, versioned entries, strong consistency via Raft per partition.

Metadata Store Option	Consistency	Range Scans	Battle-tested
FoundationDB	Strong (Paxos)	Yes	Apple (iCloud)
TiKV	Strong (Raft)	Yes	PingCAP
CockroachDB KV	Strong (Raft)	Yes	Cockroach Labs

5.6 Storage Classes

Class	Latency	$/TB/mo	Retrieval $/GB	Use Case
Standard	ms	$23.00	$0.00	Active data
Infrequent Access	ms	$12.50	$0.01	Backups
Glacier Instant	ms	$4.00	$0.03	Archives needing instant access
Glacier Flexible	min-hrs	$3.60	$0.01-0.03	Compliance archives
Deep Archive	12-48h	$0.99	$0.02	Legal/regulatory (tape robots)
Intelligent-Tiering	ms	auto	$0.00	Unknown access patterns

Standard to IA is a billing change (data stays put). Glacier uses denser HDD arrays with higher EC ratios like 16+4 (1.25x overhead instead of 1.4x, since access is rare). Deep Archive uses extremely dense, low-cost storage, often tape-backed systems where retrieval takes 12-48 hours because physical media must be loaded by robotic arms.

5.7 CDN and Edge Layer

At 250K GETs/sec x 100KB = 25 GB/sec egress. CDN offloads 60-80%, reducing origin to ~7.5 GB/sec.

Objects are immutable, so CDN caching is safe by design. But storage presigned URLs are NOT CDN-friendly: each URL has a unique signature, so CDN sees every request as a different cache key (always a miss). For private content through CDN, use CDN-signed URLs with Origin Access Control. Full details in the presigned URL deep dive (Section 10.4).

WAF at CDN edge: per-IP/bucket/account rate limiting, geo-restrictions, enumeration attack detection.

5.8 Control Plane vs Data Plane

	Control Plane	Data Plane
Does	Decisions: placement, versioning, access	Bytes: encode, transfer, store
Components	Metadata Service, Raft, CRUSH, IAM, Lifecycle/GC	Data Router Fleet (stateless), Storage Nodes
Consistency	Strong (Raft)	Quorum (11/14 shards)

Data Router is a logical role, not necessarily a separate fleet. It handles erasure coding and shard distribution. In the API path, it runs as a separate stateless fleet (1,500 instances, ~33 PUTs/sec each). In the presigned path, the storage node takes on this role directly. Some deployments embed it in storage nodes entirely and skip the separate fleet. The role is the same regardless of where the code runs: receive bytes, encode, distribute shards, coordinate the metadata commit.

5.9 Technology Selection

Component	Technology	Why
CDN / Edge	CloudFront / Fastly	CDN-signed URLs, OAC, 200+ PoPs, terabit DDoS
Metadata store	Custom LSM-tree + Raft	Ordered KV, range scans, strong consistency
Object data	Raw block, extent-based	No filesystem overhead, direct I/O
Erasure coding	Reed-Solomon, ISA-L	AVX-512 accelerated, ~10 GB/sec/core
Placement	CRUSH-derived	Deterministic, no central authority
Internal RPC	gRPC + protobuf	Low latency, bidirectional streaming
IAM / Policy	Custom evaluator	Sub-ms evaluation, in-memory cache
KMS	HSM-backed	Three-level key hierarchy, bucket key caching
Monitoring	Prometheus + Grafana + Jaeger	Standard stack
Events	Kafka	Durable event stream

5.10 Build vs Use

Before building from scratch, it's worth evaluating existing distributed object storage systems. They already implement erasure coding, placement, metadata scaling, and the S3 API. The question is where they hit their limits.

MinIO is lightweight, S3-compatible, and designed for private cloud. Single Go binary, Reed-Solomon per object, distributes data across server pools. Operationally simple. Architected for petabyte scale, not exabyte. At very large scale, the lack of CRUSH-style topology-aware placement becomes a limiting factor.

Ceph (RADOS Gateway) is battle-tested at massive scale. CERN runs it for physics data at exabyte scale. DigitalOcean built their object storage on it. Ceph uses CRUSH for deterministic placement, supports multiple erasure coding schemes. The downside is operational complexity. Ceph clusters need dedicated ops teams who understand PG states, recovery priorities, and OSD tuning.

SeaweedFS is optimized for simplicity and small file performance. Central master for metadata and volume management, Reed-Solomon for data. Fast to set up, good for smaller deployments. The central master becomes a bottleneck at large scale.

Property	MinIO	Ceph (RADOS Gateway)	SeaweedFS
API compatibility	Near-complete S3	Near-complete S3	Partial S3
Erasure coding	RS per object	Multiple schemes (RS, LRC)	RS per volume
Placement	Server pools, distributed	CRUSH (topology-aware)	Central master assigns volumes
Operational complexity	Low	High	Low
Battle-tested scale	PB range	EB range (CERN, DigitalOcean)	TB to low PB range

Your scale	Recommendation
< 100TB	MinIO. Simple, fast to deploy, good S3 compatibility.
100TB - 10PB	MinIO or Ceph. MinIO for simplicity, Ceph for fine-grained placement control.
10PB - 1EB	Ceph with a dedicated ops team, or custom with sufficient engineering depth.
> 1EB	Custom. At this scale, controlling every trade-off matters, and the team size (50-100 engineers) justifies the investment.

6. High-Level Architecture

Three parallel paths exist. Downloads go through CDN. Uploads go through the API path or presigned URLs. All three paths converge at the same storage layer with identical placement, erasure coding, and durability guarantees. Only the access pattern differs. The storage is the same.

6.1 CDN Download Path

Read-heavy, cached, no API involvement. CDN handles auth and caching at the edge.

For reads with CDN-signed URLs. CDN talks directly to storage origin. No API Router, no Metadata Service, no Data Router in this path.

How Path 1 works step by step:

The client never talks to the storage system directly. It talks to the CDN.

Your backend generates a CDN-signed URL (using a CDN key pair, completely separate from storage credentials) and gives it to the client.
Client sends the request to the CDN edge (e.g., https://d1234.cloudfront.net/photos/beach.jpg?Signature=...&Expires=...).
CDN edge validates the signature using the public key, checks expiry, and enforces access policies (IP restrictions, geo-blocks, header requirements). If invalid or expired: 403, no origin hit.
Cache hit: CDN already has this object cached from a previous request. Returns it immediately. This is the fast path and handles 60-80% of reads.
Cache miss: CDN doesn't have it cached. CDN fetches from the storage origin using its Origin Access Control (OAC) credential. The storage bucket is private and blocks all direct access. Only the CDN's service identity is allowed to read from it.
The storage origin has an internal gateway that calls the metadata service for the object lookup (PG ID, shard locations), then calls the data layer to fetch the 10 data shards in parallel from the appropriate storage nodes (determined by CRUSH). If any shards are missing (degraded read), it fetches parity shards and reconstructs via Reed-Solomon. The assembled object bytes are returned to CDN.
CDN receives the bytes, caches them (based on Cache-Control headers), and serves to the client.

Notice what's NOT in this path: no API Router, no IAM Engine, no Data Router. The CDN handles auth (its own signature). The storage origin handles internal routing. The CDN sees a single HTTP endpoint and gets back a blob.

6.2 API Upload Path

Full controlled path. Every component involved: auth, metadata, encoding, storage, consensus.

How Path 2 works step by step:

This is the controlled path where every component is involved.

Client sends PUT /bucket/key with the object bytes and an HMAC-SHA256 signature in the Authorization header.
Request passes through WAF (malicious request filtering, rate limiting) and the L7 load balancer (distributes across API Router instances).
API Router validates the HMAC signature (recomputes it from the request and the client's secret key, compares). Then calls the IAM Engine to check if this user has PutObject permission on this bucket. If either fails: 403.
API Router asks the Metadata Service for the placement group: "where should this object's shards go?" Metadata Service hashes the bucket+key to a PG ID, looks up the CRUSH map, and returns the list of 14 storage nodes.
API Router streams the object bytes to a Data Router instance (stateless, any instance works).
Data Router processes the data in 1MB stripes. For each stripe: split into 10 data shards, compute 4 parity shards using Reed-Solomon encoding. Pipelining means it doesn't wait for the full object before starting.
Data Router sends all 14 shards to the designated storage nodes in parallel. Each storage node appends the shard to an extent file, computes CRC32C checksum, fsyncs to disk, and ACKs.
Once a quorum of shard writes succeed (at least 11 of 14), the Data Router tells the Metadata Service to commit. The Metadata Service writes the object metadata (key, ETag, shard locations) through Raft consensus. 2 of 3 Raft replicas must confirm.
Only after both the shard quorum AND the Raft commit succeed does the client get a 200 OK with the ETag and version-id. If shards wrote successfully but the Raft commit fails (e.g., leader election), those shards become orphans. GC cleans them up. Client gets an error and retries.

CDN is not involved in uploads. It would add latency and caching is useless for writes.

6.3 Presigned Upload Path

Direct path for large uploads. Storage node handles everything. Zero bytes through API fleet.

Bypasses CDN, WAF, LB, API Router, and Data Router. The storage node does all the work:

How Path 3 works step by step:

This is the fast path for large uploads. A 5TB video upload sends zero bytes through the API Router.

Client first asks the API server (via normal authenticated request) to generate a presigned URL. The API server signs the URL with the storage system's secret key, embedding the bucket, key, operation (PUT), expiry, and any conditions (content-type, max size, IP restriction). Returns the URL to the client.
Client uploads the object bytes directly to the storage node endpoint in the presigned URL. No CDN, no WAF, no LB, no API Router in the data path.
The storage node receives the bytes and validates the presigned URL: reconstructs the "string to sign" from the request (method, path, query params, headers, expiry), looks up the secret key for that credential (cached locally on every storage node), computes its own HMAC-SHA256, and compares. If it matches and the expiry hasn't passed, the request is authentic. No database lookup needed, no session state.
Storage node looks up the CRUSH placement map (cached locally) to determine which 14 nodes should hold this object's shards.
Storage node runs Reed-Solomon encoding on the incoming bytes (same 1MB stripe pipelining as the Data Router would do). Produces 10 data + 4 parity shards.
Storage node writes 1 shard locally and sends the other 13 to peer storage nodes in parallel. Each peer appends to extent file, CRC32C, fsync, ACK.
Once quorum ACKs arrive (at least 11 of 14), the storage node sends the metadata to the Metadata Service for Raft commit. Same Raft consensus as the API path: 2 of 3 replicas confirm.
Only after both shard quorum and Raft commit succeed does the client get 200 OK.

The storage node temporarily takes on the Data Router's role: signature validation, CRUSH lookup, EC encoding, shard distribution, and Raft commit coordination. Once the upload is done, it goes back to being a regular storage node. This dual role is what makes presigned uploads possible without a separate fleet of proxy servers.

6.4 Background Systems

Always running, not tied to any request path:

These run continuously regardless of client traffic:

Repair / Rebalance: When a storage node or drive fails, the repair worker detects it via heartbeat timeout (~30 seconds), reads surviving shards for affected placement groups, reconstructs missing shards using Reed-Solomon math, and writes them to healthy nodes. Also handles rebalancing when nodes are added or removed.
Garbage Collector: Cleans up shards that are no longer referenced by any live object (from overwrites, deletes, aborted multipart uploads, or failed writes where shards landed but metadata commit failed). 24-hour grace period before physical deletion.
Scrubber: Reads every shard on every drive periodically (full scan every 2 weeks), computes CRC32C, and compares to the stored checksum. Detects silent bit rot before a client reads the corrupted data.
Lifecycle Manager: Evaluates lifecycle rules against objects (e.g., "move to Glacier after 90 days"). Triggers storage class transitions by re-encoding shards with the target EC scheme and moving them to the appropriate storage tier.

6.5 Component Reference

Component	What it does
CDN Edge	Caches objects for reads. Only involved in downloads, never uploads. Uses CDN-signed URLs (not storage presigned URLs). See Section 10.4.
WAF / DDoS	Filters malicious requests, rate limiting. Sits in front of the API path.
API Router Fleet	Stateless servers that receive client requests, validate HMAC signatures, and route to internal services. The front door. Horizontally scalable, no state.
IAM Engine	Checks if this user is allowed to do this operation on this bucket/object. Evaluates IAM policies + bucket policies. Returns allow or deny.
Metadata Service	Stores the index of all objects: maps object key → placement group (PG) ID + version + ETag + shard locations. Does NOT decide which nodes hold the shards (that's CRUSH). Metadata answers "does this object exist and what's its PG?" CRUSH answers "which 14 nodes does this PG map to?"
Raft Groups (1.36M)	Raft is a consensus protocol that ensures metadata writes are durable and consistent. The metadata is split into 1.36 million partitions, each running its own Raft group (3 replicas). A metadata write is confirmed after 2 of 3 replicas agree, which guarantees PUT-then-GET consistency.
CRUSH Placement	Deterministic algorithm that answers: "given this object, which 14 storage nodes hold its shards?" Uses cluster topology (racks, AZs) to spread shards across failure domains. Runs locally on any node without a central server.
Data Router Fleet	Stateless servers that handle erasure coding. Receive object bytes, split into 10 data + 4 parity shards (Reed-Solomon), send each shard to the correct storage node (determined by CRUSH). Only used in the API path.
Storage Nodes	The actual machines with disks. 36 HDDs each. Store shards in extent files, verify checksums on read, report health via heartbeats. In the presigned path, temporarily take on the Data Router's job (EC encode + distribute shards).

Storage Node: Dual Role. In the API path, a storage node is simple: it receives a pre-encoded shard from the Data Router, appends it to an extent file, fsyncs, and ACKs. It doesn't know about erasure coding. In the presigned path, the storage node that receives the upload temporarily acts as the Data Router too: validates signature, looks up CRUSH, runs EC encoding, distributes shards to peers, and commits metadata. Same work gets done, just by a different component.

All paths converge at the storage layer with the same EC + durability guarantees. Whether an object arrives via the API path, a presigned URL, or is read through CDN, it ends up as the same 14 erasure-coded shards on the same storage nodes with the same Raft-committed metadata. The path is different. The durability is identical.

6.6 Path Comparison

Step	API Upload Path	Presigned Upload Path	CDN Download Path
Entry point	WAF → LB → API Router	Storage Node directly	CDN Edge
Auth	API Router + IAM Engine	Storage Node validates signature locally	CDN validates CDN-signed URL at edge
Placement	Data Router calls CRUSH	Storage Node calls CRUSH	Handled internally by storage origin
EC encode	Data Router	Storage Node (acts as Data Router)	N/A (read path)
Shard write	Data Router → 14 Storage Nodes (quorum 11/14)	Storage Node → 13 peers + 1 local (quorum 11/14)	N/A
Metadata	Shard quorum ACK → Raft commit → then final ACK to client	Shard quorum ACK → Raft commit → then 200 OK	Storage origin handles lookup internally
What's skipped	Nothing	CDN, WAF, LB, API Router, Data Router	API Router, IAM, Data Router (CDN talks to storage origin directly via OAC)

Request flows (all operations):

Operation	Path
PUT (API)	Client → WAF → LB → API Router → IAM → Metadata (PG lookup) → Data Router (EC encode) → Storage Nodes (14 shards, quorum 11/14) → Raft commit → ACK
PUT (Presigned)	Client → Storage Node directly (validate sig → CRUSH → EC encode → distribute 14 shards, quorum 11/14 → Raft commit → 200 OK)
GET (CDN)	Client → CDN Edge (validate CDN-signed URL, cache hit: serve / miss: fetch from storage origin via OAC → cache + serve). No API Router or Data Router in this path.
GET (Presigned)	Client → Storage Node directly (validate sig → read shards → stream). No CDN. See Section 10.4 for CDN alternatives.
DELETE	Client → API Router → Metadata (write delete marker via Raft)
LIST	Client → API Router → Metadata (parallel range scan across partitions, k-way merge, early termination at max-keys)

6.7 Object Data Lifecycle

Every object follows this lifecycle from creation to eventual cleanup:

7. Back-of-Envelope Estimation

7.1 Request Throughput

Daily requests: 10B → ~115K/sec average, ~350K/sec peak (3x)
  GET 70%: 250K/sec   PUT 15%: 50K/sec   LIST 10%: 35K/sec   DELETE 5%: 15K/sec

7.2 Storage

100T objects × 100KB avg = 10 EB logical  (average varies widely: 1KB for configs, 100KB typical, 1MB+ for media)
10+4 EC (1.4x) = 14 EB physical
19,500 nodes × 36 drives × 20TB = 14 EB
700,000 total drives

7.3 Metadata

~700 bytes/object × 100T = ~70 PB, with LSM overhead (bloom filters, indexes, WAL): ~140 PB
140 PB / 100GB per partition = ~1.4M partitions
13,600 metadata nodes (100 partitions each)

7.4 Bandwidth

PUT ingress: 50K/sec × 100KB = 5 GB/sec
GET egress:  250K/sec × 100KB = 25 GB/sec (CDN offloads 70% → 7.5 GB/sec origin)
Parity write overhead: 5 GB/sec × (4 parity shards / 10 data shards) = 2 GB/sec extra

7.5 Router Nodes

API Routers: 350K peak req/sec / ~175 req/sec per node = ~2,000 nodes
Data Routers: 50K PUTs/sec / ~33 PUTs/sec per instance = ~1,500 nodes
Total router nodes: ~3,500

7.6 Summary

Resource	Count
Storage nodes	~19,500
Total drives	~700,000
Physical storage	14 EB
Metadata nodes	~13,600
Raft partitions	~1,360,000
Placement groups	~100,000
API/Data Router nodes	~3,500
CDN PoPs	200+
Peak req/sec	~350,000

Cost breakdown:

Capex:
  Storage nodes:  19,500 × $15K/node  = $292M
  Metadata nodes: 13,600 × $8K/node   = $109M
  Router nodes:    3,500 × $5K/node   =  $18M
  Networking (switches, cross-connects) =  $50M
  Total capex:                           $469M

Opex (annual):
  Power + cooling: 36,600 nodes × 500W × $0.10/kWh × 8,760h = $160M
  Operations (staff, facilities, bandwidth):                    $100M
  Total opex:                                                   $260M/year

Amortized over 4 years: ($469M / 4) + $260M = $377M/year
Cost per GB: $377M / (10 EB × 10^9 GB/EB) = $0.003/GB/month
Cost per TB: about $3/TB/month internal

Retail $23/TB/month covers margin, multi-region, and ops overhead. At scale, cross-AZ network transfer often costs more than the storage hardware itself. Placing the majority of data shards in the local AZ is one of the biggest cost levers available.

8. Data Model

8.1 Bucket Metadata (PostgreSQL)

sql

CREATE TABLE buckets (
    account_id      BIGINT NOT NULL,
    bucket_name     TEXT NOT NULL,
    bucket_id       UUID DEFAULT gen_random_uuid(),
    region          TEXT NOT NULL,
    versioning      TEXT DEFAULT 'disabled',
    acl             JSONB DEFAULT '{}',
    policy          JSONB,
    lifecycle_rules JSONB DEFAULT '[]',
    encryption      JSONB,
    created_at      TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (account_id, bucket_name)
);
CREATE UNIQUE INDEX idx_bucket_name ON buckets (bucket_name);

8.2 Object Metadata (Distributed KV, protobuf)

Key: /{bucket_id}/{object_key}\0{version_id}. Version IDs are timestamps. To make the newest sort first in a forward-only scan, each byte is XORed with 0xFF. This flips the bits so a larger timestamp (newer) becomes a smaller byte value (sorts earlier). Simple trick that avoids needing reverse iteration in the storage engine.

protobuf

message ObjectMetadata {
    string object_key = 1;
    bytes version_id = 2;
    uint64 size = 3;
    string etag = 4;
    string content_type = 5;
    string storage_class = 6;
    string encryption_key_id = 7;
    map<string, string> user_metadata = 8;
    uint32 placement_group = 9;
    repeated ShardLocation shard_map = 10;  // cached for fast reads; recomputable from PG + CRUSH
    bool is_delete_marker = 11;
    int64 created_at = 12;
}

message ShardLocation {
    uint32 shard_index = 1;  // 0-13
    uint64 node_id = 2;
    uint64 extent_id = 3;
    uint64 offset = 4;
    uint32 length = 5;
}

8.3 Placement Group Map (etcd)

protobuf

message PlacementGroup {
    uint32 pg_id = 1;
    repeated uint64 node_ids = 2;  // 14 entries
    uint64 epoch = 3;
    string state = 4;  // ACTIVE, REBALANCING, DEGRADED
}

8.4 Extent Index (local per node)

Each node holds roughly 72 billion shard entries (100T objects × 14 shards / 19,500 nodes). At 32 bytes per entry, a full in-memory index would need ~2.3TB, which doesn't fit in RAM. In practice, the index is tiered: a bloom filter in memory (a few GB, tells you "this extent definitely doesn't have your shard" for most lookups), a persistent on-disk sorted index per extent file (written when the extent is sealed), and an LRU cache in RAM for hot entries. Most reads hit the bloom filter first, skip irrelevant extents, then do one disk seek into the right extent. In practice, the top 1-5% of hot entries live in the RAM cache, covering the majority of read traffic. Cold reads incur one additional disk seek to the on-disk index before reaching the data.

8.5 Multipart Upload State

Key: /mpu/{bucket_id}/{object_key}/{upload_id}

protobuf

message MultipartUpload {
    string upload_id = 1;
    map<uint32, PartInfo> parts = 2;
    int64 created_at = 3;
    int64 expires_at = 4;  // auto-abort deadline
}

message PartInfo {
    uint64 size = 1;
    string etag = 2;
    uint32 placement_group = 3;
    repeated ShardLocation shards = 4;
}

8.6 Version Listing Index

Secondary index: /versions/{bucket_id}/{object_key}\0{reverse_version_id} pointing to primary metadata. Uses the same bit-flip trick from Section 8.2 to make newest versions sort first in a forward scan.

8.7 Lifecycle State

Key: /lifecycle/{bucket_id}/{rule_id}/{object_key} → {current_class, target_class, transition_at, state}

9. API Design

9.1 PutObject

http

PUT /{bucket}/{key} HTTP/1.1
Host: storage.us-east-1.example.com
Content-Length: 10485760
Content-Type: image/jpeg
x-storage-class: STANDARD
x-storage-server-side-encryption: kms
x-storage-server-side-encryption-kms-key-id: kms:us-east-1:123456:key/abcd-1234
x-storage-meta-original-filename: beach-photo.jpg
Authorization: HMAC-SHA256 Credential=AKID/20260310/us-east-1/storage/request,
    SignedHeaders=content-length;content-type;host;x-storage-date,
    Signature=abcdef1234567890...

[10 MB of object bytes]

http

HTTP/1.1 200 OK
ETag: "d41d8cd98f00b204e9800998ecf8427e"
x-storage-version-id: 3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo

9.2 GetObject

http

GET /{bucket}/{key} HTTP/1.1
Range: bytes=0-1048575
If-None-Match: "d41d8cd98f00b204e9800998ecf8427e"

Returns 206 Partial Content with range, or 304 Not Modified if ETag matches.

9.3 DeleteObject

http

DELETE /{bucket}/{key}?versionId=3sL4kqtJlcpXr... HTTP/1.1

Without versionId on a versioned bucket: creates a delete marker (data untouched). With versionId: permanently deletes that version.

9.4 ListObjectsV2

http

GET /{bucket}?list-type=2&prefix=photos/2024/&delimiter=/&max-keys=1000 HTTP/1.1

xml

<ListBucketResult>
    <Prefix>photos/2024/</Prefix>
    <CommonPrefixes><Prefix>photos/2024/january/</Prefix></CommonPrefixes>
    <CommonPrefixes><Prefix>photos/2024/vacation/</Prefix></CommonPrefixes>
    <Contents>
        <Key>photos/2024/index.html</Key>
        <ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
        <Size>2048</Size>
    </Contents>
    <IsTruncated>true</IsTruncated>
    <NextContinuationToken>def456</NextContinuationToken>
</ListBucketResult>

9.5 Multipart Upload

http

POST /{bucket}/{key}?uploads HTTP/1.1                    → UploadId=abc123
PUT /{bucket}/{key}?partNumber=1&uploadId=abc123 HTTP/1.1 → ETag for part 1
PUT /{bucket}/{key}?partNumber=2&uploadId=abc123 HTTP/1.1 → ETag for part 2
POST /{bucket}/{key}?uploadId=abc123 HTTP/1.1             → Complete (send part ETags)
DELETE /{bucket}/{key}?uploadId=abc123 HTTP/1.1            → Abort (if needed)

Parts upload in parallel, out of order, from different machines. Each part independently erasure-coded.

9.6 Presigned URLs

https://storage.../my-bucket/photo.jpg
  ?X-Storage-Algorithm=HMAC-SHA256
  &X-Storage-Credential=AKID%2F20260310%2Fus-east-1%2Fstorage%2Frequest
  &X-Storage-Expires=3600
  &X-Storage-Signature=fe5f80f77d5fa3beca038a248ff027...

Self-contained: signature embeds bucket, key, operation, expiry. Conditions: IP range, content-type, content-length-range. These go direct to storage nodes, not through CDN. See Section 10.4 for the full presigned URL and CDN deep dive.

10. Deep Dives

10.1 End-to-End: From Upload to Recovery

Before diving into individual flows, here's the full lifecycle of a file using a simplified 2+1 scheme (2 data shards, 1 parity). The real system uses 10+4, but the mechanics are identical.

Setup: File = [A B C D] (4 blocks). Stripe size = 2 blocks (k=2). One parity shard per stripe (m=1).

Step 1: Split into stripes and encode.

Stripe	Data Shards	Parity Shard	Stored
Stripe 1	D0 = `A`, D1 = `B`	P = `A XOR B`	`[A, B, A⊕B]`
Stripe 2	D0 = `C`, D1 = `D`	P = `C XOR D`	`[C, D, C⊕D]`

Each stripe is encoded independently. Stripes don't know about each other.

Step 2: Distribute shards across nodes.

Shard	Stripe 1	Stripe 2
Node 1	`A`	`C`
Node 2	`B`	`D`
Node 3	`A⊕B` (parity)	`C⊕D` (parity)

Every shard lands on a different node, different rack. Metadata service records the mapping.

Step 3: Node 2 crashes.

Heartbeat monitor detects the failure within 30 seconds. Shard B (Stripe 1) and shard D (Stripe 2) are now missing.

Step 4: Read during failure. Client gets [A B C D]. No error, just ~50ms extra latency from reconstruction.

Step 5: Background repair. Repair worker reconstructs B, writes it to Node 7, updates metadata. Full redundancy restored.

Mapping to the production system: Replace 2+1 with the real EC scheme. Stripes are 1MB. Parity uses GF(2^8) matrix math instead of simple XOR. The system tolerates 4 missing shards instead of 1. Everything else works the same.

10.2 The PUT Path

API Router validates HMAC, calls IAM. (~2ms)
Metadata Service returns PG + shard-to-node mapping. (~5ms)
Data Router reads 1MB stripes, RS encodes 10 data + 4 parity shards.
14 parallel shard writes to storage nodes (append to extent file, CRC32C, fsync). (~10-20ms)
Quorum 11/14 means up to 3 slow nodes don't block. Missing shards repaired async.
Raft commit of object metadata. (~5-10ms)
Return 200 with ETag and version-id.

Total for 10MB PUT: ~38ms + network transfer. Data isn't durable until both the shard quorum write and the Raft metadata commit succeed. If shards land on disk but the Raft commit fails (say, a leader election during commit), those shards become orphans. GC cleans them up within 24 hours. The client gets an error and retries.

10.3 The GET Path

There are three ways to read an object. Each skips different layers:

GET via CDN (most common for high-traffic reads):

Client sends CDN-signed URL to CDN edge.
CDN validates signature, checks cache. Hit (60-80%): serve immediately.
Miss: CDN fetches from storage origin via OAC. Storage handles metadata lookup + shard fetch internally.

GET via API (normal authenticated read):

Client sends GET with HMAC signature to API Router.
API Router validates auth, calls IAM.
Metadata lookup from Raft leader: returns PG ID, ETag, size, and cached shard locations. (~5ms) (Alternatively, the reader can compute shard locations from PG + CRUSH.)
Data Router fetches 10 data shards in parallel from storage nodes.
Happy path (>99%): All 10 respond. Stream to client. ~15-20ms server-side.
Degraded path: 1-4 shards missing → request parity shards → RS reconstruct. +50-200ms. Queue background repair.

GET via storage presigned URL (direct, no CDN):

Client sends request directly to storage node with presigned URL.
Storage node validates signature and expiry locally.
Storage node reads local shards, fetches remaining from peers, reconstructs if needed, streams back.
No CDN, no API Router, no Data Router in this path.

Speculative read: If 9 of 10 respond quickly and 1 is lagging, fire off a request for a parity shard without waiting. Whichever arrives first wins. This keeps p99 tight even when individual nodes are having a bad day.

10.4 Presigned URLs: Storage vs CDN

There are two completely separate signing systems. Mixing them up is one of the most common mistakes.

Storage presigned URLs go direct to the storage node. CDN is not involved.

Storage presigned PUT flow:

Client asks the API server for a signed URL (authenticated with normal credentials).
API server generates the URL: HMAC-SHA256 signature over bucket, key, operation, expiry, conditions.
Client uploads directly to the storage node using the signed URL.
Storage node validates: reconstructs the "string to sign" from the request (method, path, query params, headers, expiry), looks up the secret key for that credential (cached locally), computes its own HMAC-SHA256, compares. Match = authentic. Past expiry = 403. No database lookup, no session state.
Storage node runs EC, distributes shards, commits metadata via Raft.

Storage presigned GET: Client hits the storage endpoint directly. Node validates signature, streams the object. No CDN, no caching benefit. Each URL is unique (different signature), so CDN would just be a pass-through.

CDN-signed URLs (the correct CDN pattern):

Storage presigned URLs and CDN-signed URLs are two different systems. For private content served through CDN:

Storage bucket is private. Users cannot reach it directly.
CDN is the only entity allowed to fetch from storage, via Origin Access Control (OAC).
Your backend generates a CDN-signed URL using a CDN key pair (completely separate from storage credentials). URL looks like https://d1234.cloudfront.net/video.mp4?Signature=...&Expires=...&Key-Pair-Id=...
User requests the CDN-signed URL. CDN verifies the signature and expiry at the edge using the public key. If invalid or expired: 403 immediately, no origin hit.
If valid: serve from CDN cache (fast). On cache miss: CDN fetches from storage using its OAC credential, caches the response, returns to user.

Wrong: https://bucket.s3.amazonaws.com/file?X-Amz-Signature=... (storage presigned URL, bypasses CDN) Right: https://d1234.cloudfront.net/file?Signature=... (CDN-signed URL, goes through CDN)

Pattern	URL goes to	Who validates	Caching	Best for
Storage presigned URL	Storage directly	Storage node	None	Uploads, temp downloads
CDN-signed URL + OAC	CDN edge	CDN edge	Full CDN caching	Private content, high read traffic
Public bucket + CDN	CDN edge	No auth needed	Full CDN caching	Static assets, public content

Use case	Best choice
Single private file, low traffic	Storage presigned URL
Single private file, high traffic	CDN-signed URL
Many files (video streaming)	CDN-signed cookies
Public assets	CDN + public bucket

10.5 Multipart Upload

Each part is independently erasure-coded with its own shard map. Parts upload in parallel, out of order, from different machines. The upload_id ties them together.

Each part is already durable on upload. When you upload Part 3, the storage system erasure-codes it, distributes shards, and stores it with full durability guarantees immediately. Parts don't sit around as raw unprotected bytes. They're already encoded and replicated. The system just hasn't stitched them into a visible object yet.

Parts before completion are stored as temporary internal objects. Not visible via normal GET. Don't show up in bucket listing. Only accessible through the UploadId. Think of them as "durable but invisible."

Resume: If upload fails at part 7 of 20, call ListParts (using the UploadId) to see which parts are stored. Parts 1-6 and 8-20 are durable. Re-upload only part 7. Parts are idempotent (same content = same ETag).

What happens during CompleteMultipartUpload:

Validate: Check that the UploadId exists and all referenced parts are present. Verify ETags match. Parts must be in ascending order.
Assemble: No full data rewrite is required. The system links existing parts into a single logical object by updating metadata pointers to reference the already-stored parts' shard locations. This is primarily a metadata operation. (Some systems may later compact or re-layout parts for read performance, but this is a background optimization, not part of the completion step.)
Compute final ETag: MD5(concat(MD5(part1) + MD5(part2) + ...)) + "-" + part_count. Example: "5d41402abc4b2a76b9719d911017c592-4" means 4 parts. Can't be compared to a single-PUT MD5 of the same content.
Atomic commit: Object becomes visible in one metadata write. Before: object does not exist. After: fully readable.
Cleanup: Temporary multipart state removed. Part data stays (now referenced by final object metadata).

For stronger integrity, the system also supports explicit SHA-256 or CRC32 checksums at completion time.

Auto-abort: Incomplete uploads older than 7 days are automatically cleaned up.

Object Size	Part Size	Parts	Parallelism
100MB	10MB	10	5-10
1GB	16MB	~63	10
10GB	64MB	~157	10-20
1TB	512MB	~2,000	50-100
5TB	512MB	~10,000	100

10.6 Small Object Packing: Extent Files

Most objects are small. The median is closer to 10KB. If you store each one as a separate file, you hit three walls at once: inode exhaustion (ext4 allocates a fixed number of inodes, typically tens to hundreds of millions per filesystem), IOPS waste (a 7200 RPM HDD does about 150 random I/O operations per second because each read requires a physical disk seek of ~7ms, so reading a 1KB file costs the same IOPS as reading a 1MB file), and filesystem fragmentation from billions of tiny files.

Solution: Pack thousands of objects into 256MB extent files. Facebook published this approach as Haystack (OSDI 2010) to solve the exact same small-file problem for photo storage. Append shard to open extent, update in-memory index, fsync, seal at 256MB.

+------------------------------------------------------------------+
| Extent File (256 MB)                                             |
| Header | ObjA shard|CRC|pg| ObjB shard|CRC|pg| ... | free space |
+------------------------------------------------------------------+

Read: O(1) in-memory hash lookup → seek to offset → read bytes → verify CRC32C.

Metric	One file per object	Extent packing
Write IOPS (1KB, HDD)	150/sec (seek)	5,000+/sec (sequential)
Max objects per 20TB	~4B (inode limit)	Unlimited (memory-bound)
Metadata per object	~256 bytes (inode)	32 bytes (in-memory)

Compaction: When >30% of an extent's shards are dead, GC rewrites live shards to a new extent. Large objects (>4MB): stored in dedicated files, not packed.

Small object durability strategy (important design decision): For objects under 1MB, EC overhead is disproportionate. A 1KB object produces 14 shards of ~100 bytes each, and every read requires fetching 10 of them from 10 different nodes. The per-request overhead dwarfs the data. Most production systems handle this with a size-based split:

Object size	Strategy	Why
< 256KB	3x replication	EC overhead too high relative to data. Replication is simpler and reads hit one node instead of 10.
256KB - 4MB	EC or replication (configurable)	Transition zone. EC starts to pay off.
> 4MB	Erasure coding (10+4)	EC savings dominate. Stored in dedicated files, not packed into extents.

This policy is typically configurable per bucket or system-wide and applied at write time. The thresholds above are guidelines, not hard rules.

This is a big design choice that affects both cost and read latency for the majority of objects (most are small).

10.7 Prefix Listing at Scale

Keys are sorted in the LSM-tree. prefix=photos/2024/ → range scan from "photos/2024/" to "photos/2024/\xff". For each key: strip prefix, find delimiter → CommonPrefix (simulates subdirectory) or Contents (simulates file).

Scaling to 50B matches: The metadata service scans all relevant partitions in parallel. Results come back sorted per partition. A k-way merge combines these sorted streams into one globally sorted result (like merging k sorted lists at once). Once 1000 results are collected (the max-keys default), scanning stops and a continuation token marks where to resume on the next request.

Applications must design key structure with LIST in mind. {date}/{category}/{uuid} is good. {uuid} with no prefix structure means scanning everything.

10.8 Storage Class Transitions

Per-partition lifecycle scanner runs hourly. Matching objects: read shards → re-encode with target EC (e.g., 16+4 for Glacier) → write to target storage → update metadata → GC old shards. Rate-limited: 1,000 transitions/min/partition.

10.9 Garbage Collection

Sources of garbage: overwrites (old shards), deletes (all shards), aborted multipart (orphan parts), failed writes (orphan shards).

GC is always async. You never delete anything in the write path. Dead shards sit for a 24-hour grace period before physical deletion. This protects against in-flight reads, clock skew, and race conditions. When an extent passes 30% dead ratio, GC compacts it by rewriting the live shards into a fresh extent.

GC gets 20% of cluster bandwidth (~1.4 GB/sec, ~86 TB/day). Keep an eye on the backlog trend. If it's growing, either bump the budget or figure out why the deletion rate spiked.

Orphan detection: Once a week, the metadata service exports every referenced shard. Each storage node compares that list against its local index. Anything unmatched is an orphan candidate, quarantined for 7 days, then deleted.

10.10 Rebalancing and Repair

Triggers: Node added (new capacity), node removed (decommission), disk failure, capacity imbalance.

Repair flow: Heartbeat detects failure (30s) → PGs prioritized by severity (10/14 shards = critical, 11/14 = high, 13/14 = low) → read 10 surviving shards → RS reconstruct → write to new node → update PG map.

Rate-limited: 20% of cluster I/O. Backpressure: pause if client p99 exceeds threshold.

Repair time for one node failure: ~72 PGs of data to reconstruct, roughly ~1 PB total. At 1.4 GB/sec with 10 parallel workers: ~20 hours under ideal conditions. Real-world repair takes longer due to throttling (to protect client latency), network contention, and concurrent repairs. Budget 2-3x the ideal estimate. Keeping repair fast is critical for durability because the longer the window stays open, the higher the chance of a second failure hitting the same PG.

Node decommission: Mark DRAINING → stop writes → copy all shards to replacements → verify → remove from CRUSH.

10.11 Versioning

Versioning enabled: every PUT creates a new version. DELETE without version_id creates a delete marker (data untouched, recoverable). DELETE with version_id permanently removes that version.

MFA Delete: Requires MFA to permanently delete versions or change versioning state. Protects against compromised credentials.

Storage impact: 1MB file updated 1,000×/day = 1GB/day versioned data. Lifecycle rules essential: "expire non-current versions after 30 days."

10.12 Hot Objects and Caching

Anything accessed more than 100 times per minute gets promoted to a local SSD cache on the Data Router (LRU, 10GB per node). For truly viral content hitting 10K+ req/sec on a single object, the system copies it to multiple Data Router caches and uses request coalescing at the CDN edge. That way a thousand simultaneous cache misses turn into one origin fetch instead of a thousand.

10.13 Storage Layer Realism

Each node: 36 HDDs (20TB), 2 NVMe SSDs (extent index + WAL), 256GB RAM, 2×25Gbps NIC.

CRUSH failure domain hierarchy: Region → AZ (3) → Rack (~540) → Node (~36/rack) → Drive (36/node).

For a 10+4 PG: 5+5+4 shard distribution across 3 AZs, each shard on a different rack. Max 5 per AZ. Losing the AZ with 4 leaves exactly 10 (sufficient). For AZ-loss tolerance on critical data, use 10+6 (max 5/AZ, losing any AZ leaves ≥11 ≥10).

10.14 User Experience

5MB photo upload: Auth 2ms + PG lookup 5ms + encode <1ms + 14 shard writes 20ms + Raft 10ms = ~40ms + transfer. Photo erasure-coded across 3 AZs with 11 nines.

4GB video multipart: 800 parts × 5MB, 10 parallel. ~5.5 minutes on 100 Mbps. WiFi drops at part 400 → reconnect → ListParts → resume from 400. No wasted work.

Data lake byte-range read: Bytes 100MB-110MB of a 50GB Parquet file → Data Router computes target stripe → fetches only relevant shards. ~20ms first byte. Never downloads full 50GB.

Accidental delete recovery: Versioning enabled → DELETE only creates marker → list versions → GET with version_id or delete the marker. Data never lost.

11. What Breaks First at Scale

Metadata hot spots. Popular bucket overwhelms a single Raft partition. Auto-split at 200GB or 10K ops/sec.
Reconstruction storms. Rack failure (20 nodes) triggers massive repair I/O. Rate-limited repair + priority queues prevent cascading degradation.
GC backlog. 5K dead shard sets/sec from overwrites. Without 20% I/O budget for GC, leaked storage hits petabytes in months.
Small object IOPS wall. Median 10KB object = 150 reads/sec per HDD (seek-bound). Extent packing mandatory.
Presigned URL abuse. Leaked URL with 7-day expiry = open door. Short expiry + IP restrictions + bucket deny policies.

12. Partial Update Layer (Application-Level)

This is NOT built into object storage. S3 has no concept of chunks, manifests, or partial updates. Changing 1 byte in a 5GB file means re-uploading the entire 5GB. This is by design: immutability keeps the storage engine simple. The manifest service described here is application code, not part of S3. S3 just stores blobs.

Dropbox, Google Drive, and backup tools like Restic all solve this the same way: they build a chunking and manifest layer on top of the blob store.

Here's how it works concretely. Say you have a 100MB document split into 12 chunks of ~8MB each:

Each chunk is stored as a regular S3 object in a dedicated bucket (e.g., my-app-chunks). Chunk keys look like chunks/a1b2c3d4, chunks/e5f6g7h8, etc. S3 doesn't know these are chunks. They're just objects.
The manifest is a separate record (stored in your database or a metadata service) that says: "file doc-123, version 5 = [chunks/a1b2c3d4, chunks/e5f6g7h8, ..., chunks/x9y0z1w2] in this order."
To edit the document (say chunk 3 changed), you upload one new 8MB chunk to S3 (chunks/NEW_p4q5r6), then update the manifest to point to the new chunk while keeping the other 11 chunk references unchanged. Total upload: 8MB instead of 100MB.
To read the document, fetch the manifest, download all 12 chunks from S3 in parallel, concatenate them in order.

S3 stores the bytes. The manifest service tracks which bytes belong to which file version.

12.1 Data Model

Key: /{tenant_id}/{file_id}\0{version} (newest first via same XOR trick as Section 8.2)

protobuf

message FileManifest {
    string file_id = 1;
    string tenant_id = 2;
    uint64 version = 3;
    repeated ChunkRef chunks = 4;
    uint64 total_size = 5;
    string content_hash = 6;
    int64 created_at = 7;
    uint64 expected_version = 8;  // optimistic locking
}

message ChunkRef {
    uint32 index = 1;
    string object_key = 2;  // key in chunks bucket
    uint64 offset = 3;
    uint64 size = 4;
    string etag = 5;
}

12.2 API

Endpoint	Purpose
`POST /files/{id}/upload-urls`	Get presigned URLs for new chunks
`POST /files/{id}/versions`	Commit new manifest (with `If-Match: version_N` for optimistic locking)
`GET /files/{id}`	Get latest manifest + chunk presigned GET URLs
`GET /files/{id}/versions`	List all versions
`DELETE /files/{id}/versions/{v}`	Delete specific version

12.3 Atomic Commit and Consistency

Upload new/changed chunks via presigned URLs.
Create new manifest referencing old chunks (unchanged) + new chunks.
Atomic pointer switch: manifest service commits new version in a single Raft write.

No partial state visible. Before commit: readers see version N. After: version N+1. If commit fails, uploaded chunks are orphans cleaned by GC.

Reads always go through the Raft leader for the latest manifest. Updating a file and reading it back immediately returns the new version. No surprises.

12.4 Concurrency, Failure, Cleanup

Optimistic locking: Read version N → edit → submit with expected_version=N. If another client committed N+1, reject with 409 Conflict → re-read, re-apply, retry. No distributed locks needed.

Read path: Fetch manifest → get ordered chunk list with presigned GET URLs → fetch chunks in parallel (10-20 concurrent) → reassemble. For range reads, the manifest service computes which chunks cover the byte range and returns only those URLs.

Chunk upload fails? Retry. It's idempotent. Manifest commit fails? Retry. No data lost since the chunks are already durable. Upload dies halfway through? The file hasn't changed because the manifest hasn't been updated. Check which chunks already exist with HEAD, upload the missing ones, commit.

GC: Background scanner finds chunks not referenced by any live manifest. 24-hour grace period, then delete. Aborted uploads cleaned after 7 days.

Security: Users interact with file_id, never raw chunk keys. Chunks bucket denies all access except from manifest service IAM role.

12.5 End-to-End Flow

Read: GET file → manifest returns chunk URLs → parallel fetch (CDN-cached for unchanged chunks) → reassemble.

Delete: Remove manifest pointer → GC identifies unreferenced chunks → delete after grace period.

Chunk Size	Best For
4MB	Document editing, configs (fine-grained updates)
8MB	General-purpose (recommended default)
16MB	Video, media (fewer fetches, less metadata)

13. Identify Bottlenecks

#	Bottleneck	Mitigation
1	Metadata partition hot spots	Auto-split at 200GB or 10K ops/sec. Adaptive partitioning monitors ops/sec per partition. For sequential key patterns, add a hash prefix to distribute writes (e.g., `hash(key)[0:2]/key`)
2	EC reconstruction latency (+50-200ms)	Proactive repair, speculative parity reads, SSD cache
3	Small object IOPS (150/sec/HDD)	Extent packing, SSD read cache, batch reads
4	LIST on enormous prefixes	Parallel merge, secondary prefix index, max-keys 1000
5	LSM compaction write amplification	Rate-limited leveled compaction, NVMe for metadata
6	PG rebalancing I/O spike	Rate-limited migration (2 PGs/node/hour), off-peak scheduling
7	GC vs client I/O contention	20% I/O budget for GC, lower priority class
8	Cross-AZ EC bandwidth	Place 7/10 data shards in local AZ
9	Multipart orphan storage leak	Auto-abort after 7 days
10	136 PB metadata footprint	Bloom filters, NVMe mandatory, prefix-partitioned scans
11	Presigned URL abuse	Short expiry (1h default), IP restrictions, audit logging
12	Raft round-trip overhead	Pipelined Raft, read leases
13	CDN cache stampede	Request coalescing, stale-while-revalidate
14	Manifest hot files	Short TTL cache, read replicas, per-file rate limit

14. Failure Scenarios

Scenario	Shards Remaining	Impact	Recovery	Data Loss
Single drive failure	13/14	Zero user impact	Repair in hours: read 10 → RS reconstruct → write to new drive	None
Node failure (36 drives)	13/14 per PG (~72 PGs)	Degraded reads +50ms	Background repair, rate-limited	None
Full AZ outage	9-10/14	PGs with ≤9 shards temporarily unreadable	Wait for AZ recovery or rebuild. CRUSH must enforce max 4/AZ.	None if max 4/AZ enforced
Raft leader failure	N/A	Partition writes stall ~5-10ms	Raft election (pre-vote: 1-2 round trips)	None
Metadata corruption	N/A	Corrupted replica removed	New replica catches up from snapshot + Raft log	None (2 healthy replicas)
Silent bit rot	13/14 (corrupted ≈ missing)	Detected by CRC32C on read or scrubber	Reconstruct from 10 good shards	None if <5 corrupted
AZ network partition	N/A	Minority-side leaders stall	Leaders step down; majority-side continues	None (Raft guarantee)
Multipart failure mid-stream	N/A	Incomplete upload	ListParts → resume from failed part. Auto-abort after 7 days.	None
EC engine bug (bad parity)	10/14 data OK	Reconstruction from parity produces wrong data	Detect via ETag mismatch. Try other shard combinations. Re-encode parity from 10 data shards.	None if data shards intact
GC premature deletion	13/14	Durability reduced	24h grace period + weekly reconciliation + soft delete prevents most cases. Background repair restores.	None (EC covers it)
Partition split during writes	N/A	~100-200ms write stall	Automatic; API Router updates partition map	None
Reconstruction storm (rack failure, 20 nodes)	13/14 per PG	Cascading I/O degradation	Rate-limited repair, priority queue, lower ionice, backpressure on repair if p99 spikes	None if ≥10 shards survive
CDN origin failure	N/A	Cached: served. Uncached: 503.	DNS health checks reroute to healthy region	None (availability only)
Manifest service failure	N/A	Partial update layer down; core storage works	Raft election or manual replica restart	None
Rebalancing during failure	12-13/14	Dual disruption; repair takes longer	Pause migrations, prioritize repair, resume after stable	None under normal conditions
Metadata service fully down	N/A (data intact on disk)	All reads and writes fail. Data exists on storage nodes but nobody can find it (no index). CDN serves cached objects until cache expires.	Raft election restores within seconds for single partition. Full metadata outage (all partitions) requires cluster-level recovery from Raft snapshots. Data on storage nodes is never lost.	None (availability, not durability)

When Is Data Actually Lost?

Only when 5+ of 14 shards for the same object are destroyed before repair completes.

Probability: With 2% annual drive failure rate and 6-hour repair time:

P(1 drive fails in 6h) ≈ 1.4 × 10⁻⁵
P(5 of 14 fail before repair) ≈ 10⁻²¹ per object per cycle
Across 100T objects annually: ~10⁻⁴ objects lost/year = 1 object per 10,000 years

That is 11 nines.

Risk Factor	Mitigation
Correlated failures (bad drive batch)	Diversify vendors, spread same-batch across racks
Slow repair (overloaded cluster)	Prioritize repair I/O, pre-provision capacity
Silent corruption	Scrubber every 2 weeks, CRC on every read
Firmware bug	Verify shards on read-back before ACK
Natural disaster	Cross-region replication

15. Deployment Strategy

15.1 Multi-Region Deployment

Region	API Routers	Data Routers	Metadata	Storage	Total
us-east-1	800	600	5,500	8,000	14,900
us-west-2	600	450	4,000	6,000	11,050
eu-west-1	600	450	4,100	5,500	10,650

15.2 Technology Stack

Layer	Technology
CDN / Edge	CloudFront / Fastly
API Routers	Go, stateless, K8s
Data Routers	Rust, ISA-L, K8s
Metadata	Go, Raft + LSM-tree
Storage Nodes	Rust, direct I/O, extents
Manifest Service	Go, PostgreSQL / Raft KV
Coordination	etcd (3-node/region)
Observability	Prometheus + Grafana + Jaeger
Events	Kafka

15.3 Canary Deployment

5% rollout, 30-minute soak. Auto-rollback if: PUT success <99.9%, GET p99 +20%, reconstruction rate 2x, Raft commit +50%, CDN hit ratio -10%.

Storage node upgrades: Rack-by-rack rolling (drain → upgrade → smoke test → re-enable). 19,500 nodes across ~540 racks at 20 min/rack = ~7.5 days full cluster.

16. Observability

16.1 Key SLIs

# Latency
put_latency_ms{quantile="0.99"}
get_latency_ms{quantile="0.99"}
get_first_byte_latency_ms{quantile="0.99"}
list_latency_ms{quantile="0.99"}

# CDN
cdn_cache_hit_ratio{region}
cdn_origin_fetch_latency_ms{quantile="0.99"}

# Erasure coding
ec_reconstruct_count_total
ec_reconstruct_latency_ms{quantile="0.99"}

# Health
shard_repair_backlog
pgs_degraded_total
gc_backlog_bytes
gc_bytes_reclaimed_per_hour
raft_commit_latency_ms{quantile="0.99"}
raft_leader_elections_total
drive_utilization_percent{node_id}
cross_az_bandwidth_bytes_per_sec{src_az, dst_az}

16.2 Critical Alerts

Alert	Condition	Severity
PUT success rate drop	<99.9% for 5 min	P1
Reconstruction rate spike	>5% of GETs	P2
Raft election storm	>10/min for a partition	P1
Drive failure spike	>20/day (vs baseline 5.5)	P1
GC backlog growth	Increasing for >24h	P2
PG degradation	>50 degraded PGs	P1
Raft replication lag	>5 seconds	P1
CDN hit ratio drop	<50% for 15 min	P2

16.3 Distributed Tracing

Every request gets a trace ID at the API Router. Follows through: CDN → API Router → Metadata → Data Router → Storage Nodes → Raft. Sampling: 0.1% normal, 100% for errors or p99 breaches. Jaeger with 7-day retention.

Example degraded GET trace:

TraceID: 7a8b9c0d... Duration: 127ms
  cdn_edge.check:           3ms (MISS)
  api_router.authenticate:  2ms
  metadata_service.lookup:  6ms (partition 87432)
  data_router.fetch_shards: 112ms
    9/10 shards in 15ms, shard 1 TIMEOUT at 50ms
    Parity shard 10: 12ms, RS reconstruct: 0.3ms

Dashboards: executive (rate, errors, latency, cost), API layer, metadata/Raft, data/EC, storage/drives, network, CDN.

17. Security

17.1 IAM and Policy Engine

Every request: API Router → IAM engine with (principal, action, resource). Evaluates IAM policies (user-level) → bucket policies (resource-level) → ACLs (legacy). Explicit DENY wins, then ALLOW, default DENY. In-memory policy cache on each API Router, refreshed every 5 minutes.

17.2 Encryption

Three-level key hierarchy: Master Key (HSM, annual rotation) → Bucket Key (cached 24h, reduces KMS calls from 50K/sec to 1/bucket/day) → Object Data Key (unique per object, AES-256-GCM).

SSE-Default: Service-managed keys, transparent.
SSE-KMS: Customer-managed, every use audited, revocable.
SSE-C: Customer-provided per request, not stored.

17.3 Transport and Access Control

Auth: HMAC-SHA256 with 15-minute replay window. Transport: TLS 1.3 client-to-edge, mTLS service-to-service.

Data integrity: CRC32C per shard, ETag (MD5) end-to-end, Content-MD5 header optional.

Block Public Access: ON by default. Object Lock (WORM): Governance mode (override with IAM) and Compliance mode (no one deletes until retention expires). MFA Delete: Requires MFA for permanent deletion.

Audit logging: Every API call logged with: principal, action, resource, source IP, timestamp, request-id, response status. Logs are immutable, retained for compliance (configurable, default 90 days), and queryable for forensic analysis and regulatory audits.

Network segmentation: Storage nodes on isolated network (no internet). API Routers in DMZ. Each component has minimum required permissions.

Explore the Technologies

Core Technologies

Technology	Role	Learn More
FoundationDB	Metadata store: ordered KV, strong consistency	FoundationDB
RocksDB	LSM-tree engine for metadata nodes	RocksDB
gRPC	Internal RPC with protobuf	gRPC
etcd	Placement coordination, cluster membership	etcd
Prometheus + Grafana	Metrics and dashboards	Prometheus
Kafka	Event notifications	Kafka
MinIO	S3-compatible private cloud object storage	MinIO
Ceph	Exabyte-scale storage with CRUSH placement	Ceph
SeaweedFS	Simple small-file optimized storage	SeaweedFS

Infrastructure Patterns

Pattern	Role	Learn More
Object Storage	The system being designed	Object Storage
Replication & Consistency	Raft for metadata, EC for data	Replication
Database Sharding	1.36M metadata partitions	Sharding
Hot/Warm/Cold Tiering	Storage class transitions	Tiering
CDN & Edge	CDN-signed URLs, OAC, WAF	CDN
Distributed Tracing	End-to-end request tracing	Tracing
Erasure Coding	Reed-Solomon math and shard recovery	Erasure Coding
CRUSH Placement	Topology-aware shard placement (PG → nodes)	CRUSH

System Design: AI Software Engineer (From Autocomplete to Autonomous App Builder)

March 25, 2026 · 94 min read

RAG and LLM Platform at Scale: Ingestion, Retrieval, Generation, and Evaluation for 10M Queries/Day

March 22, 2026 · 105 min read

Building a Multi-Tenant AI Agent Platform for Restaurant Intelligence

March 17, 2026 · 109 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignMarch 31, 2026· 65 min read

System Design: Object Storage (Erasure Coding, Flat Namespace, and Exabyte Scale)

Goal: Build a distributed object storage platform storing 100+ trillion objects across exabytes of raw data. Support PutObject, GetObject, DeleteObject, ListObjectsV2, multipart uploads up to 5TB, versioning, storage class transitions, and presigned URLs. Target 11 nines of durability (99.999999999%) using Reed-Solomon erasure coding (10+4) combined with fast repair, multi-AZ isolation, and failure independence. Strong read-after-write consistency. Flat namespace. CDN-accelerated reads. Clear control plane / data plane separation.

Important: Object storage is immutable by design. There is no PATCH, no partial write, no append. PUT always creates a new version. This removes the need for distributed locking, conflict resolution, and crash recovery from partial writes. Applications that need file-level edits build a manifest-based chunking layer on top (Section 12).

Reading guide:

Sections 1-4: Problem, requirements, design principles

Section 5: Technology selection: erasure coding, placement, metadata, CDN, control/data plane

Section 6: Architecture diagram with numbered request flows

Section 7: Back-of-envelope math

Sections 8-9: Data model and API design

Section 10: Deep dives: end-to-end walkthrough, PUT/GET, presigned URLs, multipart, extent files, GC, rebalancing

Section 11: What breaks first at scale

Section 12: Partial update layer (application-level, not built into storage)

Sections 13-17: Bottlenecks, failures, deployment, observability, security

Mental model (read this first):

Metadata Service: object key → placement group (PG) ID + version + ETag. Knows what exists, not where it lives.

CRUSH: PG ID → 14 storage nodes. Computed locally by any component using the cluster map. No central lookup.

Storage Nodes: store erasure-coded shards in extent files on disk. An object is split into shards (EC concept: 10 data + 4 parity). Each shard is then appended into a large extent file (storage concept: a 256MB container holding many shards sequentially). Extents are just files on disk. Shards are just slices of objects.

Cluster Map: shared topology tree (AZ → rack → node → disk with weights). Updated by control plane, cached everywhere. What makes CRUSH deterministic.

Immutability: PUT creates new shards. Old shards untouched until GC. No locking, no conflicts.

Source of truth for placement: The PG ID stored in metadata is the authoritative reference. CRUSH + cluster map derive node locations from that PG ID. The shard_map field in metadata is a cached optimization for faster reads and may be stale or recomputed.

System flow in 5 steps:

PUT arrives → metadata assigns a PG ID

CRUSH computes which 14 nodes hold this PG's shards

Object is erasure-coded into 14 shards

Shards are appended into extent files on those storage nodes

Metadata commits via Raft → object is visible

TL;DR: Reed-Solomon (10+4) splits each object into 10 data + 4 parity shards, so each object is typically placed across 14 storage nodes in different failure domains (racks, AZs) via CRUSH policies. Combined with fast repair and multi-AZ isolation, this targets 11 nines of durability at 1.4x storage overhead, saving $32M/month/EB vs triple replication. The architecture separates control plane from data plane. The control plane manages roughly 136 PB of metadata (500-1KB per object depending on features, averaging ~700 bytes x 100T objects), split into 1.36M Raft partitions at 100GB each for strong consistency. The data plane is 19,500 storage nodes (14 EB physical storage at 720TB per node, 36 drives each). CRUSH deterministically maps each object to its 14 nodes using the cluster topology. No per-object lookup needed, but CRUSH depends on a globally consistent cluster map that's distributed and cached by all components. Small objects pack into 256MB extent files. Objects are immutable. Flat namespace with sorted LSM-tree keys gives near-constant-time lookups and efficient prefix scans. For CDN-served private content, use CDN-signed URLs with Origin Access Control (not storage presigned URLs). For file-level edits, build a manifest service (application-layer) on top that tracks chunk-to-file mappings and atomically swaps versions.

1. Problem Statement

Problem 3: Strong consistency. PUT then GET always returns the latest version. Requires Raft consensus at every metadata write.

Scale: 100T+ objects, exabytes of data, 350K req/sec peak, 5TB max object size.

Common anti-patterns at scale:

Blind 3x replication. Costs 3x storage (around $60M/month per EB) with weaker fault tolerance. Use erasure coding (RS 10+4) for better cost and durability.
Single metadata node. Cannot index or serve 100T objects. Use partitioned, distributed metadata with consensus (Raft).
One file per object on a filesystem. Filesystems like ext4 allocate a fixed number of inodes (small metadata records that track each file). In practice, a filesystem has tens to hundreds of millions of inodes depending on how it's formatted. At large scale, inodes run out before disk space does. On top of that, very large directories degrade performance due to metadata overhead and lookup costs. The fix: pack many small objects into large append-only segment files (called extents, typically 256MB each). A separate index (persisted on disk and cached in memory) maps each object to its offset and size within these segments. Result: fast sequential writes and efficient reads via a single lookup plus a direct seek. See Section 10.6 for the full extent file design.
Routing all data through API servers. The API fleet becomes a bandwidth bottleneck. Use presigned URLs for direct client-to-storage transfers.
Eventual consistency via polling. Leads to stale reads and unpredictable behavior. Use strong consistency (Raft) for metadata.
Naive hashing without placement groups. Adding a single node reshuffles data across the entire cluster. Use placement groups to limit rebalancing scope.
No garbage collection. Leaked and orphaned data accumulates to petabytes over months. Run background GC for cleanup.
Synchronous writes to all AZs. High latency, poor availability under partial failure. Write to a quorum, repair asynchronously.
Treating all objects the same. Inefficient for both ends of the size spectrum. Small objects need packing into extents. Large objects need multipart upload.
Ignoring storage tiering. Cold data sitting on SSD wastes money. Use tiered storage (hot/warm/cold) with lifecycle policies.
Ignoring CDN and edge caching. All reads hitting origin means high bandwidth cost. CDN offloads ~60-80% of read traffic.

2. Functional Requirements

ID	Requirement	Priority
FR-1	PutObject: upload objects up to 5TB	P0
FR-2	GetObject: retrieve objects with byte-range reads	P0
FR-3	DeleteObject: soft delete (versioning) or hard delete (version-id)	P0
FR-4	ListObjectsV2: prefix + delimiter + pagination over trillions of objects	P0
FR-5	Multipart upload: create, upload-part (up to 10,000 parts), complete, abort	P0
FR-6	Presigned URLs: time-limited signed URLs for direct client-to-storage transfers	P0
FR-7	Bucket versioning: every PUT creates a new version, recoverable deletes	P1
FR-8	Storage class transitions: Standard, IA, Glacier Instant/Flexible, Deep Archive, Intelligent-Tiering	P1
FR-9	Lifecycle policies: automatic transition and expiration	P1
FR-10	CopyObject: server-side copy without re-upload	P1
FR-11	Object tagging for lifecycle rules and access policies	P1
FR-12	Bucket policies and IAM: resource-level and user-level access control	P0
FR-13	Server-side encryption: SSE-Default, SSE-KMS, SSE-C	P0
FR-14	Event notifications on object create/delete/transition	P1
FR-15	Cross-region replication: event-driven async replication with version-based ordering, eventual consistency to target region, automatic retry on failure	P2
FR-16	CDN integration: edge caching via CDN-signed URLs, WAF/DDoS protection	P1
FR-17	Automatic rebalancing and repair on node addition/removal/failure	P0

3. Non-Functional Requirements

Property	Target
Durability	99.999999999% (11 nines)
Availability	99.99% (52.6 min downtime/year)
PUT latency (p99, <1MB)	< 200ms
GET latency (p99, first byte, <1MB)	< 100ms
LIST latency (p99, 1000 results)	< 500ms
Total objects	100T+
Total logical storage	Exabytes
Maximum object size	5TB
Peak request rate	350K/sec
Storage overhead	< 1.5x (erasure coding)
Consistency	Strong read-after-write
CDN cache hit ratio	>= 60% for read-heavy workloads

4. Design Principles

2. Separate control plane from data plane. Metadata (Raft consensus) and data (quorum writes, EC) need to scale independently. Mixing them creates a bottleneck that caps both.

3. No central bottleneck. Data Router is a stateless fleet. Placement uses CRUSH (deterministic, no central authority). Metadata is partitioned across 1.36M Raft groups.

5. High-Level Approach & Technology Selection

5.1 Object Storage is Not a Filesystem

5.2 Immutability: The Core Design Decision

5.3 Erasure Coding vs Replication

3x Replication: 10MB object → 30MB stored (3.0x), survives 2 failures. 10+4 Reed-Solomon: 10MB → 10 data shards (1MB each) + 4 parity shards = 14MB stored (1.4x), survives ANY 4 failures.

	3x Replication	10+4 Erasure Coding
Physical storage (1 EB logical)	3 EB	1.4 EB
Monthly cost ($0.02/GB)	$60M	$28M
Fault tolerance	2 failures	4 failures

For the math behind Reed-Solomon (generator matrices, Galois field arithmetic, reconstruction), see the Erasure Coding deep dive. Section 10.1 shows how it works with a concrete end-to-end example.

5.4 Object Placement: Placement Groups + CRUSH

pg_id = hash(bucket_id + "/" + object_key) % 100,000

object → PG → CRUSH → nodes

Example:
pg_id = 42
CRUSH(pg_id) → [Node A, Node F, Node K, ...]
These nodes store all 14 shards for objects in that PG.

Each PG maps to 14 nodes, meaning all shards of objects in that PG are consistently placed on the same node set. This avoids per-object placement decisions and keeps data layout predictable.

The placement map is computed locally (no central lookup) and cached by all components. It changes only when cluster topology changes (node join, leave, or failure), minimizing coordination overhead.

5.5 Metadata Architecture

Requirements: <10ms point lookup, efficient prefix range scan, versioned entries, strong consistency via Raft per partition.

Metadata Store Option	Consistency	Range Scans	Battle-tested
FoundationDB	Strong (Paxos)	Yes	Apple (iCloud)
TiKV	Strong (Raft)	Yes	PingCAP
CockroachDB KV	Strong (Raft)	Yes	Cockroach Labs

5.6 Storage Classes

Class	Latency	$/TB/mo	Retrieval $/GB	Use Case
Standard	ms	$23.00	$0.00	Active data
Infrequent Access	ms	$12.50	$0.01	Backups
Glacier Instant	ms	$4.00	$0.03	Archives needing instant access
Glacier Flexible	min-hrs	$3.60	$0.01-0.03	Compliance archives
Deep Archive	12-48h	$0.99	$0.02	Legal/regulatory (tape robots)
Intelligent-Tiering	ms	auto	$0.00	Unknown access patterns

5.7 CDN and Edge Layer

At 250K GETs/sec x 100KB = 25 GB/sec egress. CDN offloads 60-80%, reducing origin to ~7.5 GB/sec.

WAF at CDN edge: per-IP/bucket/account rate limiting, geo-restrictions, enumeration attack detection.

5.8 Control Plane vs Data Plane

	Control Plane	Data Plane
Does	Decisions: placement, versioning, access	Bytes: encode, transfer, store
Components	Metadata Service, Raft, CRUSH, IAM, Lifecycle/GC	Data Router Fleet (stateless), Storage Nodes
Consistency	Strong (Raft)	Quorum (11/14 shards)

5.9 Technology Selection

Component	Technology	Why
CDN / Edge	CloudFront / Fastly	CDN-signed URLs, OAC, 200+ PoPs, terabit DDoS
Metadata store	Custom LSM-tree + Raft	Ordered KV, range scans, strong consistency
Object data	Raw block, extent-based	No filesystem overhead, direct I/O
Erasure coding	Reed-Solomon, ISA-L	AVX-512 accelerated, ~10 GB/sec/core
Placement	CRUSH-derived	Deterministic, no central authority
Internal RPC	gRPC + protobuf	Low latency, bidirectional streaming
IAM / Policy	Custom evaluator	Sub-ms evaluation, in-memory cache
KMS	HSM-backed	Three-level key hierarchy, bucket key caching
Monitoring	Prometheus + Grafana + Jaeger	Standard stack
Events	Kafka	Durable event stream

5.10 Build vs Use

Property	MinIO	Ceph (RADOS Gateway)	SeaweedFS
API compatibility	Near-complete S3	Near-complete S3	Partial S3
Erasure coding	RS per object	Multiple schemes (RS, LRC)	RS per volume
Placement	Server pools, distributed	CRUSH (topology-aware)	Central master assigns volumes
Operational complexity	Low	High	Low
Battle-tested scale	PB range	EB range (CERN, DigitalOcean)	TB to low PB range

Your scale	Recommendation
< 100TB	MinIO. Simple, fast to deploy, good S3 compatibility.
100TB - 10PB	MinIO or Ceph. MinIO for simplicity, Ceph for fine-grained placement control.
10PB - 1EB	Ceph with a dedicated ops team, or custom with sufficient engineering depth.
> 1EB	Custom. At this scale, controlling every trade-off matters, and the team size (50-100 engineers) justifies the investment.

6. High-Level Architecture

6.1 CDN Download Path

Read-heavy, cached, no API involvement. CDN handles auth and caching at the edge.

For reads with CDN-signed URLs. CDN talks directly to storage origin. No API Router, no Metadata Service, no Data Router in this path.

How Path 1 works step by step:

The client never talks to the storage system directly. It talks to the CDN.

Your backend generates a CDN-signed URL (using a CDN key pair, completely separate from storage credentials) and gives it to the client.
Client sends the request to the CDN edge (e.g., https://d1234.cloudfront.net/photos/beach.jpg?Signature=...&Expires=...).
CDN edge validates the signature using the public key, checks expiry, and enforces access policies (IP restrictions, geo-blocks, header requirements). If invalid or expired: 403, no origin hit.
Cache hit: CDN already has this object cached from a previous request. Returns it immediately. This is the fast path and handles 60-80% of reads.
Cache miss: CDN doesn't have it cached. CDN fetches from the storage origin using its Origin Access Control (OAC) credential. The storage bucket is private and blocks all direct access. Only the CDN's service identity is allowed to read from it.
The storage origin has an internal gateway that calls the metadata service for the object lookup (PG ID, shard locations), then calls the data layer to fetch the 10 data shards in parallel from the appropriate storage nodes (determined by CRUSH). If any shards are missing (degraded read), it fetches parity shards and reconstructs via Reed-Solomon. The assembled object bytes are returned to CDN.
CDN receives the bytes, caches them (based on Cache-Control headers), and serves to the client.

6.2 API Upload Path

Full controlled path. Every component involved: auth, metadata, encoding, storage, consensus.

How Path 2 works step by step:

This is the controlled path where every component is involved.

Client sends PUT /bucket/key with the object bytes and an HMAC-SHA256 signature in the Authorization header.
Request passes through WAF (malicious request filtering, rate limiting) and the L7 load balancer (distributes across API Router instances).
API Router validates the HMAC signature (recomputes it from the request and the client's secret key, compares). Then calls the IAM Engine to check if this user has PutObject permission on this bucket. If either fails: 403.
API Router asks the Metadata Service for the placement group: "where should this object's shards go?" Metadata Service hashes the bucket+key to a PG ID, looks up the CRUSH map, and returns the list of 14 storage nodes.
API Router streams the object bytes to a Data Router instance (stateless, any instance works).
Data Router processes the data in 1MB stripes. For each stripe: split into 10 data shards, compute 4 parity shards using Reed-Solomon encoding. Pipelining means it doesn't wait for the full object before starting.
Data Router sends all 14 shards to the designated storage nodes in parallel. Each storage node appends the shard to an extent file, computes CRC32C checksum, fsyncs to disk, and ACKs.
Once a quorum of shard writes succeed (at least 11 of 14), the Data Router tells the Metadata Service to commit. The Metadata Service writes the object metadata (key, ETag, shard locations) through Raft consensus. 2 of 3 Raft replicas must confirm.
Only after both the shard quorum AND the Raft commit succeed does the client get a 200 OK with the ETag and version-id. If shards wrote successfully but the Raft commit fails (e.g., leader election), those shards become orphans. GC cleans them up. Client gets an error and retries.

CDN is not involved in uploads. It would add latency and caching is useless for writes.

6.3 Presigned Upload Path

Direct path for large uploads. Storage node handles everything. Zero bytes through API fleet.

Bypasses CDN, WAF, LB, API Router, and Data Router. The storage node does all the work:

How Path 3 works step by step:

This is the fast path for large uploads. A 5TB video upload sends zero bytes through the API Router.

Client first asks the API server (via normal authenticated request) to generate a presigned URL. The API server signs the URL with the storage system's secret key, embedding the bucket, key, operation (PUT), expiry, and any conditions (content-type, max size, IP restriction). Returns the URL to the client.
Client uploads the object bytes directly to the storage node endpoint in the presigned URL. No CDN, no WAF, no LB, no API Router in the data path.
The storage node receives the bytes and validates the presigned URL: reconstructs the "string to sign" from the request (method, path, query params, headers, expiry), looks up the secret key for that credential (cached locally on every storage node), computes its own HMAC-SHA256, and compares. If it matches and the expiry hasn't passed, the request is authentic. No database lookup needed, no session state.
Storage node looks up the CRUSH placement map (cached locally) to determine which 14 nodes should hold this object's shards.
Storage node runs Reed-Solomon encoding on the incoming bytes (same 1MB stripe pipelining as the Data Router would do). Produces 10 data + 4 parity shards.
Storage node writes 1 shard locally and sends the other 13 to peer storage nodes in parallel. Each peer appends to extent file, CRC32C, fsync, ACK.
Once quorum ACKs arrive (at least 11 of 14), the storage node sends the metadata to the Metadata Service for Raft commit. Same Raft consensus as the API path: 2 of 3 replicas confirm.
Only after both shard quorum and Raft commit succeed does the client get 200 OK.

6.4 Background Systems

Always running, not tied to any request path:

These run continuously regardless of client traffic:

Repair / Rebalance: When a storage node or drive fails, the repair worker detects it via heartbeat timeout (~30 seconds), reads surviving shards for affected placement groups, reconstructs missing shards using Reed-Solomon math, and writes them to healthy nodes. Also handles rebalancing when nodes are added or removed.
Garbage Collector: Cleans up shards that are no longer referenced by any live object (from overwrites, deletes, aborted multipart uploads, or failed writes where shards landed but metadata commit failed). 24-hour grace period before physical deletion.
Scrubber: Reads every shard on every drive periodically (full scan every 2 weeks), computes CRC32C, and compares to the stored checksum. Detects silent bit rot before a client reads the corrupted data.
Lifecycle Manager: Evaluates lifecycle rules against objects (e.g., "move to Glacier after 90 days"). Triggers storage class transitions by re-encoding shards with the target EC scheme and moving them to the appropriate storage tier.

6.5 Component Reference

Component	What it does
CDN Edge	Caches objects for reads. Only involved in downloads, never uploads. Uses CDN-signed URLs (not storage presigned URLs). See Section 10.4.
WAF / DDoS	Filters malicious requests, rate limiting. Sits in front of the API path.
API Router Fleet	Stateless servers that receive client requests, validate HMAC signatures, and route to internal services. The front door. Horizontally scalable, no state.
IAM Engine	Checks if this user is allowed to do this operation on this bucket/object. Evaluates IAM policies + bucket policies. Returns allow or deny.
Metadata Service	Stores the index of all objects: maps object key → placement group (PG) ID + version + ETag + shard locations. Does NOT decide which nodes hold the shards (that's CRUSH). Metadata answers "does this object exist and what's its PG?" CRUSH answers "which 14 nodes does this PG map to?"
Raft Groups (1.36M)	Raft is a consensus protocol that ensures metadata writes are durable and consistent. The metadata is split into 1.36 million partitions, each running its own Raft group (3 replicas). A metadata write is confirmed after 2 of 3 replicas agree, which guarantees PUT-then-GET consistency.
CRUSH Placement	Deterministic algorithm that answers: "given this object, which 14 storage nodes hold its shards?" Uses cluster topology (racks, AZs) to spread shards across failure domains. Runs locally on any node without a central server.
Data Router Fleet	Stateless servers that handle erasure coding. Receive object bytes, split into 10 data + 4 parity shards (Reed-Solomon), send each shard to the correct storage node (determined by CRUSH). Only used in the API path.
Storage Nodes	The actual machines with disks. 36 HDDs each. Store shards in extent files, verify checksums on read, report health via heartbeats. In the presigned path, temporarily take on the Data Router's job (EC encode + distribute shards).

6.6 Path Comparison

Step	API Upload Path	Presigned Upload Path	CDN Download Path
Entry point	WAF → LB → API Router	Storage Node directly	CDN Edge
Auth	API Router + IAM Engine	Storage Node validates signature locally	CDN validates CDN-signed URL at edge
Placement	Data Router calls CRUSH	Storage Node calls CRUSH	Handled internally by storage origin
EC encode	Data Router	Storage Node (acts as Data Router)	N/A (read path)
Shard write	Data Router → 14 Storage Nodes (quorum 11/14)	Storage Node → 13 peers + 1 local (quorum 11/14)	N/A
Metadata	Shard quorum ACK → Raft commit → then final ACK to client	Shard quorum ACK → Raft commit → then 200 OK	Storage origin handles lookup internally
What's skipped	Nothing	CDN, WAF, LB, API Router, Data Router	API Router, IAM, Data Router (CDN talks to storage origin directly via OAC)

Request flows (all operations):

Operation	Path
PUT (API)	Client → WAF → LB → API Router → IAM → Metadata (PG lookup) → Data Router (EC encode) → Storage Nodes (14 shards, quorum 11/14) → Raft commit → ACK
PUT (Presigned)	Client → Storage Node directly (validate sig → CRUSH → EC encode → distribute 14 shards, quorum 11/14 → Raft commit → 200 OK)
GET (CDN)	Client → CDN Edge (validate CDN-signed URL, cache hit: serve / miss: fetch from storage origin via OAC → cache + serve). No API Router or Data Router in this path.
GET (Presigned)	Client → Storage Node directly (validate sig → read shards → stream). No CDN. See Section 10.4 for CDN alternatives.
DELETE	Client → API Router → Metadata (write delete marker via Raft)
LIST	Client → API Router → Metadata (parallel range scan across partitions, k-way merge, early termination at max-keys)

6.7 Object Data Lifecycle

Every object follows this lifecycle from creation to eventual cleanup:

7. Back-of-Envelope Estimation

7.1 Request Throughput

Daily requests: 10B → ~115K/sec average, ~350K/sec peak (3x)
  GET 70%: 250K/sec   PUT 15%: 50K/sec   LIST 10%: 35K/sec   DELETE 5%: 15K/sec

7.2 Storage

100T objects × 100KB avg = 10 EB logical  (average varies widely: 1KB for configs, 100KB typical, 1MB+ for media)
10+4 EC (1.4x) = 14 EB physical
19,500 nodes × 36 drives × 20TB = 14 EB
700,000 total drives

7.3 Metadata

~700 bytes/object × 100T = ~70 PB, with LSM overhead (bloom filters, indexes, WAL): ~140 PB
140 PB / 100GB per partition = ~1.4M partitions
13,600 metadata nodes (100 partitions each)

7.4 Bandwidth

PUT ingress: 50K/sec × 100KB = 5 GB/sec
GET egress:  250K/sec × 100KB = 25 GB/sec (CDN offloads 70% → 7.5 GB/sec origin)
Parity write overhead: 5 GB/sec × (4 parity shards / 10 data shards) = 2 GB/sec extra

7.5 Router Nodes

API Routers: 350K peak req/sec / ~175 req/sec per node = ~2,000 nodes
Data Routers: 50K PUTs/sec / ~33 PUTs/sec per instance = ~1,500 nodes
Total router nodes: ~3,500

7.6 Summary

Resource	Count
Storage nodes	~19,500
Total drives	~700,000
Physical storage	14 EB
Metadata nodes	~13,600
Raft partitions	~1,360,000
Placement groups	~100,000
API/Data Router nodes	~3,500
CDN PoPs	200+
Peak req/sec	~350,000

Cost breakdown:

Capex:
  Storage nodes:  19,500 × $15K/node  = $292M
  Metadata nodes: 13,600 × $8K/node   = $109M
  Router nodes:    3,500 × $5K/node   =  $18M
  Networking (switches, cross-connects) =  $50M
  Total capex:                           $469M

Opex (annual):
  Power + cooling: 36,600 nodes × 500W × $0.10/kWh × 8,760h = $160M
  Operations (staff, facilities, bandwidth):                    $100M
  Total opex:                                                   $260M/year

Amortized over 4 years: ($469M / 4) + $260M = $377M/year
Cost per GB: $377M / (10 EB × 10^9 GB/EB) = $0.003/GB/month
Cost per TB: about $3/TB/month internal

8. Data Model

8.1 Bucket Metadata (PostgreSQL)

sql

CREATE TABLE buckets (
    account_id      BIGINT NOT NULL,
    bucket_name     TEXT NOT NULL,
    bucket_id       UUID DEFAULT gen_random_uuid(),
    region          TEXT NOT NULL,
    versioning      TEXT DEFAULT 'disabled',
    acl             JSONB DEFAULT '{}',
    policy          JSONB,
    lifecycle_rules JSONB DEFAULT '[]',
    encryption      JSONB,
    created_at      TIMESTAMPTZ DEFAULT now(),
    PRIMARY KEY (account_id, bucket_name)
);
CREATE UNIQUE INDEX idx_bucket_name ON buckets (bucket_name);

8.2 Object Metadata (Distributed KV, protobuf)

protobuf

message ObjectMetadata {
    string object_key = 1;
    bytes version_id = 2;
    uint64 size = 3;
    string etag = 4;
    string content_type = 5;
    string storage_class = 6;
    string encryption_key_id = 7;
    map<string, string> user_metadata = 8;
    uint32 placement_group = 9;
    repeated ShardLocation shard_map = 10;  // cached for fast reads; recomputable from PG + CRUSH
    bool is_delete_marker = 11;
    int64 created_at = 12;
}

message ShardLocation {
    uint32 shard_index = 1;  // 0-13
    uint64 node_id = 2;
    uint64 extent_id = 3;
    uint64 offset = 4;
    uint32 length = 5;
}

8.3 Placement Group Map (etcd)

protobuf

message PlacementGroup {
    uint32 pg_id = 1;
    repeated uint64 node_ids = 2;  // 14 entries
    uint64 epoch = 3;
    string state = 4;  // ACTIVE, REBALANCING, DEGRADED
}

8.4 Extent Index (local per node)

8.5 Multipart Upload State

Key: /mpu/{bucket_id}/{object_key}/{upload_id}

protobuf

message MultipartUpload {
    string upload_id = 1;
    map<uint32, PartInfo> parts = 2;
    int64 created_at = 3;
    int64 expires_at = 4;  // auto-abort deadline
}

message PartInfo {
    uint64 size = 1;
    string etag = 2;
    uint32 placement_group = 3;
    repeated ShardLocation shards = 4;
}

8.6 Version Listing Index

8.7 Lifecycle State

Key: /lifecycle/{bucket_id}/{rule_id}/{object_key} → {current_class, target_class, transition_at, state}

9. API Design

9.1 PutObject

http

PUT /{bucket}/{key} HTTP/1.1
Host: storage.us-east-1.example.com
Content-Length: 10485760
Content-Type: image/jpeg
x-storage-class: STANDARD
x-storage-server-side-encryption: kms
x-storage-server-side-encryption-kms-key-id: kms:us-east-1:123456:key/abcd-1234
x-storage-meta-original-filename: beach-photo.jpg
Authorization: HMAC-SHA256 Credential=AKID/20260310/us-east-1/storage/request,
    SignedHeaders=content-length;content-type;host;x-storage-date,
    Signature=abcdef1234567890...

[10 MB of object bytes]

http

HTTP/1.1 200 OK
ETag: "d41d8cd98f00b204e9800998ecf8427e"
x-storage-version-id: 3sL4kqtJlcpXroDTDmJ+rmSpXd3dIbrHY+MTRCxf3vjVBH40Nr8X8gdRQBpUMLUo

9.2 GetObject

http

GET /{bucket}/{key} HTTP/1.1
Range: bytes=0-1048575
If-None-Match: "d41d8cd98f00b204e9800998ecf8427e"

Returns 206 Partial Content with range, or 304 Not Modified if ETag matches.

9.3 DeleteObject

http

DELETE /{bucket}/{key}?versionId=3sL4kqtJlcpXr... HTTP/1.1

Without versionId on a versioned bucket: creates a delete marker (data untouched). With versionId: permanently deletes that version.

9.4 ListObjectsV2

http

GET /{bucket}?list-type=2&prefix=photos/2024/&delimiter=/&max-keys=1000 HTTP/1.1

xml

<ListBucketResult>
    <Prefix>photos/2024/</Prefix>
    <CommonPrefixes><Prefix>photos/2024/january/</Prefix></CommonPrefixes>
    <CommonPrefixes><Prefix>photos/2024/vacation/</Prefix></CommonPrefixes>
    <Contents>
        <Key>photos/2024/index.html</Key>
        <ETag>"d41d8cd98f00b204e9800998ecf8427e"</ETag>
        <Size>2048</Size>
    </Contents>
    <IsTruncated>true</IsTruncated>
    <NextContinuationToken>def456</NextContinuationToken>
</ListBucketResult>

9.5 Multipart Upload

http

POST /{bucket}/{key}?uploads HTTP/1.1                    → UploadId=abc123
PUT /{bucket}/{key}?partNumber=1&uploadId=abc123 HTTP/1.1 → ETag for part 1
PUT /{bucket}/{key}?partNumber=2&uploadId=abc123 HTTP/1.1 → ETag for part 2
POST /{bucket}/{key}?uploadId=abc123 HTTP/1.1             → Complete (send part ETags)
DELETE /{bucket}/{key}?uploadId=abc123 HTTP/1.1            → Abort (if needed)

Parts upload in parallel, out of order, from different machines. Each part independently erasure-coded.

9.6 Presigned URLs

https://storage.../my-bucket/photo.jpg
  ?X-Storage-Algorithm=HMAC-SHA256
  &X-Storage-Credential=AKID%2F20260310%2Fus-east-1%2Fstorage%2Frequest
  &X-Storage-Expires=3600
  &X-Storage-Signature=fe5f80f77d5fa3beca038a248ff027...

10. Deep Dives

10.1 End-to-End: From Upload to Recovery

Before diving into individual flows, here's the full lifecycle of a file using a simplified 2+1 scheme (2 data shards, 1 parity). The real system uses 10+4, but the mechanics are identical.

Setup: File = [A B C D] (4 blocks). Stripe size = 2 blocks (k=2). One parity shard per stripe (m=1).

Step 1: Split into stripes and encode.

Stripe	Data Shards	Parity Shard	Stored
Stripe 1	D0 = `A`, D1 = `B`	P = `A XOR B`	`[A, B, A⊕B]`
Stripe 2	D0 = `C`, D1 = `D`	P = `C XOR D`	`[C, D, C⊕D]`

Each stripe is encoded independently. Stripes don't know about each other.

Step 2: Distribute shards across nodes.

Shard	Stripe 1	Stripe 2
Node 1	`A`	`C`
Node 2	`B`	`D`
Node 3	`A⊕B` (parity)	`C⊕D` (parity)

Every shard lands on a different node, different rack. Metadata service records the mapping.

Step 3: Node 2 crashes.

Heartbeat monitor detects the failure within 30 seconds. Shard B (Stripe 1) and shard D (Stripe 2) are now missing.

Step 4: Read during failure. Client gets [A B C D]. No error, just ~50ms extra latency from reconstruction.

Step 5: Background repair. Repair worker reconstructs B, writes it to Node 7, updates metadata. Full redundancy restored.

10.2 The PUT Path

API Router validates HMAC, calls IAM. (~2ms)
Metadata Service returns PG + shard-to-node mapping. (~5ms)
Data Router reads 1MB stripes, RS encodes 10 data + 4 parity shards.
14 parallel shard writes to storage nodes (append to extent file, CRC32C, fsync). (~10-20ms)
Quorum 11/14 means up to 3 slow nodes don't block. Missing shards repaired async.
Raft commit of object metadata. (~5-10ms)
Return 200 with ETag and version-id.

10.3 The GET Path

There are three ways to read an object. Each skips different layers:

GET via CDN (most common for high-traffic reads):

Client sends CDN-signed URL to CDN edge.
CDN validates signature, checks cache. Hit (60-80%): serve immediately.
Miss: CDN fetches from storage origin via OAC. Storage handles metadata lookup + shard fetch internally.

GET via API (normal authenticated read):

Client sends GET with HMAC signature to API Router.
API Router validates auth, calls IAM.
Metadata lookup from Raft leader: returns PG ID, ETag, size, and cached shard locations. (~5ms) (Alternatively, the reader can compute shard locations from PG + CRUSH.)
Data Router fetches 10 data shards in parallel from storage nodes.
Happy path (>99%): All 10 respond. Stream to client. ~15-20ms server-side.
Degraded path: 1-4 shards missing → request parity shards → RS reconstruct. +50-200ms. Queue background repair.

GET via storage presigned URL (direct, no CDN):

Client sends request directly to storage node with presigned URL.
Storage node validates signature and expiry locally.
Storage node reads local shards, fetches remaining from peers, reconstructs if needed, streams back.
No CDN, no API Router, no Data Router in this path.

10.4 Presigned URLs: Storage vs CDN

There are two completely separate signing systems. Mixing them up is one of the most common mistakes.

Storage presigned URLs go direct to the storage node. CDN is not involved.

Storage presigned PUT flow:

Client asks the API server for a signed URL (authenticated with normal credentials).
API server generates the URL: HMAC-SHA256 signature over bucket, key, operation, expiry, conditions.
Client uploads directly to the storage node using the signed URL.
Storage node validates: reconstructs the "string to sign" from the request (method, path, query params, headers, expiry), looks up the secret key for that credential (cached locally), computes its own HMAC-SHA256, compares. Match = authentic. Past expiry = 403. No database lookup, no session state.
Storage node runs EC, distributes shards, commits metadata via Raft.

CDN-signed URLs (the correct CDN pattern):

Storage presigned URLs and CDN-signed URLs are two different systems. For private content served through CDN:

Storage bucket is private. Users cannot reach it directly.
CDN is the only entity allowed to fetch from storage, via Origin Access Control (OAC).
Your backend generates a CDN-signed URL using a CDN key pair (completely separate from storage credentials). URL looks like https://d1234.cloudfront.net/video.mp4?Signature=...&Expires=...&Key-Pair-Id=...
User requests the CDN-signed URL. CDN verifies the signature and expiry at the edge using the public key. If invalid or expired: 403 immediately, no origin hit.
If valid: serve from CDN cache (fast). On cache miss: CDN fetches from storage using its OAC credential, caches the response, returns to user.

Wrong: https://bucket.s3.amazonaws.com/file?X-Amz-Signature=... (storage presigned URL, bypasses CDN) Right: https://d1234.cloudfront.net/file?Signature=... (CDN-signed URL, goes through CDN)

Pattern	URL goes to	Who validates	Caching	Best for
Storage presigned URL	Storage directly	Storage node	None	Uploads, temp downloads
CDN-signed URL + OAC	CDN edge	CDN edge	Full CDN caching	Private content, high read traffic
Public bucket + CDN	CDN edge	No auth needed	Full CDN caching	Static assets, public content

Use case	Best choice
Single private file, low traffic	Storage presigned URL
Single private file, high traffic	CDN-signed URL
Many files (video streaming)	CDN-signed cookies
Public assets	CDN + public bucket

10.5 Multipart Upload

Each part is independently erasure-coded with its own shard map. Parts upload in parallel, out of order, from different machines. The upload_id ties them together.

What happens during CompleteMultipartUpload:

Validate: Check that the UploadId exists and all referenced parts are present. Verify ETags match. Parts must be in ascending order.
Assemble: No full data rewrite is required. The system links existing parts into a single logical object by updating metadata pointers to reference the already-stored parts' shard locations. This is primarily a metadata operation. (Some systems may later compact or re-layout parts for read performance, but this is a background optimization, not part of the completion step.)
Compute final ETag: MD5(concat(MD5(part1) + MD5(part2) + ...)) + "-" + part_count. Example: "5d41402abc4b2a76b9719d911017c592-4" means 4 parts. Can't be compared to a single-PUT MD5 of the same content.
Atomic commit: Object becomes visible in one metadata write. Before: object does not exist. After: fully readable.
Cleanup: Temporary multipart state removed. Part data stays (now referenced by final object metadata).

For stronger integrity, the system also supports explicit SHA-256 or CRC32 checksums at completion time.

Auto-abort: Incomplete uploads older than 7 days are automatically cleaned up.

Object Size	Part Size	Parts	Parallelism
100MB	10MB	10	5-10
1GB	16MB	~63	10
10GB	64MB	~157	10-20
1TB	512MB	~2,000	50-100
5TB	512MB	~10,000	100

10.6 Small Object Packing: Extent Files

+------------------------------------------------------------------+
| Extent File (256 MB)                                             |
| Header | ObjA shard|CRC|pg| ObjB shard|CRC|pg| ... | free space |
+------------------------------------------------------------------+

Read: O(1) in-memory hash lookup → seek to offset → read bytes → verify CRC32C.

Metric	One file per object	Extent packing
Write IOPS (1KB, HDD)	150/sec (seek)	5,000+/sec (sequential)
Max objects per 20TB	~4B (inode limit)	Unlimited (memory-bound)
Metadata per object	~256 bytes (inode)	32 bytes (in-memory)

Compaction: When >30% of an extent's shards are dead, GC rewrites live shards to a new extent. Large objects (>4MB): stored in dedicated files, not packed.

Object size	Strategy	Why
< 256KB	3x replication	EC overhead too high relative to data. Replication is simpler and reads hit one node instead of 10.
256KB - 4MB	EC or replication (configurable)	Transition zone. EC starts to pay off.
> 4MB	Erasure coding (10+4)	EC savings dominate. Stored in dedicated files, not packed into extents.

This policy is typically configurable per bucket or system-wide and applied at write time. The thresholds above are guidelines, not hard rules.

This is a big design choice that affects both cost and read latency for the majority of objects (most are small).

10.7 Prefix Listing at Scale

Applications must design key structure with LIST in mind. {date}/{category}/{uuid} is good. {uuid} with no prefix structure means scanning everything.

10.8 Storage Class Transitions

10.9 Garbage Collection

Sources of garbage: overwrites (old shards), deletes (all shards), aborted multipart (orphan parts), failed writes (orphan shards).

GC gets 20% of cluster bandwidth (~1.4 GB/sec, ~86 TB/day). Keep an eye on the backlog trend. If it's growing, either bump the budget or figure out why the deletion rate spiked.

10.10 Rebalancing and Repair

Triggers: Node added (new capacity), node removed (decommission), disk failure, capacity imbalance.

Rate-limited: 20% of cluster I/O. Backpressure: pause if client p99 exceeds threshold.

Node decommission: Mark DRAINING → stop writes → copy all shards to replacements → verify → remove from CRUSH.

10.11 Versioning

Versioning enabled: every PUT creates a new version. DELETE without version_id creates a delete marker (data untouched, recoverable). DELETE with version_id permanently removes that version.

MFA Delete: Requires MFA to permanently delete versions or change versioning state. Protects against compromised credentials.

Storage impact: 1MB file updated 1,000×/day = 1GB/day versioned data. Lifecycle rules essential: "expire non-current versions after 30 days."

10.12 Hot Objects and Caching

10.13 Storage Layer Realism

Each node: 36 HDDs (20TB), 2 NVMe SSDs (extent index + WAL), 256GB RAM, 2×25Gbps NIC.

CRUSH failure domain hierarchy: Region → AZ (3) → Rack (~540) → Node (~36/rack) → Drive (36/node).

10.14 User Experience

5MB photo upload: Auth 2ms + PG lookup 5ms + encode <1ms + 14 shard writes 20ms + Raft 10ms = ~40ms + transfer. Photo erasure-coded across 3 AZs with 11 nines.

4GB video multipart: 800 parts × 5MB, 10 parallel. ~5.5 minutes on 100 Mbps. WiFi drops at part 400 → reconnect → ListParts → resume from 400. No wasted work.

Data lake byte-range read: Bytes 100MB-110MB of a 50GB Parquet file → Data Router computes target stripe → fetches only relevant shards. ~20ms first byte. Never downloads full 50GB.

Accidental delete recovery: Versioning enabled → DELETE only creates marker → list versions → GET with version_id or delete the marker. Data never lost.

11. What Breaks First at Scale

Metadata hot spots. Popular bucket overwhelms a single Raft partition. Auto-split at 200GB or 10K ops/sec.
Reconstruction storms. Rack failure (20 nodes) triggers massive repair I/O. Rate-limited repair + priority queues prevent cascading degradation.
GC backlog. 5K dead shard sets/sec from overwrites. Without 20% I/O budget for GC, leaked storage hits petabytes in months.
Small object IOPS wall. Median 10KB object = 150 reads/sec per HDD (seek-bound). Extent packing mandatory.
Presigned URL abuse. Leaked URL with 7-day expiry = open door. Short expiry + IP restrictions + bucket deny policies.

12. Partial Update Layer (Application-Level)

This is NOT built into object storage. S3 has no concept of chunks, manifests, or partial updates. Changing 1 byte in a 5GB file means re-uploading the entire 5GB. This is by design: immutability keeps the storage engine simple. The manifest service described here is application code, not part of S3. S3 just stores blobs.

Dropbox, Google Drive, and backup tools like Restic all solve this the same way: they build a chunking and manifest layer on top of the blob store.

Here's how it works concretely. Say you have a 100MB document split into 12 chunks of ~8MB each:

Each chunk is stored as a regular S3 object in a dedicated bucket (e.g., my-app-chunks). Chunk keys look like chunks/a1b2c3d4, chunks/e5f6g7h8, etc. S3 doesn't know these are chunks. They're just objects.
The manifest is a separate record (stored in your database or a metadata service) that says: "file doc-123, version 5 = [chunks/a1b2c3d4, chunks/e5f6g7h8, ..., chunks/x9y0z1w2] in this order."
To edit the document (say chunk 3 changed), you upload one new 8MB chunk to S3 (chunks/NEW_p4q5r6), then update the manifest to point to the new chunk while keeping the other 11 chunk references unchanged. Total upload: 8MB instead of 100MB.
To read the document, fetch the manifest, download all 12 chunks from S3 in parallel, concatenate them in order.

S3 stores the bytes. The manifest service tracks which bytes belong to which file version.

12.1 Data Model

Key: /{tenant_id}/{file_id}\0{version} (newest first via same XOR trick as Section 8.2)

protobuf

message FileManifest {
    string file_id = 1;
    string tenant_id = 2;
    uint64 version = 3;
    repeated ChunkRef chunks = 4;
    uint64 total_size = 5;
    string content_hash = 6;
    int64 created_at = 7;
    uint64 expected_version = 8;  // optimistic locking
}

message ChunkRef {
    uint32 index = 1;
    string object_key = 2;  // key in chunks bucket
    uint64 offset = 3;
    uint64 size = 4;
    string etag = 5;
}

12.2 API

Endpoint	Purpose
`POST /files/{id}/upload-urls`	Get presigned URLs for new chunks
`POST /files/{id}/versions`	Commit new manifest (with `If-Match: version_N` for optimistic locking)
`GET /files/{id}`	Get latest manifest + chunk presigned GET URLs
`GET /files/{id}/versions`	List all versions
`DELETE /files/{id}/versions/{v}`	Delete specific version

12.3 Atomic Commit and Consistency

Upload new/changed chunks via presigned URLs.
Create new manifest referencing old chunks (unchanged) + new chunks.
Atomic pointer switch: manifest service commits new version in a single Raft write.

No partial state visible. Before commit: readers see version N. After: version N+1. If commit fails, uploaded chunks are orphans cleaned by GC.

Reads always go through the Raft leader for the latest manifest. Updating a file and reading it back immediately returns the new version. No surprises.

12.4 Concurrency, Failure, Cleanup

GC: Background scanner finds chunks not referenced by any live manifest. 24-hour grace period, then delete. Aborted uploads cleaned after 7 days.

Security: Users interact with file_id, never raw chunk keys. Chunks bucket denies all access except from manifest service IAM role.

12.5 End-to-End Flow

Read: GET file → manifest returns chunk URLs → parallel fetch (CDN-cached for unchanged chunks) → reassemble.

Delete: Remove manifest pointer → GC identifies unreferenced chunks → delete after grace period.

Chunk Size	Best For
4MB	Document editing, configs (fine-grained updates)
8MB	General-purpose (recommended default)
16MB	Video, media (fewer fetches, less metadata)

13. Identify Bottlenecks

#	Bottleneck	Mitigation
1	Metadata partition hot spots	Auto-split at 200GB or 10K ops/sec. Adaptive partitioning monitors ops/sec per partition. For sequential key patterns, add a hash prefix to distribute writes (e.g., `hash(key)[0:2]/key`)
2	EC reconstruction latency (+50-200ms)	Proactive repair, speculative parity reads, SSD cache
3	Small object IOPS (150/sec/HDD)	Extent packing, SSD read cache, batch reads
4	LIST on enormous prefixes	Parallel merge, secondary prefix index, max-keys 1000
5	LSM compaction write amplification	Rate-limited leveled compaction, NVMe for metadata
6	PG rebalancing I/O spike	Rate-limited migration (2 PGs/node/hour), off-peak scheduling
7	GC vs client I/O contention	20% I/O budget for GC, lower priority class
8	Cross-AZ EC bandwidth	Place 7/10 data shards in local AZ
9	Multipart orphan storage leak	Auto-abort after 7 days
10	136 PB metadata footprint	Bloom filters, NVMe mandatory, prefix-partitioned scans
11	Presigned URL abuse	Short expiry (1h default), IP restrictions, audit logging
12	Raft round-trip overhead	Pipelined Raft, read leases
13	CDN cache stampede	Request coalescing, stale-while-revalidate
14	Manifest hot files	Short TTL cache, read replicas, per-file rate limit

14. Failure Scenarios

Scenario	Shards Remaining	Impact	Recovery	Data Loss
Single drive failure	13/14	Zero user impact	Repair in hours: read 10 → RS reconstruct → write to new drive	None
Node failure (36 drives)	13/14 per PG (~72 PGs)	Degraded reads +50ms	Background repair, rate-limited	None
Full AZ outage	9-10/14	PGs with ≤9 shards temporarily unreadable	Wait for AZ recovery or rebuild. CRUSH must enforce max 4/AZ.	None if max 4/AZ enforced
Raft leader failure	N/A	Partition writes stall ~5-10ms	Raft election (pre-vote: 1-2 round trips)	None
Metadata corruption	N/A	Corrupted replica removed	New replica catches up from snapshot + Raft log	None (2 healthy replicas)
Silent bit rot	13/14 (corrupted ≈ missing)	Detected by CRC32C on read or scrubber	Reconstruct from 10 good shards	None if <5 corrupted
AZ network partition	N/A	Minority-side leaders stall	Leaders step down; majority-side continues	None (Raft guarantee)
Multipart failure mid-stream	N/A	Incomplete upload	ListParts → resume from failed part. Auto-abort after 7 days.	None
EC engine bug (bad parity)	10/14 data OK	Reconstruction from parity produces wrong data	Detect via ETag mismatch. Try other shard combinations. Re-encode parity from 10 data shards.	None if data shards intact
GC premature deletion	13/14	Durability reduced	24h grace period + weekly reconciliation + soft delete prevents most cases. Background repair restores.	None (EC covers it)
Partition split during writes	N/A	~100-200ms write stall	Automatic; API Router updates partition map	None
Reconstruction storm (rack failure, 20 nodes)	13/14 per PG	Cascading I/O degradation	Rate-limited repair, priority queue, lower ionice, backpressure on repair if p99 spikes	None if ≥10 shards survive
CDN origin failure	N/A	Cached: served. Uncached: 503.	DNS health checks reroute to healthy region	None (availability only)
Manifest service failure	N/A	Partial update layer down; core storage works	Raft election or manual replica restart	None
Rebalancing during failure	12-13/14	Dual disruption; repair takes longer	Pause migrations, prioritize repair, resume after stable	None under normal conditions
Metadata service fully down	N/A (data intact on disk)	All reads and writes fail. Data exists on storage nodes but nobody can find it (no index). CDN serves cached objects until cache expires.	Raft election restores within seconds for single partition. Full metadata outage (all partitions) requires cluster-level recovery from Raft snapshots. Data on storage nodes is never lost.	None (availability, not durability)

When Is Data Actually Lost?

Only when 5+ of 14 shards for the same object are destroyed before repair completes.

Probability: With 2% annual drive failure rate and 6-hour repair time:

P(1 drive fails in 6h) ≈ 1.4 × 10⁻⁵
P(5 of 14 fail before repair) ≈ 10⁻²¹ per object per cycle
Across 100T objects annually: ~10⁻⁴ objects lost/year = 1 object per 10,000 years

That is 11 nines.

Risk Factor	Mitigation
Correlated failures (bad drive batch)	Diversify vendors, spread same-batch across racks
Slow repair (overloaded cluster)	Prioritize repair I/O, pre-provision capacity
Silent corruption	Scrubber every 2 weeks, CRC on every read
Firmware bug	Verify shards on read-back before ACK
Natural disaster	Cross-region replication

15. Deployment Strategy

15.1 Multi-Region Deployment

Region	API Routers	Data Routers	Metadata	Storage	Total
us-east-1	800	600	5,500	8,000	14,900
us-west-2	600	450	4,000	6,000	11,050
eu-west-1	600	450	4,100	5,500	10,650

15.2 Technology Stack

Layer	Technology
CDN / Edge	CloudFront / Fastly
API Routers	Go, stateless, K8s
Data Routers	Rust, ISA-L, K8s
Metadata	Go, Raft + LSM-tree
Storage Nodes	Rust, direct I/O, extents
Manifest Service	Go, PostgreSQL / Raft KV
Coordination	etcd (3-node/region)
Observability	Prometheus + Grafana + Jaeger
Events	Kafka

15.3 Canary Deployment

5% rollout, 30-minute soak. Auto-rollback if: PUT success <99.9%, GET p99 +20%, reconstruction rate 2x, Raft commit +50%, CDN hit ratio -10%.

Storage node upgrades: Rack-by-rack rolling (drain → upgrade → smoke test → re-enable). 19,500 nodes across ~540 racks at 20 min/rack = ~7.5 days full cluster.

16. Observability

16.1 Key SLIs

# Latency
put_latency_ms{quantile="0.99"}
get_latency_ms{quantile="0.99"}
get_first_byte_latency_ms{quantile="0.99"}
list_latency_ms{quantile="0.99"}

# CDN
cdn_cache_hit_ratio{region}
cdn_origin_fetch_latency_ms{quantile="0.99"}

# Erasure coding
ec_reconstruct_count_total
ec_reconstruct_latency_ms{quantile="0.99"}

# Health
shard_repair_backlog
pgs_degraded_total
gc_backlog_bytes
gc_bytes_reclaimed_per_hour
raft_commit_latency_ms{quantile="0.99"}
raft_leader_elections_total
drive_utilization_percent{node_id}
cross_az_bandwidth_bytes_per_sec{src_az, dst_az}

16.2 Critical Alerts

Alert	Condition	Severity
PUT success rate drop	<99.9% for 5 min	P1
Reconstruction rate spike	>5% of GETs	P2
Raft election storm	>10/min for a partition	P1
Drive failure spike	>20/day (vs baseline 5.5)	P1
GC backlog growth	Increasing for >24h	P2
PG degradation	>50 degraded PGs	P1
Raft replication lag	>5 seconds	P1
CDN hit ratio drop	<50% for 15 min	P2

16.3 Distributed Tracing

Example degraded GET trace:

TraceID: 7a8b9c0d... Duration: 127ms
  cdn_edge.check:           3ms (MISS)
  api_router.authenticate:  2ms
  metadata_service.lookup:  6ms (partition 87432)
  data_router.fetch_shards: 112ms
    9/10 shards in 15ms, shard 1 TIMEOUT at 50ms
    Parity shard 10: 12ms, RS reconstruct: 0.3ms

Dashboards: executive (rate, errors, latency, cost), API layer, metadata/Raft, data/EC, storage/drives, network, CDN.

17. Security

17.1 IAM and Policy Engine

17.2 Encryption

Three-level key hierarchy: Master Key (HSM, annual rotation) → Bucket Key (cached 24h, reduces KMS calls from 50K/sec to 1/bucket/day) → Object Data Key (unique per object, AES-256-GCM).

SSE-Default: Service-managed keys, transparent.
SSE-KMS: Customer-managed, every use audited, revocable.
SSE-C: Customer-provided per request, not stored.

17.3 Transport and Access Control

Auth: HMAC-SHA256 with 15-minute replay window. Transport: TLS 1.3 client-to-edge, mTLS service-to-service.

Data integrity: CRC32C per shard, ETag (MD5) end-to-end, Content-MD5 header optional.

Network segmentation: Storage nodes on isolated network (no internet). API Routers in DMZ. Each component has minimum required permissions.

Explore the Technologies

Core Technologies

Technology	Role	Learn More
FoundationDB	Metadata store: ordered KV, strong consistency	FoundationDB
RocksDB	LSM-tree engine for metadata nodes	RocksDB
gRPC	Internal RPC with protobuf	gRPC
etcd	Placement coordination, cluster membership	etcd
Prometheus + Grafana	Metrics and dashboards	Prometheus
Kafka	Event notifications	Kafka
MinIO	S3-compatible private cloud object storage	MinIO
Ceph	Exabyte-scale storage with CRUSH placement	Ceph
SeaweedFS	Simple small-file optimized storage	SeaweedFS

Infrastructure Patterns

Pattern	Role	Learn More
Object Storage	The system being designed	Object Storage
Replication & Consistency	Raft for metadata, EC for data	Replication
Database Sharding	1.36M metadata partitions	Sharding
Hot/Warm/Cold Tiering	Storage class transitions	Tiering
CDN & Edge	CDN-signed URLs, OAC, WAF	CDN
Distributed Tracing	End-to-end request tracing	Tracing
Erasure Coding	Reed-Solomon math and shard recovery	Erasure Coding
CRUSH Placement	Topology-aware shard placement (PG → nodes)	CRUSH

System Design: AI Software Engineer (From Autocomplete to Autonomous App Builder)

March 25, 2026 · 94 min read

RAG and LLM Platform at Scale: Ingestion, Retrieval, Generation, and Evaluation for 10M Queries/Day

March 22, 2026 · 105 min read

Building a Multi-Tenant AI Agent Platform for Restaurant Intelligence

March 17, 2026 · 109 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.