CrackingWalnuts

System DesignApril 2, 2026· 66 min read

System Design: Dropbox (File Sync, Chunking, and Deduplication)

Goal: Build a cloud file sync platform serving 500M registered users, 50M DAU, syncing 10B files across 100PB of storage. Support chunked uploads, resumable transfers, delta sync (upload only changed bytes), cross-device sync via WebSocket, content deduplication across users, peer-to-peer LAN sync, and file versioning. Files available on other devices within 10 seconds for files under 10MB on broadband.

Mental model — four ideas that make everything else click:

A file = ordered list of chunk hashes. Chunks are immutable, content-addressed blobs in object storage.

Editing a file means producing a new list of hashes. Most stay the same. Only new chunks get uploaded.

Dedup = identical content produces the same SHA-256 hash, stored once, referenced many times.

Sync = push metadata via WebSocket, other devices download only missing chunks.

TL;DR: Split files into ~4MB chunks. SHA-256 per chunk enables content-addressable storage and cross-user dedup. Presigned URLs let clients upload directly to object storage — the server never touches file bytes. Resumable uploads: fail at chunk 47, resume from 48. Filesystem watcher detects changes, client uploads only changed chunks (delta sync), WebSocket pushes notifications in under 1 second. LAN sync for same-network devices. Last-writer-wins with conflicted copies for conflicts.

The Three Problems

File sync looks simple until the math gets real. Three problems shape every decision in this design.

Problem 1: Efficient transfer. The platform processes 1M file uploads per minute at peak, average file size 2MB. A user opens a 100MB spreadsheet, changes one cell (maybe 1KB of actual data), and saves. Re-uploading the entire 100MB wastes bandwidth. Multiply that by millions of users and the cost becomes unsustainable. The system must detect what actually changed and upload only those bytes.

Problem 2: Multi-device sync. 50M daily active users, each with 2-3 devices on average. That's 100-150M devices that need to stay in sync. When a user edits a file on their laptop, their phone and work desktop need to know within seconds. And "know about it" means downloading exactly the changed portions, not the whole file. Polling is out of the question — 150M devices hitting the server every 5 seconds asking "anything new?" is 30M requests/second of pure waste. The server must push changes.

Problem 3: Storage efficiency. 100PB of total storage. Without deduplication, every copy of the same file takes separate space. A company onboards 500 employees who all receive the same 50MB employee handbook PDF. That's 500 copies, 25GB, for one file. With content-addressable deduplication, the system stores one copy and 500 pointers.

Scale numbers:

500M registered users
50M DAU
10B files under management
100PB total storage
1M uploads/minute at peak (~16,667/sec)
Average file size: 2MB

Requirements

Functional Requirements

ID	Requirement	Priority
FR-01	Upload files from desktop, mobile, and web clients	P0
FR-02	Download files to any connected device	P0
FR-03	Resume interrupted uploads and downloads from the point of failure	P0
FR-04	Multi-device sync: file changes propagate to all user devices automatically	P0
FR-05	Delta sync: upload only changed portions of a file, not the entire file	P0
FR-06	Content deduplication: identical content stored once across all users	P0
FR-07	File versioning: retain previous versions for rollback	P1
FR-08	Conflict resolution: handle concurrent edits from multiple devices	P0
FR-09	Selective sync: choose which folders sync to which devices	P1
FR-10	LAN sync: transfer files directly between devices on the same network	P1
FR-11	File and folder sharing with permissions (view, edit)	P1
FR-12	Full-text search by filename, file type, and content metadata	P2
FR-13	Bandwidth throttling: user-configurable upload/download limits	P2
FR-14	File preview: thumbnails and previews for images, documents, videos	P2
FR-15	Team workspaces: shared folders with role-based access control	P1

Non-Functional Requirements

ID	Requirement	Target
NFR-01	Sync latency (metadata propagation)	< 1 second
NFR-02	File availability on other devices	< 10 seconds for files under 10MB on broadband
NFR-03	Resumable upload gap tolerance	Up to 7 days between pause and resume
NFR-04	Deduplication ratio	50% storage savings across all users
NFR-05	Data durability	99.999999999% (11 nines, standard tier)
NFR-06	Service availability	99.99% (52 minutes downtime/year)
NFR-07	Maximum file size	50GB
NFR-08	Concurrent sync connections	150M (50M DAU * ~3 devices * 30% concurrency)
NFR-09	Version retention	180 days or 100 versions, whichever comes first

Why Naive Approaches Fail

Before arriving at the solution, three approaches fail.

Approach 1: Full-file upload. Client detects a change, uploads the entire file. A 100MB spreadsheet with a 1KB edit uploads 100MB. At 1M uploads/minute, that is bandwidth suicide. Only viable under ~1MB.

Approach 2: Server-side binary diff. Client uploads the full file, server computes a diff against the old version. The server needs both versions in memory — about 1GB for a 500MB file. At 1M concurrent uploads, that is 1PB of server memory. Worse: routing all bytes through the server kills presigned URLs — the optimization that lets clients upload directly to object storage without the server touching file bytes. Verdict: impractical at scale.

Approach 3: Fixed-size chunking. Split the file into 4MB blocks at fixed byte positions. Hash each. Upload only blocks with changed hashes. Works for appends and in-place edits. Breaks on insertions:

Original (20MB, 5 chunks of 4MB each):
Position:  [0-4MB]  [4-8MB]  [8-12MB] [12-16MB] [16-20MB]
Chunks:    [C1:h_a] [C2:h_b] [C3:h_c] [C4:h_d]  [C5:h_e]

After inserting 1KB at byte 100:
Position:  [0-4MB]  [4-8MB]  [8-12MB] [12-16MB] [16-20MB+1KB]
Chunks:    [C1':h_f] [C2':h_g] [C3':h_h] [C4':h_i] [C5':h_j]

Every hash changed. The 1KB insertion shifted all subsequent bytes. Fixed chunking degrades to full upload for mid-file insertions — which text editors, spreadsheets, and document processors do constantly.

What is needed is a chunking strategy where boundaries do not shift on insertion. Two solutions exist: content-defined chunking (CDC) and rsync-style delta sync. Both rely on a shared foundational concept.

The Two-Hash Foundation

Problem: Scanning a file for chunk boundaries or matching content requires checking every byte position. Cryptographic hashes are too expensive for this. But cheap hashes produce false matches.

Simple example: Imagine scanning a 100MB file byte-by-byte. At each position, you need to answer: "Is this a chunk boundary?" or "Does this data match something I've seen before?" SHA-256 at every byte position would take minutes. A fast rolling hash takes milliseconds — but it sometimes says "match" when the data is different.

Mental model: Think of a metal detector (weak hash) and a gemologist (strong hash). The metal detector scans the entire beach quickly — it beeps at every metal object, including bottle caps. The gemologist examines only the beeps to determine which are actually gold. You would never hire the gemologist to scan the beach, and you would never trust the metal detector to authenticate gold.

Every file scanning algorithm in this design uses two hashes with different roles:

	Weak hash (Rabin / rolling)	Strong hash (SHA-256)
Speed	O(1) per byte position	Expensive (processes entire block)
Used where	Every byte position in the file	Only after a weak hash match
In CDC	Find chunk boundaries (`hash % TARGET == 0`)	Compute chunk identity for dedup and storage
In rsync	Find candidate block matches in old file	Verify the match is real, not a collision
Collisions	Yes — fast but imprecise	Cryptographically collision-resistant

python

# The two-tier pattern used by both CDC and rsync
for each byte_position in file:
    weak = rolling_hash_update(weak, outgoing_byte, incoming_byte)  # O(1)

    if weak_hash_matches(weak):        # Fast check at every position
        strong = sha256(block_data)     # Expensive, but rare
        if strong_hash_matches(strong): # Confirms real match
            # CDC: declare chunk boundary
            # rsync: emit COPY instruction

The most common weak hash is the Rabin fingerprint: a polynomial rolling hash that treats bytes as coefficients over a Galois field. What matters for the design: O(1) sliding update, uniformly distributed output, and deterministic behavior — same bytes always produce the same hash and the same reproducible boundaries.

Key distinction: Rabin finds boundaries and candidate matches. SHA-256 identifies content. Rabin is never used for dedup or storage addressing — that is SHA-256's job.

Content-Defined Chunking (CDC)

Problem: Fixed chunk boundaries shift when bytes are inserted, causing all downstream hashes to change. How do you chunk a file so that a small edit only affects the chunk containing the edit?

Simple example: A 12MB file is split into three ~4MB chunks. You insert 1KB in the first chunk. With fixed boundaries, all three chunks change. With content-defined boundaries, only the first chunk changes — because the boundaries are determined by the content near each split point, and the content at those split points did not change.

Mental model: Imagine cutting a rope wherever you find a red bead. The positions of the red beads depend on the rope's content, not on fixed measurements. If you splice new rope into the middle, only the section between the nearest two red beads changes. All other sections remain identical.

How it works:

Slide a 48-byte window across the file, one byte at a time.
Compute a Rabin fingerprint at each position. The rolling hash updates in O(1).
Declare a boundary when rabin_hash % TARGET == 0. TARGET controls average chunk size. For ~4MB chunks, roughly 1 in 4,194,304 positions triggers a boundary. Min/max constraints (2MB–8MB) prevent pathological sizes.
SHA-256 hash each resulting chunk. This hash becomes the chunk's permanent identity — used for dedup, storage addressing, and integrity verification.

Pseudocode:

python

WINDOW_SIZE = 48
TARGET = 4 * 1024 * 1024      # ~4MB average chunk
MIN_CHUNK = 2 * 1024 * 1024   # 2MB minimum
MAX_CHUNK = 8 * 1024 * 1024   # 8MB maximum

def content_defined_chunking(file_bytes):
    chunks = []
    chunk_start = 0
    window = file_bytes[0:WINDOW_SIZE]
    fingerprint = rabin_init(window)

    i = WINDOW_SIZE
    while i < len(file_bytes):
        outgoing = file_bytes[i - WINDOW_SIZE]
        incoming = file_bytes[i]
        fingerprint = rabin_slide(fingerprint, outgoing, incoming)

        chunk_length = i - chunk_start
        is_boundary = (fingerprint % TARGET == 0 and chunk_length >= MIN_CHUNK)
        is_max = (chunk_length >= MAX_CHUNK)

        if is_boundary or is_max:
            chunk_data = file_bytes[chunk_start:i]
            chunks.append({
                'data': chunk_data,
                'hash': sha256(chunk_data),
                'offset': chunk_start,
                'size': len(chunk_data)
            })
            chunk_start = i
        i += 1

    if chunk_start < len(file_bytes):
        chunk_data = file_bytes[chunk_start:]
        chunks.append({
            'data': chunk_data,
            'hash': sha256(chunk_data),
            'offset': chunk_start,
            'size': len(chunk_data)
        })
    return chunks

In practice, the file is streamed — memory usage is constant regardless of file size: just the 48-byte window, the current chunk buffer (max 8MB), and the hash state.

Why it works — concrete example:

Original file, CDC produces 3 chunks:
[----Chunk A (3.8MB)----][----Chunk B (4.2MB)----][----Chunk C (4.0MB)----]
       hash: h_a                hash: h_b                hash: h_c

Insert 1KB in the middle of Chunk A:
[----Chunk A' (3.8MB)---][----Chunk B (4.2MB)----][----Chunk C (4.0MB)----]
       hash: h_a'               hash: h_b                hash: h_c

Only Chunk A changed. The boundary between A and B is determined by the content at that point, which did not change. Chunks B and C keep the same hash. The client uploads one ~4MB chunk instead of the entire 12MB file.

CDC enables cross-user deduplication. Because chunk boundaries are content-defined and SHA-256 is deterministic, two users uploading the same file produce identical chunk hashes. The system stores one copy and two pointers.

rsync-Style Delta Sync

Problem: CDC reduces upload size to whole chunks (~4MB). But a 1KB edit still uploads an entire ~4MB chunk. Can we transmit only the actual changed bytes?

Simple example: You have a 100MB file. You change 1KB in the middle. CDC uploads ~4MB (the affected chunk). rsync-style delta sync uploads ~1KB — the literal changed bytes — plus lightweight COPY instructions telling the receiver "reuse these parts from the old file."

Mental model: Imagine you have two nearly identical books. Instead of sending the entire new book, you send instructions: "Copy pages 1-50 from the old book. Here are 2 new pages. Copy pages 53-200 from the old book." The instructions are tiny compared to shipping a whole book.

Production file sync systems use this approach for transfers: fixed-size blocks with rsync-style rolling hash scanning. The core idea is position-independent matching — finding reusable data anywhere in the old file, regardless of alignment.

Terminology: Receiver = side with the old file. Sender = side with the new file. When uploading: client = Sender, server = Receiver. When downloading: roles reverse.

How it works:

Receiver splits the old file into fixed-size blocks (~4MB). For each block, compute a weak hash (rolling) and a strong hash (SHA-256). Send the (weak, strong) pairs to the sender.
Sender scans the new file with a rolling hash, one byte at a time. At each position: does this rolling hash match any of the receiver's block hashes?
Match -> COPY. Weak hash matches? Verify with strong hash. If confirmed, emit COPY block_N — the receiver reuses that block from the old file.
No match -> INSERT. Bytes between matches are new data. Emit INSERT "new_bytes" — literal bytes the receiver must use.
Receiver reconstructs the new file by executing COPY/INSERT instructions in order.

Concrete example:

Old file (receiver has this):
  Block 0: [ABCD]  ->  (weak0, strong0)
  Block 1: [EFGH]  ->  (weak1, strong1)

New file (sender has this):
  XXABCDEFGH

Sender scans with rolling hash:
  Position 0: [XXAB] -> no match
  Position 1: [XABC] -> no match
  Position 2: [ABCD] -> weak match! -> verify strong hash -> confirmed (Block 0)
  ...
  Position 6: [EFGH] -> weak match! -> verify strong hash -> confirmed (Block 1)

The delta payload sent over the network:

json

[
  { "type": "INSERT", "data": "XX" },
  { "type": "COPY",   "block_index": 0 },
  { "type": "COPY",   "block_index": 1 }
]

Only INSERT carries actual bytes. COPY is a lightweight pointer. This achieves byte-level delta granularity: a 1KB edit transmits ~1KB of INSERT data plus COPY instructions, not an entire ~4MB chunk.

The key insight: matching is position-independent. Block 0 was at position 0 in the old file but appears at position 2 in the new file. rsync finds it anyway because the rolling hash scans every position. This is what solves the insertion problem that breaks fixed-size chunking.

Tradeoffs:

CPU cost: Scans the entire file byte-by-byte with a rolling hash. For a 1GB file: ~1 billion hash computations. More CPU than fixed chunking, comparable to CDC.
No cross-user dedup. rsync compares old vs new versions of the same file. It does not deduplicate across different users' files. Separate dedup mechanisms handle cross-user storage efficiency.

CDC vs rsync: Choosing the Right Model

Both approaches solve the insertion problem. They differ in what they optimize for.

CDC asks: "Do you already have this exact chunk?" rsync asks: "Can I find this data anywhere in your file?"

CDC dedup model — hash-based, global:

Client: file -> CDC -> chunks -> hashes -> [h_a, h_b, h_c]
Server: h_a exists (another user uploaded it) -> skip
        h_b exists (previous version) -> skip
        h_c is new -> upload needed
Result: upload only h_c. Dedup works across ALL users and files.

rsync dedup model — content-matching, local:

Receiver sends block hashes from old file
Sender scans new file, finds matches -> COPY / INSERT
Result: minimal bytes transferred. But only compares old vs new version of the SAME file.

Concrete example showing the difference:

Alice uploads Q1-report.pptx (100MB)
  CDC chunks: [slides-A][company-logo][chart-data][slides-B]
  All new -> 100MB uploaded.

Bob uploads Q2-report.pptx (120MB) -- different report, same logo and chart
  CDC chunks: [slides-X][company-logo][chart-data][slides-Y][slides-Z]
  Server: company-logo and chart-data already exist (from Alice) -> skip
  Upload: only 72MB instead of 120MB. Cross-user dedup saved 48MB.

rsync: no equivalent. rsync compares Bob's file against Bob's previous
  version of that file -- not against Alice's file. No cross-user dedup.

Comparison table:

Dimension	Fixed Chunking	CDC	rsync
Boundary depends on position?	Yes	No (content-defined)	No (position-independent scan)
Handles insertions?	No — all chunks shift	Yes — only affected chunk changes	Yes — finds blocks at new positions
Delta granularity	Whole chunks (~4MB)	Whole chunks (~4MB)	Byte-level (actual changed bytes)
Dedup across users/files?	Yes (hash identity)	Yes (hash identity)	No (same-file pairs only)
CPU cost	Low	Medium (Rabin at every byte)	High (rolling hash at every byte)
Server touches file bytes?	No	No	Sender processes file for delta

Bandwidth comparison for a 100MB file:

Scenario	Full Upload	Fixed Chunks	CDC	rsync
1KB edit at start	100MB	100MB (all shifted)	~4MB (1 chunk)	~1KB
1KB edit in middle	100MB	~50MB (half shifted)	~4MB (1 chunk)	~1KB
1KB edit at end	100MB	4MB (last chunk)	~4MB (1 chunk)	~1KB
1KB append	100MB	4MB (new last chunk)	~4MB (1 chunk)	~1KB
50MB overwrite	100MB	~52MB	~52MB	~50MB
No change	100MB	0MB (hashes match)	0MB (hashes match)	0MB

This design uses CDC as the primary model because: hash-based dedup maps cleanly to content-addressable object storage, presigned URLs work directly with CDC chunks, and global cross-user dedup saves 50%+ storage. Production systems combine rsync-style delta sync for transfers with chunk-based dedup for storage — the best of both worlds.

Chunk Size Tuning

The TARGET parameter controls average chunk size. This is a critical tuning knob.

Smaller chunks (1MB average): Better dedup and delta granularity. More metadata per file (100 records for a 100MB file), more objects to manage, more SHA-256 computations.

Larger chunks (16MB average): Less metadata, fewer objects. Worse dedup and delta granularity. Larger minimum upload size.

The sweet spot: 4MB average. A 100MB file produces ~25 chunks (manageable metadata). A small edit changes 1-2 chunks (4-8MB delta — acceptable on broadband). With min 2MB and max 8MB constraints, pathological cases stay bounded. At 10B files averaging 3 chunks each: ~30B chunk references, ~15B unique chunks after 50% dedup.

Tradeoff	Impact
CPU cost per file	Rabin + SHA-256 at every byte. ~2-3 seconds for a 1GB file on a modern laptop.
More metadata per file	25 chunk records for 100MB. More rows in version_chunks.
Minimum effective edit size	A 1-byte change still produces one new ~4MB chunk.
Complexity	CDC boundary detection, min/max enforcement, hash computation add client-side code.

These costs are worth it. The alternative (full-file upload) wastes orders of magnitude more bandwidth at scale.

Technology Selection

Component	Technology	Role
Chunk storage	Object Storage	Content-addressable blob storage, 11 nines durability, multipart upload
File metadata	PostgreSQL + Citus	Users, files, versions, ACLs. Sharded by `user_id` for horizontal scale
Dedup cache	Redis Cluster	Bloom filter for chunk existence checks, upload session state, connection registry
Sync events	Kafka (RF=3)	Sync event distribution, upload completion events, dedup events
File search	Elasticsearch	Search by filename, type, content metadata
CDN	CloudFront	Edge caching for frequently downloaded chunks
Real-time sync	WebSocket Gateway	Push sync notifications to connected devices
LAN discovery	mDNS (Bonjour/Avahi)	Discover devices on the same local network for P2P sync

Why PostgreSQL + Citus? File metadata is inherently relational: files belong to users, live in folder hierarchies, have versions, and each version references an ordered list of chunks. ACLs need transactional consistency. A single PostgreSQL instance cannot handle 10B files and 50B version-chunk mappings. Citus gives horizontal sharding with full SQL and ACID guarantees. Since almost all queries are user-scoped, Citus co-locates everything on one shard per user: fast, local, transactional.

Dimension	PostgreSQL + Citus	DynamoDB	MongoDB	CockroachDB
Data model fit	Relational. File trees, ACLs, versions.	Key-value. Hierarchy requires denormalization.	Document. Decent for metadata, poor for ACL joins.	Relational, distributed.
Sharding	Automatic by `user_id`. Co-locates user data.	Partition key required. Cross-partition expensive.	Manual, complex at scale.	Automatic range-based.
Transactions	Full ACID, even across co-located tables.	Single-partition only.	Multi-document with caveats.	Full distributed (Spanner-style).
Cost	Open source. Commodity hardware.	Expensive at this write volume.	Open source but operationally complex.	Higher resource overhead than Citus.

CockroachDB is the closest competitor — also distributed SQL. But its consensus-based writes (Raft per range) add latency compared to Citus's single-node writes for co-located data.

Architecture Overview

The architecture has three planes: Control plane handles metadata and orchestration. Data plane stores chunk bytes. Event plane delivers async notifications. The invariant that ties it all together: a file is always an ordered list of chunk hashes. Everything else is machinery to keep that list consistent across devices.

The Upload Path

Every numbered step must succeed for the upload to complete. Three phases: decision (steps 1-3), data transfer (steps 4-5), commit (steps 6-7).

User edits a file on their laptop
Desktop client's filesystem watcher detects the change
Client re-chunks the file using CDC, producing a list of chunk hashes
Client diffs the new chunk list against its local SQLite DB to find new chunks
Client calls /upload/init with the list of new chunk hashes
Server runs dedup: checks each hash against the Bloom filter (Redis), then PostgreSQL. Already-existing chunks are skipped.
Server returns presigned URLs only for chunks that don't already exist
Client uploads new chunks directly to object storage via presigned URLs (file bytes never touch the server)
Client calls /upload/finalize. Server writes new file version (metadata + chunk references) to PostgreSQL. The Sync Service publishes an event to Kafka.
Kafka delivers the event to the WebSocket Gateway, which pushes a notification to all other devices owned by this user
Other devices pull the updated metadata and download only the changed chunks

File bytes bypass the entire server fleet. At petabyte-scale upload ingress, routing data through API servers would require thousands of proxy nodes. Presigned URLs eliminate that.

Critical path vs async path:

Critical path (must succeed)	Async path (eventual)
`/upload/init` (dedup check)	Kafka event delivery
Chunk upload to object storage	WebSocket notification to other devices
`/upload/finalize` (metadata write)	Search indexing (Elasticsearch)
	Thumbnail generation

The Sync Path

After a successful upload, the Sync Service publishes an event to Kafka. This path is decoupled from the upload — if Kafka is slow, sync is delayed but no data is lost.

Full Architecture

Layer Responsibilities

Layer 1: Edge. The L7 load balancer terminates TLS and routes requests. The API Gateway handles authentication (JWT validation), rate limiting (per-user, per-IP), and request routing. WebSocket connections are sticky-routed to the same gateway pod.

Layer 2: Services. Four stateless services:

Sync Service: Handles file metadata CRUD, version creation, cursor-based change feeds, and folder operations. Writes to PostgreSQL, publishes events to Kafka.
Upload Orchestrator: Manages the upload lifecycle — calls Dedup Service for hash checks, generates presigned URLs for new chunks, tracks progress in Redis. On finalize, delegates metadata write to the Sync Service (which owns metadata and publishes the Kafka event).
WebSocket Gateway: Maintains persistent connections with all online devices. Consumes sync events from Kafka and pushes them to the right devices. Registers connections in Redis so other services can look up which pod holds which user.
Dedup Service: Called during /upload/init before any data upload. Handles chunk existence checks (Bloom filter in Redis + PostgreSQL verification), chunk insertion into the chunks table, and ref_count management. This is on the critical path.

Layer 3: Message Bus. Kafka with replication factor 3. Three topic families: sync-events for file change notifications, upload-complete for post-upload processing, and dedup-events for async verification. Kafka buffers upload spikes, guarantees ordering per user (partitioned by user_id), and enables fan-out: one event feeds search indexing, thumbnail generation, and sync notifications independently.

Layer 4: Storage.

Object Storage stores chunk bytes. Content-addressable: the key is the SHA-256 hash (e.g., chunks/a1/b2/a1b2c3d4e5f6...). Same content always maps to the same key.
PostgreSQL + Citus stores all metadata: users, files, folders, versions, chunk lists, ACLs. Sharded by user_id. Source of truth for "what files exist and what chunks they contain."
Redis Cluster stores ephemeral state: (1) Bloom filter for chunk existence checks, (2) upload session progress for resumable uploads, (3) WebSocket connection registry mapping user_id to gateway pod.
Elasticsearch indexes file metadata for search.

Why both local diff AND server dedup? Local diff (client compares chunk hashes against its own SQLite DB) reduces upload volume within a single file across versions. Server dedup (Bloom filter + PostgreSQL) eliminates duplicates across users and files. A 50MB PDF uploaded by 500 employees: local diff does not help (each client sees it as new). Server dedup stores it once.

Why Bloom filter AND PostgreSQL for dedup? The Bloom filter is fast but has false positives. PostgreSQL is accurate but slow. The Bloom filter handles 99%+ of checks; PostgreSQL verifies the rest.

Why Sync Service AND Upload Orchestrator? Upload Orchestrator manages the upload lifecycle (presigned URLs, session tracking, resume). Sync Service owns metadata consistency (file versions, chunk references, cursor-based change feeds). On finalize, the Orchestrator delegates the metadata write to the Sync Service. Separation keeps each service focused and independently scalable.

Scale Estimation

Upload Throughput

Peak uploads:     1M files/minute = 16,667 files/sec
Avg chunks/file:  3 (many small files are 1 chunk, large files bring up the average)
Total chunks/sec: 16,667 * 3 = 50,000 chunks/sec
Dedup hit rate:   50% (half the chunks already exist in the system)
New store PUTs/sec:  50,000 * 0.5 = 25,000 PUT operations/sec
Avg chunk size:   4MB
Object storage ingress: 25,000 * 4MB = 100GB/sec peak

25K PUTs per second is within the object store's capabilities (3,500 PUT/sec per prefix, so we use 256 hash-based prefixes for headroom).

Storage

Total files:          10B
Average file size:    2MB
Gross storage:        10B * 2MB = 20PB
Average versions:     5 per file
Gross with versions:  20PB * 5 = 100PB
Dedup ratio:          50%
Actual object storage:    100PB * 0.5 = 50PB
Unique chunks:        50PB / 4MB = 12.5 billion unique chunks

50PB in Standard tier. Tiered storage (intelligent tiering, cold archive for old versions) reduces cost significantly.

Metadata (PostgreSQL + Citus)

Chunk record:      hash(32B) + size(4B) + bucket(8B) + ref_count(4B) + timestamps(8B) = ~56 bytes
                   12.5B chunks * 56B = 700GB

File record:       id(16B) + user_id(16B) + parent_id(16B) + name(256B) + metadata(128B)
                   + timestamps(16B) + indexes(~116B) = ~564 bytes
                   10B files * 564B = 5.6TB

Version record:    id(16B) + file_id(16B) + user_id(16B) + version(4B) + size(8B)
                   + checksum(32B) + device_id(16B) + chunk_count(4B) + timestamps(16B)
                   + indexes(~72B) = ~200 bytes
                   10B files * 5 versions = 50B versions * 200B = 10TB

Version-chunk mapping: version_id(16B) + user_id(16B) + chunk_index(4B)
                       + chunk_hash(32B) + chunk_size(4B) = ~72 bytes
                       50B versions * 3 chunks avg = 150B rows * 72B = 10.8TB

Total PostgreSQL:  700GB + 5.6TB + 10TB + 10.8TB = 27TB

27TB sharded across 50+ Citus worker nodes. Each node holds ~500GB — within a single PostgreSQL instance's comfort zone with NVMe storage.

Bandwidth

Upload to object storage:    100GB/sec peak
Download from object storage: ~200GB/sec peak (downloads typically 2x uploads)
CDN hit rate:    60% for popular files
CDN offload:     200GB/sec * 0.6 = 120GB/sec served from CDN
Object storage egress: 200GB/sec * 0.4 = 80GB/sec

WebSocket Connections

DAU:                50M
Devices per user:   2.5 average
Online concurrency: 30% at peak
Connections:        50M * 2.5 * 0.3 = 37.5M ~ 40M concurrent connections
Memory per conn:    ~15KB (buffers, session state, send queue)
Total memory:       40M * 15KB = 600GB
Connections per pod: 500K (practical limit per gateway instance)
Gateway pods:       40M / 500K = 80 pods

80 WebSocket gateway pods, each holding 500K connections. Memory-bound, not CPU-bound.

Redis Cluster

Bloom filter:      12.5B chunks, 0.1% false positive rate
                   15 bits per element = 12.5B * 15 / 8 = ~23.4GB
Active upload sessions: 100K concurrent * 2KB each = 200MB
Connection registry:    40M entries * 32B each = 1.28GB
Rate limit counters:    5M * 16B = 80MB
Total Redis memory:     ~25GB

Summary

Resource	Quantity	Notes
Object storage	50PB	Content-addressable chunks, deduplicated
Object store PUTs/sec	25K	After 50% dedup
Object storage ingress	100GB/sec	Peak upload bandwidth
PostgreSQL storage	27TB	Sharded across 50+ Citus nodes
Redis memory	~25GB	Bloom filter + sessions + connections
WebSocket pods	80	500K connections each
Kafka throughput	50K events/sec	Sync events, upload completions
Elasticsearch	~2TB	Filename and metadata index

Data Model

Users Table

sql

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email           TEXT UNIQUE NOT NULL,
    display_name    TEXT NOT NULL,
    quota_bytes     BIGINT NOT NULL DEFAULT 2147483648,  -- 2GB free tier
    storage_used    BIGINT NOT NULL DEFAULT 0,
    plan            TEXT NOT NULL DEFAULT 'free',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Citus distribution
SELECT create_distributed_table('users', 'id');

File Metadata Table

sql

CREATE TABLE file_metadata (
    id              UUID NOT NULL DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES users(id),
    parent_id       UUID,                    -- NULL for root folder
    name            TEXT NOT NULL,
    is_folder       BOOLEAN NOT NULL DEFAULT false,
    latest_version  INTEGER NOT NULL DEFAULT 0,
    size_bytes      BIGINT NOT NULL DEFAULT 0,
    mime_type       TEXT,
    checksum        TEXT,                    -- SHA-256 of full file content
    is_deleted      BOOLEAN NOT NULL DEFAULT false,
    deleted_at      TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),

    PRIMARY KEY (user_id, id)               -- Citus requires dist key in PK
);

SELECT create_distributed_table('file_metadata', 'user_id');

CREATE INDEX idx_file_parent ON file_metadata(user_id, parent_id);
CREATE UNIQUE INDEX idx_file_name ON file_metadata(user_id, parent_id, name)
    WHERE is_deleted = false;

File Versions Table

sql

CREATE TABLE file_versions (
    id              UUID NOT NULL DEFAULT gen_random_uuid(),
    file_id         UUID NOT NULL,
    user_id         UUID NOT NULL,
    version_number  INTEGER NOT NULL,
    size_bytes      BIGINT NOT NULL,
    checksum        TEXT NOT NULL,           -- SHA-256 of full file (Merkle root)
    device_id       UUID NOT NULL,
    chunk_count     INTEGER NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),

    PRIMARY KEY (user_id, id),
    FOREIGN KEY (user_id, file_id) REFERENCES file_metadata(user_id, id)
);

SELECT create_distributed_table('file_versions', 'user_id',
    colocate_with => 'file_metadata');

CREATE UNIQUE INDEX idx_version_number
    ON file_versions(user_id, file_id, version_number);

Version Chunks Table

sql

CREATE TABLE version_chunks (
    version_id      UUID NOT NULL,
    user_id         UUID NOT NULL,
    chunk_index     INTEGER NOT NULL,        -- 0-based order within the file
    chunk_hash      BYTEA NOT NULL,          -- SHA-256 (32 bytes)
    chunk_size      INTEGER NOT NULL,
    storage_key     TEXT NOT NULL,            -- denormalized (avoids cross-shard join)

    PRIMARY KEY (user_id, version_id, chunk_index),
    FOREIGN KEY (user_id, version_id) REFERENCES file_versions(user_id, id)
);

SELECT create_distributed_table('version_chunks', 'user_id',
    colocate_with => 'file_metadata');

Byte offsets are not stored explicitly. The file is reconstructed by concatenating chunks in chunk_index order. Each chunk knows its own size, so byte offsets can be derived at read time if needed for range reads.

Chunks Table (Global, Not User-Sharded)

sql

CREATE TABLE chunks (
    hash            BYTEA PRIMARY KEY,       -- SHA-256 (32 bytes)
    size_bytes      INTEGER NOT NULL,
    storage_bucket  TEXT NOT NULL,
    storage_key     TEXT NOT NULL,
    ref_count       INTEGER NOT NULL DEFAULT 1,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    last_referenced TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Sharded by hash (NOT a reference table -- 700GB replicated to every node is impractical)
SELECT create_distributed_table('chunks', 'hash');

Why cross-shard joins are expensive: The chunks table is sharded by hash, while version_chunks is sharded by user_id. A JOIN between them fans out to multiple shards, adding network round trips and coordination overhead.

How cross-shard joins are avoided:

Bloom filter + point lookups for dedup. The Bloom filter handles 99%+ of chunk existence checks. The chunks table is queried only for verification and ref count management — point lookups by hash (single-shard), not joins.
Denormalized storage_key in version_chunks. When downloading a file, the server needs the storage key for each chunk to generate presigned URLs. Instead of joining across shards, storage_key is stored directly in version_chunks. One extra column (~50 bytes per row) eliminates a cross-shard join on every download. Written once during /upload/complete, never changes.
Co-location by user_id. All user-scoped tables are co-located on the same Citus shard. Queries within a user's data are single-shard operations — fast and transactional.

The chunks table remains the source of truth for chunk existence and ref counts. But hot-path reads (download, sync) never touch it — they read denormalized data from version_chunks on the user's local shard.

Upload Sessions in Redis

json

{
    "key": "upload:{upload_id}",
    "value": {
        "upload_id": "up_abc123",
        "user_id": "usr_xyz789",
        "file_id": "file_def456",
        "storage_multipart_id": "mpu_id",
        "total_chunks": 25,
        "completed_chunks": [0, 1, 2, 3, 5, 6, 7],
        "failed_chunks": [4],
        "presigned_urls": {
            "8": "https://storage.example.com/...?X-Storage-Signature=...",
            "9": "https://storage.example.com/...?X-Storage-Signature=..."
        },
        "chunk_hashes": ["a1b2c3...", "d4e5f6...", "..."],
        "chunks_to_upload": [4, 8, 9, 10, 11, 12],
        "chunks_existing": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
        "started_at": "2026-03-10T14:30:00Z",
        "expires_at": "2026-03-17T14:30:00Z"
    },
    "TTL": "7 days"
}

Upload sessions live in Redis with a 7-day TTL matching the multipart upload lifecycle.

Client-Side SQLite Schema

Every desktop client maintains a local SQLite database mirroring server state for synced folders:

sql

CREATE TABLE local_files (
    path            TEXT PRIMARY KEY,        -- relative to sync root
    file_id         TEXT,                    -- server file ID
    version_number  INTEGER,
    size_bytes      INTEGER,
    checksum        TEXT,                    -- SHA-256 of full file
    chunk_hashes    TEXT,                    -- JSON array of chunk SHA-256 hashes
    local_mtime     INTEGER,                -- filesystem modification time
    sync_status     TEXT DEFAULT 'synced',   -- synced, pending_upload, pending_download,
                                            --   conflicted, error
    last_synced_at  INTEGER                  -- epoch seconds
);

CREATE TABLE sync_queue (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    path            TEXT NOT NULL,
    action          TEXT NOT NULL,           -- upload, download, delete, rename, move
    priority        INTEGER DEFAULT 0,
    created_at      INTEGER NOT NULL,
    attempts        INTEGER DEFAULT 0,
    last_error      TEXT
);

CREATE TABLE sync_state (
    key             TEXT PRIMARY KEY,
    value           TEXT
);
-- stores: last_cursor, last_sync_time, device_id

Object key format: Chunks are stored with a key derived from their SHA-256 hash, distributed across prefixes:

chunks/{hash[0:2]}/{hash[2:4]}/{full_sha256_hash}

Example: chunks/a1/b2/a1b2c3d4e5f6...full64charhash

The two-level prefix creates 65,536 prefixes (256 * 256), distributing storage operations evenly to avoid per-prefix rate limits.

API Design

Upload Flow

Initialize upload:

POST /v1/files/upload/init
Authorization: Bearer <jwt>
Content-Type: application/json

{
    "path": "/Documents/report.xlsx",
    "size_bytes": 104857600,
    "checksum": "sha256:abc123...",
    "base_version": 5,
    "chunk_hashes": [
        "a1b2c3d4...",
        "e5f6a7b8...",
        "c9d0e1f2...",
        ...
    ],
    "chunk_sizes": [
        4194304,
        4194304,
        3670016,
        ...
    ]
}

Response 200:
{
    "upload_id": "up_abc123",
    "chunks_needed": [0, 5, 12],
    "chunks_existing": [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, ...],
    "presigned_urls": {
        "0": {
            "url": "https://storage.example.com/chunks/a1/b2/a1b2...?...",
            "expires_at": "2026-03-10T15:00:00Z"
        },
        "5": {
            "url": "https://storage.example.com/chunks/e5/f6/e5f6...?...",
            "expires_at": "2026-03-10T15:00:00Z"
        },
        "12": {
            "url": "https://storage.example.com/chunks/c9/d0/c9d0...?...",
            "expires_at": "2026-03-10T15:00:00Z"
        }
    }
}

The server checks all chunk hashes against the dedup system. Only truly new chunks get presigned URLs.

Complete upload:

POST /v1/files/upload/{upload_id}/complete
Authorization: Bearer <jwt>

{
    "etags": {
        "0": "\"etag_for_chunk_0\"",
        "5": "\"etag_for_chunk_5\"",
        "12": "\"etag_for_chunk_12\""
    }
}

Response 200:
{
    "file_id": "file_def456",
    "version": 6,
    "size_bytes": 104857600,
    "synced_at": "2026-03-10T14:35:00Z"
}

Check progress (for resume):

GET /v1/files/upload/{upload_id}/progress
Authorization: Bearer <jwt>

Response 200:
{
    "upload_id": "up_abc123",
    "total_chunks": 25,
    "completed_chunks": [0, 5],
    "remaining_chunks": [12],
    "presigned_urls": {
        "12": {
            "url": "https://storage.example.com/chunks/c9/d0/c9d0...?...",
            "expires_at": "2026-03-10T15:15:00Z"
        }
    }
}

If presigned URLs have expired, the server generates fresh ones. The multipart upload stays valid for 7 days.

Download

GET /v1/files/{file_id}/download?version=6
Authorization: Bearer <jwt>

Response 200:
{
    "file_id": "file_def456",
    "version": 6,
    "size_bytes": 104857600,
    "checksum": "sha256:abc123...",
    "chunks": [
        {
            "index": 0,
            "hash": "a1b2c3d4...",
            "size": 4194304,
            "url": "https://cdn.example.com/chunks/a1/b2/a1b2...?..."
        },
        {
            "index": 1,
            "hash": "e5f6a7b8...",
            "size": 4194304,
            "url": "https://cdn.example.com/chunks/e5/f6/e5f6...?..."
        }
    ]
}

URLs point to the CDN for caching. The client downloads chunks in parallel, verifies each against its hash, and assembles the file locally.

Sync (Cursor-Based)

POST /v1/sync/changes
Authorization: Bearer <jwt>

{
    "cursor": "cursor_2026031014300000_shard42",
    "limit": 500
}

Response 200:
{
    "changes": [
        {
            "type": "file_modified",
            "file_id": "file_def456",
            "path": "/Documents/report.xlsx",
            "version": 6,
            "size_bytes": 104857600,
            "checksum": "sha256:abc123...",
            "modified_by_device": "device_laptop01",
            "timestamp": "2026-03-10T14:35:00Z"
        },
        {
            "type": "file_created",
            "file_id": "file_ghi789",
            "path": "/Photos/vacation.jpg",
            "version": 1,
            "size_bytes": 5242880,
            "checksum": "sha256:def456...",
            "modified_by_device": "device_phone01",
            "timestamp": "2026-03-10T14:36:00Z"
        }
    ],
    "cursor": "cursor_2026031014360000_shard42",
    "has_more": false
}

The cursor is opaque to the client. Internally it encodes a timestamp and shard identifier. The client stores it in local SQLite and uses it on reconnection to catch up on missed changes.

WebSocket Protocol

Connection handshake:

wss://ws.example.com/v1/sync/ws
Authorization: Bearer <jwt>
X-Device-Id: device_laptop01

Message types:

json

// Client -> Server: Subscribe to sync events
{
    "type": "SUBSCRIBE",
    "folders": ["/", "/Documents", "/Photos"],
    "cursor": "cursor_2026031014300000_shard42"
}

// Server -> Client: File change notification
{
    "type": "CHANGE",
    "id": "evt_123",
    "change_type": "file_modified",
    "file_id": "file_def456",
    "path": "/Documents/report.xlsx",
    "version": 6,
    "modified_by": "device_phone01",
    "timestamp": "2026-03-10T14:35:00Z"
}

// Client -> Server: Acknowledge receipt
{
    "type": "ACK",
    "id": "evt_123"
}

// Bidirectional: Keepalive
{ "type": "PING", "timestamp": 1741614900000 }
{ "type": "PONG", "timestamp": 1741614900000 }

The SUBSCRIBE message includes a cursor so the server can replay missed events. PING/PONG every 30 seconds detects dead connections. Three unanswered PINGs triggers connection teardown and removal from the Redis connection registry.

File Operations

POST /v1/files/move
{ "file_id": "file_def456", "new_parent_id": "folder_abc", "new_name": "renamed-report.xlsx" }

POST /v1/files/copy
{ "file_id": "file_def456", "destination_parent_id": "folder_xyz", "name": "report-copy.xlsx" }

DELETE /v1/files/{file_id}
// Soft delete. Moves to trash. Hard delete after 30 days.

GET /v1/files/{file_id}/versions
// Returns list of all versions with timestamps, sizes, device info.

POST /v1/files/{file_id}/restore?version=3
// Creates new version identical to version 3.

POST /v1/shares
{ "file_id": "file_def456", "email": "colleague@example.com", "permission": "edit", "expires_at": "2026-04-10T00:00:00Z" }

GET /v1/search?q=quarterly+report&type=xlsx&modified_after=2026-01-01

The Upload Pipeline

Problem: How does the client go from "file changed on disk" to "change confirmed and syncing to other devices"?

Mental model: A factory assembly line. Raw material (file bytes) enters at one end. The file is sliced, fingerprinted, checked against inventory, and only new pieces ship to the warehouse. A receipt is filed, and all branch offices are notified.

The chunking engine is the most performance-critical component in the desktop client. It runs entirely on the user's machine.

Step 1: Filesystem watcher detects a change. The desktop client installs OS-level watchers: FSEvents (macOS), inotify (Linux), ReadDirectoryChangesW (Windows). The event is added to a sync queue with 500ms debouncing to avoid processing mid-write states.

Step 2: Read and chunk the file. The client streams the file through the CDC engine. Memory usage is constant: only the 48-byte sliding window plus the current chunk buffer (max 8MB). A 50GB file processes in a single streaming pass.

Step 3: SHA-256 hash each chunk. On modern CPUs with SHA-NI extensions, SHA-256 runs at ~2GB/sec. A 4MB chunk hashes in ~2ms.

Step 4: Compare against local SQLite. The client compares the new chunk hash list against the previous version stored locally. Unchanged chunks (same hash, same position) don't need uploading.

Step 5: Send hash list to server. POST /v1/files/upload/init with the full list of chunk hashes.

Step 6: Server responds with dedup results. "Chunks 0, 1, 3, 4 already exist. Only chunks 2 and 5 need uploading." This combines delta sync (only changed chunks from the previous version) with cross-user dedup (chunks that exist from other users' files).

Step 7: Upload new chunks via presigned URLs. Parallel uploads (4-8 concurrent by default). Each chunk is a separate PUT operation directly to object storage.

Concrete example: A user edits a 100MB Excel file, changing values in row 5000. CDC produces 25 chunks. Compared to the previous version, 1-2 chunks have different hashes. The server reports that 23 of 25 chunks already exist. The client uploads only 2 chunks (8MB total) — 92% bandwidth savings.

Full pipeline pseudocode:

python

def sync_file(file_path):
    # Step 1: Read and chunk
    chunks = []
    with open(file_path, 'rb') as f:
        for chunk_data in cdc_stream(f):
            chunk_hash = sha256(chunk_data)
            chunks.append({
                'index': len(chunks),
                'hash': chunk_hash,
                'size': len(chunk_data),
                'data': chunk_data
            })

    # Step 2: Local delta detection
    prev_hashes = sqlite_get_chunk_hashes(file_path)
    changed_indices = []
    for i, chunk in enumerate(chunks):
        if i >= len(prev_hashes) or chunk['hash'] != prev_hashes[i]:
            changed_indices.append(i)

    if not changed_indices and len(chunks) == len(prev_hashes):
        return  # No changes

    # Step 3: Server dedup check
    response = api_upload_init(
        path=file_path,
        chunk_hashes=[c['hash'] for c in chunks],
        chunk_sizes=[c['size'] for c in chunks]
    )

    # Step 4: Upload only needed chunks
    for idx in response['chunks_needed']:
        presigned_url = response['presigned_urls'][str(idx)]
        upload_chunk(presigned_url, chunks[idx]['data'])

    # Step 5: Finalize
    api_upload_complete(response['upload_id'], etags)

    # Step 6: Update local state
    sqlite_update_file(file_path, chunks)

Resumable Uploads

Problem: A user uploading a 2GB video over a flaky connection will experience drops. Without resumable uploads, they restart from scratch every time. At chunk 47 of 500, the connection dies. How do we resume from chunk 48?

Simple example: Think of mailing 500 packages to the same address. You track which packages were delivered. If the mail truck breaks down after 300 deliveries, you resume from package 301 — not package 1.

Mental model: Each chunk is an independent package with its own tracking number (presigned URL). The upload session is a shipping manifest in Redis. Progress survives client restarts, network failures, and even server pod restarts (the object store is the ultimate source of truth for what was delivered).

The flow:

Create session. The Upload Orchestrator stores session state in Redis.
Generate presigned URLs. Each needed chunk gets a presigned PUT URL. URLs expire in 15 minutes (security best practice).
Client uploads chunks directly. Object store returns an ETag per completed chunk.
Track progress. Client notifies the Orchestrator after each chunk. Redis session updated.
Network failure. Client loses connectivity. Some chunks uploaded, some not. The session and uploaded chunks are preserved.
Resume. Client calls GET /upload/{id}/progress. Orchestrator checks Redis session plus ListParts on the object store. Returns remaining chunks with fresh presigned URLs.
Continue. Client uploads remaining chunks from where it left off.
Finalize. Once all chunks are uploaded, create version record and clean up.

Key detail: Presigned URLs expire in 15 minutes, but the multipart upload stays valid for 7 days. When a client reconnects after URLs expire, the Orchestrator generates fresh URLs. Chunks already uploaded are preserved.

Edge case: Redis session lost. If Redis loses the session (node failure, TTL expiry), the Orchestrator reconstructs it by calling ListParts on the object store. Slower but ensures resumability through infrastructure failures.

Content Deduplication

Problem: 500 employees upload the same 50MB PDF. Without dedup, that is 25GB for one file. With dedup, 50MB. How does the system detect that a chunk already exists before uploading it?

Simple example: A library catalogues books by ISBN. When someone donates a book, the librarian checks the ISBN. If the library already has a copy, the donation is politely declined. The ISBN is the SHA-256 hash; the library catalogue is the Bloom filter + PostgreSQL.

Mental model: Two-tier lookup. A fast approximate check (Bloom filter) says "definitely new" or "probably exists." If "probably exists," a slow exact check (PostgreSQL) confirms. This keeps the fast path fast while maintaining correctness.

The dedup flow (runs during /upload/init, before any bytes upload):

Client sends chunk hashes (SHA-256) to server
Server checks each hash: Redis Bloom filter (fast) then PostgreSQL chunks table (accurate)
Server returns: chunks_existing (skip) + chunks_needed (upload)
Only chunks_needed get presigned URLs

Bloom filter sizing: 12.5B chunks with 15 bits per element gives a 0.1% false positive rate in ~23GB. This fits in Redis Cluster memory. If the Bloom filter says "not present," the chunk is guaranteed new — no further checks needed. The 0.1% false positives are caught by PostgreSQL verification.

Cross-user dedup example: A company with 500 employees distributes a 50MB handbook PDF. The first employee uploads it: CDC produces 13 chunks, all new, all uploaded, ref_count = 1 each. The second employee uploads the same PDF: same 13 hashes, all exist, zero bytes uploaded, ref_counts incremented to 2. Employees 3-500: zero uploads. Storage savings: 99.8% for this file. Across the entire platform, empirical data suggests 50-60% overall dedup ratio.

Security tradeoff: If the server tells a client "this chunk already exists, skip upload," the client can infer someone else has the same content. Mitigations: (1) dedup only within the same organization, (2) per-tenant encryption keys so chunks differ across tenants, or (3) per-upload salt (eliminates cross-user dedup). Consumer services accept the risk. Enterprise deployments restrict to within-tenant.

Reference Counting and Garbage Collection

Every chunk has a ref_count tracking how many file versions reference it:

Incremented when a new file version references an existing chunk
Decremented when a file version is deleted (version expiry, user deletion)

When ref_count reaches 0, the chunk is eligible for deletion. A 24-hour grace period protects against race conditions (a chunk's ref_count could temporarily hit 0 during a concurrent upload about to reference it).

Garbage collection process:

Background job runs hourly, scans for chunks where ref_count = 0 AND last_referenced < now() - 24 hours
Delete the object from the store
Delete the chunks table row
Remove from Bloom filter (requires periodic rebuild from the chunks table, since standard Bloom filters don't support deletion; rebuilt nightly)

Race condition: simultaneous uploads of the same chunk. User A and User B both upload a file containing the same chunk. Both pass the Bloom filter as "new." Both try to insert into the chunks table. The unique constraint on hash catches it: one insert succeeds (ref_count = 1), the other hits a conflict and retries as a ref_count increment. No duplicate stored.

Weekly reconciliation: Safety net. Scans all version_chunks references, counts actual references per chunk, compares against chunks.ref_count, and fixes discrepancies.

Integrity Verification

Problem: Presigned URLs mean file bytes bypass the server entirely. The client claims "I uploaded chunk with hash h_abc," but the server never saw the bytes. How do we verify nothing is corrupted or tampered with?

Mental model: Think of a notary system. Individual documents (chunks) are notarized independently. Then the collection of documents is sealed in an envelope with a tamper-evident seal (Merkle root). If any document is swapped, reordered, or missing, the seal breaks.

Chunk-Level Verification

A malicious or buggy client could claim hash h_abc but upload different data. Every subsequent download of that chunk delivers garbage.

Step 1: ETag check (fast, inline). During /upload/complete, the server compares ETags. The ETag is typically an MD5 of the uploaded content — catches accidental corruption.

Step 2: SHA-256 verification (async, thorough). A Kafka event triggers a verification worker that downloads the chunk, computes SHA-256, and compares against the claimed hash. Catches everything: malicious uploads, silent corruption, hash collisions.

Step 3: Version status. New versions are created with status: pending_verification. The version becomes verified once all chunks pass. Other devices are notified only after verification completes (typically 1-5 seconds added).

File-Level Integrity (Merkle Tree)

Individual chunk hashes prove each chunk is valid. But they don't prove the whole file is valid: chunks could be reordered, duplicated, or missing. The checksum field in file_versions solves this via a Merkle tree over chunk hashes.

File: [Chunk0] [Chunk1] [Chunk2] [Chunk3]
        |        |        |        |
       h_0      h_1      h_2      h_3       <- leaf hashes (SHA-256 of chunk content)
         \      /          \      /
     SHA-256(h_0||h_1)  SHA-256(h_2||h_3)    <- intermediate nodes
              \              /
         SHA-256(h_01 || h_23)               <- Merkle root = file checksum

python

def merkle_root(chunk_hashes):
    if len(chunk_hashes) == 0:
        return sha256(b'')
    if len(chunk_hashes) == 1:
        return chunk_hashes[0]

    nodes = list(chunk_hashes)
    if len(nodes) % 2 == 1:
        nodes.append(nodes[-1])  # duplicate last hash

    while len(nodes) > 1:
        next_level = []
        for i in range(0, len(nodes), 2):
            combined = sha256(nodes[i] + nodes[i+1])
            next_level.append(combined)
        nodes = next_level

    return nodes[0]  # Root hash

Why Merkle tree instead of concatenating all hashes?

Partial verification. To verify a single chunk belongs to the file, you need only its sibling hashes along the path to the root — O(log N) hashes, not all N. For 12,500 chunks: 14 hashes instead of 12,500.
Corruption localization. Binary-search for exactly which chunk is corrupted — O(log N) checks.
Tamper evidence. Changing, reordering, or removing any chunk changes the Merkle root.

When	What	How
Upload: client	Merkle root from chunk hashes	Sent in `/upload/init` as `checksum`
Upload: server	Recompute Merkle root from request	Must match client's claimed checksum
Download: client	Each chunk hash	SHA-256(downloaded_bytes) == expected hash
Download: client	Merkle root after assembly	Recompute from chunk hashes, compare to version checksum
Periodic: server	Async chunk verification	Worker re-hashes chunks in object store

The Sync Protocol

Problem: When a file changes on one device, how do all other devices learn about it and download only the changed parts — ideally within seconds?

Simple example: A group chat where the server posts "report.xlsx updated to version 6" and each member checks what's new since their last read. Members who were offline scroll back through history when they return.

Mental model: WebSocket is the real-time notification channel (instant push). Cursor-based sync is the catch-up mechanism (reliable recovery). Together they ensure no change is ever missed, regardless of connectivity.

Upload Notification Path

After a successful upload, the Sync Service publishes a sync event to Kafka. The WebSocket Gateway consumes it and pushes to all other devices.

Download Path

1. WebSocket CHANGE message arrives
2. Client fetches file version metadata (chunk list)
3. Client diffs chunk list against local SQLite
4. Client downloads only new/changed chunks from CDN/object storage
5. Client assembles file from chunks (local cache + downloaded)
6. Client writes file to filesystem
7. Client updates local SQLite with new version
8. FS watcher fires (from own write) - client ignores
     (filters out self-triggered events by checking sync_status)

Cursor-Based Offline Sync

When a device comes online after being offline, it cannot rely on missed WebSocket messages. Instead:

Client reads its last cursor from local SQLite
Calls POST /v1/sync/changes with the cursor
Server returns all changes since that cursor, paginated
Client processes each change: download new files, update modified, delete removed
Client stores the new cursor

The cursor encodes a timestamp plus shard identifier, allowing efficient queries on the user's Citus shard.

Kafka topic design: Sync events are partitioned by user_id_hash % 1000, creating 1000 partitions. All events for a single user stay ordered (same partition) while load distributes across brokers.

Conflict Resolution

Problem: Two devices edit the same file while one is offline. When both try to upload, whose edit wins? And what happens to the loser's work?

Simple example: Two people write different notes on the same shared whiteboard while in different rooms. When they reconvene, both notes are preserved — the later one is on the whiteboard, the earlier one is photographed and kept as a "conflicted copy."

Mental model: Last-writer-wins for the primary file, safety copy for the loser. No data is ever silently lost.

Every file upload includes a base_version in the init request. The server checks:

If base_version == server's latest version: success, create version N+1.
If base_version < server's latest version: conflict detected.

Concrete scenario: Alice has budget.xlsx synced on laptop and phone. Current version is v5.

Laptop goes offline (airplane mode)
Alice edits on laptop. Local version: v5-local-edit.
Meanwhile, she edits the same file on her phone (online).
Phone uploads. Server creates v6 (base_version=5, matches, no conflict).
Laptop reconnects. Tries to upload with base_version=5.
Server sees base_version=5 but current=6. Conflict.

Resolution: Server accepts the laptop's upload as v7. Creates a new file containing the phone's v6 content as budget (conflicted copy - Phone - 2026-03-10).xlsx. Both devices are notified. Alice sees both files, manually merges, and deletes the conflicted copy.

Why last-writer-wins? For a general-purpose file sync platform handling binary files (images, videos, compiled documents), there is no universal merge algorithm. Last-writer-wins with conflicted copies is the safest general solution. More sophisticated strategies (operational transformation) apply only to specific file types.

Conflict rate: Less than 0.01% of file operations in practice. But when they happen, the system must handle them gracefully and never lose data.

Delta Sync in Practice

A user has a 20MB PowerPoint file synced across laptop and phone. On the laptop, they change a few slides (affecting about 2MB of data in the middle).

Upload delta (laptop to server):

Version 1 (synced to both devices):
Chunks: [A, B, C, D, E]
Hashes: [h1, h2, h3, h4, h5]

User edits slides in the region covered by chunk C.

Version 2 (laptop computes via CDC):
Chunks: [A, B, C', D, E]
Hashes: [h1, h2, h3', h4, h5]

Local diff: only index 2 changed (h3 -> h3')
Server checks dedup: h1, h2, h4, h5 exist. h3' is new.
Laptop uploads only chunk C' (~4MB) via presigned URL
Server creates version 2 referencing [h1, h2, h3', h4, h5]
Bandwidth: ~4MB instead of 20MB (80% savings)

Download delta (server to phone):

Phone receives WebSocket notification: file updated to v2
Phone fetches v2 chunk list: [h1, h2, h3', h4, h5]
Phone's local SQLite has v1: [h1, h2, h3, h4, h5]
Diff: only h3' is new. Phone already has h1, h2, h4, h5 locally.
Phone downloads only chunk C' (~4MB)
Assembles file: local A, B + downloaded C' + local D, E
Bandwidth: ~4MB instead of 20MB

Scale impact: At 1M uploads/minute, if average delta sync saves 70% of bandwidth, upload volume drops from 2TB/minute to ~600GB/minute. rsync-style delta sync achieves even greater savings with byte-level granularity.

LAN Sync

Problem: Two devices on the same local network both need a file. Downloading from the cloud wastes internet bandwidth when the bytes are right there on the other machine, reachable at gigabit speeds.

Simple example: Two laptops in the same office. Instead of both downloading a 500MB file from the cloud, the second laptop copies it from the first over the local network in half a second.

Mental model: Cloud for truth, LAN for speed. Metadata always flows through the server (who has what, what version). Only chunk bytes take the LAN shortcut.

Discovery via mDNS. When the desktop client starts, it broadcasts on the local network:

Service type: _filesync._tcp.local
Instance: {user_id_hash}_{device_id}
Port: 17500
TXT: version=2, user_id_hash=abc123

Other instances discover this broadcast and filter for matching user_id_hash to find devices belonging to the same user.

Authentication. Each device has a TLS certificate signed by the server during device registration. Devices perform mutual TLS handshake, verifying certificates against the server's CA. This prevents impersonation on shared networks.

Chunk transfer over LAN. When Device B needs chunks that Device A has:

Device B needs chunks: [h3', h7', h15']
Device B asks Device A (LAN peer): "Do you have h3', h7', h15'?"
Device A responds: "I have h3' and h7'. Don't have h15'."
Device B downloads h3' and h7' from Device A over LAN (1Gbps+)
Device B downloads h15' from object store/CDN (internet speed)

Fallback. If LAN transfer fails (device offline, network change), the client transparently falls back to cloud download.

Office-scale impact: 200 employees, 500MB shared presentation. Without LAN sync: 200 * 500MB = 100GB from the internet. With LAN sync: first device downloads 500MB from the cloud, remaining 199 download from LAN peers. Internet bandwidth: ~500MB. That is 99.5% reduction.

File Versioning

Problem: Users need to undo changes, recover deleted content, and compare versions. How do you store version history without multiplying storage costs?

Simple example: A photo album where you don't print duplicate photos. Version 1 and Version 3 both include the same photo — they share the same print. Each version is just a list of which photos to include.

Mental model: Versions are pointers, not copies. Creating a new version only stores what changed. Restoring an old version is a metadata operation — no bytes are copied.

v1: [A, B, C]         -- Initial upload
v2: [A, B', C]        -- B edited, becomes B'
v3: [A, B', C, D]     -- D appended
v4 (restore v1): [A, B, C]  -- Same chunks as v1, new version record

Chunk A is referenced by all 4 versions (ref_count = 4). Chunk B is in v1 and v4 (ref_count = 2). Chunk B' is in v2 and v3 (ref_count = 2). Chunk D is only in v3 (ref_count = 1).

Restoring a version creates a new version record pointing to the same chunks. No data copied. Milliseconds, regardless of file size.

Version retention: 180 days or 100 versions, whichever comes first. When a version expires:

Version record and version_chunks entries deleted from PostgreSQL
Each referenced chunk's ref_count decremented
Chunks reaching ref_count = 0 enter the GC pipeline (24-hour grace, then deletion)

Version listing is fast: the file_versions table is indexed by (user_id, file_id, version_number) and co-located on the same Citus shard.

Download Path and Caching

On download, the client receives a chunk list and diffs it against its local chunk cache (SQLite tracks which hashes are on disk). Only missing chunks are downloaded. Editing a 100MB file where 2 chunks changed means downloading 8MB, not 100MB.

Chunks are cached locally on disk. Since chunks are content-addressed (hash = identity), a cached chunk is valid forever. If the same hash appears in a different file, the cached copy is reused without downloading.

Small file optimization: Files under 1MB produce a single chunk. At billions of small files, this creates billions of individual objects. At extreme scale, small files can be packed into larger container blobs to reduce object count. For most deployments, one object per chunk is fine up to tens of billions of objects.

Why object storage, not a filesystem? At 12.5 billion chunks, a traditional filesystem would fail. ext4 allocates a fixed number of inodes (tens to hundreds of millions per filesystem). 12.5 billion chunks would require hundreds of filesystems just for inode count. Large directories degrade lookup performance. Object storage uses a flat key-value namespace with no inode concept, no directory hierarchy, no per-file metadata overhead.

Bottlenecks and Mitigations

Twelve bottlenecks to watch in production.

1. Client CPU for SHA-256. SHA-256 at software speed: ~500MB/sec. For a 4GB file, 8 seconds. Mitigation: SHA-NI hardware instructions (~2GB/sec). Hash while chunking (streaming, not two-pass).

2. Presigned URL generation at 25K/sec. Each requires HMAC-SHA256. Mitigation: batch generation, pre-compute URL templates, connection pooling.

3. Redis Bloom filter false positives. At 0.1% FP rate, 25 of every 25K checks hit PostgreSQL. Mitigation: add an LRU cache in the Dedup Service for recently verified chunks.

4. PostgreSQL write amplification during bulk uploads. Bulk initial sync creates thousands of INSERTs. Mitigation: batch inserts using COPY, group commits, connection pooling (PgBouncer).

5. WebSocket fan-out for shared folders. 500-member shared folder means 500 notifications per change. Mitigation: batch notifications (group changes within 100ms), compress payloads, rate-limit per-folder.

6. Chunk existence checks at 50K/sec. Mitigation: multi-tier caching. L1: in-process LRU (top 1M hashes, ~32MB). L2: Redis Bloom filter. L3: PostgreSQL. 95% hit L1 or L2.

7. Object storage rate limits per prefix. 3,500 PUT/sec per prefix. Mitigation: 65,536 prefixes from hash-based keys. 25K PUTs/sec distributed = 0.38 PUTs/sec per prefix.

8. Large file chunk lists. A 50GB file has 12,500 chunks. Mitigation: index on (user_id, version_id, chunk_index). Cache chunk lists in Redis for actively syncing files.

9. GC thundering herd on version expiry. If launched 180 days ago, massive expiry wave. Mitigation: rate-limited GC (10K deletions/min), jitter on expiry cutoff.

10. Client memory during large file chunking. Mitigation: streaming pipeline. Read 64KB blocks, feed through Rabin fingerprinting, emit chunks as found. Peak memory: 8MB (max chunk) + 48 bytes (window).

11. Metadata DB growth. 27TB today, growing ~50TB/year. Mitigation: partition file_versions and version_chunks by month. Archive old partitions to columnar storage.

12. FS watcher event storms. Extracting a ZIP with 10,000 files generates 10,000 events. Mitigation: directory-level debounce (500ms), batch into single sync operation, rate-limit (100 files/sec).

Failure Scenarios

Thirteen failure scenarios with recovery strategies.

Object Storage Outage

Impact: All uploads and downloads fail. Recovery: Local files safe. Uploads resume when storage returns (multipart valid 7 days). Metadata operations (rename, move) still work (PostgreSQL only). Data loss: None. 11 nines durability — data is temporarily inaccessible, not lost.

Upload Interrupted Mid-Transfer

Impact: Partial upload. Some chunks in store, others not. Recovery: Client calls GET /upload/{id}/progress, gets remaining chunks and fresh presigned URLs, resumes. Data loss: None.

PostgreSQL Citus Node Failure

Impact: One shard unavailable. Affected users can't sync; others unaffected. Recovery: Automatic failover to standby (10-30 seconds via Patroni). Data loss: Possible loss of ~1 second of metadata writes with async replication. Chunks in object storage are safe. Synchronous replication eliminates this at the cost of latency.

Redis Cluster Failure

Impact: Bloom filter unavailable, upload sessions lost, connection registry gone. Recovery: Bloom filter rebuilt from chunks table (30-60 min). Sessions reconstructed via ListParts. Connection registry rebuilds organically on reconnect. Data loss: None. Redis stores only derived/ephemeral data.

WebSocket Gateway Crash

Impact: Up to 500K connections drop. Recovery: Clients reconnect with exponential backoff (1s, 2s, 4s). On reconnect, SUBSCRIBE with last cursor replays missed events. Data loss: None. All changes persisted in PostgreSQL before WebSocket notification sent.

Kafka Broker Failure

Impact: Sync event delivery delayed for affected partitions. Recovery: New leader elected (5-15 seconds). Producers retry. Consumers resume from last committed offset. Data loss: None with acks=all and min.insync.replicas=2.

Client Crash During Chunking

Impact: Partially computed hashes, partially sent request. Recovery: Client restarts, finds file marked pending_upload in SQLite, re-runs CDC (deterministic, same result). Resumes existing upload or starts new one. Data loss: None.

Concurrent Conflicting Edits

Impact: Two devices edited same file. Recovery: Last-writer-wins + conflicted copy. Both edits preserved. Data loss: None.

Chunk Reference Count Corruption

Impact: Premature deletion (ref_count too low) or storage leak (ref_count too high). Recovery: 24-hour GC grace period prevents premature deletion. Weekly reconciliation job corrects discrepancies. Even if prematurely deleted, same content re-uploaded produces same hash — chunk recreated. Data loss: Extremely unlikely.

Presigned URL Expiry During Slow Upload

Impact: Upload fails with 403. Recovery: Client calls GET /upload/{id}/progress for fresh URLs. Multipart upload still valid (7 days). Already-uploaded chunks preserved. Data loss: None.

Network Partition Between Client and Server

Impact: Client works locally, can't sync. Recovery: Changes accumulate in local sync queue. On reconnect, cursor-based sync catches up. Conflicts resolved if applicable. Data loss: None. Eventually consistent.

Data Corruption (Bit Rot)

Impact: Chunk content doesn't match SHA-256 on download. Recovery: Client detects mismatch, reports to server. Server fetches from another replica or region. If all copies corrupt, user's local device may still have the chunk for re-upload. Data loss: Astronomically rare (11 nines durability). SHA-256 verification ensures detection.

When Data Is Actually Lost

Data is truly irrecoverable only when ALL of:

The object in the store is deleted or corrupted (11 nines against this), AND
No other user has the same chunk (dedup means popular content has many references), AND
The user's local copy is also gone (disk failure, device stolen)

The system provides defense in depth: object storage durability, content addressing, local device copies, and version history. True data loss requires a catastrophic combination of failures that is, for practical purposes, impossible.

Deployment and Operations

Multi-Region Deployment

Three regions for global coverage, each with a full stack:

Region	Location	Primary User Base	Pod Counts
US-East-1	Virginia	North/South America	Sync: 40, Upload: 20, WS: 30, Dedup: 10
EU-West-1	Ireland	Europe, Africa	Sync: 30, Upload: 15, WS: 25, Dedup: 8
AP-Southeast-1	Singapore	Asia-Pacific	Sync: 20, Upload: 10, WS: 25, Dedup: 6

Object storage buckets are region-local with cross-region replication. PostgreSQL uses logical replication for cross-region read replicas. Writes route to the user's home region (assigned at signup based on geographic IP).

Canary Deployment

Deploy to 1 canary pod in US-East-1
Route 5% of traffic for 15 minutes
Monitor: upload success rate (>99.5%), sync latency p99 (<3s), error rate (<0.5%), dedup hit rate (stable)
Roll to 25%, 50%, 100% with 10-minute holds
Automated rollback if thresholds breached
Full rollout across three regions: ~2 hours

Observability

Key metrics:

yaml

# Upload pipeline
upload_success_rate                   # Gauge: successful / total attempts
upload_duration_seconds               # Histogram: init to complete
presigned_url_generation_seconds      # Histogram: URL generation latency

# Sync pipeline
sync_propagation_latency_seconds      # Histogram: upload complete to WS notification
websocket_connections_active          # Gauge: current connections

# Dedup
dedup_hit_rate                        # Gauge: chunks skipped / total checked
bloom_filter_false_positive_rate      # Gauge

# Infrastructure
kafka_consumer_lag                    # Gauge: messages behind per partition
pg_replication_lag_bytes              # Gauge: standby lag
redis_memory_used_bytes               # Gauge

Critical alerts:

Alert	Condition	Severity
Upload success rate drop	< 99% for 5 min	P1
Sync latency spike	p99 > 5s for 5 min	P1
Object storage errors	5xx > 1% for 5 min	P1
Kafka consumer lag	> 100K messages for 10 min	P2
PostgreSQL replication lag	> 10MB for 5 min	P2
WebSocket connection drop	> 10% in 1 min	P2
Dedup ratio anomaly	hit rate drops > 20%	P3

Distributed tracing: Every file operation carries a trace ID through HTTP headers, Kafka message headers, and WebSocket fields. When a user reports "my file isn't syncing," support looks up the trace and sees exactly where the pipeline stalled.

trace-abc123:
  /upload/init           150ms
    dedup.bloom_check      2ms
    dedup.pg_verify       15ms
    url.generate           5ms
  storage.put_chunk      800ms  [client-side]
  /upload/complete       200ms
    pg.create_version     20ms
    kafka.publish          5ms
  kafka.consume            2ms
  ws.push                  1ms

Security

Transport: TLS 1.3 for all client-server communication. mTLS between internal services (service mesh). WebSocket uses wss://. LAN sync uses mTLS with device certificates.

Authentication: OAuth 2.0 with JWT access tokens (15-minute expiry), refresh tokens stored securely on client. Device certificates for LAN sync.

Authorization: Per-file ACL in PostgreSQL. Three levels: owner, editor, viewer. Shared folders inherit permissions. Presigned URLs scoped to specific keys, 15-minute expiry.

Encryption at rest: Server-side encryption with KMS. Per-customer KMS keys available for enterprise. PostgreSQL encrypted via volume-level encryption.

Zero-knowledge option: Client encrypts before chunking using a user-held key. Server never sees plaintext. Trade-off: breaks cross-user dedup (identical files produce different ciphertext). Offered as opt-in for enterprise.

Malware scanning: Object storage event triggers serverless function on each new chunk. ClamAV or commercial scanner. Infected files quarantined. Async, doesn't block upload.

Rate limiting: Per-user: 1000 API calls/minute, 100 uploads/minute, 50GB/day upload. Per-IP: 5000 API calls/minute. Redis counters with sliding window at the API Gateway.

Audit logging: All file operations logged to immutable Kafka topic, archived to object storage. Enterprise admins query for compliance (SOC 2, GDPR).

Beyond This Design: Real-World Evolution

This design is the right starting point for an interview and for initial production. Real systems evolve beyond it as scale grows.

Metadata growth becomes the dominant challenge. At tens of billions of files and hundreds of billions of version-chunk mappings, relational databases struggle even with sharding. Production systems evolve toward custom distributed metadata services — purpose-built key-value stores optimized for the specific access patterns of file sync (user-scoped tree lookups, cursor-based change feeds, reference counting).

Joins disappear entirely from hot paths. The denormalized storage_key in version_chunks is a first step. Mature systems pre-compute every relationship needed for reads: materialized views for folder listings, denormalized stores for version histories, pre-joined chunk manifests for downloads. The write path becomes more complex (maintaining denormalized views), but reads become simple key lookups.

Chunk metadata separates from file metadata. The global chunks table (sharded by hash, queried during dedup) operates at a different scale and access pattern than user-scoped file metadata. Production systems split these into separate storage systems: a dedicated chunk index (possibly a distributed hash table) for dedup, and a separate per-user metadata store for file trees.

Bloom filter management becomes a distributed systems problem. A single Bloom filter in Redis works at 12.5B chunks. At 100B+ chunks, the filter exceeds single-cluster memory. Solutions include partitioned Bloom filters (shard by hash prefix), counting Bloom filters (support deletion without full rebuilds), or replacing with a scalable probabilistic index.

Storage tiering becomes critical for cost. Not all chunks are equal. A chunk last referenced 6 months ago should not sit in the same storage tier as one accessed hourly. Production systems implement automatic lifecycle policies: hot tier (SSD-backed, frequently accessed), warm tier (standard object storage), cold tier (archival, rarely accessed). The transition must be transparent — a download request for a cold chunk triggers automatic retrieval with slightly higher latency.

Consistency models get more nuanced. This design uses eventual consistency for sync notifications and strong consistency for metadata writes. At global scale, even metadata writes face latency-consistency tradeoffs across regions. Systems evolve toward tunable consistency: strong within a region, eventually consistent across regions, with conflict resolution at the application layer.

The client becomes smarter. Early clients are thin — chunk, hash, upload. Mature clients predict which files will be needed (pre-fetch based on usage patterns), compress chunks before upload, prioritize sync order by file importance, and coordinate directly with LAN peers for bulk transfers without server involvement.

Explore the Technologies

Dive deeper into the technologies and infrastructure patterns used in this design:

Core Technologies

Technology	Role in This Design	Learn More
PostgreSQL	File metadata, versions, ACLs, folder trees (sharded via Citus)	PostgreSQL
Redis	Chunk existence Bloom filter, upload sessions, WebSocket connection registry	Redis
Kafka	Sync event distribution, upload completion, dedup events	Kafka
Elasticsearch	File search by name, type, content metadata	Elasticsearch
Prometheus	Metrics for upload SLOs, sync latency, dedup ratio, connection counts	Prometheus
Grafana	Dashboards for sync monitoring, storage analytics, dedup tracking	Grafana

Distributed Systems Concepts

Concept	Role in This Design	Learn More
Bloom Filters	Probabilistic chunk existence check in Redis dedup layer	Bloom Filters
Merkle Trees	File integrity verification via Merkle root over chunk hashes	Merkle Trees
Write-Ahead Log	PostgreSQL WAL for crash recovery, Kafka durable log	Write-Ahead Log
Consistent Hashing	Foundation for content-addressable chunk storage addressing	Consistent Hashing

Infrastructure Patterns

Pattern	Relevance to This Design	Learn More
Object Storage and Data Lake	Object storage as content-addressable chunk store, lifecycle policies for version expiry	Object Storage
Message Queues and Event Streaming	Kafka for async sync event propagation, upload completion processing	Event Streaming
Database Sharding	Citus shards PostgreSQL by user_id, co-locating files/versions/chunks per user	Database Sharding
WebSocket and Real-Time Communication	Push-based sync notifications to 40M concurrent device connections	WebSocket and Real-Time
CDN and Edge Computing	CloudFront caching for frequently downloaded chunks, 60% hit rate	CDN and Edge Computing
Caching Strategies	Multi-tier: in-process LRU, Redis Bloom filter, PostgreSQL for chunk dedup	Caching Strategies
Rate Limiting and Throttling	Per-user upload limits, API rate limiting, bandwidth throttling	Rate Limiting
Replication and Consistency	Object storage cross-region replication, PostgreSQL streaming replication, Kafka RF=3	Replication and Consistency
Kubernetes Architecture	Stateless services auto-scaled, WebSocket gateways with sticky sessions	Kubernetes Architecture

System Design: Object Storage (Erasure Coding, Flat Namespace, and Exabyte Scale)

March 31, 2026 · 51 min read

System Design: AI Software Engineer (From Autocomplete to Autonomous App Builder)

March 25, 2026 · 94 min read

RAG and LLM Platform at Scale: Ingestion, Retrieval, Generation, and Evaluation for 10M Queries/Day

March 22, 2026 · 105 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignApril 2, 2026· 66 min read

System Design: Dropbox (File Sync, Chunking, and Deduplication)

Goal: Build a cloud file sync platform serving 500M registered users, 50M DAU, syncing 10B files across 100PB of storage. Support chunked uploads, resumable transfers, delta sync (upload only changed bytes), cross-device sync via WebSocket, content deduplication across users, peer-to-peer LAN sync, and file versioning. Files available on other devices within 10 seconds for files under 10MB on broadband.

Mental model — four ideas that make everything else click:

A file = ordered list of chunk hashes. Chunks are immutable, content-addressed blobs in object storage.

Editing a file means producing a new list of hashes. Most stay the same. Only new chunks get uploaded.

Dedup = identical content produces the same SHA-256 hash, stored once, referenced many times.

Sync = push metadata via WebSocket, other devices download only missing chunks.

TL;DR: Split files into ~4MB chunks. SHA-256 per chunk enables content-addressable storage and cross-user dedup. Presigned URLs let clients upload directly to object storage — the server never touches file bytes. Resumable uploads: fail at chunk 47, resume from 48. Filesystem watcher detects changes, client uploads only changed chunks (delta sync), WebSocket pushes notifications in under 1 second. LAN sync for same-network devices. Last-writer-wins with conflicted copies for conflicts.

The Three Problems

File sync looks simple until the math gets real. Three problems shape every decision in this design.

Scale numbers:

500M registered users
50M DAU
10B files under management
100PB total storage
1M uploads/minute at peak (~16,667/sec)
Average file size: 2MB

Requirements

Functional Requirements

ID	Requirement	Priority
FR-01	Upload files from desktop, mobile, and web clients	P0
FR-02	Download files to any connected device	P0
FR-03	Resume interrupted uploads and downloads from the point of failure	P0
FR-04	Multi-device sync: file changes propagate to all user devices automatically	P0
FR-05	Delta sync: upload only changed portions of a file, not the entire file	P0
FR-06	Content deduplication: identical content stored once across all users	P0
FR-07	File versioning: retain previous versions for rollback	P1
FR-08	Conflict resolution: handle concurrent edits from multiple devices	P0
FR-09	Selective sync: choose which folders sync to which devices	P1
FR-10	LAN sync: transfer files directly between devices on the same network	P1
FR-11	File and folder sharing with permissions (view, edit)	P1
FR-12	Full-text search by filename, file type, and content metadata	P2
FR-13	Bandwidth throttling: user-configurable upload/download limits	P2
FR-14	File preview: thumbnails and previews for images, documents, videos	P2
FR-15	Team workspaces: shared folders with role-based access control	P1

Non-Functional Requirements

ID	Requirement	Target
NFR-01	Sync latency (metadata propagation)	< 1 second
NFR-02	File availability on other devices	< 10 seconds for files under 10MB on broadband
NFR-03	Resumable upload gap tolerance	Up to 7 days between pause and resume
NFR-04	Deduplication ratio	50% storage savings across all users
NFR-05	Data durability	99.999999999% (11 nines, standard tier)
NFR-06	Service availability	99.99% (52 minutes downtime/year)
NFR-07	Maximum file size	50GB
NFR-08	Concurrent sync connections	150M (50M DAU * ~3 devices * 30% concurrency)
NFR-09	Version retention	180 days or 100 versions, whichever comes first

Why Naive Approaches Fail

Before arriving at the solution, three approaches fail.

Original (20MB, 5 chunks of 4MB each):
Position:  [0-4MB]  [4-8MB]  [8-12MB] [12-16MB] [16-20MB]
Chunks:    [C1:h_a] [C2:h_b] [C3:h_c] [C4:h_d]  [C5:h_e]

After inserting 1KB at byte 100:
Position:  [0-4MB]  [4-8MB]  [8-12MB] [12-16MB] [16-20MB+1KB]
Chunks:    [C1':h_f] [C2':h_g] [C3':h_h] [C4':h_i] [C5':h_j]

The Two-Hash Foundation

Problem: Scanning a file for chunk boundaries or matching content requires checking every byte position. Cryptographic hashes are too expensive for this. But cheap hashes produce false matches.

Every file scanning algorithm in this design uses two hashes with different roles:

	Weak hash (Rabin / rolling)	Strong hash (SHA-256)
Speed	O(1) per byte position	Expensive (processes entire block)
Used where	Every byte position in the file	Only after a weak hash match
In CDC	Find chunk boundaries (`hash % TARGET == 0`)	Compute chunk identity for dedup and storage
In rsync	Find candidate block matches in old file	Verify the match is real, not a collision
Collisions	Yes — fast but imprecise	Cryptographically collision-resistant

python

# The two-tier pattern used by both CDC and rsync
for each byte_position in file:
    weak = rolling_hash_update(weak, outgoing_byte, incoming_byte)  # O(1)

    if weak_hash_matches(weak):        # Fast check at every position
        strong = sha256(block_data)     # Expensive, but rare
        if strong_hash_matches(strong): # Confirms real match
            # CDC: declare chunk boundary
            # rsync: emit COPY instruction

Key distinction: Rabin finds boundaries and candidate matches. SHA-256 identifies content. Rabin is never used for dedup or storage addressing — that is SHA-256's job.

Content-Defined Chunking (CDC)

Problem: Fixed chunk boundaries shift when bytes are inserted, causing all downstream hashes to change. How do you chunk a file so that a small edit only affects the chunk containing the edit?

How it works:

Slide a 48-byte window across the file, one byte at a time.
Compute a Rabin fingerprint at each position. The rolling hash updates in O(1).
Declare a boundary when rabin_hash % TARGET == 0. TARGET controls average chunk size. For ~4MB chunks, roughly 1 in 4,194,304 positions triggers a boundary. Min/max constraints (2MB–8MB) prevent pathological sizes.
SHA-256 hash each resulting chunk. This hash becomes the chunk's permanent identity — used for dedup, storage addressing, and integrity verification.

Pseudocode:

python

WINDOW_SIZE = 48
TARGET = 4 * 1024 * 1024      # ~4MB average chunk
MIN_CHUNK = 2 * 1024 * 1024   # 2MB minimum
MAX_CHUNK = 8 * 1024 * 1024   # 8MB maximum

def content_defined_chunking(file_bytes):
    chunks = []
    chunk_start = 0
    window = file_bytes[0:WINDOW_SIZE]
    fingerprint = rabin_init(window)

    i = WINDOW_SIZE
    while i < len(file_bytes):
        outgoing = file_bytes[i - WINDOW_SIZE]
        incoming = file_bytes[i]
        fingerprint = rabin_slide(fingerprint, outgoing, incoming)

        chunk_length = i - chunk_start
        is_boundary = (fingerprint % TARGET == 0 and chunk_length >= MIN_CHUNK)
        is_max = (chunk_length >= MAX_CHUNK)

        if is_boundary or is_max:
            chunk_data = file_bytes[chunk_start:i]
            chunks.append({
                'data': chunk_data,
                'hash': sha256(chunk_data),
                'offset': chunk_start,
                'size': len(chunk_data)
            })
            chunk_start = i
        i += 1

    if chunk_start < len(file_bytes):
        chunk_data = file_bytes[chunk_start:]
        chunks.append({
            'data': chunk_data,
            'hash': sha256(chunk_data),
            'offset': chunk_start,
            'size': len(chunk_data)
        })
    return chunks

In practice, the file is streamed — memory usage is constant regardless of file size: just the 48-byte window, the current chunk buffer (max 8MB), and the hash state.

Why it works — concrete example:

Original file, CDC produces 3 chunks:
[----Chunk A (3.8MB)----][----Chunk B (4.2MB)----][----Chunk C (4.0MB)----]
       hash: h_a                hash: h_b                hash: h_c

Insert 1KB in the middle of Chunk A:
[----Chunk A' (3.8MB)---][----Chunk B (4.2MB)----][----Chunk C (4.0MB)----]
       hash: h_a'               hash: h_b                hash: h_c

rsync-Style Delta Sync

Problem: CDC reduces upload size to whole chunks (~4MB). But a 1KB edit still uploads an entire ~4MB chunk. Can we transmit only the actual changed bytes?

Terminology: Receiver = side with the old file. Sender = side with the new file. When uploading: client = Sender, server = Receiver. When downloading: roles reverse.

How it works:

Receiver splits the old file into fixed-size blocks (~4MB). For each block, compute a weak hash (rolling) and a strong hash (SHA-256). Send the (weak, strong) pairs to the sender.
Sender scans the new file with a rolling hash, one byte at a time. At each position: does this rolling hash match any of the receiver's block hashes?
Match -> COPY. Weak hash matches? Verify with strong hash. If confirmed, emit COPY block_N — the receiver reuses that block from the old file.
No match -> INSERT. Bytes between matches are new data. Emit INSERT "new_bytes" — literal bytes the receiver must use.
Receiver reconstructs the new file by executing COPY/INSERT instructions in order.

Concrete example:

Old file (receiver has this):
  Block 0: [ABCD]  ->  (weak0, strong0)
  Block 1: [EFGH]  ->  (weak1, strong1)

New file (sender has this):
  XXABCDEFGH

Sender scans with rolling hash:
  Position 0: [XXAB] -> no match
  Position 1: [XABC] -> no match
  Position 2: [ABCD] -> weak match! -> verify strong hash -> confirmed (Block 0)
  ...
  Position 6: [EFGH] -> weak match! -> verify strong hash -> confirmed (Block 1)

The delta payload sent over the network:

json

[
  { "type": "INSERT", "data": "XX" },
  { "type": "COPY",   "block_index": 0 },
  { "type": "COPY",   "block_index": 1 }
]

Tradeoffs:

CPU cost: Scans the entire file byte-by-byte with a rolling hash. For a 1GB file: ~1 billion hash computations. More CPU than fixed chunking, comparable to CDC.
No cross-user dedup. rsync compares old vs new versions of the same file. It does not deduplicate across different users' files. Separate dedup mechanisms handle cross-user storage efficiency.

CDC vs rsync: Choosing the Right Model

Both approaches solve the insertion problem. They differ in what they optimize for.

CDC asks: "Do you already have this exact chunk?" rsync asks: "Can I find this data anywhere in your file?"

CDC dedup model — hash-based, global:

Client: file -> CDC -> chunks -> hashes -> [h_a, h_b, h_c]
Server: h_a exists (another user uploaded it) -> skip
        h_b exists (previous version) -> skip
        h_c is new -> upload needed
Result: upload only h_c. Dedup works across ALL users and files.

rsync dedup model — content-matching, local:

Receiver sends block hashes from old file
Sender scans new file, finds matches -> COPY / INSERT
Result: minimal bytes transferred. But only compares old vs new version of the SAME file.

Concrete example showing the difference:

Alice uploads Q1-report.pptx (100MB)
  CDC chunks: [slides-A][company-logo][chart-data][slides-B]
  All new -> 100MB uploaded.

Bob uploads Q2-report.pptx (120MB) -- different report, same logo and chart
  CDC chunks: [slides-X][company-logo][chart-data][slides-Y][slides-Z]
  Server: company-logo and chart-data already exist (from Alice) -> skip
  Upload: only 72MB instead of 120MB. Cross-user dedup saved 48MB.

rsync: no equivalent. rsync compares Bob's file against Bob's previous
  version of that file -- not against Alice's file. No cross-user dedup.

Comparison table:

Dimension	Fixed Chunking	CDC	rsync
Boundary depends on position?	Yes	No (content-defined)	No (position-independent scan)
Handles insertions?	No — all chunks shift	Yes — only affected chunk changes	Yes — finds blocks at new positions
Delta granularity	Whole chunks (~4MB)	Whole chunks (~4MB)	Byte-level (actual changed bytes)
Dedup across users/files?	Yes (hash identity)	Yes (hash identity)	No (same-file pairs only)
CPU cost	Low	Medium (Rabin at every byte)	High (rolling hash at every byte)
Server touches file bytes?	No	No	Sender processes file for delta

Bandwidth comparison for a 100MB file:

Scenario	Full Upload	Fixed Chunks	CDC	rsync
1KB edit at start	100MB	100MB (all shifted)	~4MB (1 chunk)	~1KB
1KB edit in middle	100MB	~50MB (half shifted)	~4MB (1 chunk)	~1KB
1KB edit at end	100MB	4MB (last chunk)	~4MB (1 chunk)	~1KB
1KB append	100MB	4MB (new last chunk)	~4MB (1 chunk)	~1KB
50MB overwrite	100MB	~52MB	~52MB	~50MB
No change	100MB	0MB (hashes match)	0MB (hashes match)	0MB

Chunk Size Tuning

The TARGET parameter controls average chunk size. This is a critical tuning knob.

Smaller chunks (1MB average): Better dedup and delta granularity. More metadata per file (100 records for a 100MB file), more objects to manage, more SHA-256 computations.

Larger chunks (16MB average): Less metadata, fewer objects. Worse dedup and delta granularity. Larger minimum upload size.

Tradeoff	Impact
CPU cost per file	Rabin + SHA-256 at every byte. ~2-3 seconds for a 1GB file on a modern laptop.
More metadata per file	25 chunk records for 100MB. More rows in version_chunks.
Minimum effective edit size	A 1-byte change still produces one new ~4MB chunk.
Complexity	CDC boundary detection, min/max enforcement, hash computation add client-side code.

These costs are worth it. The alternative (full-file upload) wastes orders of magnitude more bandwidth at scale.

Technology Selection

Component	Technology	Role
Chunk storage	Object Storage	Content-addressable blob storage, 11 nines durability, multipart upload
File metadata	PostgreSQL + Citus	Users, files, versions, ACLs. Sharded by `user_id` for horizontal scale
Dedup cache	Redis Cluster	Bloom filter for chunk existence checks, upload session state, connection registry
Sync events	Kafka (RF=3)	Sync event distribution, upload completion events, dedup events
File search	Elasticsearch	Search by filename, type, content metadata
CDN	CloudFront	Edge caching for frequently downloaded chunks
Real-time sync	WebSocket Gateway	Push sync notifications to connected devices
LAN discovery	mDNS (Bonjour/Avahi)	Discover devices on the same local network for P2P sync

Dimension	PostgreSQL + Citus	DynamoDB	MongoDB	CockroachDB
Data model fit	Relational. File trees, ACLs, versions.	Key-value. Hierarchy requires denormalization.	Document. Decent for metadata, poor for ACL joins.	Relational, distributed.
Sharding	Automatic by `user_id`. Co-locates user data.	Partition key required. Cross-partition expensive.	Manual, complex at scale.	Automatic range-based.
Transactions	Full ACID, even across co-located tables.	Single-partition only.	Multi-document with caveats.	Full distributed (Spanner-style).
Cost	Open source. Commodity hardware.	Expensive at this write volume.	Open source but operationally complex.	Higher resource overhead than Citus.

CockroachDB is the closest competitor — also distributed SQL. But its consensus-based writes (Raft per range) add latency compared to Citus's single-node writes for co-located data.

Architecture Overview

The Upload Path

Every numbered step must succeed for the upload to complete. Three phases: decision (steps 1-3), data transfer (steps 4-5), commit (steps 6-7).

User edits a file on their laptop
Desktop client's filesystem watcher detects the change
Client re-chunks the file using CDC, producing a list of chunk hashes
Client diffs the new chunk list against its local SQLite DB to find new chunks
Client calls /upload/init with the list of new chunk hashes
Server runs dedup: checks each hash against the Bloom filter (Redis), then PostgreSQL. Already-existing chunks are skipped.
Server returns presigned URLs only for chunks that don't already exist
Client uploads new chunks directly to object storage via presigned URLs (file bytes never touch the server)
Client calls /upload/finalize. Server writes new file version (metadata + chunk references) to PostgreSQL. The Sync Service publishes an event to Kafka.
Kafka delivers the event to the WebSocket Gateway, which pushes a notification to all other devices owned by this user
Other devices pull the updated metadata and download only the changed chunks

File bytes bypass the entire server fleet. At petabyte-scale upload ingress, routing data through API servers would require thousands of proxy nodes. Presigned URLs eliminate that.

Critical path vs async path:

Critical path (must succeed)	Async path (eventual)
`/upload/init` (dedup check)	Kafka event delivery
Chunk upload to object storage	WebSocket notification to other devices
`/upload/finalize` (metadata write)	Search indexing (Elasticsearch)
	Thumbnail generation

The Sync Path

After a successful upload, the Sync Service publishes an event to Kafka. This path is decoupled from the upload — if Kafka is slow, sync is delayed but no data is lost.

Full Architecture

Layer Responsibilities

Layer 2: Services. Four stateless services:

Sync Service: Handles file metadata CRUD, version creation, cursor-based change feeds, and folder operations. Writes to PostgreSQL, publishes events to Kafka.
Upload Orchestrator: Manages the upload lifecycle — calls Dedup Service for hash checks, generates presigned URLs for new chunks, tracks progress in Redis. On finalize, delegates metadata write to the Sync Service (which owns metadata and publishes the Kafka event).
WebSocket Gateway: Maintains persistent connections with all online devices. Consumes sync events from Kafka and pushes them to the right devices. Registers connections in Redis so other services can look up which pod holds which user.
Dedup Service: Called during /upload/init before any data upload. Handles chunk existence checks (Bloom filter in Redis + PostgreSQL verification), chunk insertion into the chunks table, and ref_count management. This is on the critical path.

Layer 4: Storage.

Object Storage stores chunk bytes. Content-addressable: the key is the SHA-256 hash (e.g., chunks/a1/b2/a1b2c3d4e5f6...). Same content always maps to the same key.
PostgreSQL + Citus stores all metadata: users, files, folders, versions, chunk lists, ACLs. Sharded by user_id. Source of truth for "what files exist and what chunks they contain."
Redis Cluster stores ephemeral state: (1) Bloom filter for chunk existence checks, (2) upload session progress for resumable uploads, (3) WebSocket connection registry mapping user_id to gateway pod.
Elasticsearch indexes file metadata for search.

Scale Estimation

Upload Throughput

Peak uploads:     1M files/minute = 16,667 files/sec
Avg chunks/file:  3 (many small files are 1 chunk, large files bring up the average)
Total chunks/sec: 16,667 * 3 = 50,000 chunks/sec
Dedup hit rate:   50% (half the chunks already exist in the system)
New store PUTs/sec:  50,000 * 0.5 = 25,000 PUT operations/sec
Avg chunk size:   4MB
Object storage ingress: 25,000 * 4MB = 100GB/sec peak

25K PUTs per second is within the object store's capabilities (3,500 PUT/sec per prefix, so we use 256 hash-based prefixes for headroom).

Storage

Total files:          10B
Average file size:    2MB
Gross storage:        10B * 2MB = 20PB
Average versions:     5 per file
Gross with versions:  20PB * 5 = 100PB
Dedup ratio:          50%
Actual object storage:    100PB * 0.5 = 50PB
Unique chunks:        50PB / 4MB = 12.5 billion unique chunks

50PB in Standard tier. Tiered storage (intelligent tiering, cold archive for old versions) reduces cost significantly.

Metadata (PostgreSQL + Citus)

Chunk record:      hash(32B) + size(4B) + bucket(8B) + ref_count(4B) + timestamps(8B) = ~56 bytes
                   12.5B chunks * 56B = 700GB

File record:       id(16B) + user_id(16B) + parent_id(16B) + name(256B) + metadata(128B)
                   + timestamps(16B) + indexes(~116B) = ~564 bytes
                   10B files * 564B = 5.6TB

Version record:    id(16B) + file_id(16B) + user_id(16B) + version(4B) + size(8B)
                   + checksum(32B) + device_id(16B) + chunk_count(4B) + timestamps(16B)
                   + indexes(~72B) = ~200 bytes
                   10B files * 5 versions = 50B versions * 200B = 10TB

Version-chunk mapping: version_id(16B) + user_id(16B) + chunk_index(4B)
                       + chunk_hash(32B) + chunk_size(4B) = ~72 bytes
                       50B versions * 3 chunks avg = 150B rows * 72B = 10.8TB

Total PostgreSQL:  700GB + 5.6TB + 10TB + 10.8TB = 27TB

27TB sharded across 50+ Citus worker nodes. Each node holds ~500GB — within a single PostgreSQL instance's comfort zone with NVMe storage.

Bandwidth

Upload to object storage:    100GB/sec peak
Download from object storage: ~200GB/sec peak (downloads typically 2x uploads)
CDN hit rate:    60% for popular files
CDN offload:     200GB/sec * 0.6 = 120GB/sec served from CDN
Object storage egress: 200GB/sec * 0.4 = 80GB/sec

WebSocket Connections

DAU:                50M
Devices per user:   2.5 average
Online concurrency: 30% at peak
Connections:        50M * 2.5 * 0.3 = 37.5M ~ 40M concurrent connections
Memory per conn:    ~15KB (buffers, session state, send queue)
Total memory:       40M * 15KB = 600GB
Connections per pod: 500K (practical limit per gateway instance)
Gateway pods:       40M / 500K = 80 pods

80 WebSocket gateway pods, each holding 500K connections. Memory-bound, not CPU-bound.

Redis Cluster

Bloom filter:      12.5B chunks, 0.1% false positive rate
                   15 bits per element = 12.5B * 15 / 8 = ~23.4GB
Active upload sessions: 100K concurrent * 2KB each = 200MB
Connection registry:    40M entries * 32B each = 1.28GB
Rate limit counters:    5M * 16B = 80MB
Total Redis memory:     ~25GB

Summary

Resource	Quantity	Notes
Object storage	50PB	Content-addressable chunks, deduplicated
Object store PUTs/sec	25K	After 50% dedup
Object storage ingress	100GB/sec	Peak upload bandwidth
PostgreSQL storage	27TB	Sharded across 50+ Citus nodes
Redis memory	~25GB	Bloom filter + sessions + connections
WebSocket pods	80	500K connections each
Kafka throughput	50K events/sec	Sync events, upload completions
Elasticsearch	~2TB	Filename and metadata index

Data Model

Users Table

sql

CREATE TABLE users (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email           TEXT UNIQUE NOT NULL,
    display_name    TEXT NOT NULL,
    quota_bytes     BIGINT NOT NULL DEFAULT 2147483648,  -- 2GB free tier
    storage_used    BIGINT NOT NULL DEFAULT 0,
    plan            TEXT NOT NULL DEFAULT 'free',
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Citus distribution
SELECT create_distributed_table('users', 'id');

File Metadata Table

sql

CREATE TABLE file_metadata (
    id              UUID NOT NULL DEFAULT gen_random_uuid(),
    user_id         UUID NOT NULL REFERENCES users(id),
    parent_id       UUID,                    -- NULL for root folder
    name            TEXT NOT NULL,
    is_folder       BOOLEAN NOT NULL DEFAULT false,
    latest_version  INTEGER NOT NULL DEFAULT 0,
    size_bytes      BIGINT NOT NULL DEFAULT 0,
    mime_type       TEXT,
    checksum        TEXT,                    -- SHA-256 of full file content
    is_deleted      BOOLEAN NOT NULL DEFAULT false,
    deleted_at      TIMESTAMPTZ,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT now(),

    PRIMARY KEY (user_id, id)               -- Citus requires dist key in PK
);

SELECT create_distributed_table('file_metadata', 'user_id');

CREATE INDEX idx_file_parent ON file_metadata(user_id, parent_id);
CREATE UNIQUE INDEX idx_file_name ON file_metadata(user_id, parent_id, name)
    WHERE is_deleted = false;

File Versions Table

sql

CREATE TABLE file_versions (
    id              UUID NOT NULL DEFAULT gen_random_uuid(),
    file_id         UUID NOT NULL,
    user_id         UUID NOT NULL,
    version_number  INTEGER NOT NULL,
    size_bytes      BIGINT NOT NULL,
    checksum        TEXT NOT NULL,           -- SHA-256 of full file (Merkle root)
    device_id       UUID NOT NULL,
    chunk_count     INTEGER NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),

    PRIMARY KEY (user_id, id),
    FOREIGN KEY (user_id, file_id) REFERENCES file_metadata(user_id, id)
);

SELECT create_distributed_table('file_versions', 'user_id',
    colocate_with => 'file_metadata');

CREATE UNIQUE INDEX idx_version_number
    ON file_versions(user_id, file_id, version_number);

Version Chunks Table

sql

CREATE TABLE version_chunks (
    version_id      UUID NOT NULL,
    user_id         UUID NOT NULL,
    chunk_index     INTEGER NOT NULL,        -- 0-based order within the file
    chunk_hash      BYTEA NOT NULL,          -- SHA-256 (32 bytes)
    chunk_size      INTEGER NOT NULL,
    storage_key     TEXT NOT NULL,            -- denormalized (avoids cross-shard join)

    PRIMARY KEY (user_id, version_id, chunk_index),
    FOREIGN KEY (user_id, version_id) REFERENCES file_versions(user_id, id)
);

SELECT create_distributed_table('version_chunks', 'user_id',
    colocate_with => 'file_metadata');

Chunks Table (Global, Not User-Sharded)

sql

CREATE TABLE chunks (
    hash            BYTEA PRIMARY KEY,       -- SHA-256 (32 bytes)
    size_bytes      INTEGER NOT NULL,
    storage_bucket  TEXT NOT NULL,
    storage_key     TEXT NOT NULL,
    ref_count       INTEGER NOT NULL DEFAULT 1,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    last_referenced TIMESTAMPTZ NOT NULL DEFAULT now()
);

-- Sharded by hash (NOT a reference table -- 700GB replicated to every node is impractical)
SELECT create_distributed_table('chunks', 'hash');

How cross-shard joins are avoided:

Bloom filter + point lookups for dedup. The Bloom filter handles 99%+ of chunk existence checks. The chunks table is queried only for verification and ref count management — point lookups by hash (single-shard), not joins.
Denormalized storage_key in version_chunks. When downloading a file, the server needs the storage key for each chunk to generate presigned URLs. Instead of joining across shards, storage_key is stored directly in version_chunks. One extra column (~50 bytes per row) eliminates a cross-shard join on every download. Written once during /upload/complete, never changes.
Co-location by user_id. All user-scoped tables are co-located on the same Citus shard. Queries within a user's data are single-shard operations — fast and transactional.

Upload Sessions in Redis

json

{
    "key": "upload:{upload_id}",
    "value": {
        "upload_id": "up_abc123",
        "user_id": "usr_xyz789",
        "file_id": "file_def456",
        "storage_multipart_id": "mpu_id",
        "total_chunks": 25,
        "completed_chunks": [0, 1, 2, 3, 5, 6, 7],
        "failed_chunks": [4],
        "presigned_urls": {
            "8": "https://storage.example.com/...?X-Storage-Signature=...",
            "9": "https://storage.example.com/...?X-Storage-Signature=..."
        },
        "chunk_hashes": ["a1b2c3...", "d4e5f6...", "..."],
        "chunks_to_upload": [4, 8, 9, 10, 11, 12],
        "chunks_existing": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
        "started_at": "2026-03-10T14:30:00Z",
        "expires_at": "2026-03-17T14:30:00Z"
    },
    "TTL": "7 days"
}

Upload sessions live in Redis with a 7-day TTL matching the multipart upload lifecycle.

Client-Side SQLite Schema

Every desktop client maintains a local SQLite database mirroring server state for synced folders:

sql

CREATE TABLE local_files (
    path            TEXT PRIMARY KEY,        -- relative to sync root
    file_id         TEXT,                    -- server file ID
    version_number  INTEGER,
    size_bytes      INTEGER,
    checksum        TEXT,                    -- SHA-256 of full file
    chunk_hashes    TEXT,                    -- JSON array of chunk SHA-256 hashes
    local_mtime     INTEGER,                -- filesystem modification time
    sync_status     TEXT DEFAULT 'synced',   -- synced, pending_upload, pending_download,
                                            --   conflicted, error
    last_synced_at  INTEGER                  -- epoch seconds
);

CREATE TABLE sync_queue (
    id              INTEGER PRIMARY KEY AUTOINCREMENT,
    path            TEXT NOT NULL,
    action          TEXT NOT NULL,           -- upload, download, delete, rename, move
    priority        INTEGER DEFAULT 0,
    created_at      INTEGER NOT NULL,
    attempts        INTEGER DEFAULT 0,
    last_error      TEXT
);

CREATE TABLE sync_state (
    key             TEXT PRIMARY KEY,
    value           TEXT
);
-- stores: last_cursor, last_sync_time, device_id

Object key format: Chunks are stored with a key derived from their SHA-256 hash, distributed across prefixes:

chunks/{hash[0:2]}/{hash[2:4]}/{full_sha256_hash}

Example: chunks/a1/b2/a1b2c3d4e5f6...full64charhash

The two-level prefix creates 65,536 prefixes (256 * 256), distributing storage operations evenly to avoid per-prefix rate limits.

API Design

Upload Flow

Initialize upload:

POST /v1/files/upload/init
Authorization: Bearer <jwt>
Content-Type: application/json

{
    "path": "/Documents/report.xlsx",
    "size_bytes": 104857600,
    "checksum": "sha256:abc123...",
    "base_version": 5,
    "chunk_hashes": [
        "a1b2c3d4...",
        "e5f6a7b8...",
        "c9d0e1f2...",
        ...
    ],
    "chunk_sizes": [
        4194304,
        4194304,
        3670016,
        ...
    ]
}

Response 200:
{
    "upload_id": "up_abc123",
    "chunks_needed": [0, 5, 12],
    "chunks_existing": [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, ...],
    "presigned_urls": {
        "0": {
            "url": "https://storage.example.com/chunks/a1/b2/a1b2...?...",
            "expires_at": "2026-03-10T15:00:00Z"
        },
        "5": {
            "url": "https://storage.example.com/chunks/e5/f6/e5f6...?...",
            "expires_at": "2026-03-10T15:00:00Z"
        },
        "12": {
            "url": "https://storage.example.com/chunks/c9/d0/c9d0...?...",
            "expires_at": "2026-03-10T15:00:00Z"
        }
    }
}

The server checks all chunk hashes against the dedup system. Only truly new chunks get presigned URLs.

Complete upload:

POST /v1/files/upload/{upload_id}/complete
Authorization: Bearer <jwt>

{
    "etags": {
        "0": "\"etag_for_chunk_0\"",
        "5": "\"etag_for_chunk_5\"",
        "12": "\"etag_for_chunk_12\""
    }
}

Response 200:
{
    "file_id": "file_def456",
    "version": 6,
    "size_bytes": 104857600,
    "synced_at": "2026-03-10T14:35:00Z"
}

Check progress (for resume):

GET /v1/files/upload/{upload_id}/progress
Authorization: Bearer <jwt>

Response 200:
{
    "upload_id": "up_abc123",
    "total_chunks": 25,
    "completed_chunks": [0, 5],
    "remaining_chunks": [12],
    "presigned_urls": {
        "12": {
            "url": "https://storage.example.com/chunks/c9/d0/c9d0...?...",
            "expires_at": "2026-03-10T15:15:00Z"
        }
    }
}

If presigned URLs have expired, the server generates fresh ones. The multipart upload stays valid for 7 days.

Download

GET /v1/files/{file_id}/download?version=6
Authorization: Bearer <jwt>

Response 200:
{
    "file_id": "file_def456",
    "version": 6,
    "size_bytes": 104857600,
    "checksum": "sha256:abc123...",
    "chunks": [
        {
            "index": 0,
            "hash": "a1b2c3d4...",
            "size": 4194304,
            "url": "https://cdn.example.com/chunks/a1/b2/a1b2...?..."
        },
        {
            "index": 1,
            "hash": "e5f6a7b8...",
            "size": 4194304,
            "url": "https://cdn.example.com/chunks/e5/f6/e5f6...?..."
        }
    ]
}

URLs point to the CDN for caching. The client downloads chunks in parallel, verifies each against its hash, and assembles the file locally.

Sync (Cursor-Based)

POST /v1/sync/changes
Authorization: Bearer <jwt>

{
    "cursor": "cursor_2026031014300000_shard42",
    "limit": 500
}

Response 200:
{
    "changes": [
        {
            "type": "file_modified",
            "file_id": "file_def456",
            "path": "/Documents/report.xlsx",
            "version": 6,
            "size_bytes": 104857600,
            "checksum": "sha256:abc123...",
            "modified_by_device": "device_laptop01",
            "timestamp": "2026-03-10T14:35:00Z"
        },
        {
            "type": "file_created",
            "file_id": "file_ghi789",
            "path": "/Photos/vacation.jpg",
            "version": 1,
            "size_bytes": 5242880,
            "checksum": "sha256:def456...",
            "modified_by_device": "device_phone01",
            "timestamp": "2026-03-10T14:36:00Z"
        }
    ],
    "cursor": "cursor_2026031014360000_shard42",
    "has_more": false
}

The cursor is opaque to the client. Internally it encodes a timestamp and shard identifier. The client stores it in local SQLite and uses it on reconnection to catch up on missed changes.

WebSocket Protocol

Connection handshake:

wss://ws.example.com/v1/sync/ws
Authorization: Bearer <jwt>
X-Device-Id: device_laptop01

Message types:

json

// Client -> Server: Subscribe to sync events
{
    "type": "SUBSCRIBE",
    "folders": ["/", "/Documents", "/Photos"],
    "cursor": "cursor_2026031014300000_shard42"
}

// Server -> Client: File change notification
{
    "type": "CHANGE",
    "id": "evt_123",
    "change_type": "file_modified",
    "file_id": "file_def456",
    "path": "/Documents/report.xlsx",
    "version": 6,
    "modified_by": "device_phone01",
    "timestamp": "2026-03-10T14:35:00Z"
}

// Client -> Server: Acknowledge receipt
{
    "type": "ACK",
    "id": "evt_123"
}

// Bidirectional: Keepalive
{ "type": "PING", "timestamp": 1741614900000 }
{ "type": "PONG", "timestamp": 1741614900000 }

File Operations

POST /v1/files/move
{ "file_id": "file_def456", "new_parent_id": "folder_abc", "new_name": "renamed-report.xlsx" }

POST /v1/files/copy
{ "file_id": "file_def456", "destination_parent_id": "folder_xyz", "name": "report-copy.xlsx" }

DELETE /v1/files/{file_id}
// Soft delete. Moves to trash. Hard delete after 30 days.

GET /v1/files/{file_id}/versions
// Returns list of all versions with timestamps, sizes, device info.

POST /v1/files/{file_id}/restore?version=3
// Creates new version identical to version 3.

POST /v1/shares
{ "file_id": "file_def456", "email": "colleague@example.com", "permission": "edit", "expires_at": "2026-04-10T00:00:00Z" }

GET /v1/search?q=quarterly+report&type=xlsx&modified_after=2026-01-01

The Upload Pipeline

Problem: How does the client go from "file changed on disk" to "change confirmed and syncing to other devices"?

The chunking engine is the most performance-critical component in the desktop client. It runs entirely on the user's machine.

Step 3: SHA-256 hash each chunk. On modern CPUs with SHA-NI extensions, SHA-256 runs at ~2GB/sec. A 4MB chunk hashes in ~2ms.

Step 4: Compare against local SQLite. The client compares the new chunk hash list against the previous version stored locally. Unchanged chunks (same hash, same position) don't need uploading.

Step 5: Send hash list to server. POST /v1/files/upload/init with the full list of chunk hashes.

Step 7: Upload new chunks via presigned URLs. Parallel uploads (4-8 concurrent by default). Each chunk is a separate PUT operation directly to object storage.

Full pipeline pseudocode:

python

def sync_file(file_path):
    # Step 1: Read and chunk
    chunks = []
    with open(file_path, 'rb') as f:
        for chunk_data in cdc_stream(f):
            chunk_hash = sha256(chunk_data)
            chunks.append({
                'index': len(chunks),
                'hash': chunk_hash,
                'size': len(chunk_data),
                'data': chunk_data
            })

    # Step 2: Local delta detection
    prev_hashes = sqlite_get_chunk_hashes(file_path)
    changed_indices = []
    for i, chunk in enumerate(chunks):
        if i >= len(prev_hashes) or chunk['hash'] != prev_hashes[i]:
            changed_indices.append(i)

    if not changed_indices and len(chunks) == len(prev_hashes):
        return  # No changes

    # Step 3: Server dedup check
    response = api_upload_init(
        path=file_path,
        chunk_hashes=[c['hash'] for c in chunks],
        chunk_sizes=[c['size'] for c in chunks]
    )

    # Step 4: Upload only needed chunks
    for idx in response['chunks_needed']:
        presigned_url = response['presigned_urls'][str(idx)]
        upload_chunk(presigned_url, chunks[idx]['data'])

    # Step 5: Finalize
    api_upload_complete(response['upload_id'], etags)

    # Step 6: Update local state
    sqlite_update_file(file_path, chunks)

Resumable Uploads

The flow:

Create session. The Upload Orchestrator stores session state in Redis.
Generate presigned URLs. Each needed chunk gets a presigned PUT URL. URLs expire in 15 minutes (security best practice).
Client uploads chunks directly. Object store returns an ETag per completed chunk.
Track progress. Client notifies the Orchestrator after each chunk. Redis session updated.
Network failure. Client loses connectivity. Some chunks uploaded, some not. The session and uploaded chunks are preserved.
Resume. Client calls GET /upload/{id}/progress. Orchestrator checks Redis session plus ListParts on the object store. Returns remaining chunks with fresh presigned URLs.
Continue. Client uploads remaining chunks from where it left off.
Finalize. Once all chunks are uploaded, create version record and clean up.

Content Deduplication

Problem: 500 employees upload the same 50MB PDF. Without dedup, that is 25GB for one file. With dedup, 50MB. How does the system detect that a chunk already exists before uploading it?

The dedup flow (runs during /upload/init, before any bytes upload):

Client sends chunk hashes (SHA-256) to server
Server checks each hash: Redis Bloom filter (fast) then PostgreSQL chunks table (accurate)
Server returns: chunks_existing (skip) + chunks_needed (upload)
Only chunks_needed get presigned URLs

Reference Counting and Garbage Collection

Every chunk has a ref_count tracking how many file versions reference it:

Incremented when a new file version references an existing chunk
Decremented when a file version is deleted (version expiry, user deletion)

Garbage collection process:

Background job runs hourly, scans for chunks where ref_count = 0 AND last_referenced < now() - 24 hours
Delete the object from the store
Delete the chunks table row
Remove from Bloom filter (requires periodic rebuild from the chunks table, since standard Bloom filters don't support deletion; rebuilt nightly)

Weekly reconciliation: Safety net. Scans all version_chunks references, counts actual references per chunk, compares against chunks.ref_count, and fixes discrepancies.

Integrity Verification

Chunk-Level Verification

A malicious or buggy client could claim hash h_abc but upload different data. Every subsequent download of that chunk delivers garbage.

Step 1: ETag check (fast, inline). During /upload/complete, the server compares ETags. The ETag is typically an MD5 of the uploaded content — catches accidental corruption.

File-Level Integrity (Merkle Tree)

File: [Chunk0] [Chunk1] [Chunk2] [Chunk3]
        |        |        |        |
       h_0      h_1      h_2      h_3       <- leaf hashes (SHA-256 of chunk content)
         \      /          \      /
     SHA-256(h_0||h_1)  SHA-256(h_2||h_3)    <- intermediate nodes
              \              /
         SHA-256(h_01 || h_23)               <- Merkle root = file checksum

python

def merkle_root(chunk_hashes):
    if len(chunk_hashes) == 0:
        return sha256(b'')
    if len(chunk_hashes) == 1:
        return chunk_hashes[0]

    nodes = list(chunk_hashes)
    if len(nodes) % 2 == 1:
        nodes.append(nodes[-1])  # duplicate last hash

    while len(nodes) > 1:
        next_level = []
        for i in range(0, len(nodes), 2):
            combined = sha256(nodes[i] + nodes[i+1])
            next_level.append(combined)
        nodes = next_level

    return nodes[0]  # Root hash

Why Merkle tree instead of concatenating all hashes?

Partial verification. To verify a single chunk belongs to the file, you need only its sibling hashes along the path to the root — O(log N) hashes, not all N. For 12,500 chunks: 14 hashes instead of 12,500.
Corruption localization. Binary-search for exactly which chunk is corrupted — O(log N) checks.
Tamper evidence. Changing, reordering, or removing any chunk changes the Merkle root.

When	What	How
Upload: client	Merkle root from chunk hashes	Sent in `/upload/init` as `checksum`
Upload: server	Recompute Merkle root from request	Must match client's claimed checksum
Download: client	Each chunk hash	SHA-256(downloaded_bytes) == expected hash
Download: client	Merkle root after assembly	Recompute from chunk hashes, compare to version checksum
Periodic: server	Async chunk verification	Worker re-hashes chunks in object store

The Sync Protocol

Problem: When a file changes on one device, how do all other devices learn about it and download only the changed parts — ideally within seconds?

Upload Notification Path

After a successful upload, the Sync Service publishes a sync event to Kafka. The WebSocket Gateway consumes it and pushes to all other devices.

Download Path

1. WebSocket CHANGE message arrives
2. Client fetches file version metadata (chunk list)
3. Client diffs chunk list against local SQLite
4. Client downloads only new/changed chunks from CDN/object storage
5. Client assembles file from chunks (local cache + downloaded)
6. Client writes file to filesystem
7. Client updates local SQLite with new version
8. FS watcher fires (from own write) - client ignores
     (filters out self-triggered events by checking sync_status)

Cursor-Based Offline Sync

When a device comes online after being offline, it cannot rely on missed WebSocket messages. Instead:

Client reads its last cursor from local SQLite
Calls POST /v1/sync/changes with the cursor
Server returns all changes since that cursor, paginated
Client processes each change: download new files, update modified, delete removed
Client stores the new cursor

The cursor encodes a timestamp plus shard identifier, allowing efficient queries on the user's Citus shard.

Conflict Resolution

Problem: Two devices edit the same file while one is offline. When both try to upload, whose edit wins? And what happens to the loser's work?

Mental model: Last-writer-wins for the primary file, safety copy for the loser. No data is ever silently lost.

Every file upload includes a base_version in the init request. The server checks:

If base_version == server's latest version: success, create version N+1.
If base_version < server's latest version: conflict detected.

Concrete scenario: Alice has budget.xlsx synced on laptop and phone. Current version is v5.

Laptop goes offline (airplane mode)
Alice edits on laptop. Local version: v5-local-edit.
Meanwhile, she edits the same file on her phone (online).
Phone uploads. Server creates v6 (base_version=5, matches, no conflict).
Laptop reconnects. Tries to upload with base_version=5.
Server sees base_version=5 but current=6. Conflict.

Conflict rate: Less than 0.01% of file operations in practice. But when they happen, the system must handle them gracefully and never lose data.

Delta Sync in Practice

A user has a 20MB PowerPoint file synced across laptop and phone. On the laptop, they change a few slides (affecting about 2MB of data in the middle).

Upload delta (laptop to server):

Version 1 (synced to both devices):
Chunks: [A, B, C, D, E]
Hashes: [h1, h2, h3, h4, h5]

User edits slides in the region covered by chunk C.

Version 2 (laptop computes via CDC):
Chunks: [A, B, C', D, E]
Hashes: [h1, h2, h3', h4, h5]

Local diff: only index 2 changed (h3 -> h3')
Server checks dedup: h1, h2, h4, h5 exist. h3' is new.
Laptop uploads only chunk C' (~4MB) via presigned URL
Server creates version 2 referencing [h1, h2, h3', h4, h5]
Bandwidth: ~4MB instead of 20MB (80% savings)

Download delta (server to phone):

Phone receives WebSocket notification: file updated to v2
Phone fetches v2 chunk list: [h1, h2, h3', h4, h5]
Phone's local SQLite has v1: [h1, h2, h3, h4, h5]
Diff: only h3' is new. Phone already has h1, h2, h4, h5 locally.
Phone downloads only chunk C' (~4MB)
Assembles file: local A, B + downloaded C' + local D, E
Bandwidth: ~4MB instead of 20MB

LAN Sync

Simple example: Two laptops in the same office. Instead of both downloading a 500MB file from the cloud, the second laptop copies it from the first over the local network in half a second.

Mental model: Cloud for truth, LAN for speed. Metadata always flows through the server (who has what, what version). Only chunk bytes take the LAN shortcut.

Discovery via mDNS. When the desktop client starts, it broadcasts on the local network:

Service type: _filesync._tcp.local
Instance: {user_id_hash}_{device_id}
Port: 17500
TXT: version=2, user_id_hash=abc123

Other instances discover this broadcast and filter for matching user_id_hash to find devices belonging to the same user.

Chunk transfer over LAN. When Device B needs chunks that Device A has:

Device B needs chunks: [h3', h7', h15']
Device B asks Device A (LAN peer): "Do you have h3', h7', h15'?"
Device A responds: "I have h3' and h7'. Don't have h15'."
Device B downloads h3' and h7' from Device A over LAN (1Gbps+)
Device B downloads h15' from object store/CDN (internet speed)

Fallback. If LAN transfer fails (device offline, network change), the client transparently falls back to cloud download.

File Versioning

Problem: Users need to undo changes, recover deleted content, and compare versions. How do you store version history without multiplying storage costs?

Mental model: Versions are pointers, not copies. Creating a new version only stores what changed. Restoring an old version is a metadata operation — no bytes are copied.

v1: [A, B, C]         -- Initial upload
v2: [A, B', C]        -- B edited, becomes B'
v3: [A, B', C, D]     -- D appended
v4 (restore v1): [A, B, C]  -- Same chunks as v1, new version record

Chunk A is referenced by all 4 versions (ref_count = 4). Chunk B is in v1 and v4 (ref_count = 2). Chunk B' is in v2 and v3 (ref_count = 2). Chunk D is only in v3 (ref_count = 1).

Restoring a version creates a new version record pointing to the same chunks. No data copied. Milliseconds, regardless of file size.

Version retention: 180 days or 100 versions, whichever comes first. When a version expires:

Version record and version_chunks entries deleted from PostgreSQL
Each referenced chunk's ref_count decremented
Chunks reaching ref_count = 0 enter the GC pipeline (24-hour grace, then deletion)