System Design: Dropbox (File Sync, Chunking, and Deduplication)
Goal: Build a cloud file sync platform serving 500M registered users, 50M DAU, syncing 10B files across 100PB of storage. Support chunked uploads, resumable transfers, delta sync (upload only changed bytes), cross-device sync via WebSocket, content deduplication across users, peer-to-peer LAN sync, and file versioning. Files available on other devices within 10 seconds for files under 10MB on broadband.
Mental model — four ideas that make everything else click:
- A file = ordered list of chunk hashes. Chunks are immutable, content-addressed blobs in object storage.
- Editing a file means producing a new list of hashes. Most stay the same. Only new chunks get uploaded.
- Dedup = identical content produces the same SHA-256 hash, stored once, referenced many times.
- Sync = push metadata via WebSocket, other devices download only missing chunks.
TL;DR: Split files into ~4MB chunks. SHA-256 per chunk enables content-addressable storage and cross-user dedup. Presigned URLs let clients upload directly to object storage — the server never touches file bytes. Resumable uploads: fail at chunk 47, resume from 48. Filesystem watcher detects changes, client uploads only changed chunks (delta sync), WebSocket pushes notifications in under 1 second. LAN sync for same-network devices. Last-writer-wins with conflicted copies for conflicts.
The Three Problems
File sync looks simple until the math gets real. Three problems shape every decision in this design.
Problem 1: Efficient transfer. The platform processes 1M file uploads per minute at peak, average file size 2MB. A user opens a 100MB spreadsheet, changes one cell (maybe 1KB of actual data), and saves. Re-uploading the entire 100MB wastes bandwidth. Multiply that by millions of users and the cost becomes unsustainable. The system must detect what actually changed and upload only those bytes.
Problem 2: Multi-device sync. 50M daily active users, each with 2-3 devices on average. That's 100-150M devices that need to stay in sync. When a user edits a file on their laptop, their phone and work desktop need to know within seconds. And "know about it" means downloading exactly the changed portions, not the whole file. Polling is out of the question — 150M devices hitting the server every 5 seconds asking "anything new?" is 30M requests/second of pure waste. The server must push changes.
Problem 3: Storage efficiency. 100PB of total storage. Without deduplication, every copy of the same file takes separate space. A company onboards 500 employees who all receive the same 50MB employee handbook PDF. That's 500 copies, 25GB, for one file. With content-addressable deduplication, the system stores one copy and 500 pointers.
Scale numbers:
- 500M registered users
- 50M DAU
- 10B files under management
- 100PB total storage
- 1M uploads/minute at peak (~16,667/sec)
- Average file size: 2MB
Requirements
Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Upload files from desktop, mobile, and web clients | P0 |
| FR-02 | Download files to any connected device | P0 |
| FR-03 | Resume interrupted uploads and downloads from the point of failure | P0 |
| FR-04 | Multi-device sync: file changes propagate to all user devices automatically | P0 |
| FR-05 | Delta sync: upload only changed portions of a file, not the entire file | P0 |
| FR-06 | Content deduplication: identical content stored once across all users | P0 |
| FR-07 | File versioning: retain previous versions for rollback | P1 |
| FR-08 | Conflict resolution: handle concurrent edits from multiple devices | P0 |
| FR-09 | Selective sync: choose which folders sync to which devices | P1 |
| FR-10 | LAN sync: transfer files directly between devices on the same network | P1 |
| FR-11 | File and folder sharing with permissions (view, edit) | P1 |
| FR-12 | Full-text search by filename, file type, and content metadata | P2 |
| FR-13 | Bandwidth throttling: user-configurable upload/download limits | P2 |
| FR-14 | File preview: thumbnails and previews for images, documents, videos | P2 |
| FR-15 | Team workspaces: shared folders with role-based access control | P1 |
Non-Functional Requirements
| ID | Requirement | Target |
|---|---|---|
| NFR-01 | Sync latency (metadata propagation) | < 1 second |
| NFR-02 | File availability on other devices | < 10 seconds for files under 10MB on broadband |
| NFR-03 | Resumable upload gap tolerance | Up to 7 days between pause and resume |
| NFR-04 | Deduplication ratio | 50% storage savings across all users |
| NFR-05 | Data durability | 99.999999999% (11 nines, standard tier) |
| NFR-06 | Service availability | 99.99% (52 minutes downtime/year) |
| NFR-07 | Maximum file size | 50GB |
| NFR-08 | Concurrent sync connections | 150M (50M DAU * ~3 devices * 30% concurrency) |
| NFR-09 | Version retention | 180 days or 100 versions, whichever comes first |
Why Naive Approaches Fail
Before arriving at the solution, three approaches fail.
Approach 1: Full-file upload. Client detects a change, uploads the entire file. A 100MB spreadsheet with a 1KB edit uploads 100MB. At 1M uploads/minute, that is bandwidth suicide. Only viable under ~1MB.
Approach 2: Server-side binary diff. Client uploads the full file, server computes a diff against the old version. The server needs both versions in memory — about 1GB for a 500MB file. At 1M concurrent uploads, that is 1PB of server memory. Worse: routing all bytes through the server kills presigned URLs — the optimization that lets clients upload directly to object storage without the server touching file bytes. Verdict: impractical at scale.
Approach 3: Fixed-size chunking. Split the file into 4MB blocks at fixed byte positions. Hash each. Upload only blocks with changed hashes. Works for appends and in-place edits. Breaks on insertions:
Original (20MB, 5 chunks of 4MB each):
Position: [0-4MB] [4-8MB] [8-12MB] [12-16MB] [16-20MB]
Chunks: [C1:h_a] [C2:h_b] [C3:h_c] [C4:h_d] [C5:h_e]
After inserting 1KB at byte 100:
Position: [0-4MB] [4-8MB] [8-12MB] [12-16MB] [16-20MB+1KB]
Chunks: [C1':h_f] [C2':h_g] [C3':h_h] [C4':h_i] [C5':h_j]
Every hash changed. The 1KB insertion shifted all subsequent bytes. Fixed chunking degrades to full upload for mid-file insertions — which text editors, spreadsheets, and document processors do constantly.
What is needed is a chunking strategy where boundaries do not shift on insertion. Two solutions exist: content-defined chunking (CDC) and rsync-style delta sync. Both rely on a shared foundational concept.
The Two-Hash Foundation
Problem: Scanning a file for chunk boundaries or matching content requires checking every byte position. Cryptographic hashes are too expensive for this. But cheap hashes produce false matches.
Simple example: Imagine scanning a 100MB file byte-by-byte. At each position, you need to answer: "Is this a chunk boundary?" or "Does this data match something I've seen before?" SHA-256 at every byte position would take minutes. A fast rolling hash takes milliseconds — but it sometimes says "match" when the data is different.
Mental model: Think of a metal detector (weak hash) and a gemologist (strong hash). The metal detector scans the entire beach quickly — it beeps at every metal object, including bottle caps. The gemologist examines only the beeps to determine which are actually gold. You would never hire the gemologist to scan the beach, and you would never trust the metal detector to authenticate gold.
Every file scanning algorithm in this design uses two hashes with different roles:
| Weak hash (Rabin / rolling) | Strong hash (SHA-256) | |
|---|---|---|
| Speed | O(1) per byte position | Expensive (processes entire block) |
| Used where | Every byte position in the file | Only after a weak hash match |
| In CDC | Find chunk boundaries (hash % TARGET == 0) | Compute chunk identity for dedup and storage |
| In rsync | Find candidate block matches in old file | Verify the match is real, not a collision |
| Collisions | Yes — fast but imprecise | Cryptographically collision-resistant |
# The two-tier pattern used by both CDC and rsync
for each byte_position in file:
weak = rolling_hash_update(weak, outgoing_byte, incoming_byte) # O(1)
if weak_hash_matches(weak): # Fast check at every position
strong = sha256(block_data) # Expensive, but rare
if strong_hash_matches(strong): # Confirms real match
# CDC: declare chunk boundary
# rsync: emit COPY instructionThe most common weak hash is the Rabin fingerprint: a polynomial rolling hash that treats bytes as coefficients over a Galois field. What matters for the design: O(1) sliding update, uniformly distributed output, and deterministic behavior — same bytes always produce the same hash and the same reproducible boundaries.
Key distinction: Rabin finds boundaries and candidate matches. SHA-256 identifies content. Rabin is never used for dedup or storage addressing — that is SHA-256's job.
Content-Defined Chunking (CDC)
Problem: Fixed chunk boundaries shift when bytes are inserted, causing all downstream hashes to change. How do you chunk a file so that a small edit only affects the chunk containing the edit?
Simple example: A 12MB file is split into three ~4MB chunks. You insert 1KB in the first chunk. With fixed boundaries, all three chunks change. With content-defined boundaries, only the first chunk changes — because the boundaries are determined by the content near each split point, and the content at those split points did not change.
Mental model: Imagine cutting a rope wherever you find a red bead. The positions of the red beads depend on the rope's content, not on fixed measurements. If you splice new rope into the middle, only the section between the nearest two red beads changes. All other sections remain identical.
How it works:
- Slide a 48-byte window across the file, one byte at a time.
- Compute a Rabin fingerprint at each position. The rolling hash updates in O(1).
- Declare a boundary when
rabin_hash % TARGET == 0. TARGET controls average chunk size. For ~4MB chunks, roughly 1 in 4,194,304 positions triggers a boundary. Min/max constraints (2MB–8MB) prevent pathological sizes. - SHA-256 hash each resulting chunk. This hash becomes the chunk's permanent identity — used for dedup, storage addressing, and integrity verification.
Pseudocode:
WINDOW_SIZE = 48
TARGET = 4 * 1024 * 1024 # ~4MB average chunk
MIN_CHUNK = 2 * 1024 * 1024 # 2MB minimum
MAX_CHUNK = 8 * 1024 * 1024 # 8MB maximum
def content_defined_chunking(file_bytes):
chunks = []
chunk_start = 0
window = file_bytes[0:WINDOW_SIZE]
fingerprint = rabin_init(window)
i = WINDOW_SIZE
while i < len(file_bytes):
outgoing = file_bytes[i - WINDOW_SIZE]
incoming = file_bytes[i]
fingerprint = rabin_slide(fingerprint, outgoing, incoming)
chunk_length = i - chunk_start
is_boundary = (fingerprint % TARGET == 0 and chunk_length >= MIN_CHUNK)
is_max = (chunk_length >= MAX_CHUNK)
if is_boundary or is_max:
chunk_data = file_bytes[chunk_start:i]
chunks.append({
'data': chunk_data,
'hash': sha256(chunk_data),
'offset': chunk_start,
'size': len(chunk_data)
})
chunk_start = i
i += 1
if chunk_start < len(file_bytes):
chunk_data = file_bytes[chunk_start:]
chunks.append({
'data': chunk_data,
'hash': sha256(chunk_data),
'offset': chunk_start,
'size': len(chunk_data)
})
return chunksIn practice, the file is streamed — memory usage is constant regardless of file size: just the 48-byte window, the current chunk buffer (max 8MB), and the hash state.
Why it works — concrete example:
Original file, CDC produces 3 chunks:
[----Chunk A (3.8MB)----][----Chunk B (4.2MB)----][----Chunk C (4.0MB)----]
hash: h_a hash: h_b hash: h_c
Insert 1KB in the middle of Chunk A:
[----Chunk A' (3.8MB)---][----Chunk B (4.2MB)----][----Chunk C (4.0MB)----]
hash: h_a' hash: h_b hash: h_c
Only Chunk A changed. The boundary between A and B is determined by the content at that point, which did not change. Chunks B and C keep the same hash. The client uploads one ~4MB chunk instead of the entire 12MB file.
CDC enables cross-user deduplication. Because chunk boundaries are content-defined and SHA-256 is deterministic, two users uploading the same file produce identical chunk hashes. The system stores one copy and two pointers.
rsync-Style Delta Sync
Problem: CDC reduces upload size to whole chunks (~4MB). But a 1KB edit still uploads an entire ~4MB chunk. Can we transmit only the actual changed bytes?
Simple example: You have a 100MB file. You change 1KB in the middle. CDC uploads ~4MB (the affected chunk). rsync-style delta sync uploads ~1KB — the literal changed bytes — plus lightweight COPY instructions telling the receiver "reuse these parts from the old file."
Mental model: Imagine you have two nearly identical books. Instead of sending the entire new book, you send instructions: "Copy pages 1-50 from the old book. Here are 2 new pages. Copy pages 53-200 from the old book." The instructions are tiny compared to shipping a whole book.
Production file sync systems use this approach for transfers: fixed-size blocks with rsync-style rolling hash scanning. The core idea is position-independent matching — finding reusable data anywhere in the old file, regardless of alignment.
Terminology: Receiver = side with the old file. Sender = side with the new file. When uploading: client = Sender, server = Receiver. When downloading: roles reverse.
How it works:
- Receiver splits the old file into fixed-size blocks (~4MB). For each block, compute a weak hash (rolling) and a strong hash (SHA-256). Send the (weak, strong) pairs to the sender.
- Sender scans the new file with a rolling hash, one byte at a time. At each position: does this rolling hash match any of the receiver's block hashes?
- Match -> COPY. Weak hash matches? Verify with strong hash. If confirmed, emit
COPY block_N— the receiver reuses that block from the old file. - No match -> INSERT. Bytes between matches are new data. Emit
INSERT "new_bytes"— literal bytes the receiver must use. - Receiver reconstructs the new file by executing COPY/INSERT instructions in order.
Concrete example:
Old file (receiver has this):
Block 0: [ABCD] -> (weak0, strong0)
Block 1: [EFGH] -> (weak1, strong1)
New file (sender has this):
XXABCDEFGH
Sender scans with rolling hash:
Position 0: [XXAB] -> no match
Position 1: [XABC] -> no match
Position 2: [ABCD] -> weak match! -> verify strong hash -> confirmed (Block 0)
...
Position 6: [EFGH] -> weak match! -> verify strong hash -> confirmed (Block 1)
The delta payload sent over the network:
[
{ "type": "INSERT", "data": "XX" },
{ "type": "COPY", "block_index": 0 },
{ "type": "COPY", "block_index": 1 }
]Only INSERT carries actual bytes. COPY is a lightweight pointer. This achieves byte-level delta granularity: a 1KB edit transmits ~1KB of INSERT data plus COPY instructions, not an entire ~4MB chunk.
The key insight: matching is position-independent. Block 0 was at position 0 in the old file but appears at position 2 in the new file. rsync finds it anyway because the rolling hash scans every position. This is what solves the insertion problem that breaks fixed-size chunking.
Tradeoffs:
- CPU cost: Scans the entire file byte-by-byte with a rolling hash. For a 1GB file: ~1 billion hash computations. More CPU than fixed chunking, comparable to CDC.
- No cross-user dedup. rsync compares old vs new versions of the same file. It does not deduplicate across different users' files. Separate dedup mechanisms handle cross-user storage efficiency.
CDC vs rsync: Choosing the Right Model
Both approaches solve the insertion problem. They differ in what they optimize for.
CDC asks: "Do you already have this exact chunk?" rsync asks: "Can I find this data anywhere in your file?"
CDC dedup model — hash-based, global:
Client: file -> CDC -> chunks -> hashes -> [h_a, h_b, h_c]
Server: h_a exists (another user uploaded it) -> skip
h_b exists (previous version) -> skip
h_c is new -> upload needed
Result: upload only h_c. Dedup works across ALL users and files.
rsync dedup model — content-matching, local:
Receiver sends block hashes from old file
Sender scans new file, finds matches -> COPY / INSERT
Result: minimal bytes transferred. But only compares old vs new version of the SAME file.
Concrete example showing the difference:
Alice uploads Q1-report.pptx (100MB)
CDC chunks: [slides-A][company-logo][chart-data][slides-B]
All new -> 100MB uploaded.
Bob uploads Q2-report.pptx (120MB) -- different report, same logo and chart
CDC chunks: [slides-X][company-logo][chart-data][slides-Y][slides-Z]
Server: company-logo and chart-data already exist (from Alice) -> skip
Upload: only 72MB instead of 120MB. Cross-user dedup saved 48MB.
rsync: no equivalent. rsync compares Bob's file against Bob's previous
version of that file -- not against Alice's file. No cross-user dedup.
Comparison table:
| Dimension | Fixed Chunking | CDC | rsync |
|---|---|---|---|
| Boundary depends on position? | Yes | No (content-defined) | No (position-independent scan) |
| Handles insertions? | No — all chunks shift | Yes — only affected chunk changes | Yes — finds blocks at new positions |
| Delta granularity | Whole chunks (~4MB) | Whole chunks (~4MB) | Byte-level (actual changed bytes) |
| Dedup across users/files? | Yes (hash identity) | Yes (hash identity) | No (same-file pairs only) |
| CPU cost | Low | Medium (Rabin at every byte) | High (rolling hash at every byte) |
| Server touches file bytes? | No | No | Sender processes file for delta |
Bandwidth comparison for a 100MB file:
| Scenario | Full Upload | Fixed Chunks | CDC | rsync |
|---|---|---|---|---|
| 1KB edit at start | 100MB | 100MB (all shifted) | ~4MB (1 chunk) | ~1KB |
| 1KB edit in middle | 100MB | ~50MB (half shifted) | ~4MB (1 chunk) | ~1KB |
| 1KB edit at end | 100MB | 4MB (last chunk) | ~4MB (1 chunk) | ~1KB |
| 1KB append | 100MB | 4MB (new last chunk) | ~4MB (1 chunk) | ~1KB |
| 50MB overwrite | 100MB | ~52MB | ~52MB | ~50MB |
| No change | 100MB | 0MB (hashes match) | 0MB (hashes match) | 0MB |
This design uses CDC as the primary model because: hash-based dedup maps cleanly to content-addressable object storage, presigned URLs work directly with CDC chunks, and global cross-user dedup saves 50%+ storage. Production systems combine rsync-style delta sync for transfers with chunk-based dedup for storage — the best of both worlds.
Chunk Size Tuning
The TARGET parameter controls average chunk size. This is a critical tuning knob.
Smaller chunks (1MB average): Better dedup and delta granularity. More metadata per file (100 records for a 100MB file), more objects to manage, more SHA-256 computations.
Larger chunks (16MB average): Less metadata, fewer objects. Worse dedup and delta granularity. Larger minimum upload size.
The sweet spot: 4MB average. A 100MB file produces ~25 chunks (manageable metadata). A small edit changes 1-2 chunks (4-8MB delta — acceptable on broadband). With min 2MB and max 8MB constraints, pathological cases stay bounded. At 10B files averaging 3 chunks each: ~30B chunk references, ~15B unique chunks after 50% dedup.
| Tradeoff | Impact |
|---|---|
| CPU cost per file | Rabin + SHA-256 at every byte. ~2-3 seconds for a 1GB file on a modern laptop. |
| More metadata per file | 25 chunk records for 100MB. More rows in version_chunks. |
| Minimum effective edit size | A 1-byte change still produces one new ~4MB chunk. |
| Complexity | CDC boundary detection, min/max enforcement, hash computation add client-side code. |
These costs are worth it. The alternative (full-file upload) wastes orders of magnitude more bandwidth at scale.
Technology Selection
| Component | Technology | Role |
|---|---|---|
| Chunk storage | Object Storage | Content-addressable blob storage, 11 nines durability, multipart upload |
| File metadata | PostgreSQL + Citus | Users, files, versions, ACLs. Sharded by user_id for horizontal scale |
| Dedup cache | Redis Cluster | Bloom filter for chunk existence checks, upload session state, connection registry |
| Sync events | Kafka (RF=3) | Sync event distribution, upload completion events, dedup events |
| File search | Elasticsearch | Search by filename, type, content metadata |
| CDN | CloudFront | Edge caching for frequently downloaded chunks |
| Real-time sync | WebSocket Gateway | Push sync notifications to connected devices |
| LAN discovery | mDNS (Bonjour/Avahi) | Discover devices on the same local network for P2P sync |
Why PostgreSQL + Citus? File metadata is inherently relational: files belong to users, live in folder hierarchies, have versions, and each version references an ordered list of chunks. ACLs need transactional consistency. A single PostgreSQL instance cannot handle 10B files and 50B version-chunk mappings. Citus gives horizontal sharding with full SQL and ACID guarantees. Since almost all queries are user-scoped, Citus co-locates everything on one shard per user: fast, local, transactional.
| Dimension | PostgreSQL + Citus | DynamoDB | MongoDB | CockroachDB |
|---|---|---|---|---|
| Data model fit | Relational. File trees, ACLs, versions. | Key-value. Hierarchy requires denormalization. | Document. Decent for metadata, poor for ACL joins. | Relational, distributed. |
| Sharding | Automatic by user_id. Co-locates user data. | Partition key required. Cross-partition expensive. | Manual, complex at scale. | Automatic range-based. |
| Transactions | Full ACID, even across co-located tables. | Single-partition only. | Multi-document with caveats. | Full distributed (Spanner-style). |
| Cost | Open source. Commodity hardware. | Expensive at this write volume. | Open source but operationally complex. | Higher resource overhead than Citus. |
CockroachDB is the closest competitor — also distributed SQL. But its consensus-based writes (Raft per range) add latency compared to Citus's single-node writes for co-located data.
Architecture Overview
The architecture has three planes: Control plane handles metadata and orchestration. Data plane stores chunk bytes. Event plane delivers async notifications. The invariant that ties it all together: a file is always an ordered list of chunk hashes. Everything else is machinery to keep that list consistent across devices.
The Upload Path
Every numbered step must succeed for the upload to complete. Three phases: decision (steps 1-3), data transfer (steps 4-5), commit (steps 6-7).
- User edits a file on their laptop
- Desktop client's filesystem watcher detects the change
- Client re-chunks the file using CDC, producing a list of chunk hashes
- Client diffs the new chunk list against its local SQLite DB to find new chunks
- Client calls
/upload/initwith the list of new chunk hashes - Server runs dedup: checks each hash against the Bloom filter (Redis), then PostgreSQL. Already-existing chunks are skipped.
- Server returns presigned URLs only for chunks that don't already exist
- Client uploads new chunks directly to object storage via presigned URLs (file bytes never touch the server)
- Client calls
/upload/finalize. Server writes new file version (metadata + chunk references) to PostgreSQL. The Sync Service publishes an event to Kafka. - Kafka delivers the event to the WebSocket Gateway, which pushes a notification to all other devices owned by this user
- Other devices pull the updated metadata and download only the changed chunks
File bytes bypass the entire server fleet. At petabyte-scale upload ingress, routing data through API servers would require thousands of proxy nodes. Presigned URLs eliminate that.
Critical path vs async path:
| Critical path (must succeed) | Async path (eventual) |
|---|---|
/upload/init (dedup check) | Kafka event delivery |
| Chunk upload to object storage | WebSocket notification to other devices |
/upload/finalize (metadata write) | Search indexing (Elasticsearch) |
| Thumbnail generation |
The Sync Path
After a successful upload, the Sync Service publishes an event to Kafka. This path is decoupled from the upload — if Kafka is slow, sync is delayed but no data is lost.
Full Architecture
Layer Responsibilities
Layer 1: Edge. The L7 load balancer terminates TLS and routes requests. The API Gateway handles authentication (JWT validation), rate limiting (per-user, per-IP), and request routing. WebSocket connections are sticky-routed to the same gateway pod.
Layer 2: Services. Four stateless services:
- Sync Service: Handles file metadata CRUD, version creation, cursor-based change feeds, and folder operations. Writes to PostgreSQL, publishes events to Kafka.
- Upload Orchestrator: Manages the upload lifecycle — calls Dedup Service for hash checks, generates presigned URLs for new chunks, tracks progress in Redis. On finalize, delegates metadata write to the Sync Service (which owns metadata and publishes the Kafka event).
- WebSocket Gateway: Maintains persistent connections with all online devices. Consumes sync events from Kafka and pushes them to the right devices. Registers connections in Redis so other services can look up which pod holds which user.
- Dedup Service: Called during
/upload/initbefore any data upload. Handles chunk existence checks (Bloom filter in Redis + PostgreSQL verification), chunk insertion into the chunks table, and ref_count management. This is on the critical path.
Layer 3: Message Bus. Kafka with replication factor 3. Three topic families: sync-events for file change notifications, upload-complete for post-upload processing, and dedup-events for async verification. Kafka buffers upload spikes, guarantees ordering per user (partitioned by user_id), and enables fan-out: one event feeds search indexing, thumbnail generation, and sync notifications independently.
Layer 4: Storage.
- Object Storage stores chunk bytes. Content-addressable: the key is the SHA-256 hash (e.g.,
chunks/a1/b2/a1b2c3d4e5f6...). Same content always maps to the same key. - PostgreSQL + Citus stores all metadata: users, files, folders, versions, chunk lists, ACLs. Sharded by
user_id. Source of truth for "what files exist and what chunks they contain." - Redis Cluster stores ephemeral state: (1) Bloom filter for chunk existence checks, (2) upload session progress for resumable uploads, (3) WebSocket connection registry mapping user_id to gateway pod.
- Elasticsearch indexes file metadata for search.
Why both local diff AND server dedup? Local diff (client compares chunk hashes against its own SQLite DB) reduces upload volume within a single file across versions. Server dedup (Bloom filter + PostgreSQL) eliminates duplicates across users and files. A 50MB PDF uploaded by 500 employees: local diff does not help (each client sees it as new). Server dedup stores it once.
Why Bloom filter AND PostgreSQL for dedup? The Bloom filter is fast but has false positives. PostgreSQL is accurate but slow. The Bloom filter handles 99%+ of checks; PostgreSQL verifies the rest.
Why Sync Service AND Upload Orchestrator? Upload Orchestrator manages the upload lifecycle (presigned URLs, session tracking, resume). Sync Service owns metadata consistency (file versions, chunk references, cursor-based change feeds). On finalize, the Orchestrator delegates the metadata write to the Sync Service. Separation keeps each service focused and independently scalable.
Scale Estimation
Upload Throughput
Peak uploads: 1M files/minute = 16,667 files/sec
Avg chunks/file: 3 (many small files are 1 chunk, large files bring up the average)
Total chunks/sec: 16,667 * 3 = 50,000 chunks/sec
Dedup hit rate: 50% (half the chunks already exist in the system)
New store PUTs/sec: 50,000 * 0.5 = 25,000 PUT operations/sec
Avg chunk size: 4MB
Object storage ingress: 25,000 * 4MB = 100GB/sec peak
25K PUTs per second is within the object store's capabilities (3,500 PUT/sec per prefix, so we use 256 hash-based prefixes for headroom).
Storage
Total files: 10B
Average file size: 2MB
Gross storage: 10B * 2MB = 20PB
Average versions: 5 per file
Gross with versions: 20PB * 5 = 100PB
Dedup ratio: 50%
Actual object storage: 100PB * 0.5 = 50PB
Unique chunks: 50PB / 4MB = 12.5 billion unique chunks
50PB in Standard tier. Tiered storage (intelligent tiering, cold archive for old versions) reduces cost significantly.
Metadata (PostgreSQL + Citus)
Chunk record: hash(32B) + size(4B) + bucket(8B) + ref_count(4B) + timestamps(8B) = ~56 bytes
12.5B chunks * 56B = 700GB
File record: id(16B) + user_id(16B) + parent_id(16B) + name(256B) + metadata(128B)
+ timestamps(16B) + indexes(~116B) = ~564 bytes
10B files * 564B = 5.6TB
Version record: id(16B) + file_id(16B) + user_id(16B) + version(4B) + size(8B)
+ checksum(32B) + device_id(16B) + chunk_count(4B) + timestamps(16B)
+ indexes(~72B) = ~200 bytes
10B files * 5 versions = 50B versions * 200B = 10TB
Version-chunk mapping: version_id(16B) + user_id(16B) + chunk_index(4B)
+ chunk_hash(32B) + chunk_size(4B) = ~72 bytes
50B versions * 3 chunks avg = 150B rows * 72B = 10.8TB
Total PostgreSQL: 700GB + 5.6TB + 10TB + 10.8TB = 27TB
27TB sharded across 50+ Citus worker nodes. Each node holds ~500GB — within a single PostgreSQL instance's comfort zone with NVMe storage.
Bandwidth
Upload to object storage: 100GB/sec peak
Download from object storage: ~200GB/sec peak (downloads typically 2x uploads)
CDN hit rate: 60% for popular files
CDN offload: 200GB/sec * 0.6 = 120GB/sec served from CDN
Object storage egress: 200GB/sec * 0.4 = 80GB/sec
WebSocket Connections
DAU: 50M
Devices per user: 2.5 average
Online concurrency: 30% at peak
Connections: 50M * 2.5 * 0.3 = 37.5M ~ 40M concurrent connections
Memory per conn: ~15KB (buffers, session state, send queue)
Total memory: 40M * 15KB = 600GB
Connections per pod: 500K (practical limit per gateway instance)
Gateway pods: 40M / 500K = 80 pods
80 WebSocket gateway pods, each holding 500K connections. Memory-bound, not CPU-bound.
Redis Cluster
Bloom filter: 12.5B chunks, 0.1% false positive rate
15 bits per element = 12.5B * 15 / 8 = ~23.4GB
Active upload sessions: 100K concurrent * 2KB each = 200MB
Connection registry: 40M entries * 32B each = 1.28GB
Rate limit counters: 5M * 16B = 80MB
Total Redis memory: ~25GB
Summary
| Resource | Quantity | Notes |
|---|---|---|
| Object storage | 50PB | Content-addressable chunks, deduplicated |
| Object store PUTs/sec | 25K | After 50% dedup |
| Object storage ingress | 100GB/sec | Peak upload bandwidth |
| PostgreSQL storage | 27TB | Sharded across 50+ Citus nodes |
| Redis memory | ~25GB | Bloom filter + sessions + connections |
| WebSocket pods | 80 | 500K connections each |
| Kafka throughput | 50K events/sec | Sync events, upload completions |
| Elasticsearch | ~2TB | Filename and metadata index |
Data Model
Users Table
CREATE TABLE users (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
email TEXT UNIQUE NOT NULL,
display_name TEXT NOT NULL,
quota_bytes BIGINT NOT NULL DEFAULT 2147483648, -- 2GB free tier
storage_used BIGINT NOT NULL DEFAULT 0,
plan TEXT NOT NULL DEFAULT 'free',
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Citus distribution
SELECT create_distributed_table('users', 'id');File Metadata Table
CREATE TABLE file_metadata (
id UUID NOT NULL DEFAULT gen_random_uuid(),
user_id UUID NOT NULL REFERENCES users(id),
parent_id UUID, -- NULL for root folder
name TEXT NOT NULL,
is_folder BOOLEAN NOT NULL DEFAULT false,
latest_version INTEGER NOT NULL DEFAULT 0,
size_bytes BIGINT NOT NULL DEFAULT 0,
mime_type TEXT,
checksum TEXT, -- SHA-256 of full file content
is_deleted BOOLEAN NOT NULL DEFAULT false,
deleted_at TIMESTAMPTZ,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, id) -- Citus requires dist key in PK
);
SELECT create_distributed_table('file_metadata', 'user_id');
CREATE INDEX idx_file_parent ON file_metadata(user_id, parent_id);
CREATE UNIQUE INDEX idx_file_name ON file_metadata(user_id, parent_id, name)
WHERE is_deleted = false;File Versions Table
CREATE TABLE file_versions (
id UUID NOT NULL DEFAULT gen_random_uuid(),
file_id UUID NOT NULL,
user_id UUID NOT NULL,
version_number INTEGER NOT NULL,
size_bytes BIGINT NOT NULL,
checksum TEXT NOT NULL, -- SHA-256 of full file (Merkle root)
device_id UUID NOT NULL,
chunk_count INTEGER NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
PRIMARY KEY (user_id, id),
FOREIGN KEY (user_id, file_id) REFERENCES file_metadata(user_id, id)
);
SELECT create_distributed_table('file_versions', 'user_id',
colocate_with => 'file_metadata');
CREATE UNIQUE INDEX idx_version_number
ON file_versions(user_id, file_id, version_number);Version Chunks Table
CREATE TABLE version_chunks (
version_id UUID NOT NULL,
user_id UUID NOT NULL,
chunk_index INTEGER NOT NULL, -- 0-based order within the file
chunk_hash BYTEA NOT NULL, -- SHA-256 (32 bytes)
chunk_size INTEGER NOT NULL,
storage_key TEXT NOT NULL, -- denormalized (avoids cross-shard join)
PRIMARY KEY (user_id, version_id, chunk_index),
FOREIGN KEY (user_id, version_id) REFERENCES file_versions(user_id, id)
);
SELECT create_distributed_table('version_chunks', 'user_id',
colocate_with => 'file_metadata');Byte offsets are not stored explicitly. The file is reconstructed by concatenating chunks in chunk_index order. Each chunk knows its own size, so byte offsets can be derived at read time if needed for range reads.
Chunks Table (Global, Not User-Sharded)
CREATE TABLE chunks (
hash BYTEA PRIMARY KEY, -- SHA-256 (32 bytes)
size_bytes INTEGER NOT NULL,
storage_bucket TEXT NOT NULL,
storage_key TEXT NOT NULL,
ref_count INTEGER NOT NULL DEFAULT 1,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
last_referenced TIMESTAMPTZ NOT NULL DEFAULT now()
);
-- Sharded by hash (NOT a reference table -- 700GB replicated to every node is impractical)
SELECT create_distributed_table('chunks', 'hash');Why cross-shard joins are expensive: The chunks table is sharded by hash, while version_chunks is sharded by user_id. A JOIN between them fans out to multiple shards, adding network round trips and coordination overhead.
How cross-shard joins are avoided:
-
Bloom filter + point lookups for dedup. The Bloom filter handles 99%+ of chunk existence checks. The chunks table is queried only for verification and ref count management — point lookups by hash (single-shard), not joins.
-
Denormalized
storage_keyinversion_chunks. When downloading a file, the server needs the storage key for each chunk to generate presigned URLs. Instead of joining across shards,storage_keyis stored directly inversion_chunks. One extra column (~50 bytes per row) eliminates a cross-shard join on every download. Written once during/upload/complete, never changes. -
Co-location by
user_id. All user-scoped tables are co-located on the same Citus shard. Queries within a user's data are single-shard operations — fast and transactional.
The chunks table remains the source of truth for chunk existence and ref counts. But hot-path reads (download, sync) never touch it — they read denormalized data from version_chunks on the user's local shard.
Upload Sessions in Redis
{
"key": "upload:{upload_id}",
"value": {
"upload_id": "up_abc123",
"user_id": "usr_xyz789",
"file_id": "file_def456",
"storage_multipart_id": "mpu_id",
"total_chunks": 25,
"completed_chunks": [0, 1, 2, 3, 5, 6, 7],
"failed_chunks": [4],
"presigned_urls": {
"8": "https://storage.example.com/...?X-Storage-Signature=...",
"9": "https://storage.example.com/...?X-Storage-Signature=..."
},
"chunk_hashes": ["a1b2c3...", "d4e5f6...", "..."],
"chunks_to_upload": [4, 8, 9, 10, 11, 12],
"chunks_existing": [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24],
"started_at": "2026-03-10T14:30:00Z",
"expires_at": "2026-03-17T14:30:00Z"
},
"TTL": "7 days"
}Upload sessions live in Redis with a 7-day TTL matching the multipart upload lifecycle.
Client-Side SQLite Schema
Every desktop client maintains a local SQLite database mirroring server state for synced folders:
CREATE TABLE local_files (
path TEXT PRIMARY KEY, -- relative to sync root
file_id TEXT, -- server file ID
version_number INTEGER,
size_bytes INTEGER,
checksum TEXT, -- SHA-256 of full file
chunk_hashes TEXT, -- JSON array of chunk SHA-256 hashes
local_mtime INTEGER, -- filesystem modification time
sync_status TEXT DEFAULT 'synced', -- synced, pending_upload, pending_download,
-- conflicted, error
last_synced_at INTEGER -- epoch seconds
);
CREATE TABLE sync_queue (
id INTEGER PRIMARY KEY AUTOINCREMENT,
path TEXT NOT NULL,
action TEXT NOT NULL, -- upload, download, delete, rename, move
priority INTEGER DEFAULT 0,
created_at INTEGER NOT NULL,
attempts INTEGER DEFAULT 0,
last_error TEXT
);
CREATE TABLE sync_state (
key TEXT PRIMARY KEY,
value TEXT
);
-- stores: last_cursor, last_sync_time, device_idObject key format: Chunks are stored with a key derived from their SHA-256 hash, distributed across prefixes:
chunks/{hash[0:2]}/{hash[2:4]}/{full_sha256_hash}
Example: chunks/a1/b2/a1b2c3d4e5f6...full64charhash
The two-level prefix creates 65,536 prefixes (256 * 256), distributing storage operations evenly to avoid per-prefix rate limits.
API Design
Upload Flow
Initialize upload:
POST /v1/files/upload/init
Authorization: Bearer <jwt>
Content-Type: application/json
{
"path": "/Documents/report.xlsx",
"size_bytes": 104857600,
"checksum": "sha256:abc123...",
"base_version": 5,
"chunk_hashes": [
"a1b2c3d4...",
"e5f6a7b8...",
"c9d0e1f2...",
...
],
"chunk_sizes": [
4194304,
4194304,
3670016,
...
]
}
Response 200:
{
"upload_id": "up_abc123",
"chunks_needed": [0, 5, 12],
"chunks_existing": [1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 13, ...],
"presigned_urls": {
"0": {
"url": "https://storage.example.com/chunks/a1/b2/a1b2...?...",
"expires_at": "2026-03-10T15:00:00Z"
},
"5": {
"url": "https://storage.example.com/chunks/e5/f6/e5f6...?...",
"expires_at": "2026-03-10T15:00:00Z"
},
"12": {
"url": "https://storage.example.com/chunks/c9/d0/c9d0...?...",
"expires_at": "2026-03-10T15:00:00Z"
}
}
}
The server checks all chunk hashes against the dedup system. Only truly new chunks get presigned URLs.
Complete upload:
POST /v1/files/upload/{upload_id}/complete
Authorization: Bearer <jwt>
{
"etags": {
"0": "\"etag_for_chunk_0\"",
"5": "\"etag_for_chunk_5\"",
"12": "\"etag_for_chunk_12\""
}
}
Response 200:
{
"file_id": "file_def456",
"version": 6,
"size_bytes": 104857600,
"synced_at": "2026-03-10T14:35:00Z"
}
Check progress (for resume):
GET /v1/files/upload/{upload_id}/progress
Authorization: Bearer <jwt>
Response 200:
{
"upload_id": "up_abc123",
"total_chunks": 25,
"completed_chunks": [0, 5],
"remaining_chunks": [12],
"presigned_urls": {
"12": {
"url": "https://storage.example.com/chunks/c9/d0/c9d0...?...",
"expires_at": "2026-03-10T15:15:00Z"
}
}
}
If presigned URLs have expired, the server generates fresh ones. The multipart upload stays valid for 7 days.
Download
GET /v1/files/{file_id}/download?version=6
Authorization: Bearer <jwt>
Response 200:
{
"file_id": "file_def456",
"version": 6,
"size_bytes": 104857600,
"checksum": "sha256:abc123...",
"chunks": [
{
"index": 0,
"hash": "a1b2c3d4...",
"size": 4194304,
"url": "https://cdn.example.com/chunks/a1/b2/a1b2...?..."
},
{
"index": 1,
"hash": "e5f6a7b8...",
"size": 4194304,
"url": "https://cdn.example.com/chunks/e5/f6/e5f6...?..."
}
]
}
URLs point to the CDN for caching. The client downloads chunks in parallel, verifies each against its hash, and assembles the file locally.
Sync (Cursor-Based)
POST /v1/sync/changes
Authorization: Bearer <jwt>
{
"cursor": "cursor_2026031014300000_shard42",
"limit": 500
}
Response 200:
{
"changes": [
{
"type": "file_modified",
"file_id": "file_def456",
"path": "/Documents/report.xlsx",
"version": 6,
"size_bytes": 104857600,
"checksum": "sha256:abc123...",
"modified_by_device": "device_laptop01",
"timestamp": "2026-03-10T14:35:00Z"
},
{
"type": "file_created",
"file_id": "file_ghi789",
"path": "/Photos/vacation.jpg",
"version": 1,
"size_bytes": 5242880,
"checksum": "sha256:def456...",
"modified_by_device": "device_phone01",
"timestamp": "2026-03-10T14:36:00Z"
}
],
"cursor": "cursor_2026031014360000_shard42",
"has_more": false
}
The cursor is opaque to the client. Internally it encodes a timestamp and shard identifier. The client stores it in local SQLite and uses it on reconnection to catch up on missed changes.
WebSocket Protocol
Connection handshake:
wss://ws.example.com/v1/sync/ws
Authorization: Bearer <jwt>
X-Device-Id: device_laptop01
Message types:
// Client -> Server: Subscribe to sync events
{
"type": "SUBSCRIBE",
"folders": ["/", "/Documents", "/Photos"],
"cursor": "cursor_2026031014300000_shard42"
}
// Server -> Client: File change notification
{
"type": "CHANGE",
"id": "evt_123",
"change_type": "file_modified",
"file_id": "file_def456",
"path": "/Documents/report.xlsx",
"version": 6,
"modified_by": "device_phone01",
"timestamp": "2026-03-10T14:35:00Z"
}
// Client -> Server: Acknowledge receipt
{
"type": "ACK",
"id": "evt_123"
}
// Bidirectional: Keepalive
{ "type": "PING", "timestamp": 1741614900000 }
{ "type": "PONG", "timestamp": 1741614900000 }The SUBSCRIBE message includes a cursor so the server can replay missed events. PING/PONG every 30 seconds detects dead connections. Three unanswered PINGs triggers connection teardown and removal from the Redis connection registry.
File Operations
POST /v1/files/move
{ "file_id": "file_def456", "new_parent_id": "folder_abc", "new_name": "renamed-report.xlsx" }
POST /v1/files/copy
{ "file_id": "file_def456", "destination_parent_id": "folder_xyz", "name": "report-copy.xlsx" }
DELETE /v1/files/{file_id}
// Soft delete. Moves to trash. Hard delete after 30 days.
GET /v1/files/{file_id}/versions
// Returns list of all versions with timestamps, sizes, device info.
POST /v1/files/{file_id}/restore?version=3
// Creates new version identical to version 3.
POST /v1/shares
{ "file_id": "file_def456", "email": "colleague@example.com", "permission": "edit", "expires_at": "2026-04-10T00:00:00Z" }
GET /v1/search?q=quarterly+report&type=xlsx&modified_after=2026-01-01
The Upload Pipeline
Problem: How does the client go from "file changed on disk" to "change confirmed and syncing to other devices"?
Mental model: A factory assembly line. Raw material (file bytes) enters at one end. The file is sliced, fingerprinted, checked against inventory, and only new pieces ship to the warehouse. A receipt is filed, and all branch offices are notified.
The chunking engine is the most performance-critical component in the desktop client. It runs entirely on the user's machine.
Step 1: Filesystem watcher detects a change. The desktop client installs OS-level watchers: FSEvents (macOS), inotify (Linux), ReadDirectoryChangesW (Windows). The event is added to a sync queue with 500ms debouncing to avoid processing mid-write states.
Step 2: Read and chunk the file. The client streams the file through the CDC engine. Memory usage is constant: only the 48-byte sliding window plus the current chunk buffer (max 8MB). A 50GB file processes in a single streaming pass.
Step 3: SHA-256 hash each chunk. On modern CPUs with SHA-NI extensions, SHA-256 runs at ~2GB/sec. A 4MB chunk hashes in ~2ms.
Step 4: Compare against local SQLite. The client compares the new chunk hash list against the previous version stored locally. Unchanged chunks (same hash, same position) don't need uploading.
Step 5: Send hash list to server. POST /v1/files/upload/init with the full list of chunk hashes.
Step 6: Server responds with dedup results. "Chunks 0, 1, 3, 4 already exist. Only chunks 2 and 5 need uploading." This combines delta sync (only changed chunks from the previous version) with cross-user dedup (chunks that exist from other users' files).
Step 7: Upload new chunks via presigned URLs. Parallel uploads (4-8 concurrent by default). Each chunk is a separate PUT operation directly to object storage.
Concrete example: A user edits a 100MB Excel file, changing values in row 5000. CDC produces 25 chunks. Compared to the previous version, 1-2 chunks have different hashes. The server reports that 23 of 25 chunks already exist. The client uploads only 2 chunks (8MB total) — 92% bandwidth savings.
Full pipeline pseudocode:
def sync_file(file_path):
# Step 1: Read and chunk
chunks = []
with open(file_path, 'rb') as f:
for chunk_data in cdc_stream(f):
chunk_hash = sha256(chunk_data)
chunks.append({
'index': len(chunks),
'hash': chunk_hash,
'size': len(chunk_data),
'data': chunk_data
})
# Step 2: Local delta detection
prev_hashes = sqlite_get_chunk_hashes(file_path)
changed_indices = []
for i, chunk in enumerate(chunks):
if i >= len(prev_hashes) or chunk['hash'] != prev_hashes[i]:
changed_indices.append(i)
if not changed_indices and len(chunks) == len(prev_hashes):
return # No changes
# Step 3: Server dedup check
response = api_upload_init(
path=file_path,
chunk_hashes=[c['hash'] for c in chunks],
chunk_sizes=[c['size'] for c in chunks]
)
# Step 4: Upload only needed chunks
for idx in response['chunks_needed']:
presigned_url = response['presigned_urls'][str(idx)]
upload_chunk(presigned_url, chunks[idx]['data'])
# Step 5: Finalize
api_upload_complete(response['upload_id'], etags)
# Step 6: Update local state
sqlite_update_file(file_path, chunks)Resumable Uploads
Problem: A user uploading a 2GB video over a flaky connection will experience drops. Without resumable uploads, they restart from scratch every time. At chunk 47 of 500, the connection dies. How do we resume from chunk 48?
Simple example: Think of mailing 500 packages to the same address. You track which packages were delivered. If the mail truck breaks down after 300 deliveries, you resume from package 301 — not package 1.
Mental model: Each chunk is an independent package with its own tracking number (presigned URL). The upload session is a shipping manifest in Redis. Progress survives client restarts, network failures, and even server pod restarts (the object store is the ultimate source of truth for what was delivered).
The flow:
- Create session. The Upload Orchestrator stores session state in Redis.
- Generate presigned URLs. Each needed chunk gets a presigned PUT URL. URLs expire in 15 minutes (security best practice).
- Client uploads chunks directly. Object store returns an ETag per completed chunk.
- Track progress. Client notifies the Orchestrator after each chunk. Redis session updated.
- Network failure. Client loses connectivity. Some chunks uploaded, some not. The session and uploaded chunks are preserved.
- Resume. Client calls
GET /upload/{id}/progress. Orchestrator checks Redis session plusListPartson the object store. Returns remaining chunks with fresh presigned URLs. - Continue. Client uploads remaining chunks from where it left off.
- Finalize. Once all chunks are uploaded, create version record and clean up.
Key detail: Presigned URLs expire in 15 minutes, but the multipart upload stays valid for 7 days. When a client reconnects after URLs expire, the Orchestrator generates fresh URLs. Chunks already uploaded are preserved.
Edge case: Redis session lost. If Redis loses the session (node failure, TTL expiry), the Orchestrator reconstructs it by calling ListParts on the object store. Slower but ensures resumability through infrastructure failures.
Content Deduplication
Problem: 500 employees upload the same 50MB PDF. Without dedup, that is 25GB for one file. With dedup, 50MB. How does the system detect that a chunk already exists before uploading it?
Simple example: A library catalogues books by ISBN. When someone donates a book, the librarian checks the ISBN. If the library already has a copy, the donation is politely declined. The ISBN is the SHA-256 hash; the library catalogue is the Bloom filter + PostgreSQL.
Mental model: Two-tier lookup. A fast approximate check (Bloom filter) says "definitely new" or "probably exists." If "probably exists," a slow exact check (PostgreSQL) confirms. This keeps the fast path fast while maintaining correctness.
The dedup flow (runs during /upload/init, before any bytes upload):
- Client sends chunk hashes (SHA-256) to server
- Server checks each hash: Redis Bloom filter (fast) then PostgreSQL chunks table (accurate)
- Server returns:
chunks_existing(skip) +chunks_needed(upload) - Only
chunks_neededget presigned URLs
Bloom filter sizing: 12.5B chunks with 15 bits per element gives a 0.1% false positive rate in ~23GB. This fits in Redis Cluster memory. If the Bloom filter says "not present," the chunk is guaranteed new — no further checks needed. The 0.1% false positives are caught by PostgreSQL verification.
Cross-user dedup example: A company with 500 employees distributes a 50MB handbook PDF. The first employee uploads it: CDC produces 13 chunks, all new, all uploaded, ref_count = 1 each. The second employee uploads the same PDF: same 13 hashes, all exist, zero bytes uploaded, ref_counts incremented to 2. Employees 3-500: zero uploads. Storage savings: 99.8% for this file. Across the entire platform, empirical data suggests 50-60% overall dedup ratio.
Security tradeoff: If the server tells a client "this chunk already exists, skip upload," the client can infer someone else has the same content. Mitigations: (1) dedup only within the same organization, (2) per-tenant encryption keys so chunks differ across tenants, or (3) per-upload salt (eliminates cross-user dedup). Consumer services accept the risk. Enterprise deployments restrict to within-tenant.
Reference Counting and Garbage Collection
Every chunk has a ref_count tracking how many file versions reference it:
- Incremented when a new file version references an existing chunk
- Decremented when a file version is deleted (version expiry, user deletion)
When ref_count reaches 0, the chunk is eligible for deletion. A 24-hour grace period protects against race conditions (a chunk's ref_count could temporarily hit 0 during a concurrent upload about to reference it).
Garbage collection process:
- Background job runs hourly, scans for chunks where
ref_count = 0ANDlast_referenced < now() - 24 hours - Delete the object from the store
- Delete the chunks table row
- Remove from Bloom filter (requires periodic rebuild from the chunks table, since standard Bloom filters don't support deletion; rebuilt nightly)
Race condition: simultaneous uploads of the same chunk. User A and User B both upload a file containing the same chunk. Both pass the Bloom filter as "new." Both try to insert into the chunks table. The unique constraint on hash catches it: one insert succeeds (ref_count = 1), the other hits a conflict and retries as a ref_count increment. No duplicate stored.
Weekly reconciliation: Safety net. Scans all version_chunks references, counts actual references per chunk, compares against chunks.ref_count, and fixes discrepancies.
Integrity Verification
Problem: Presigned URLs mean file bytes bypass the server entirely. The client claims "I uploaded chunk with hash h_abc," but the server never saw the bytes. How do we verify nothing is corrupted or tampered with?
Mental model: Think of a notary system. Individual documents (chunks) are notarized independently. Then the collection of documents is sealed in an envelope with a tamper-evident seal (Merkle root). If any document is swapped, reordered, or missing, the seal breaks.
Chunk-Level Verification
A malicious or buggy client could claim hash h_abc but upload different data. Every subsequent download of that chunk delivers garbage.
Step 1: ETag check (fast, inline). During /upload/complete, the server compares ETags. The ETag is typically an MD5 of the uploaded content — catches accidental corruption.
Step 2: SHA-256 verification (async, thorough). A Kafka event triggers a verification worker that downloads the chunk, computes SHA-256, and compares against the claimed hash. Catches everything: malicious uploads, silent corruption, hash collisions.
Step 3: Version status. New versions are created with status: pending_verification. The version becomes verified once all chunks pass. Other devices are notified only after verification completes (typically 1-5 seconds added).
File-Level Integrity (Merkle Tree)
Individual chunk hashes prove each chunk is valid. But they don't prove the whole file is valid: chunks could be reordered, duplicated, or missing. The checksum field in file_versions solves this via a Merkle tree over chunk hashes.
File: [Chunk0] [Chunk1] [Chunk2] [Chunk3]
| | | |
h_0 h_1 h_2 h_3 <- leaf hashes (SHA-256 of chunk content)
\ / \ /
SHA-256(h_0||h_1) SHA-256(h_2||h_3) <- intermediate nodes
\ /
SHA-256(h_01 || h_23) <- Merkle root = file checksum
def merkle_root(chunk_hashes):
if len(chunk_hashes) == 0:
return sha256(b'')
if len(chunk_hashes) == 1:
return chunk_hashes[0]
nodes = list(chunk_hashes)
if len(nodes) % 2 == 1:
nodes.append(nodes[-1]) # duplicate last hash
while len(nodes) > 1:
next_level = []
for i in range(0, len(nodes), 2):
combined = sha256(nodes[i] + nodes[i+1])
next_level.append(combined)
nodes = next_level
return nodes[0] # Root hashWhy Merkle tree instead of concatenating all hashes?
- Partial verification. To verify a single chunk belongs to the file, you need only its sibling hashes along the path to the root — O(log N) hashes, not all N. For 12,500 chunks: 14 hashes instead of 12,500.
- Corruption localization. Binary-search for exactly which chunk is corrupted — O(log N) checks.
- Tamper evidence. Changing, reordering, or removing any chunk changes the Merkle root.
| When | What | How |
|---|---|---|
| Upload: client | Merkle root from chunk hashes | Sent in /upload/init as checksum |
| Upload: server | Recompute Merkle root from request | Must match client's claimed checksum |
| Download: client | Each chunk hash | SHA-256(downloaded_bytes) == expected hash |
| Download: client | Merkle root after assembly | Recompute from chunk hashes, compare to version checksum |
| Periodic: server | Async chunk verification | Worker re-hashes chunks in object store |
The Sync Protocol
Problem: When a file changes on one device, how do all other devices learn about it and download only the changed parts — ideally within seconds?
Simple example: A group chat where the server posts "report.xlsx updated to version 6" and each member checks what's new since their last read. Members who were offline scroll back through history when they return.
Mental model: WebSocket is the real-time notification channel (instant push). Cursor-based sync is the catch-up mechanism (reliable recovery). Together they ensure no change is ever missed, regardless of connectivity.
Upload Notification Path
After a successful upload, the Sync Service publishes a sync event to Kafka. The WebSocket Gateway consumes it and pushes to all other devices.
Download Path
1. WebSocket CHANGE message arrives
2. Client fetches file version metadata (chunk list)
3. Client diffs chunk list against local SQLite
4. Client downloads only new/changed chunks from CDN/object storage
5. Client assembles file from chunks (local cache + downloaded)
6. Client writes file to filesystem
7. Client updates local SQLite with new version
8. FS watcher fires (from own write) - client ignores
(filters out self-triggered events by checking sync_status)
Cursor-Based Offline Sync
When a device comes online after being offline, it cannot rely on missed WebSocket messages. Instead:
- Client reads its last cursor from local SQLite
- Calls
POST /v1/sync/changeswith the cursor - Server returns all changes since that cursor, paginated
- Client processes each change: download new files, update modified, delete removed
- Client stores the new cursor
The cursor encodes a timestamp plus shard identifier, allowing efficient queries on the user's Citus shard.
Kafka topic design: Sync events are partitioned by user_id_hash % 1000, creating 1000 partitions. All events for a single user stay ordered (same partition) while load distributes across brokers.
Conflict Resolution
Problem: Two devices edit the same file while one is offline. When both try to upload, whose edit wins? And what happens to the loser's work?
Simple example: Two people write different notes on the same shared whiteboard while in different rooms. When they reconvene, both notes are preserved — the later one is on the whiteboard, the earlier one is photographed and kept as a "conflicted copy."
Mental model: Last-writer-wins for the primary file, safety copy for the loser. No data is ever silently lost.
Every file upload includes a base_version in the init request. The server checks:
- If
base_version == server's latest version: success, create version N+1. - If
base_version < server's latest version: conflict detected.
Concrete scenario: Alice has budget.xlsx synced on laptop and phone. Current version is v5.
- Laptop goes offline (airplane mode)
- Alice edits on laptop. Local version: v5-local-edit.
- Meanwhile, she edits the same file on her phone (online).
- Phone uploads. Server creates v6 (base_version=5, matches, no conflict).
- Laptop reconnects. Tries to upload with base_version=5.
- Server sees base_version=5 but current=6. Conflict.
Resolution: Server accepts the laptop's upload as v7. Creates a new file containing the phone's v6 content as budget (conflicted copy - Phone - 2026-03-10).xlsx. Both devices are notified. Alice sees both files, manually merges, and deletes the conflicted copy.
Why last-writer-wins? For a general-purpose file sync platform handling binary files (images, videos, compiled documents), there is no universal merge algorithm. Last-writer-wins with conflicted copies is the safest general solution. More sophisticated strategies (operational transformation) apply only to specific file types.
Conflict rate: Less than 0.01% of file operations in practice. But when they happen, the system must handle them gracefully and never lose data.
Delta Sync in Practice
A user has a 20MB PowerPoint file synced across laptop and phone. On the laptop, they change a few slides (affecting about 2MB of data in the middle).
Upload delta (laptop to server):
Version 1 (synced to both devices):
Chunks: [A, B, C, D, E]
Hashes: [h1, h2, h3, h4, h5]
User edits slides in the region covered by chunk C.
Version 2 (laptop computes via CDC):
Chunks: [A, B, C', D, E]
Hashes: [h1, h2, h3', h4, h5]
- Local diff: only index 2 changed (h3 -> h3')
- Server checks dedup: h1, h2, h4, h5 exist. h3' is new.
- Laptop uploads only chunk C' (~4MB) via presigned URL
- Server creates version 2 referencing [h1, h2, h3', h4, h5]
- Bandwidth: ~4MB instead of 20MB (80% savings)
Download delta (server to phone):
- Phone receives WebSocket notification: file updated to v2
- Phone fetches v2 chunk list: [h1, h2, h3', h4, h5]
- Phone's local SQLite has v1: [h1, h2, h3, h4, h5]
- Diff: only h3' is new. Phone already has h1, h2, h4, h5 locally.
- Phone downloads only chunk C' (~4MB)
- Assembles file: local A, B + downloaded C' + local D, E
- Bandwidth: ~4MB instead of 20MB
Scale impact: At 1M uploads/minute, if average delta sync saves 70% of bandwidth, upload volume drops from 2TB/minute to ~600GB/minute. rsync-style delta sync achieves even greater savings with byte-level granularity.
LAN Sync
Problem: Two devices on the same local network both need a file. Downloading from the cloud wastes internet bandwidth when the bytes are right there on the other machine, reachable at gigabit speeds.
Simple example: Two laptops in the same office. Instead of both downloading a 500MB file from the cloud, the second laptop copies it from the first over the local network in half a second.
Mental model: Cloud for truth, LAN for speed. Metadata always flows through the server (who has what, what version). Only chunk bytes take the LAN shortcut.
Discovery via mDNS. When the desktop client starts, it broadcasts on the local network:
Service type: _filesync._tcp.local
Instance: {user_id_hash}_{device_id}
Port: 17500
TXT: version=2, user_id_hash=abc123
Other instances discover this broadcast and filter for matching user_id_hash to find devices belonging to the same user.
Authentication. Each device has a TLS certificate signed by the server during device registration. Devices perform mutual TLS handshake, verifying certificates against the server's CA. This prevents impersonation on shared networks.
Chunk transfer over LAN. When Device B needs chunks that Device A has:
Device B needs chunks: [h3', h7', h15']
Device B asks Device A (LAN peer): "Do you have h3', h7', h15'?"
Device A responds: "I have h3' and h7'. Don't have h15'."
Device B downloads h3' and h7' from Device A over LAN (1Gbps+)
Device B downloads h15' from object store/CDN (internet speed)
Fallback. If LAN transfer fails (device offline, network change), the client transparently falls back to cloud download.
Office-scale impact: 200 employees, 500MB shared presentation. Without LAN sync: 200 * 500MB = 100GB from the internet. With LAN sync: first device downloads 500MB from the cloud, remaining 199 download from LAN peers. Internet bandwidth: ~500MB. That is 99.5% reduction.
File Versioning
Problem: Users need to undo changes, recover deleted content, and compare versions. How do you store version history without multiplying storage costs?
Simple example: A photo album where you don't print duplicate photos. Version 1 and Version 3 both include the same photo — they share the same print. Each version is just a list of which photos to include.
Mental model: Versions are pointers, not copies. Creating a new version only stores what changed. Restoring an old version is a metadata operation — no bytes are copied.
v1: [A, B, C] -- Initial upload
v2: [A, B', C] -- B edited, becomes B'
v3: [A, B', C, D] -- D appended
v4 (restore v1): [A, B, C] -- Same chunks as v1, new version record
Chunk A is referenced by all 4 versions (ref_count = 4). Chunk B is in v1 and v4 (ref_count = 2). Chunk B' is in v2 and v3 (ref_count = 2). Chunk D is only in v3 (ref_count = 1).
Restoring a version creates a new version record pointing to the same chunks. No data copied. Milliseconds, regardless of file size.
Version retention: 180 days or 100 versions, whichever comes first. When a version expires:
- Version record and
version_chunksentries deleted from PostgreSQL - Each referenced chunk's
ref_countdecremented - Chunks reaching ref_count = 0 enter the GC pipeline (24-hour grace, then deletion)
Version listing is fast: the file_versions table is indexed by (user_id, file_id, version_number) and co-located on the same Citus shard.
Download Path and Caching
On download, the client receives a chunk list and diffs it against its local chunk cache (SQLite tracks which hashes are on disk). Only missing chunks are downloaded. Editing a 100MB file where 2 chunks changed means downloading 8MB, not 100MB.
Chunks are cached locally on disk. Since chunks are content-addressed (hash = identity), a cached chunk is valid forever. If the same hash appears in a different file, the cached copy is reused without downloading.
Small file optimization: Files under 1MB produce a single chunk. At billions of small files, this creates billions of individual objects. At extreme scale, small files can be packed into larger container blobs to reduce object count. For most deployments, one object per chunk is fine up to tens of billions of objects.
Why object storage, not a filesystem? At 12.5 billion chunks, a traditional filesystem would fail. ext4 allocates a fixed number of inodes (tens to hundreds of millions per filesystem). 12.5 billion chunks would require hundreds of filesystems just for inode count. Large directories degrade lookup performance. Object storage uses a flat key-value namespace with no inode concept, no directory hierarchy, no per-file metadata overhead.
Bottlenecks and Mitigations
Twelve bottlenecks to watch in production.
1. Client CPU for SHA-256. SHA-256 at software speed: ~500MB/sec. For a 4GB file, 8 seconds. Mitigation: SHA-NI hardware instructions (~2GB/sec). Hash while chunking (streaming, not two-pass).
2. Presigned URL generation at 25K/sec. Each requires HMAC-SHA256. Mitigation: batch generation, pre-compute URL templates, connection pooling.
3. Redis Bloom filter false positives. At 0.1% FP rate, 25 of every 25K checks hit PostgreSQL. Mitigation: add an LRU cache in the Dedup Service for recently verified chunks.
4. PostgreSQL write amplification during bulk uploads. Bulk initial sync creates thousands of INSERTs. Mitigation: batch inserts using COPY, group commits, connection pooling (PgBouncer).
5. WebSocket fan-out for shared folders. 500-member shared folder means 500 notifications per change. Mitigation: batch notifications (group changes within 100ms), compress payloads, rate-limit per-folder.
6. Chunk existence checks at 50K/sec. Mitigation: multi-tier caching. L1: in-process LRU (top 1M hashes, ~32MB). L2: Redis Bloom filter. L3: PostgreSQL. 95% hit L1 or L2.
7. Object storage rate limits per prefix. 3,500 PUT/sec per prefix. Mitigation: 65,536 prefixes from hash-based keys. 25K PUTs/sec distributed = 0.38 PUTs/sec per prefix.
8. Large file chunk lists. A 50GB file has 12,500 chunks. Mitigation: index on (user_id, version_id, chunk_index). Cache chunk lists in Redis for actively syncing files.
9. GC thundering herd on version expiry. If launched 180 days ago, massive expiry wave. Mitigation: rate-limited GC (10K deletions/min), jitter on expiry cutoff.
10. Client memory during large file chunking. Mitigation: streaming pipeline. Read 64KB blocks, feed through Rabin fingerprinting, emit chunks as found. Peak memory: 8MB (max chunk) + 48 bytes (window).
11. Metadata DB growth. 27TB today, growing ~50TB/year. Mitigation: partition file_versions and version_chunks by month. Archive old partitions to columnar storage.
12. FS watcher event storms. Extracting a ZIP with 10,000 files generates 10,000 events. Mitigation: directory-level debounce (500ms), batch into single sync operation, rate-limit (100 files/sec).
Failure Scenarios
Thirteen failure scenarios with recovery strategies.
Object Storage Outage
Impact: All uploads and downloads fail. Recovery: Local files safe. Uploads resume when storage returns (multipart valid 7 days). Metadata operations (rename, move) still work (PostgreSQL only). Data loss: None. 11 nines durability — data is temporarily inaccessible, not lost.
Upload Interrupted Mid-Transfer
Impact: Partial upload. Some chunks in store, others not. Recovery: Client calls GET /upload/{id}/progress, gets remaining chunks and fresh presigned URLs, resumes. Data loss: None.
PostgreSQL Citus Node Failure
Impact: One shard unavailable. Affected users can't sync; others unaffected. Recovery: Automatic failover to standby (10-30 seconds via Patroni). Data loss: Possible loss of ~1 second of metadata writes with async replication. Chunks in object storage are safe. Synchronous replication eliminates this at the cost of latency.
Redis Cluster Failure
Impact: Bloom filter unavailable, upload sessions lost, connection registry gone. Recovery: Bloom filter rebuilt from chunks table (30-60 min). Sessions reconstructed via ListParts. Connection registry rebuilds organically on reconnect. Data loss: None. Redis stores only derived/ephemeral data.
WebSocket Gateway Crash
Impact: Up to 500K connections drop. Recovery: Clients reconnect with exponential backoff (1s, 2s, 4s). On reconnect, SUBSCRIBE with last cursor replays missed events. Data loss: None. All changes persisted in PostgreSQL before WebSocket notification sent.
Kafka Broker Failure
Impact: Sync event delivery delayed for affected partitions. Recovery: New leader elected (5-15 seconds). Producers retry. Consumers resume from last committed offset. Data loss: None with acks=all and min.insync.replicas=2.
Client Crash During Chunking
Impact: Partially computed hashes, partially sent request. Recovery: Client restarts, finds file marked pending_upload in SQLite, re-runs CDC (deterministic, same result). Resumes existing upload or starts new one. Data loss: None.
Concurrent Conflicting Edits
Impact: Two devices edited same file. Recovery: Last-writer-wins + conflicted copy. Both edits preserved. Data loss: None.
Chunk Reference Count Corruption
Impact: Premature deletion (ref_count too low) or storage leak (ref_count too high). Recovery: 24-hour GC grace period prevents premature deletion. Weekly reconciliation job corrects discrepancies. Even if prematurely deleted, same content re-uploaded produces same hash — chunk recreated. Data loss: Extremely unlikely.
Presigned URL Expiry During Slow Upload
Impact: Upload fails with 403. Recovery: Client calls GET /upload/{id}/progress for fresh URLs. Multipart upload still valid (7 days). Already-uploaded chunks preserved. Data loss: None.
Network Partition Between Client and Server
Impact: Client works locally, can't sync. Recovery: Changes accumulate in local sync queue. On reconnect, cursor-based sync catches up. Conflicts resolved if applicable. Data loss: None. Eventually consistent.
Data Corruption (Bit Rot)
Impact: Chunk content doesn't match SHA-256 on download. Recovery: Client detects mismatch, reports to server. Server fetches from another replica or region. If all copies corrupt, user's local device may still have the chunk for re-upload. Data loss: Astronomically rare (11 nines durability). SHA-256 verification ensures detection.
When Data Is Actually Lost
Data is truly irrecoverable only when ALL of:
- The object in the store is deleted or corrupted (11 nines against this), AND
- No other user has the same chunk (dedup means popular content has many references), AND
- The user's local copy is also gone (disk failure, device stolen)
The system provides defense in depth: object storage durability, content addressing, local device copies, and version history. True data loss requires a catastrophic combination of failures that is, for practical purposes, impossible.
Deployment and Operations
Multi-Region Deployment
Three regions for global coverage, each with a full stack:
| Region | Location | Primary User Base | Pod Counts |
|---|---|---|---|
| US-East-1 | Virginia | North/South America | Sync: 40, Upload: 20, WS: 30, Dedup: 10 |
| EU-West-1 | Ireland | Europe, Africa | Sync: 30, Upload: 15, WS: 25, Dedup: 8 |
| AP-Southeast-1 | Singapore | Asia-Pacific | Sync: 20, Upload: 10, WS: 25, Dedup: 6 |
Object storage buckets are region-local with cross-region replication. PostgreSQL uses logical replication for cross-region read replicas. Writes route to the user's home region (assigned at signup based on geographic IP).
Canary Deployment
- Deploy to 1 canary pod in US-East-1
- Route 5% of traffic for 15 minutes
- Monitor: upload success rate (>99.5%), sync latency p99 (<3s), error rate (<0.5%), dedup hit rate (stable)
- Roll to 25%, 50%, 100% with 10-minute holds
- Automated rollback if thresholds breached
- Full rollout across three regions: ~2 hours
Observability
Key metrics:
# Upload pipeline
upload_success_rate # Gauge: successful / total attempts
upload_duration_seconds # Histogram: init to complete
presigned_url_generation_seconds # Histogram: URL generation latency
# Sync pipeline
sync_propagation_latency_seconds # Histogram: upload complete to WS notification
websocket_connections_active # Gauge: current connections
# Dedup
dedup_hit_rate # Gauge: chunks skipped / total checked
bloom_filter_false_positive_rate # Gauge
# Infrastructure
kafka_consumer_lag # Gauge: messages behind per partition
pg_replication_lag_bytes # Gauge: standby lag
redis_memory_used_bytes # GaugeCritical alerts:
| Alert | Condition | Severity |
|---|---|---|
| Upload success rate drop | < 99% for 5 min | P1 |
| Sync latency spike | p99 > 5s for 5 min | P1 |
| Object storage errors | 5xx > 1% for 5 min | P1 |
| Kafka consumer lag | > 100K messages for 10 min | P2 |
| PostgreSQL replication lag | > 10MB for 5 min | P2 |
| WebSocket connection drop | > 10% in 1 min | P2 |
| Dedup ratio anomaly | hit rate drops > 20% | P3 |
Distributed tracing: Every file operation carries a trace ID through HTTP headers, Kafka message headers, and WebSocket fields. When a user reports "my file isn't syncing," support looks up the trace and sees exactly where the pipeline stalled.
trace-abc123:
/upload/init 150ms
dedup.bloom_check 2ms
dedup.pg_verify 15ms
url.generate 5ms
storage.put_chunk 800ms [client-side]
/upload/complete 200ms
pg.create_version 20ms
kafka.publish 5ms
kafka.consume 2ms
ws.push 1ms
Security
Transport: TLS 1.3 for all client-server communication. mTLS between internal services (service mesh). WebSocket uses wss://. LAN sync uses mTLS with device certificates.
Authentication: OAuth 2.0 with JWT access tokens (15-minute expiry), refresh tokens stored securely on client. Device certificates for LAN sync.
Authorization: Per-file ACL in PostgreSQL. Three levels: owner, editor, viewer. Shared folders inherit permissions. Presigned URLs scoped to specific keys, 15-minute expiry.
Encryption at rest: Server-side encryption with KMS. Per-customer KMS keys available for enterprise. PostgreSQL encrypted via volume-level encryption.
Zero-knowledge option: Client encrypts before chunking using a user-held key. Server never sees plaintext. Trade-off: breaks cross-user dedup (identical files produce different ciphertext). Offered as opt-in for enterprise.
Malware scanning: Object storage event triggers serverless function on each new chunk. ClamAV or commercial scanner. Infected files quarantined. Async, doesn't block upload.
Rate limiting: Per-user: 1000 API calls/minute, 100 uploads/minute, 50GB/day upload. Per-IP: 5000 API calls/minute. Redis counters with sliding window at the API Gateway.
Audit logging: All file operations logged to immutable Kafka topic, archived to object storage. Enterprise admins query for compliance (SOC 2, GDPR).
Beyond This Design: Real-World Evolution
This design is the right starting point for an interview and for initial production. Real systems evolve beyond it as scale grows.
Metadata growth becomes the dominant challenge. At tens of billions of files and hundreds of billions of version-chunk mappings, relational databases struggle even with sharding. Production systems evolve toward custom distributed metadata services — purpose-built key-value stores optimized for the specific access patterns of file sync (user-scoped tree lookups, cursor-based change feeds, reference counting).
Joins disappear entirely from hot paths. The denormalized storage_key in version_chunks is a first step. Mature systems pre-compute every relationship needed for reads: materialized views for folder listings, denormalized stores for version histories, pre-joined chunk manifests for downloads. The write path becomes more complex (maintaining denormalized views), but reads become simple key lookups.
Chunk metadata separates from file metadata. The global chunks table (sharded by hash, queried during dedup) operates at a different scale and access pattern than user-scoped file metadata. Production systems split these into separate storage systems: a dedicated chunk index (possibly a distributed hash table) for dedup, and a separate per-user metadata store for file trees.
Bloom filter management becomes a distributed systems problem. A single Bloom filter in Redis works at 12.5B chunks. At 100B+ chunks, the filter exceeds single-cluster memory. Solutions include partitioned Bloom filters (shard by hash prefix), counting Bloom filters (support deletion without full rebuilds), or replacing with a scalable probabilistic index.
Storage tiering becomes critical for cost. Not all chunks are equal. A chunk last referenced 6 months ago should not sit in the same storage tier as one accessed hourly. Production systems implement automatic lifecycle policies: hot tier (SSD-backed, frequently accessed), warm tier (standard object storage), cold tier (archival, rarely accessed). The transition must be transparent — a download request for a cold chunk triggers automatic retrieval with slightly higher latency.
Consistency models get more nuanced. This design uses eventual consistency for sync notifications and strong consistency for metadata writes. At global scale, even metadata writes face latency-consistency tradeoffs across regions. Systems evolve toward tunable consistency: strong within a region, eventually consistent across regions, with conflict resolution at the application layer.
The client becomes smarter. Early clients are thin — chunk, hash, upload. Mature clients predict which files will be needed (pre-fetch based on usage patterns), compress chunks before upload, prioritize sync order by file importance, and coordinate directly with LAN peers for bulk transfers without server involvement.
Explore the Technologies
Dive deeper into the technologies and infrastructure patterns used in this design:
Core Technologies
| Technology | Role in This Design | Learn More |
|---|---|---|
| PostgreSQL | File metadata, versions, ACLs, folder trees (sharded via Citus) | PostgreSQL |
| Redis | Chunk existence Bloom filter, upload sessions, WebSocket connection registry | Redis |
| Kafka | Sync event distribution, upload completion, dedup events | Kafka |
| Elasticsearch | File search by name, type, content metadata | Elasticsearch |
| Prometheus | Metrics for upload SLOs, sync latency, dedup ratio, connection counts | Prometheus |
| Grafana | Dashboards for sync monitoring, storage analytics, dedup tracking | Grafana |
Distributed Systems Concepts
| Concept | Role in This Design | Learn More |
|---|---|---|
| Bloom Filters | Probabilistic chunk existence check in Redis dedup layer | Bloom Filters |
| Merkle Trees | File integrity verification via Merkle root over chunk hashes | Merkle Trees |
| Write-Ahead Log | PostgreSQL WAL for crash recovery, Kafka durable log | Write-Ahead Log |
| Consistent Hashing | Foundation for content-addressable chunk storage addressing | Consistent Hashing |
Infrastructure Patterns
| Pattern | Relevance to This Design | Learn More |
|---|---|---|
| Object Storage and Data Lake | Object storage as content-addressable chunk store, lifecycle policies for version expiry | Object Storage |
| Message Queues and Event Streaming | Kafka for async sync event propagation, upload completion processing | Event Streaming |
| Database Sharding | Citus shards PostgreSQL by user_id, co-locating files/versions/chunks per user | Database Sharding |
| WebSocket and Real-Time Communication | Push-based sync notifications to 40M concurrent device connections | WebSocket and Real-Time |
| CDN and Edge Computing | CloudFront caching for frequently downloaded chunks, 60% hit rate | CDN and Edge Computing |
| Caching Strategies | Multi-tier: in-process LRU, Redis Bloom filter, PostgreSQL for chunk dedup | Caching Strategies |
| Rate Limiting and Throttling | Per-user upload limits, API rate limiting, bandwidth throttling | Rate Limiting |
| Replication and Consistency | Object storage cross-region replication, PostgreSQL streaming replication, Kafka RF=3 | Replication and Consistency |
| Kubernetes Architecture | Stateless services auto-scaled, WebSocket gateways with sticky sessions | Kubernetes Architecture |
Further Reading
- Streaming File Synchronization (Dropbox Engineering Blog) -- How Dropbox rebuilt their sync engine
- The Rabin Fingerprint (Michael O. Rabin, 1981) -- The rolling hash algorithm that makes CDC possible
- FastCDC: A Fast and Efficient Content-Defined Chunking Approach -- Gear-based CDC achieving 10x speedup over Rabin
- Multipart Upload Overview -- The resumable upload mechanism under the hood
- Citus: Distributed PostgreSQL -- How Citus shards PostgreSQL for horizontal scale
- TUS: Open Protocol for Resumable File Uploads -- Standard protocol for resumable uploads (alternative to presigned URLs)
The key insight: a file is an ordered list of chunk references. Unchanged content doesn't need re-uploading, and identical content doesn't need re-storing. Everything else — the sync protocol, the presigned URLs, the Merkle trees, the conflict resolution — is machinery to keep that list consistent, efficient, and durable across devices, users, and failures.
rsync solves bandwidth efficiency. CDC + dedup solves storage efficiency. Distributed metadata solves scalability. Together, they make petabyte-scale file sync possible.