Design Dropbox
Design a cloud file storage and sync service like Dropbox. Users can upload files from one device and access them from any other device. Support file versioning, sharing, and efficient sync.
Key Topics
Interview Cheat Sheet
60s skim · 3min careful readThree planes (control for metadata, data for bytes via presigned URLs, event for sync fan-out). Content-defined chunking with a Rabin fingerprint plus SHA-256 gives both delta sync and cross-user dedup, saving about 50% of storage. WebSocket plus cursor-based catch-up handles online and offline devices. Last-writer-wins with a conflicted-copy file preserves both edits on a race. The biggest operational risk is chunk reference-count drift, mitigated by a 24-hour quarantine.
- Upload · ~4MB chunks · bytes skip the API fleet
The client splits the file into content-defined chunks, hashes each one, and asks the server which ones it already has. Only the new chunks get uploaded, and they go straight to object storage using presigned URLs. The API never touches the bytes.
Client splits file (Rabin 48-byte window finds boundaries, SHA-256 per chunk)/upload/init with hash listDedup Service checks Redis Bloom filter (>99% hit)PostgreSQL confirms existenceserver returns presigned S3 URLs only for missing chunksclient PUTs directly to S3 (~100GB/sec peak path)/upload/finalize updates metadata (fileversionchunks) - Sync · WebSocket push · cursor catch-up on reconnect
Online devices get pushed updates over WebSocket within seconds of an edit. Devices that were offline reopen and ask 'give me everything since cursor X', and the server pages back the missed changes. Both paths together make sure nothing gets dropped.
Edit on device Ametadata update committedKafka eventfan-out to 40M concurrent WebSockets (500K per pod, 80 pods)device B receives changedevice B fetches only changed chunks from S3offline device on reconnect: read last cursor from local SQLite/v1/sync/changes?cursor=Xpaginated catch-up from server - Conflict · last-writer-wins + conflicted-copy file
If two devices edit the same file before syncing, the second upload doesn't overwrite the first. The server keeps both versions and creates a 'conflicted copy' file so the user sees both. Data loss is worse than a notification.
Phone uploads with base_version=5, server current=5accepted as v6laptop uploads with base_version=5 but current is now 6conflict detectedlaptop's edit accepted as v7server writes second file 'budget (conflicted copy - Phone - 2026-03-10).xlsx' from v6both devices see both files
- •500M users, 50M DAU, 10B files, 100PB total, 1M uploads/min peak
- •2MB average file, 4MB target chunk (min 2MB, max 8MB)
- •Sync budget: under 1s for metadata, under 10s for a small file end-to-end
- •Bytes path peak: 100GB/sec direct to object store, never through API fleet
- •WebSockets: 40M concurrent at 30% online, 500K per pod, 80 pods, memory-bound at 15KB/conn
- •Metadata: PostgreSQL + Citus sharded by user_id, 27TB across 50+ workers
- •12.5B unique chunks, 24-hour quarantine before GC, weekly reconciliation
- •Content-defined chunking (Rabin 48-byte window), not fixed-size (insertion shifts all chunks), not full-file (100MB upload for 1KB edit)
- •Two-hash: Rabin finds boundaries fast, SHA-256 confirms identity (metal detector then gemologist)
- •Presigned URLs so file bytes go client to S3 directly, API fleet only sees hashes
- •PostgreSQL + Citus user_id sharding, co-locates files, versions, version_chunks on one shard
- •WebSocket push plus cursor-based catch-up, not one or the other
- •Last-writer-wins plus conflicted-copy file, not reject-the-laptop (data loss worse than a notification)
- •Dedup-vs-E2EE tension: server sees plaintext hashes, that's what enables 50% storage savings
- •Object-store 3,500 PUT/sec per-prefix limit, hash chunk IDs into 65,536 prefixes naturally
- •Chunk ref-count drift is the tier-1 operational risk, 24h quarantine plus weekly reconciliation
- Content-defined chunking (not fixed-size, not full-file)
Fixed-size chunking shifts every boundary on a mid-file insert, so a 1KB edit in a 20MB file rehashes every chunk. Content-defined chunking with a 48-byte Rabin window places boundaries by content, so the same edit only changes the chunk that contains it. 50x bandwidth saving on the most common edit pattern (mid-document changes). The two-hash trick: Rabin finds boundaries fast (the metal detector), SHA-256 confirms identity (the gemologist).
- Why bytes skip the API fleet
Peak ingress is 100GB/sec. Routing bytes through the API tier would need thousands of proxy nodes just to forward data. Presigned S3 URLs let the client PUT directly to object storage, so the server fleet only sees metadata calls (init, finalize) and stays bandwidth-light. About 99% of bytes never cross the API tier. Pairs with hashing chunk IDs into 65,536 prefixes so the 3,500-PUT/sec-per-prefix object-store limit is never hit (per-prefix load drops to <1 PUT/sec).
- Chunk ref-count drift (the tier-1 operational risk)
Every chunk has a ref_count that goes up on new versions using it and down on version expiry. Premature decrement is the killer bug because deleting a chunk that's still referenced loses user data. Mitigations: 24-hour quarantine before actual deletion plus weekly reconciliation that scans for discrepancies. Even if we delete prematurely, the same content rehashes to the same key so the chunk reappears on re-upload. Defense in depth.
- PostgreSQL + Citus sharded by user_id
Almost every metadata query is user-scoped: list this user's files, show this file's versions, fetch this version's chunks. Citus sharding by user_id co-locates files, versions, and version_chunks on the same shard, so multi-row joins stay local. ACID matters because metadata loss means unrecoverable files. DynamoDB's per-item model would force application-side joins; Cassandra would force giving up the ACID story for metadata that must not be lossy.
40-Minute Interview Playbook
Each phase is what the interviewer expects you to do and say. Concrete steps, not topic hints. The diagrams are what you sketch on the board.
- 15 min
Clarify Requirements and Scale
GoalPin the three problems (transfer, sync, storage), get the scale numbers on the board, and lock in the sync-latency budget so every choice afterward has a target.
Do & Say- ASK·1ASK: There are three problems hiding inside file sync: Transfer efficiency, multi-device sync, storage efficiency. Can I walk through each before I draw anything? Get the nod, then proceed.
- WRITE·2Write the scale numbers: 500M users, 50M DAU, 10B files, 100PB total, 1M uploads/minute peak, 2MB avg file size. Say: The 2MB average matters because it means most files are small enough that chunking metadata becomes a bigger problem than chunking math.
- SAY·3Lock the sync latency budget: under 1 second for metadata propagation, under 10 seconds for a small file to land on a second device, files up to 50GB, version retention 180 days or 100 versions. Wait for confirmation.
- SAY·4Get the durability story right: 11 nines on object storage, 99.99% on the service. The asymmetry matters because the architecture will treat the object store as the durable substrate and PostgreSQL as the metadata index, not the other way around.
- SAY·5Park out of scope. In scope: chunking, dedup, the sync protocol, conflict resolution, the upload pipeline. Defer: sharing permissions, team workspaces, search to the end if there's time.
Interviewer is grading: You frame the three problems before drawing anything. You quote the 2MB average and use it as a justification later, not just a number you wrote down.
- 210 min
Chunking and Dedup
GoalPut three chunking strategies on the board, kill two with concrete failure modes, defend CDC with the insertion example.
Draw on the boardDo & Say- SAY·1Put three approaches on the board: full-file upload, fixed-size chunking, content-defined chunking. Cross them off as you go.
- WATCH·2Kill full-file fast: A 100MB spreadsheet with a 1KB edit becomes a 100MB upload. At 1M uploads/minute that's bandwidth suicide. Only viable below maybe 1MB. Cross it off.
- WATCH·3Kill fixed chunking with the insertion example: 20MB file into five 4MB chunks. Insert 1KB at byte 100, every byte after shifts so every chunk hash changes. Fixed chunking degrades to full-file upload for middle-of-document edits. Cross it off.
- SAY·4Defend CDC: Slide a 48-byte window with Rabin fingerprint. Boundary when hash mod TARGET equals zero, TARGET tuned for 4MB average. Min 2MB, max 8MB to bound pathological cases. Boundaries are content-defined, so the 1KB edit only changes its own chunk.
- DRAW·5Sketch the two-hash idea: Rabin is cheap, O(1) per byte position, but it has collisions. SHA-256 is expensive but content-addressable. Rabin finds candidate boundaries and matches, SHA-256 confirms identity. Think metal detector versus gemologist, you'd never use either alone.
- SAY·6Connect chunking to dedup: Because SHA-256 is deterministic, two users uploading the same file produce identical chunk hashes. Server stores one copy and 500 pointers. The 500-person company onboarding example: a 50MB handbook becomes 50MB total, not 25GB.
Interviewer is grading: You volunteer the insertion failure mode for fixed chunking without prompting. You name the 48-byte window and Rabin fingerprint as the specific weak hash, not 'a rolling hash'.
- 310 min
High-Level Design (the three planes)
GoalDraw control, data, event planes separately. Make presigned URLs and the metadata write the two load-bearing decisions, label arrows with QPS or bandwidth.
Draw on the boardDo & Say- DRAW·1Draw three planes in three lanes: control (metadata, orchestration), data (chunks, metadata storage), event (Kafka for sync fan-out). Say: The whole point of this split is that file bytes never touch the API server fleet. They go straight to object storage via presigned URLs.
- SAY·2Walk the upload critical path: Client computes chunk hashes locally, calls /upload/init with the hash list only. Dedup Service checks each hash, Bloom filter first then PostgreSQL to verify. Server returns presigned URLs only for chunks that don't exist yet. Label that arrow hashes, not bytes.
- SAY·3Justify presigned URLs in one breath: At 100GB/sec of upload ingress at peak, routing bytes through the API fleet would need thousands of proxy nodes just to forward data. Presigned URLs eliminate that. The byte path is client to object store, server is only in the metadata path.
- DRAW·4Sketch the sync fan-out: On finalize, Sync Service writes the version to PostgreSQL and publishes to Kafka. Partitioned by user_id_hash mod 1000 so all events for one user stay ordered. WebSocket Gateway consumes, looks up the device's pod in Redis, pushes CHANGE.
- SAY·5Name the WebSocket sizing: 40M concurrent connections at 30% online concurrency, 500K per pod max, 80 WebSocket pods. Memory-bound at 15KB per connection, not CPU-bound. Write 40M / 80 pods on the diagram.
- SAY·6Name the metadata sharding: PostgreSQL + Citus, sharded by user_id. Almost every query is user-scoped, so Citus co-locates files, versions, version_chunks on the same shard. Cross-shard joins basically don't happen. 27TB total across 50+ worker nodes.
Interviewer is grading: You name 'bytes never touch the API server' as the load-bearing decision. You quote the 80 WebSocket pods and the user_id sharding, not generic 'partition for scale'.
- 410 min
Deep Dive: Sync Protocol and Conflict Resolution
GoalTwo sub-dives. The dual-channel sync (WebSocket push + cursor catch-up), and last-writer-wins-with-conflicted-copy for concurrent edits.
Draw on the boardDo & Say- SAY·1Sub-dive 1, the sync protocol. Frame it: A device that has been online the whole time can rely on WebSocket pushes. A device that was offline for two weeks can't. So we need two channels.
- SAY·2WebSocket path: Kafka delivers to WebSocket Gateway, which pushes CHANGE to the user's other online devices within seconds. Device fetches the new version's chunk list and downloads only chunks it doesn't have locally. A 2-chunk edit on a 100MB file is 8MB, not 100MB.
- SAY·3Cursor-based catch-up: A device that comes back online reads its last cursor from local SQLite, calls /v1/sync/changes with that cursor, server returns paginated changes since then. Cursor encodes timestamp plus shard identifier. WebSocket gives speed and cursors give correctness, so no change is ever missed.
- SAY·4Sub-dive 2, conflict resolution. Frame it with the concrete example: Alice has budget.xlsx synced. Laptop goes offline. Both laptop and phone edit. Phone uploads first, server creates v6. Laptop comes back, tries to upload with base_version=5 but current is 6. Walk through it.
- SAY·5Pick last-writer-wins-with-conflicted-copy: Server accepts the laptop's edit as v7, then creates a new file called "budget (conflicted copy - Phone - 2026-03-10).xlsx" from the v6 content. Both devices get notified. Alice sees both files and decides what to do.
- SAY·6Defend the choice: For a general-purpose sync platform that handles binary files like images, videos, compiled documents, there is no universal merge algorithm. Last-writer-wins plus conflicted copy is the safest. Data loss is worse than a conflict notification. Quote the rate: Less than 0.01% of file operations in practice.
- SAY·7Name the cycle-prevention mechanic: Each upload carries a base_version that pins the parent. That kills sync loops where device A syncs to B, B syncs back to A, repeat forever. A device only accepts a version with a strictly newer base_version than its local copy.
Interviewer is grading: You volunteer the dual-channel architecture (WebSocket plus cursor) before being asked about offline support. You can recite the conflicted-copy filename format and explain why it's better than rejecting the laptop's write.
- 55 min
Trade-offs, Bottlenecks, and Wrap-up
GoalName the deliberate trade-offs (delta granularity, sync latency, dedup vs encryption), give the GC plan for chunk reference counting, close with one sentence.
Do & Say- SAY·1Trade-off 1, CDC over rsync as the primary model: CDC gives cross-user dedup since hash identity is global. rsync only compares old vs new of the same file. Production uses CDC primary, rsync-style delta when applicable. 50%+ storage savings from cross-user dedup is the bigger win at this scale.
- SAY·2Trade-off 2, dedup versus end-to-end encryption: Cross-user dedup requires the server to see plaintext chunk hashes. True end-to-end encryption with per-user keys would kill that, because the same plaintext becomes different ciphertext for each user. The design accepts server-side hashing, which is what allows the 50% storage savings.
- SAY·3Bottleneck: Chunk reference counting. 12.5B chunks, ref_count up on new versions, down on expiry. Premature decrement deletes a referenced chunk. Mitigation: 24-hour quarantine plus weekly reconciliation scanning for discrepancies. Even on premature delete, re-upload of the same content rehashes to the same key, chunk reappears.
- SAY·4Bottleneck: Object storage rate limits. Most providers cap at 3,500 PUTs per second per prefix. At 25K PUTs/sec we need ~256 prefixes. We hash the chunk ID into a prefix structure like chunks/a1/b2/a1b2c3... so writes spread across 65,536 prefixes naturally. Per-prefix load drops to under 1 PUT/sec.
- SAY·5Close: Three planes. Control for metadata, data via presigned URLs so bytes skip the server, event for fan-out via Kafka and WebSockets. CDC plus SHA-256 gives delta sync and cross-user dedup. LWW plus conflicted copy preserves both edits. Biggest risk is chunk ref-count drift, mitigated by 24h quarantine.
Interviewer is grading: You volunteer the dedup-versus-E2EE tension without being asked. You quote the 24-hour quarantine and the 3,500 PUT-per-prefix limit, not generic 'we batch writes'.
Interview Grading by Level
What an interviewer at each level expects to see in your answer. Use this to calibrate, not to perform.
Mid-Level Engineer (L4 / SDE-II)
Lands on chunking and an object-store-plus-metadata-DB split, but the specifics (chunking strategy, dedup mechanics, sync channel) stay vague.
- Picks object storage for chunks and a relational DB for metadata, can explain why files don't live in PostgreSQL.
- Knows about chunking as 'upload only what changed' and quotes a chunk size around 4MB.
- Adds a sync channel via WebSocket or long polling and knows polling at scale is bad.
- Names SHA-256 as the chunk identifier without prompting.
- Picks last-writer-wins for conflicts when asked.
- Picks fixed-size chunking and doesn't see the insertion problem until pushed.
- Doesn't distinguish weak (Rabin) versus strong (SHA-256) hashes, treats them as one thing.
- Has no answer for offline sync beyond 'reconnect and sync'. Cursor-based catch-up doesn't come up.
- Says 'use presigned URLs' when asked about bandwidth but doesn't explain why (server can't proxy 100GB/sec).
- Conflict handling stops at 'last-writer-wins' without the conflicted-copy mechanism, so data loss risk stays on the table.
Senior Engineer (L5 / SDE-III)
Drives the design end-to-end, picks CDC with reasons, quotes scale numbers, and separates the byte path from the metadata path deliberately.
- Writes the scale (500M users, 10B files, 100PB total, 1M uploads/min, 50% dedup) on the board in the first 5 minutes.
- Picks CDC with Rabin fingerprint and 48-byte window, can explain the insertion failure of fixed chunking with a concrete example.
- Names presigned URLs as the load-bearing decision because the API fleet can't proxy 100GB/sec.
- Stacks the dedup pipeline: local diff against SQLite, Bloom filter in Redis, then PostgreSQL verification. Quotes 99%+ Bloom filter hit rate.
- Picks PostgreSQL + Citus with user_id sharding and explains co-location of files, versions, version_chunks on the same shard.
- Stacks WebSocket plus cursor-based catch-up explicitly, names the dual-channel reason.
- Picks last-writer-wins with a conflicted-copy file, names the filename convention and the 0.01% conflict rate.
- Mentions chunk reference counting but doesn't bring up the 24-hour quarantine or the reconciliation job until asked.
- Doesn't volunteer the dedup-versus-E2EE tension. Treats encryption as out of scope.
- Quotes the WebSocket numbers (40M conns) but not the per-pod sizing (500K per pod, 80 pods, memory-bound).
Staff+ Engineer (L6+)
Names the operational risks (ref-count drift, prefix hot-spotting, WebSocket pod sizing) before being asked, and ties every trade-off to a measurable SLO or storage cost.
- Volunteers the dedup-versus-end-to-end-encryption tension and names what the design accepts as a trade ('server-side hashing in exchange for 50% storage savings, which we'd revisit if regulatory requirements changed').
- Brings up object-storage prefix hot-spotting (3,500 PUT/sec per prefix) before being asked and prescribes hash-based prefix structure for 65,536 prefixes.
- Names the chunk reference-count drift as the load-bearing operational risk, prescribes 24-hour quarantine plus weekly reconciliation, and explains why prematurely deleted chunks are recoverable (same content rehashes to same key).
- Defines SLA tiers: metadata writes are tier-1, sync notifications are tier-2, search indexing is tier-3, and uses the hierarchy to justify the Kafka decoupling.
- Brings up the FS watcher event-storm problem (extracting a ZIP with 10K files generates 10K events) and prescribes directory-level debounce of 500ms with 100 files/sec rate limit.
- Names the WebSocket-pod failure mode: 500K connections dropping at once when one pod crashes, mitigated by exponential backoff reconnect (1s, 2s, 4s) and cursor replay on resubscribe.
- Closes with the one-sentence summary plus offers deeper dives (rsync hybrid for transfers, LAN sync via mDNS, large-file streaming chunking) for remaining time.
Common Follow-up Questions
click to expandQuestions an interviewer is likely to ask after your walkthrough. Rehearse the short answer.