Design Google Docs
Design a real-time collaborative document editor like Google Docs. Multiple users can edit the same document simultaneously, see each other's cursors, and all changes are synced in real-time without conflicts.
Key Topics
Interview Cheat Sheet
60s skim · 3min careful readClient-authoritative collaborative editor where Yjs CRDT lives on the device and the server is a stateless relay, not a merge authority. Consistent hashing on document_id plus Redis Pub/Sub plus persistence-layer merge form a three-layer split-brain defense. The load-bearing decision is CRDT over OT, justified by offline-first behavior.
- Local edit · keystroke under 16ms · sync under 100ms
Typing updates the document in the browser right away and saves to IndexedDB, so nothing waits for the server. The change is then sent as a tiny binary message over WebSocket. The server stores it and forwards it to the other people in the doc. It does not try to merge anything; the client does the merging.
KeystrokeTipTap (ProseMirror)Yjs document applyIndexedDB persistWebSocket binary update (~30 bytes/char, run-length encoded)L7 LB consistent-hash on document_idHocuspocus instanceappend to PostgreSQL op log + Redis hot cachebroadcast to other clients in room@hocuspocus/extension-redis Pub/Sub on hocuspocus:document_id (cross-instance during failover) - Offline reconnect · O(m log n) state vector diff
When someone comes back online after editing offline, the client sends a small summary of what it already has. The server compares it to its own version and sends back only the missing edits, no matter how many were made offline.
Client offline (buffers ops in IndexedDB, Yjs assigns stable IDs locally)reconnectsend state vector (clock per client, ~few hundred bytes)server diffserver ships only missing Yjs updatesCRDT merges locally by item ID (originLeft/originRight pointers, not integer positions)IndexedDB checkpoint - Cursor presence · ephemeral, never persisted
Each cursor is tied to a specific character ID, not a position number, so it stays in the right spot even when other people add or delete text around it. Cursor updates ride the same connection as edits, limited to 20 per second per person, and disappear when someone closes the tab.
Selection change in ProseMirrory-prosemirror converts to Yjs relative position (anchored to item ID)50ms debounce~200-byte awareness update (anchor, head, user color, status)Hocuspocus broadcasts on same WebSocket (message type byte = 1)peers update remote cursor in their local Yjs awareness mapdisconnect or 90s heartbeat timeout drops the entry
- •1B docs, 10M open simultaneously, 30M live WebSockets, 100 editors max per doc
- •Latency: under 16ms keystroke (60fps frame), under 100ms same-region sync, under 300ms cross-region
- •Yjs metadata overhead 1.5x to 3x plain text with binary RLE encoding
- •Awareness updates ~200 bytes, 50ms debounce, ~400KB/s presence on a 100-editor doc
- •Snapshot cadence: every 500 ops or 5 minutes, whichever first
- •Storage totals: 10TB content, 100TB op log, 50TB snapshots, 300TB media, ~460TB
- •Fleet: 600 WebSocket gateways at 50K conns each, 1,000 Hocuspocus instances at 10K docs each
- •Document hard cap 1.5M characters (~500 pages); WebSocket proxy block rate 1-3%
- •Yjs CRDT, not OT (Google Docs uses OT, we picked CRDT for offline-first)
- •Hocuspocus as relay, not merge authority (client owns the merge)
- •Consistent hashing on document_id at the LB (primary split-brain defense)
- •Redis Pub/Sub via @hocuspocus/extension-redis for cross-instance relay
- •PostgreSQL op log plus periodic snapshots, not Kafka or pure event store
- •Yjs relative positions for cursors, not integer offsets
- •CRDT over OT is the load-bearing decision because offline reconnect is O(m log n) not O(local × missed)
- •Tombstone GC only runs when every connected client has integrated the deletion
- •WebSocket proxy block (1-3%) needs a long-polling fallback with a degraded-mode banner
- CRDT (Yjs) over OT — the load-bearing decision
Both algorithms work. OT forces an O(local × missed) transformation matrix on reconnect for a 500-op offline buffer, and three-user transform-correctness bugs are notorious (Google Docs has shipped fixes for years). Yjs gives mathematical convergence and an O(m log n) state vector diff. We accept Yjs's 1.5-3x metadata overhead (originLeft/originRight per item, run-length-encoded binary) because offline-first is a hard requirement, not a nice-to-have.
- Three-layer split-brain defense
During a Hocuspocus failover or rolling deploy, two instances can briefly hold the same document. Layer 1: L7 consistent-hash on document_id routes every editor of one doc to the same instance in steady state. Layer 2: @hocuspocus/extension-redis relays updates between instances on the hocuspocus:document_id channel through the failover gap. Layer 3: CRDT commutativity plus persistence-layer merge on reconnect guarantees eventual correctness even if Pub/Sub drops a message.
- 100-editor hard cap (broadcast amplification)
100 concurrent editors is the hard limit, set by awareness broadcast, not Yjs merging. At 50ms debounce that is ~2,000 presence messages/sec per doc and ~400KB/s out per editor. Beyond 100 you would need to shard the document into sub-documents (block-based Notion architecture), which is a different design. We hard-cap docs at 1.5M characters (~500 pages) so a single Hocuspocus instance never holds more than ~50MB of Yjs state for one room.
- Tombstone GC and the 'connected client' constraint
Yjs keeps deleted items as tombstones so concurrent operations can still reference their IDs by originLeft/originRight. Tombstone GC only runs during snapshot creation (every 500 ops or 5 min, whichever first), and only when every currently connected client has integrated the deletion. Offline clients hold up GC until they reconnect and acknowledge. This is why GC pauses can cluster behind a long-offline user, not a steady drumbeat.
40-Minute Interview Playbook
Each phase is what the interviewer expects you to do and say. Concrete steps, not topic hints. The diagrams are what you sketch on the board.
- 15 min
Clarify Requirements and Scale
GoalPin down the editor paradigm and the concurrency model before drawing anything. Most candidates skip this and end up redesigning halfway through.
Do & Say- ASK·1ASK: Rich text like Google Docs, or block-based like Notion, or canvas like Figma? They need different CRDT data models. Park canvas/block, lock rich text. Write rich text, ProseMirror tree on the board.
- SAY·2Pin the scale: 10M documents open simultaneously, 3 concurrent editors average, 100 peak per document, 1B total docs. That's 30M live WebSockets. Write WS=30M, peak 100/doc top-left.
- SAY·3Pin the latency budget out loud: Local keystroke under 16ms (one 60fps frame), same-region sync under 100ms, cross-region under 300ms. Wait for the nod, write it down.
- SAY·4State scope explicitly. In scope: real-time sync, presence/cursors, offline editing, version history, rich text formatting. Out of scope: comments threading, suggestions mode, export to PDF, AI assist. Pull any back if you want.
- SAY·5Force the offline-editing question now: Should I design for users editing offline for hours on a plane and merging on reconnect? That decision drives OT vs CRDT. Get a yes, lock it in.
Interviewer is grading: Picks the rich-text paradigm deliberately, names the 16ms keystroke target unprompted, and forces the offline-editing requirement before any algorithm choice.
- 28 min
Pick the Merge Algorithm (OT vs CRDT)
GoalTwo algorithms on the board. Walk through a concurrent edit with each. End with a defensible choice and a one-line reason.
Do & Say- SAY·1Write both names: Operational Transformation (OT) and CRDT (Yjs/YATA). Say: Google Docs uses OT, Figma and Notion use CRDTs. Both work. The choice is driven by offline behavior.
- SAY·2Run the canonical example: doc is HELLO. Alice inserts X at pos 1, Bob deletes pos 3, concurrently. With OT, server transforms ops: Alice's insert stays at 1, Bob's delete shifts from 3 to 4. Server is the ordering authority.
- SAY·3Same example with CRDT: every character has a stable ID like (clock, client). Alice's insert references originLeft=H_id, originRight=E_id. Bob's delete tombstones the L item but keeps its ID. Merge is order-independent because operations reference IDs, not integer positions.
- SAY·4Say: Pick CRDT (Yjs) because offline sync is first-class. OT forces O(local × missed) transformation on reconnect for a 500-op buffer, and three-user transform-correctness bugs are notorious. CRDT gives mathematical convergence and O(m log n) state vector diff.
- WATCH·5Get ahead of the metadata pushback: Yes, CRDTs carry per-character metadata, originLeft and originRight pointers. Naive CRDTs blow up to 16x plain text size. Yjs's optimized binary encoding with run-length encoding for consecutive edits brings it down to 1.5 to 3x. That's the trade-off I'm taking.
- SAY·6Name the server's role explicitly: The server is a relay, not a merge authority. Hocuspocus persists ops and broadcasts them. The merge intelligence is on the client. This is the load-bearing sentence.
Interviewer is grading: You name OT and CRDT, you can walk the HELLO example through both, and the choice is justified by offline behavior, not vibes. You volunteer the metadata-overhead trade-off before being asked.
- 310 min
High-Level Design
GoalOne diagram. Client holds the full Yjs document, server is a relay, persistence is operation log plus periodic snapshots. Label the WebSocket and the Redis Pub/Sub hop.
Draw on the boardDo & Say- DRAW·1Draw the client first and label it heavy. The client holds the full Yjs document in memory, persists to IndexedDB, and renders through TipTap on top of ProseMirror. Local edits commit instantly to IndexedDB. No server round-trip on the keystroke path.
- DRAW·2Draw the WebSocket and label it binary Yjs update, ~30 bytes per character insert. Say: Custom binary protocol on WebSocket, not JSON. A run of 100 same-user characters encodes as one item, ~120 bytes, because of run-length encoding.
- DRAW·3Draw the Hocuspocus tier and call out: Consistent hashing on document_id routes every editor of one doc to the same Hocuspocus instance. That's the primary defense against split-brain. The Yjs document lives in that instance's memory.
- DRAW·4Draw the Redis Pub/Sub box and label the channel hocuspocus:document_id. Say: During failovers, scaling events, or rolling deploys, two editors can land on different instances. @hocuspocus/extension-redis relays updates between them through Pub/Sub. CRDT commutativity means out-of-order or duplicate delivery is harmless, so the relay doesn't need to be ordering-correct.
- DRAW·5Draw persistence: Postgres holds op log and snapshots every 500 ops or 5 min, whichever first. S3 is cold storage. Redis caches hot doc state for sub-ms access. Cold load from Postgres snapshot plus replay takes 500ms-1s for large docs. The under-100ms target applies to steady-state, not cold opens.
- SAY·6Storage math: 1B docs at 10KB avg is 10TB content, 100TB compacted op logs, 50TB snapshots, 300TB media. ~460TB total. WebSocket fleet is 30M connections at 50K per box, so ~600 gateway boxes. Sync tier 1,000 Hocuspocus instances at 10K active docs each.
Interviewer is grading: Client is drawn as heavy and authoritative. Server is drawn as a relay. The Redis Pub/Sub channel is named and its role is justified by failover, not steady state. You volunteer the cold-load latency carve-out.
- 412 min
Deep Dive: Presence, Offline, and Split-Brain
GoalThree sub-dives that every interviewer asks about. Have a one-paragraph answer for each, with a concrete data structure.
Draw on the boardDo & Say- SAY·1Presence sub-dive: Cursors use Yjs relative positions, not integer offsets. A relative position anchors to a CRDT item ID. When Bob inserts before Alice's cursor, Alice's cursor stays correct because it references item (155, alice), not position 5. Draw awareness payload with anchor, head, user color, status.
- SAY·2Mention the throttle number: Awareness updates are debounced at 50ms, so up to 20 per second per user. At 100 editors on a hot doc, that's about 2,000 presence messages per second, around 400KB/s. Presence is actually heavier than edits in a busy room.
- SAY·3Presence persistence: Never persisted. Not to Postgres, not to disk Redis, only through ephemeral Pub/Sub. Disconnected clients drop off after 90s heartbeat. If you want last-seen, that's a separate REST audit log, not the awareness protocol.
- SAY·4Offline sub-dive: User offline on a plane, 500 ops over four hours, reconnects. With OT this is a brutal O(500 × concurrent) transform matrix. With Yjs the client sends its state vector, server diffs and ships only missing ops. Convergence is mathematical. Draw the state vector exchange.
- SAY·5Split-brain sub-dive: Two Hocuspocus instances briefly hold the same doc during failover, each broadcasts to its clients, edits diverge for seconds. Three defenses: Consistent hashing prevents most cases, extension-redis catches scaling and failover gaps, CRDT convergence plus persist-layer merge guarantees eventual correctness.
- SAY·6Doc-too-large pushback: A 1.5M-char doc with 100 editors hits ~50MB of Yjs state on the sync server. We cap docs at 1.5M chars (~500 pages) and GC tombstones during snapshot creation, but only when every connected client has integrated the deletion.
- WATCH·7Be ready for the hot-document question: A 100-editor document is the hard limit. Beyond that, the broadcast amplification becomes the bottleneck, not Yjs merging. We'd need to shard the document into sub-documents, which is the block-based architecture, not what this design supports.
Interviewer is grading: You name relative positions for cursors and explain why integer offsets break. You volunteer the three-layer split-brain defense without prompting. You give concrete numbers for awareness throttling and document size limits.
- 55 min
Trade-offs and Wrap-up
GoalTwo deliberate trade-offs, one operational tail risk, one-sentence summary.
Do & Say- SAY·1Metadata trade-off: I'm taking 1.5 to 3x plain-text storage overhead for tombstones and origin pointers. The win is offline-first correctness. Periodic snapshot GC keeps tombstones from growing forever, but only when every connected client has seen the deletion.
- SAY·2WebSocket trade-off: WebSocket through corporate proxies is blocked ~1-3% of the time in 2026. Ship a long-polling fallback for those users with 300-500ms sync and a compatibility-mode banner. Google Docs and Notion accept this rather than building a fully separate transport.
- SAY·3Operational risk: The biggest tail risk is a Postgres write-amplification storm if the op log isn't batch-flushed. At 60M ops/sec steady state, single-row inserts would crater the primary. We batch inserts in 100ms windows and shard the op log across 100+ Citus nodes keyed by document_id.
- SAY·4Close: Relay-server architecture. Client owns the merge via Yjs CRDT, server only persists and broadcasts, presence rides the same WebSocket. Consistent hashing plus Redis Pub/Sub plus persist-layer merge is the three-layer split-brain defense. Load-bearing decision is CRDT over OT, justified by offline-first.
- SAY·5Offer deeper dives if there's time: Peritext for rich-text marks, the Hocuspocus persistence hooks, or how block-based editors like Notion differ from this single-document model.
Interviewer is grading: Trade-offs are named with a number attached ('1.5-3x overhead', '1-3% proxy block rate'). Closing summary names the load-bearing decision. Offers specific deeper-dive options, not generic 'we can talk about anything'.
Interview Grading by Level
What an interviewer at each level expects to see in your answer. Use this to calibrate, not to perform.
Mid-Level Engineer (L4 / SDE-II)
Gets to a working architecture with WebSocket and a database, but the merge algorithm stays hand-wavy and the failure modes are not named.
- Identifies WebSocket as the right transport and explains why REST polling doesn't work for sub-second sync.
- Adds a queue or pub/sub for broadcasting edits to multiple clients on the same document.
- Knows last-writer-wins is wrong and proposes some kind of operation transform or merge logic, even if vague.
- Separates document storage from real-time sync state (Postgres for persistence, Redis or in-memory for hot state).
- Adds cursor presence on a separate channel or message type and throttles updates.
- Can't articulate the difference between OT and CRDT, or picks one without a reason.
- Doesn't address offline editing beyond 'sync on reconnect' with no merge algorithm.
- Treats the server as the merge authority for everything, including for users on the same instance.
- No story for split-brain when two sync servers hold the same document during a failover.
- Storage math is missing or wrong; doesn't size the WebSocket fleet or the in-memory document tier.
Senior Engineer (L5 / SDE-III)
Picks CRDT or OT with a reason, names the Yjs metadata overhead, designs offline sync correctly, and quantifies the hot-document broadcast limit.
- Walks through OT and CRDT on a concrete two-user example and picks CRDT because of offline-first requirements.
- Names Yjs binary encoding, originLeft/originRight pointers, and the 1.5-3x storage overhead explicitly.
- Designs the offline reconnect path using state vector diffing (O(m log n)) and contrasts it with OT's O(local × missed).
- Adds consistent hashing on document_id at the load balancer and explains why it's the primary defense against split-brain.
- Quantifies presence: 50ms debounce, ~200 bytes per awareness update, 100 editors saturates a doc at ~400KB/s presence traffic.
- Describes Yjs relative positions for cursors and why integer offsets break under concurrent edits.
- Names the snapshot-plus-op-log persistence model and the 500-ops-or-5-min snapshot cadence.
- Mentions split-brain risk only after the interviewer hints at it, not on their own.
- Doesn't address the 1.5M-character document hard limit or what happens at 100+ editors on one doc.
- Acknowledges proxy/firewall WebSocket blocking only as a fallback, without sizing the failure rate.
Staff+ Engineer (L6+)
Owns the room, drives every trade-off with numbers, surfaces the load-bearing decision (CRDT for offline) before being asked, and brings operational maturity for split-brain and persistence.
- Volunteers that CRDT over OT is the load-bearing decision and ties it explicitly to offline-first requirements; doesn't wait to be asked.
- Names the three-layer split-brain defense unprompted: consistent hashing, extension-redis relay, persistence-layer merge on reconnect.
- Quantifies the broadcast amplification limit on a hot document (100 editors at ~2,000 presence messages/sec, ~400KB/s) and explains why beyond that you'd need to shard the document into blocks.
- Describes the Yjs garbage collection cycle and the constraint that tombstones can only be collected when every connected client has integrated the deletion; explains the snapshot-creation GC trigger.
- Pushes back on requirements: 'Do we really need rich-text marks, or can we ship plain-text first and add Peritext later? The marks protocol is the part that takes a quarter to debug.'
- Names the WebSocket proxy block rate (1-3%) and proposes a long-polling compatibility mode with a degraded latency banner; cites that Google Docs and Notion both accept this.
- Closes with a one-sentence summary that names CRDT, the relay-not-authority server model, the three split-brain defenses, and the metadata trade-off.
Common Follow-up Questions
click to expandQuestions an interviewer is likely to ask after your walkthrough. Rehearse the short answer.