System Design: Real-Time Collaborative Editor
Goal: Build a real-time collaborative text editor that supports 100 concurrent editors per document, 10 million simultaneously active documents, and 1 billion total documents. Handle rich text formatting, offline editing, version history, and presence indicators. Merge concurrent edits from multiple users without conflicts, data loss, or perceptible latency.
Reading guide: This post covers the full architecture of a real-time collaborative editor, from CRDT algorithms to block-based editing paradigms. It is long and detailed. Reading it linearly is not required.
Sections 1-8: Core architecture, data model, and API design
Section 4: CRDT vs OT deep dive (the heart of this design, both algorithms explained end-to-end with concrete examples)
Section 9: Implementation deep dives (WebSocket lifecycle, sync protocol, presence, offline, version history, search, rich text, multi-tenancy, GDPR, undo/redo, AI-assisted editing)
Sections 10-14: Operational concerns (bottlenecks, failures, observability, deployment, security)
New to collaborative editing? Start with Sections 1-4. Section 4 walks through both OT and CRDTs with character-level examples.
Building something similar? Sections 4-9 have the algorithm details, sizing math, and implementation deep dives. Jump straight to Section 9 if you already understand CRDTs.
Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 4 (CRDT vs OT) is the most commonly asked follow-up. Sections 10-11 (bottlenecks and failures) round out the discussion.
TL;DR: CRDT-based architecture using Yjs (YATA algorithm) for conflict-free merging. TipTap/ProseMirror for the rich text editor. WebSocket for real-time sync. Server acts as relay and persistence layer, not as ordering authority. Document tree model with snapshot + operation log storage in PostgreSQL. Presence via ephemeral Redis Pub/Sub. Offline editing syncs automatically on reconnect via state vector diffing. The hardest problems: merging concurrent edits (solved by CRDTs), rich text conflict resolution (Peritext approach), and scaling WebSocket connections across regions.
1. Problem Statement
Real-time collaborative editing means multiple users modify the same document simultaneously and every participant sees a coherent, converging result. It looks simple until you try to build it. Three problems make this genuinely hard:
-
Consistency. Alice and Bob both type at the same position in the same paragraph at the same time. Without a merge strategy, their documents diverge permanently. The system must guarantee that all replicas converge to an identical document state, regardless of the order operations arrive.
-
Latency. Every keystroke must appear on screen within 16ms (one frame at 60fps). Waiting for a server round-trip before showing the character is unacceptable. The system must apply edits optimistically on the local client and reconcile with remote edits asynchronously.
-
Rich text. This is not a plain text buffer. Bold, italic, headings, tables, images, comments, and nested lists all need to survive concurrent edits. Two users formatting the same word simultaneously, or one deleting a paragraph while another adds a comment to it, must produce a sensible result.
Scale: 10 million documents open simultaneously. Average of 3 concurrent editors per document (30 million WebSocket connections). Up to 100 concurrent editors on a single document. 1 billion total documents in storage.
The core question: How to merge concurrent edits from multiple users without conflicts, data loss, or unacceptable latency. Two approaches exist: Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs). Section 4 walks through both in full detail before making a choice.
What NOT to do:
- Lock the document while one user is editing. This caps throughput at one edit at a time and makes the system feel single-player.
- Send every keystroke to a central server and wait for confirmation before displaying it. At 200ms+ latency per character, the editor becomes unusable for real-time typing.
- Use a simple "last write wins" strategy. This silently drops one user's edits with no notification or recovery path. Users lose work and lose trust.
- Store the document as a flat string and use line-level diff/merge. Two users editing the same line simultaneously will corrupt each other's work.
- Assume all users are always online. Real users close laptops, lose signal on trains, and work on planes. Offline editing is a requirement, not a nice-to-have.
2. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Real-time collaborative editing with multiple simultaneous editors | P0 |
| FR-02 | Rich text formatting (bold, italic, underline, headings, lists, code blocks) | P0 |
| FR-03 | Document CRUD (create, read, update, delete) | P0 |
| FR-04 | Sharing with permission levels (owner, editor, commenter, viewer) | P0 |
| FR-05 | Cursor and presence indicators (see who is editing and where) | P0 |
| FR-06 | Version history with ability to view and restore past versions | P0 |
| FR-07 | Tables with concurrent cell editing | P0 |
| FR-08 | Offline editing with automatic sync on reconnect | P1 |
| FR-09 | Comments and suggestions on text ranges | P1 |
| FR-10 | Image and media embedding | P1 |
| FR-11 | In-document search (Ctrl+F) | P1 |
| FR-12 | Undo/redo per user (not global) | P1 |
| FR-13 | Cross-document full-text search | P2 |
| FR-14 | Templates | P2 |
| FR-15 | Export to PDF/DOCX | P2 |
| FR-16 | Document analytics (view count, edit frequency) | P2 |
3. Non-Functional Requirements
| Requirement | Target |
|---|---|
| Operation sync latency (same region) | < 100ms |
| Operation sync latency (cross region) | < 300ms |
| Local keystroke response | < 16ms (60fps rendering) |
| Concurrent editors per document | 100 |
| Concurrent viewers per document | 200 |
| Document size limit | 1.5M characters (~500 pages) |
| Availability | 99.99% |
| Durability | Zero user-visible data loss (client retains canonical state in IndexedDB; server persists via operation log with 100ms batch window) |
| Offline tolerance | Hours of offline editing, merge on reconnect |
| Active documents (simultaneously open) | 10M |
| Total documents | 1B |
4. High-Level Approach & Technology Selection
Both OT and CRDTs are explained end-to-end with concrete examples covering: how the algorithm works, online editing flow, offline editing flow, and conflict resolution. By the end of this section, the choice between the two should feel inevitable.
4.1 The Core Problem: Concurrent Edits
Consider a simple document containing "HELLO". Alice inserts "X" after position 1 (between H and E). Bob deletes the character at position 3 (the first L). Both edits happen simultaneously on different clients.
Alice's local view: "HXELLO" (inserted X at position 1).
Bob's local view: "HELO" (deleted L at position 3).
Now they exchange operations. If Bob naively applies Alice's insert at position 1, he gets "HXELO". If Alice naively applies Bob's delete at position 3, she gets "HXELO". Lucky coincidence? Not always. With three or more concurrent operations, naive application diverges fast.
Two approaches solve this problem:
- Operational Transformation (OT): Transform operations against each other so positions stay correct after concurrent edits. The server is the authority.
- CRDTs (Conflict-free Replicated Data Types): Give each character a globally unique identity so position is defined by neighbors, not indices. No authority needed.
Both are walked through in full below.
4.2 Approach 1: Operational Transformation (OT)
4.2.1 How OT Works (Concrete Example)
Document: "ABCD". Alice deletes position 1 (character 'B'). Bob inserts 'X' at position 3 (between 'C' and 'D'). Both edits happen simultaneously.
Server receives Alice's operation first:
- Apply
DELETE(1)to"ABCD"→"ACD". Server is now at revision 6. - Bob's
INSERT(3, 'X')was composed against revision 5 (before Alice's delete). It needs transformation. - Transform: Alice deleted at position 1, which is before Bob's insertion point (3). Everything after position 1 shifted left by one. Bob's operation becomes
INSERT(2, 'X'). - Apply transformed op to
"ACD"→"ACXD". Server is now at revision 7.
Verify with reverse order (TP1 property):
- Apply Bob's
INSERT(3, 'X')to"ABCD"→"ABCXD". Server at revision 6. - Alice's
DELETE(1)was composed against revision 5. Transform: Bob inserted at position 3, which is after Alice's delete point (1). Alice's delete is unaffected. StillDELETE(1). - Apply
DELETE(1)to"ABCXD"→"ACXD". Server at revision 7.
Both orderings produce "ACXD". This is the Transformation Property 1 (TP1): the result must be the same regardless of which operation the server processes first. TP1 is easy to verify for two operations of the same type. It becomes exponentially harder with three or more concurrent operations mixing inserts, deletes, and formatting. That complexity is the fundamental reason OT implementations are fragile at scale.
Transform function rules (pseudocode for all four operation pairs):
transform(op1, op2):
INSERT(p1, c1) vs INSERT(p2, c2):
if p1 < p2: return INSERT(p1, c1) # op1 is before op2, no shift
if p1 > p2: return INSERT(p1 + 1, c1) # op2 shifted text right
if p1 == p2: break tie by client ID # deterministic ordering
INSERT(p1, c1) vs DELETE(p2):
if p1 <= p2: return INSERT(p1, c1) # delete is after insert, no shift
if p1 > p2: return INSERT(p1 - 1, c1) # delete shifted text left
DELETE(p1) vs INSERT(p2, c2):
if p1 < p2: return DELETE(p1) # insert is after delete, no shift
if p1 >= p2: return DELETE(p1 + 1) # insert shifted text right
DELETE(p1) vs DELETE(p2):
if p1 < p2: return DELETE(p1) # independent deletes
if p1 > p2: return DELETE(p1 - 1) # earlier delete shifted text
if p1 == p2: return NOOP # same character deleted twice
These four rules are the entire foundation of OT. Every operation arriving at the server gets transformed against all operations that happened since the client's last known revision.
4.2.2 Online Flow (Step by Step)
Alice types 'X' at position 3 in a shared document. Walk through every hop:
- Alice's keystroke applies to her local document immediately (optimistic update). She sees the 'X' on screen within 16ms. No server round-trip.
- Alice's client sends
INSERT(3, 'X', rev=5)to the server over WebSocket.rev=5means "this operation was composed against server revision 5." - Server receives Alice's operation. The server is currently at revision 7 (two other operations from Bob landed while Alice's was in flight). Server transforms Alice's op against revisions 6 and 7 to produce a shifted op. Server applies the transformed op, increments to revision 8, and broadcasts the transformed op to all other clients.
- Acknowledgement: Server sends an ACK back to Alice with
rev=8. Alice's client marks the pending op as confirmed. If Alice typed more keystrokes while waiting, those pending ops are now rebased against revisions 6 through 8. - Bob's client receives the broadcast of Alice's transformed op. Bob transforms any of his own pending (unacknowledged) ops against Alice's incoming op, then applies Alice's op to his local document.
Key insight: In OT, the client maintains three states at all times:
- Confirmed state: What the server has acknowledged. The client knows this is canonical.
- Pending op: Sent to server, waiting for ACK. Only one pending op at a time.
- Buffer ops: Not yet sent. Accumulating locally while the pending op awaits acknowledgement.
When the ACK arrives, the pending op moves to confirmed. The buffer ops compose into a new pending op and get sent. This is the OT client state machine, and getting it wrong causes divergence.
4.2.3 Offline Flow
- Alice goes offline. The WebSocket connection drops.
- Alice continues typing. Each keystroke produces an operation applied against her local revision counter.
- Alice accumulates 500 operations over 2 hours offline.
- Alice reconnects. Her client reports its last known server revision (say rev=50). The server has advanced to rev=200 (150 ops from other users).
- The server sends Alice the 150 missed operations. Alice's client must transform her 500 pending ops against these 150 server ops. That is 500 x 150 = 75,000 transform calls. Each transform must be mathematically correct or the document diverges permanently.
- After transformation, Alice's rebased ops are sent to the server. The server transforms them again against any ops that arrived during the rebasing process.
The problem: Long offline periods create quadratic transformation cost (O(local_ops x missed_ops)). A single bug in the transform function across 75,000 calls causes permanent divergence that is extremely difficult to debug. Google Docs historically limited offline OT resolution, requiring server round-trips to reconcile diverged states.
Pros:
- Server maintains a single canonical document state (easy to reason about)
- Mature ecosystem (Google Docs has used OT since 2010)
- Lower per-character metadata than CRDTs (no unique ID per character)
- Access control is straightforward (server can reject operations)
Cons:
- Transform functions must satisfy TP1 (applying transformed ops in either order yields the same result). Getting this right for every operation pair is notoriously hard
- Central server is required for ordering (single point of failure for correctness)
- Offline sync is expensive and fragile (quadratic transformation cost)
- Client state machine (confirmed/pending/buffer) is complex to implement correctly
- Every new operation type (tables, images, formatting) needs new transform functions
Used by: Google Docs, CKEditor 5, Etherpad
A third approach: Microsoft Loop and Fluid Framework.
OT requires pairwise transforms. CRDTs require per-character metadata. Fluid Framework sidesteps both with total order broadcast: the server does not transform operations, and clients do not carry CRDT metadata. Instead, the server assigns a global sequence number to every incoming operation and broadcasts them to all clients in that order.
Concrete example: Alice sends
INSERT(3, 'X')and Bob sendsDELETE(5)simultaneously. Both arrive at the Fluid relay server. The server does not transform them. It assigns Alice's op sequence #41 and Bob's op sequence #42, then broadcasts both to all clients in that order. Every client applies #41 first, then #42. Because everyone applies the same operations in the same order, all replicas converge without transforms.Why this works: Divergence in OT happens because clients apply operations in different orders, requiring transforms to reconcile. Fluid eliminates divergence by forcing a single global order. No transforms needed. No CRDT item IDs needed. Operations are plain position-based, like OT, but applied in server-assigned order.
The tradeoff: The server is a sequencing authority. If the server is unreachable, clients cannot apply operations (they queue locally and wait). This makes Fluid less offline-friendly than CRDTs, where the client applies edits immediately without any server involvement. Fluid sits between OT and CRDTs: simpler than OT (no transform functions), less offline-capable than CRDTs (needs server for ordering), and lighter than CRDTs (no per-character metadata).
4.3 Approach 2: CRDTs (Conflict-free Replicated Data Types)
4.3.1 How CRDTs Work (Concrete Example)
Document: "AC". Character A has a unique ID (1, server). Character C has ID (2, server). These IDs never change.
Alice inserts 'B' after 'A'. Bob inserts 'X' after 'A'. Both happen simultaneously.
Each new character gets a globally unique ID composed of (lamport_clock, client_id):
-
Alice's insert:
{ char: 'B', id: (3, alice), originLeft: (1, server), originRight: (2, server) }Alice sees:"ABC" -
Bob's insert:
{ char: 'X', id: (3, bob), originLeft: (1, server), originRight: (2, server) }Bob sees:"AXC"
Now they exchange operations. Both replicas receive an item that has the same originLeft as an existing item (a conflict). The tie-breaking rule: compare client_id lexicographically. "alice" < "bob", so B sorts before X.
Both replicas independently arrive at: "ABXC". No server decided this ordering. Both replicas applied the same deterministic rule.
The YATA algorithm (used by Yjs) formalizes this. Each item stores { id, originLeft, originRight, content }. On conflict (two items share the same originLeft):
- Start at the conflict position.
- Scan right through existing items.
- Compare the conflicting item's ID against each existing item's ID.
- Insert before the first item whose
originLeftis different from ours, or whoseclient_idis greater than ours.
This produces a total order that every replica agrees on, without any communication beyond exchanging the items themselves. The mathematical property: the merge function is commutative (A merge B = B merge A), associative ((A merge B) merge C = A merge (B merge C)), and idempotent (A merge A = A). These three properties guarantee convergence regardless of message ordering or duplication.
4.3.2 Online Flow (Step by Step)
Alice types 'X' in the document. Walk through every hop:
- Alice's keystroke applies to the local Yjs document immediately. A new CRDT item is created with a unique ID
(clock, alice). She sees the 'X' on screen within 16ms. This edit is final. It will never be rebased, transformed, or undone by the sync process. - Yjs encodes the new item as a compact binary update (~20-50 bytes for a single character).
- Alice's client sends the binary update to the sync server over WebSocket.
- Acknowledgement model: There is no ACK required for correctness. The edit is already committed to Alice's local document. The server is a relay, not an authority. It does not transform operations, though it does establish a durable total ordering via the append-only operation log. The client does not wait for server confirmation before proceeding. If the server is slow, unreachable, or crashed, Alice keeps typing with zero impact on her local experience. Eventual consistency is verified via periodic state vector reconciliation, not per-operation ACKs.
- The sync server receives the update, appends it to the operation log, and broadcasts it to all other connected clients editing the same document.
- Bob's client receives the update. Yjs merges it into Bob's local document using the CRDT rules (unique IDs + deterministic ordering). If Bob has concurrent edits, they merge automatically. No transformation needed.
Key insight: In CRDTs, there is no "pending" state. Every local edit is immediately part of the document's permanent history. There is no rebasing. The merge function is idempotent (applying the same update twice has no effect) and commutative (the order updates arrive does not matter). This eliminates the entire class of bugs related to OT's client state machine.
4.3.3 Offline Flow
- Alice goes offline. The WebSocket connection drops.
- Alice continues typing. Each keystroke creates a new CRDT item with a unique ID in her local Yjs document. Everything is stored in IndexedDB.
- Alice accumulates 500 edits over 2 hours offline.
- Meanwhile, Bob and others make 150 edits on the server.
- Alice reconnects. Her Yjs client sends its state vector: a compact summary of what it has already seen. Example:
{ alice: 550, bob: 200, carol: 100 }meaning "I have all edits from Alice up to clock 550, Bob up to 200, Carol up to 100." - The server compares Alice's state vector against its own. It computes the diff: "Alice is missing these 150 updates from Bob and Carol." The server sends only those 150 updates.
- Alice's client applies the 150 incoming updates. The CRDT merge handles ordering automatically. No transforms. No 75,000 pairwise transform calls. Just 150 merge operations, each O(log n) where n is the document length.
- Alice's client sends her 500 offline edits to the server (the server's state vector showed it was missing those). The server merges them the same way.
- All clients converge. Total merge work: O(m log n) where m is the number of missed operations and n is the document length. No quadratic blowup.
The advantage: Offline for 2 hours or 2 weeks makes no difference to correctness. The merge protocol is identical whether the gap is 1 operation or 10,000. No rebasing, no transform chains, no divergence risk. This is what makes CRDTs genuinely offline-first, instead of bolting offline support onto an architecture that was never built for it.
4.3.4 Conflict Resolution
Same-position insert: Alice inserts 'A', Bob inserts 'B', both after the same character. In CRDTs, "same position" means both items share the same originLeft. The YATA algorithm compares client IDs to break the tie deterministically. Every replica applies the same rule independently. No server arbitration.
Delete vs. insert: Alice deletes a word. Bob simultaneously inserts new text inside that word. The deleted characters become tombstones (marked as deleted but retained in the CRDT structure). Bob's insert still lands correctly because its position is defined by originLeft and originRight pointers to specific item IDs, not integer indices. The inserted text appears, and the surrounding deleted text remains invisible. Clean resolution, no special case needed.
Delete vs. format: Alice deletes a paragraph. Bob simultaneously bolds a word in that paragraph. The bold mark applies to tombstoned characters (no visible effect). When both operations merge, the paragraph is gone and the formatting goes with it. If the paragraph is later restored (undo), the bold formatting reappears.
Pros:
- Offline-first by mathematical design (commutative, associative, idempotent merges)
- No central ordering authority required (server is a relay, not a coordinator)
- Local edits are instant and permanent (no pending state, no rebasing)
- Merge complexity is O(m log n) (m = missed ops, n = doc length), not O(n x m) for offline catch-up
- Adding new operation types does not require new transform functions
Cons:
- Per-character metadata overhead (each character carries a unique ID + origin pointers)
- Tombstones for deleted text consume memory until garbage collected
- Document size is 1.5-3x the plain text size (Yjs binary encoding)
- Undo/redo requires per-user tracking (Yjs
UndoManagerhandles this; see Section 9.11) - Garbage collection of tombstones requires coordination across all connected clients
Used by: Figma, Apple Notes, Jupyter Notebooks (Yjs), AFFiNE, Notion (partial)
4.4 Side-by-Side Comparison
| Dimension | OT | CRDT |
|---|---|---|
| Offline sync | Expensive: O(local x missed) transforms | Cheap: O(m log n) state vector diff |
| Acknowledgement model | Server ACKs each op; client has pending/buffer states | No ACK needed; local edits are final |
| Metadata per character | None (positions are integers) | Unique ID + originLeft + originRight (~20 bytes) |
| Transform/merge complexity | O(n x m) for n local, m remote ops | O(m log n) for m ops, n doc length |
| Correctness risk | High (TP1 bugs cause silent divergence) | Low (mathematical convergence guarantee) |
| Rich text maturity | Google Docs (15+ years of production) | Yjs + Peritext (newer but production-proven) |
| Latency model | Local instant, server round-trip for confirmation | Local instant, no confirmation needed |
| Used by | Google Docs, CKEditor 5, Etherpad | Figma, Apple Notes, Jupyter (Yjs), AFFiNE |
Summary: OT gives the server full control at the cost of offline fragility and transformation correctness risk. CRDTs provide offline-first and mathematical convergence guarantees at the cost of metadata overhead and tombstone management. For a modern collaborative editor where offline capability matters, CRDTs have won the practical argument. Yjs's optimized binary encoding keeps metadata overhead to 1.5-3x plain text, down from 16x in naive CRDT implementations.
4.5 Our Choice: CRDT (Yjs) with a Relay Server
After walking through both algorithms in detail, the choice:
Why CRDT over OT:
- Offline sync is a first-class citizen, not a bolted-on afterthought
- No transformation correctness bugs. I've seen teams spend months debugging OT transform bugs that only surface under specific three-user concurrent editing scenarios. CRDTs eliminate this entire class of problem.
- Local edits are instant and final (no pending/buffer state machine, no rebasing)
- Merge complexity scales linearly, not quadratically, with offline duration
Why Yjs specifically:
- Mature ecosystem with ProseMirror and TipTap integrations
- Optimized binary encoding (1.5-3x plain text size, not 16x like academic CRDT implementations)
- Production-proven at scale (Jupyter Notebooks, AFFiNE, multiple enterprise deployments)
- Run-length encoding for consecutive edits (typing "hello" stores as one item, not five)
- Built-in awareness protocol for presence and cursor tracking
Why still use a server:
- Access control enforcement (reject unauthorized edits before relaying)
- Durable persistence (operation log + snapshots in PostgreSQL and S3)
- Presence relay across clients (cursor positions, online status)
- Single WebSocket endpoint (simpler than peer-to-peer NAT traversal at scale)
The hybrid model: Yjs handles the merge algorithm. The server handles auth, persistence, relay, and presence. The client handles editing via TipTap/ProseMirror with Yjs bindings. Each layer does what it is best at.
4.6 Collaboration Paradigms: Where This Design Fits
This post designs a rich text collaborative editor. That is one of four distinct paradigms, each requiring a different CRDT data model and sync architecture:
| Paradigm | Examples | CRDT Data Model | Key Difference |
|---|---|---|---|
| Plain text | Etherpad, VS Code Live Share | Sequence CRDT (characters only) | No formatting marks, simpler merge |
| Rich text | Google Docs, this design | Sequence CRDT + marks (Peritext) | ProseMirror tree with inline formatting |
| Block-based | Notion, Coda, AFFiNE, Microsoft Loop | Tree CRDT (blocks) + per-block text CRDT | Each block is an independent CRDT subdocument |
| Canvas / design | Figma, Miro, FigJam | Map CRDT (objects with properties) | CRDTs operate on position/color/size, not text |
Block-Based Editors (Notion, AFFiNE, Coda)
In our rich text design, the entire document is a single ProseMirror tree backed by one Yjs document. Every character, heading, and table lives in the same CRDT. In a block-based editor, the document is a tree of independent blocks, each with its own CRDT:
Document
├── Block (heading): "Project Roadmap" ← Y.Text CRDT
├── Block (paragraph): "The first milestone..." ← Y.Text CRDT
├── Block (table) ← Y.Map CRDT
│ ├── Cell (0,0): "Task" ← Y.Text CRDT
│ └── Cell (0,1): "Owner" ← Y.Text CRDT
├── Block (toggle): "Implementation details" ← Y.Text CRDT
│ ├── Block (paragraph): "Step 1..." ← Y.Text CRDT (nested child)
│ └── Block (paragraph): "Step 2..." ← Y.Text CRDT (nested child)
└── Block (image): screenshot.png ← Y.Map CRDT (properties only)
In Yjs terms, the data structure looks like this:
Document = Y.Array<BlockID> // ordering CRDT (which blocks, in what order)
Block = Y.Map {
id: string, // globally unique block ID
type: "paragraph" | "heading" | "table" | "image" | "toggle" | ...,
content: Y.Text, // inline text with marks (per-block CRDT)
children: Y.Array<BlockID>, // nested blocks (toggles, columns, callouts)
props: Y.Map { level, checked, ... } // block-level properties
}
The key difference: in our design, Alice editing paragraph 1 and Bob editing paragraph 5 both modify the same Yjs document. In a block model, they modify completely independent CRDTs. This has four architectural consequences:
1. Per-block lazy loading. The client does not need the entire document to start editing. On page open, the server sends block metadata (IDs, types, ordering) and the content of blocks visible in the viewport. As the user scrolls, additional blocks are loaded on demand. This is how Notion achieves fast initial page loads even for documents with hundreds of blocks. In our design, the full Yjs document must be synced before the editor is interactive.
2. Per-block permissions. Because each block is an independent CRDT, access control can be enforced per block, not just per document. A team lead can share a specific section of a planning document with an external contractor without exposing the rest. The sync server checks block-level ACLs before sending block content. Our design only supports document-level permissions.
3. Transclusion (synced blocks). A "synced block" in Notion appears in multiple documents simultaneously. The block lives in a shared block store, and documents reference it by ID. Edits to the block propagate to every document that references it. This is straightforward when each block is an independent CRDT: the block's Yjs document is synced independently of any parent document. In our ProseMirror tree model, transclusion would require splitting the single document CRDT, which defeats the tree structure.
4. The block move problem. Drag-and-drop reordering means moving a block from one position to another in the Y.Array. If Alice moves block X above block Y while Bob moves block X below block Z, the CRDT must resolve this conflict. Sequence CRDTs (including Yjs) do not natively support "move" as an atomic operation. A move is typically implemented as delete-then-insert, which can produce duplicates or losses under concurrent moves. Notion handles this by routing structural operations (block moves, block deletions) through the server as ordered transactions, while using CRDTs for within-block text editing. This is the "hybrid" approach: tree structure is server-ordered, text content is CRDT-merged. Martin Kleppmann's "Moving Elements in List CRDTs" (2020) proposes a native move operation for sequence CRDTs, but production implementations are still limited.
Trade-offs: rich text vs. block-based
| Aspect | Rich Text (This Design) | Block-Based (Notion) |
|---|---|---|
| Sync unit | Entire document as one CRDT | Per-block, independent CRDTs |
| Initial load | Full document sync required | Metadata + visible blocks only |
| Large documents | Memory-bound (entire doc in RAM) | Viewport-bound (only visible blocks in RAM) |
| Permissions | Document-level only | Block-level possible |
| Reordering | ProseMirror transactions | Tree CRDT or server-ordered moves |
| Offline | Full document available offline | Only cached blocks available offline |
| Complexity | Single CRDT, simpler architecture | Tree of CRDTs, more moving parts |
Block-based editing is the direction the industry is moving. If I were starting this system from scratch with Notion-level ambitions (databases, synced blocks, per-block permissions), I would choose the block model. For a Google Docs-like rich text editor, the single-document CRDT is simpler, better supported by existing tooling (TipTap, ProseMirror, Hocuspocus), and sufficient for the core use case.
Canvas Editors (Figma)
Canvas editors (Figma) apply CRDTs to objects with properties (x, y, width, height, color, z-order) rather than text sequences. Figma runs the CRDT on the server and sends rendered state to clients via WebGL, which enables 500+ concurrent viewers without each viewer holding a full CRDT document in memory. The data model is fundamentally different from a document tree.
Why This Design Focuses on Rich Text
This design focuses on rich text because it represents the core challenge of collaborative document editing and the most common interview question. The CRDT fundamentals (unique IDs, deterministic ordering, offline merge) apply across all four paradigms; the difference is what you attach those IDs to.
4.7 Technology Selection
| Component | Technology | Why |
|---|---|---|
| Sync algorithm | Yjs (YATA CRDT) | Offline-first, proven binary encoding, rich ecosystem |
| Rich text editor | TipTap (ProseMirror) | Yjs binding (y-prosemirror), extensible schema, production-ready |
| Sync server | Hocuspocus / Custom Node.js | Yjs-native WebSocket server, handles auth hooks and persistence |
| Real-time transport | WebSocket | Bidirectional, low overhead, browser-native, wide proxy support |
| Document storage (hot) | Redis | In-memory Yjs document state for active documents, sub-ms access |
| Document storage (warm) | PostgreSQL | Operation logs, snapshots, document metadata, ACLs |
| Document storage (cold) | S3 | Archived snapshots, old operation logs, media files |
| Media storage | S3 + CloudFront CDN | Presigned uploads, edge-cached delivery |
| Search | Elasticsearch | Full-text search across billions of documents with ACL filtering |
| Presence | Redis Pub/Sub | Ephemeral cursor/selection broadcast across sync server instances |
| Auth | JWT + OAuth 2.0 | Stateless token validation at WebSocket handshake |
4.8 What Runs Where
Before diving into the architecture diagram, here is what each core technology actually does and which side of the network it runs on.
Client side (runs in the browser):
| Technology | What It Is | Problem It Solves |
|---|---|---|
| ProseMirror | Rich text editor framework | Renders the document, handles the editing UI (typing, cursor, selections, toolbar), and enforces the document schema. This is the "word processor" layer. |
| TipTap | Wrapper around ProseMirror | Makes ProseMirror easier to configure. Adds an extension API and the Yjs binding (y-prosemirror) that connects the editor to the CRDT layer. |
| Yjs | CRDT library (YATA algorithm) | The merge engine. Every character gets a unique ID. When two users type simultaneously, Yjs merges their edits deterministically without conflicts. Each user holds a full copy of the document in memory. |
| IndexedDB | Browser storage | Persists the Yjs document locally. Enables offline editing and instant page reloads without fetching from the server. |
Server side:
| Technology | What It Is | Problem It Solves |
|---|---|---|
| Hocuspocus | Yjs-native WebSocket server (Node.js) | The sync relay. Receives binary Yjs updates from one client and broadcasts them to all other clients editing the same document. Also handles auth hooks and persistence hooks. |
| Redis | In-memory store | Hot document state cache (sub-ms access) and presence broadcast via Pub/Sub (cursor positions across server instances). |
| PostgreSQL | Relational database | Durable persistence: operation log (every edit), snapshots (periodic full state), document metadata, and ACLs. |
| S3 | Object storage | Cold storage for archived snapshots, old operation logs, and uploaded media files. |
The flow when Alice types a character:
Alice's browser Server Bob's browser
─────────────── ────── ─────────────
TipTap (editor UI) Hocuspocus TipTap (editor UI)
↓ ↑
ProseMirror (document tree) ProseMirror (document tree)
↓ ↑
Yjs (creates CRDT item) Yjs (merges CRDT item)
↓ ↑
Binary update ──→ WebSocket ──→ Persist + Broadcast ──→ WebSocket ──→ Binary update
(PostgreSQL + Redis)
- Alice types 'X' → ProseMirror creates a transaction → Yjs creates a CRDT item with a unique ID
(clock, alice) - Yjs encodes the item as a binary update (~30 bytes)
- The update travels over WebSocket to Hocuspocus
- Hocuspocus appends it to the operation log (PostgreSQL) and broadcasts it to Bob
- Bob's Yjs merges the item using deterministic ID ordering. No transforms. ProseMirror renders it.
The key point: the server does not merge. Hocuspocus is a relay, not a merge authority. All merge intelligence lives in Yjs on the client. If the server crashes, Alice and Bob keep typing locally (Yjs saves to IndexedDB). When the server comes back, they reconnect and Yjs syncs automatically via state vector diffing. This is the fundamental difference from Google Docs (OT), where the server is the ordering authority.
Tombstone lifecycle (how deletions work in Yjs):
When a character is deleted, Yjs does not remove it from the CRDT. It marks it as a tombstone: the content is cleared, but the unique ID and origin pointers are retained. A document where you type 10,000 characters and delete 9,000 still has 10,000 items internally. The 9,000 deleted items are tombstones, skipped during rendering.
Tombstones are not permanent. Yjs can garbage-collect them, but only when every connected client has integrated the deletion. If even one client is offline and has not seen the delete, the tombstone must stay. Removing it would break that client's merge on reconnect, because its state vector references items that no longer exist.
In practice, GC runs during snapshot creation. The snapshot stores only live content plus a compressed tombstone summary. Clients reconnecting after a GC cycle receive the full snapshot instead of incremental updates. This is why Yjs documents grow during active editing but shrink back to near-plain-text size when a snapshot is taken with all clients in sync.
4.9 End-to-End Example: From Keystroke to CRDT
The previous sections explain what each technology does. This section shows how they work together with actual code.
The document:
Document
├── Block b1 (heading): "Project Plan"
└── Block b2 (paragraph): "Build authentication service using OAuth"
Two blocks. Each block gets its own TipTap editor, its own ProseMirror instance, and its own Yjs text fragment. Here is the setup with collaboration enabled:
jsimport StarterKit from "@tiptap/starter-kit" import Collaboration from "@tiptap/extension-collaboration" import { HocuspocusProvider } from "@hocuspocus/provider" import * as Y from "yjs" // Yjs document + WebSocket connection to the sync server const ydoc = new Y.Doc() const provider = new HocuspocusProvider({ url: "wss://sync.example.com", name: "doc-abc", // document ID document: ydoc, }) // Block b1: heading const headingEditor = new Editor({ element: document.querySelector("#heading"), extensions: [ StarterKit, Collaboration.configure({ document: ydoc, field: "block-b1" }), ], }) // Block b2: paragraph const paragraphEditor = new Editor({ element: document.querySelector("#paragraph"), extensions: [ StarterKit, Collaboration.configure({ document: ydoc, field: "block-b2" }), ], })
The line Collaboration.configure({ document: ydoc, field: "block-b2" }) is what connects ProseMirror to Yjs. It installs the y-prosemirror binding, which automatically converts every ProseMirror transaction into Yjs CRDT operations. No manual conversion code needed.
What happens when Alice bolds "using OAuth":
Alice selects "using OAuth" (positions 28-40 in block b2) and clicks the Bold button.
jsdocument.getElementById("boldBtn").onclick = () => { paragraphEditor.commands.toggleBold() } // The developer writes this one line. Everything below is automatic. // Layer 2: ProseMirror // The editor already knows the selection from the browser. // EditorState.selection = { from: 28, to: 40 } // TipTap creates a ProseMirror transaction: // tr.addMark(28, 40, schema.marks.bold) // Layer 3: Document tree update // Before: paragraph → text("Build authentication service using OAuth") // After: paragraph → text("Build authentication service ") + bold("using OAuth") // Layer 4: Yjs (automatic via y-prosemirror binding) // The binding intercepts the ProseMirror transaction and converts it: // Y.Text.format(28, 12, { bold: true }) // A CRDT operation is created with a unique ID (clock, alice). // Layer 5: Network (automatic via HocuspocusProvider) // Yjs encodes the operation as a binary update (~40 bytes). // HocuspocusProvider sends it over WebSocket to the server. // Server persists to op log, broadcasts to all other clients. // Bob's Yjs merges the bold mark. His ProseMirror renders it.
The developer writes one line: editor.commands.toggleBold(). ProseMirror tracks the selection. The y-prosemirror binding converts to CRDT operations. HocuspocusProvider handles the network. Four layers, but the developer only touches the top one.
Concurrent edit: Alice bolds while Bob types
While Alice bolds "using OAuth", Bob types " v2.0" at the end of the same sentence. Both edits happen simultaneously.
- Alice's operation:
Y.Text.format(28, 12, { bold: true })with a Peritext "no-expand" boundary at the right edge of "OAuth" - Bob's operation: inserts " v2.0" as a new CRDT item with
originLeftpointing to the 'h' in "OAuth"
Result on both clients: "Build authentication service using OAuth v2.0"
Bob's " v2.0" appears unbolded because the Peritext no-expand boundary at the end of Alice's bold range prevents new text from inheriting the mark. No transforms needed. Each operation references stable CRDT item IDs, not integer positions, so they merge independently regardless of arrival order.
5. High-Level Architecture
5.1 Bird's-Eye View
5.2 Component Glossary
(1) WebSocket Connection. Client connects through a load balancer to a WebSocket gateway. The load balancer uses document ID-based routing (consistent hashing) so all editors of a single document land on the same gateway instance when possible. In practice, the "gateway" and "sync server" can be a single Hocuspocus process: Hocuspocus is itself a WebSocket server that handles TLS termination (behind NGINX), auth hooks, and Yjs sync in one process. The diagram separates them to show the logical responsibilities. At scale, you may split them: a stateless gateway tier for TLS and connection management, proxying to a stateful sync tier that holds Yjs documents in memory. The inter-tier protocol would be WebSocket passthrough (the gateway forwards raw WebSocket frames to the sync server after auth validation).
(2) Auth and Sync. On connection, the gateway validates the JWT token with the Auth Service. The sync server loads the Yjs document state (from Redis if hot, PostgreSQL snapshot if cold) and runs the Yjs sync protocol (state vector exchange) to bring the client up to date. Cold document loads (no Redis cache, loading from PostgreSQL snapshot + replaying operations) can take 500ms-1s for large documents. The <100ms sync latency target applies to steady-state incremental updates, not cold opens.
(3) Real-time Broadcast. Local edits are sent as binary Yjs updates over WebSocket. The sync server persists the update and broadcasts it via Redis Pub/Sub to other sync server instances, which relay to their connected clients.
(4) Persistence. Every Yjs update is appended to the operation log in PostgreSQL. Periodic snapshots (full Yjs document binary) are saved every 500 operations or every 5 minutes, whichever comes first. Hot document state is cached in Redis.
(5) Media. Images and files are uploaded to S3 via presigned URLs. The media node in the document tree references the S3 key. CloudFront CDN caches media at the edge.
(6) REST API. Document CRUD, sharing, permission management, version history browsing, and search all go through the REST API layer. These are not real-time operations and do not use WebSocket.
6. Back-of-the-Envelope Estimation
Documents
Total documents in storage: 1B
Active documents (open right now): 10M
Average concurrent editors per active doc: 3
Total WebSocket connections: 30M (10M × 3)
Peak concurrent editors on one doc: 100
Operations
Average ops/sec per active editor: 2 (typing + formatting; upper bound)
Total ops/sec (steady state): 60M (30M × 2; design ceiling)
Realistic ops/sec (many editors idle): 15-25M (most connections are idle readers)
Peak ops/sec (3x burst): 180M
Ops/sec on a busy single document: 200 (100 editors × 2 ops/sec)
Bandwidth
Average Yjs binary update size: 100 bytes
Steady-state bandwidth: 6 GB/sec (60M × 100 bytes)
Per-document bandwidth (100 editors): 20 KB/sec (trivial)
Per-WebSocket-server bandwidth: ~10 MB/sec (at 50K connections)
Storage
Average document content size: 10 KB
Document content (1B docs): 10 TB
Operation logs (avg 100 KB/doc): 100 TB (compacted)
Snapshots (avg 50 KB/doc): 50 TB
Media (avg 1 MB/doc, 30% have media): 300 TB
Total: ~460 TB
WebSocket Servers
Connections per server: 50K (practical limit with epoll/kqueue)
Servers needed: 600 (30M / 50K)
Sync Servers (Hocuspocus)
Active documents per instance: 10K (each doc held in memory ~50-200 KB)
Memory per instance: ~1-2 GB for document state (avg doc ~50-200 KB; a 1.5M-char doc with 100 editors can peak at ~50 MB)
Instances needed: 1,000 (10M / 10K)
PostgreSQL
Write throughput (op log): 60M inserts/sec (sharded across 100+ nodes, with batch writes; see Section 10.3)
Read throughput (snapshots): ~10K reads/sec (cold document loads)
7. Data Model
7.1 Document Tree (ProseMirror Schema)
The document is a tree, not a flat string. ProseMirror defines a schema of allowed node types and their nesting rules:
json"type": "doc", "content": [ { "type": "heading", "attrs": { "level": 1 }, "content": [ { "type": "text", "text": "Project Roadmap" } ] }, { "type": "paragraph", "content": [ { "type": "text", "text": "The first milestone ", "marks": [{ "type": "bold" }] }, { "type": "text", "text": "is due next week." } ] } ] }
Yjs maps this tree structure using Y.XmlFragment (for block nodes) and Y.Text (for inline text with marks). Each node in the ProseMirror tree corresponds to a CRDT type in Yjs, enabling per-character and per-node conflict resolution.
7.2 CRDT Operation Format
Each Yjs item (the fundamental unit of the CRDT) carries:
{
id: { client: 42, clock: 157 }, // globally unique
originLeft: { client: 42, clock: 156 }, // left neighbor at insertion time
originRight: { client: 7, clock: 89 }, // right neighbor at insertion time
parent: "paragraph_node_id", // which Y.XmlFragment this belongs to
content: "hello" // the actual text (run-length encoded)
}
Binary encoding: Yjs uses a custom binary format that encodes items compactly. A single character insert is ~20-30 bytes. A run of 100 characters by the same user is ~120 bytes (not 2,000-3,000). Deleted items are tombstoned: the content field is cleared but the ID and origin pointers are retained for ordering.
7.3 Storage Schema (PostgreSQL)
sqlCREATE TABLE documents ( id UUID PRIMARY KEY, title TEXT NOT NULL, owner_id UUID NOT NULL REFERENCES users(id), permission_default TEXT DEFAULT 'viewer', -- viewer, commenter, editor created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- Yjs document snapshots (periodic full state) CREATE TABLE document_snapshots ( id BIGSERIAL PRIMARY KEY, document_id UUID NOT NULL REFERENCES documents(id), snapshot_blob BYTEA NOT NULL, -- full Yjs binary state version BIGINT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ); -- Operation log (every Yjs update appended here) CREATE TABLE operation_log ( id BIGSERIAL PRIMARY KEY, document_id UUID NOT NULL, client_id BIGINT NOT NULL, op_binary BYTEA NOT NULL, -- Yjs binary update version BIGINT NOT NULL, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() ) PARTITION BY HASH (document_id); -- sharded by document -- Document sharing / ACL CREATE TABLE document_shares ( document_id UUID NOT NULL REFERENCES documents(id), user_id UUID NOT NULL REFERENCES users(id), permission TEXT NOT NULL, -- owner, editor, commenter, viewer created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(), PRIMARY KEY (document_id, user_id) );
7.4 Media Schema
S3 key pattern: media/{document_id}/{media_id}.{extension}
Media is referenced in the document tree as an image or file node:
json"type": "image", "attrs": { "src": "media/doc-abc/img-123.png", "width": 800, "height": 600, "alt": "Architecture diagram" } }
The client resolves the S3 key to a CDN URL at render time.
8. API Design
8.1 REST Endpoints
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/documents | Create a new document |
| GET | /api/documents/:id | Get document metadata (title, owner, permissions) |
| PATCH | /api/documents/:id | Update document metadata (title, settings) |
| DELETE | /api/documents/:id | Soft-delete a document |
| POST | /api/documents/:id/share | Add/update sharing permissions |
| DELETE | /api/documents/:id/share/:userId | Revoke a share |
| GET | /api/documents/:id/history | List version history (snapshots + named versions) |
| GET | /api/documents/:id/history/:version | Get document content at a specific version |
| POST | /api/documents/:id/history | Create a named version (user-triggered save point) |
| POST | /api/documents/:id/media | Upload media (returns presigned S3 URL) |
| GET | /api/search?q=term | Cross-document full-text search (ACL-filtered) |
8.2 WebSocket Protocol
Connection: wss://sync.example.com/doc/:id?token=<JWT>
The WebSocket connection follows the Yjs sync protocol:
| Message Type | Direction | Payload | Purpose |
|---|---|---|---|
sync-step-1 | Client → Server | Client's state vector (binary) | "Here is what I have" |
sync-step-2 | Server → Client | Missing updates (binary) | "Here is what the client is missing" |
update | Bidirectional | Yjs binary update | Incremental edit broadcast |
awareness | Bidirectional | { clientId, user, cursor, selection, color } | Presence and cursor tracking |
token-refresh | Client → Server | New JWT | Client sends a fresh JWT before the current token expires. Server validates and updates the session without disconnecting |
ping | Client → Server | (empty) | Heartbeat every 30s |
pong | Server → Client | (empty) | Heartbeat response |
Initial sync flow:
- Client sends
sync-step-1with its state vector - Server computes the diff and responds with
sync-step-2containing all missing updates - Client sends
sync-step-1back (the server now acts as client to receive any updates the client has that the server is missing) - Connection enters steady state: incremental
updatemessages flow bidirectionally
9. Deep Dives
9.1 WebSocket Connection Lifecycle
Connection flow:
Reconnection: On disconnect, the client uses exponential backoff with jitter (initial delay 1s, max 30s, jitter factor 0.5). On reconnect, the client resumes from its last known state vector. The Yjs sync protocol handles catching up automatically. No operations are lost because the local Yjs document retains all state in IndexedDB.
Connection routing: The load balancer uses consistent hashing on document_id to route all editors of a single document to the same sync server instance. This avoids cross-server coordination for most operations. When a sync server fails, connections rehash to a different instance, which loads the document from persistence.
Heartbeat: Client sends a ping every 30 seconds. Server responds with pong. If no pong within 90 seconds, the client treats the connection as dead and initiates reconnection. The server evicts clients that miss 3 consecutive heartbeats.
9.2 CRDT Sync Protocol (Yjs Internals)
The state vector and sync protocol are described in Section 4.3.3. This section covers the implementation-level details that matter for production systems.
Incremental updates: After initial sync, every local edit is immediately encoded as a binary Yjs update and sent to the server. The server persists it, then broadcasts it to all other connected clients for that document. Each update is typically 20-100 bytes.
Document encoding efficiency: Yjs binary encoding achieves 1.5-3x the plain text size for a typical document. A 10 KB plain text document is 15-30 KB as a Yjs binary. This compactness comes from three optimizations:
- Run-length encoding for consecutive edits by the same client (typing "hello" stores as one item, not five)
- Delta encoding for lamport clocks (store increments, not absolute values)
- Compact varint encoding for IDs and positions
Tombstone garbage collection: When a character is deleted, Yjs tombstones it (marks it as deleted but retains the ID and origin pointers). Tombstones accumulate over time. Yjs can garbage-collect tombstones when all connected clients have integrated the deletion. In practice, GC runs during snapshot creation: the snapshot stores only live content plus a compressed tombstone summary. Clients that reconnect after a GC cycle receive the full snapshot instead of incremental updates.
9.3 Conflict Resolution Deep Dive
Scenario: Three users edit the same sentence simultaneously.
Document: "The quick brown fox". Alice, Bob, and Carol all edit at the same time.
- Alice inserts " lazy" before "fox", targeting position between "brown " and "fox"
- Bob deletes "quick " (characters 4-9)
- Carol bolds the word "brown"
Each operation in CRDT terms:
-
Alice's insert: Creates items with
originLeftpointing to the space after "brown" andoriginRightpointing to 'f' in "fox". These pointers are stable item IDs, not integer positions. -
Bob's delete: Tombstones the items corresponding to "quick ". The items still exist in the CRDT with their IDs and origin pointers intact, but their content is marked as deleted.
-
Carol's formatting: Uses the Peritext approach. Bold mark starts at the 'b' in "brown" with an "expand" boundary (new characters typed at the end inherit the mark) and ends after 'n' with a "no-expand" boundary.
Merge result regardless of arrival order: "The brown lazy fox" with "brown" in bold. Alice's "lazy" appears because its origin pointers still reference live items. Bob's deletion removes "quick " without affecting Alice's or Carol's operations. Carol's bold mark applies to "brown" even though Bob's delete changed the text around it. Zero transforms needed. Each operation references stable item IDs, so the merge function handles all three independently.
9.4 Presence and Cursor Tracking
Yjs awareness protocol: Each connected client publishes an awareness state object:
json"clientId": 42, "user": { "name": "Alice", "color": "#FF6B6B" }, "cursor": { "anchor": { "type": "relative", "item": "(155, alice)", "assoc": -1 }, "head": { "type": "relative", "item": "(160, alice)", "assoc": 1 } } }
assoc indicates cursor affinity: -1 associates with the character to the left (end of a word), 1 with the character to the right (start of a word). This determines where new text appears when another user inserts at the same position.
Cursor positions use Yjs relative positions, not integer offsets. A relative position references a specific CRDT item ID. When remote edits insert or delete text around the cursor, the cursor position remains correct because it is anchored to an item, not an index. In OT, cursor positions must be transformed against every incoming operation.
Throttling: Awareness updates are sent at most every 50ms (20 updates/sec). This limits bandwidth while keeping cursor movement smooth. At 100 editors, each cursor move generates 99 broadcasts. At 50ms throttling, that is 99 x 20 = 1,980 presence messages/sec per document. Small payload (~200 bytes each), so ~400 KB/sec per document. Manageable.
Cross-server relay: When editors of the same document are split across multiple sync server instances (due to load balancer rebalancing or failover), presence updates are relayed via Redis Pub/Sub. Each sync server subscribes to a presence channel keyed by document ID.
Ephemeral: Presence data is never persisted. If a sync server restarts, connected clients re-publish their awareness state within one heartbeat cycle (30s). Disconnected clients are automatically removed after the heartbeat timeout (90s).
9.5 Offline Editing and Sync
Client-side persistence: The Yjs document state is continuously persisted to IndexedDB. Every local edit updates IndexedDB within a debounced window (100ms). On page reload or app restart, the Yjs document is reconstructed from IndexedDB, not fetched from the server.
Offline workflow:
- WebSocket disconnects (network loss, airplane mode, etc.)
- The editor remains fully functional. Every keystroke creates CRDT items in the local Yjs document.
- IndexedDB stores the growing document state.
- A local queue tracks which updates have not been sent to the server.
Reconnection:
- WebSocket reconnects.
- Yjs sync protocol resumes: client sends its state vector, server responds with missed updates, client sends its offline edits.
- Both sides merge automatically using CRDT rules.
- Within one sync round-trip, the client and server converge.
Conflict visibility: When Alice reconnects after a long offline period and other users have edited the same regions she edited, the CRDT merge produces a technically correct result. But the merged text might not be semantically meaningful (both Alice and Bob rewrote the same sentence differently). The client can detect "conflicting regions" by comparing the merged document against the offline-start version and highlighting areas that changed from both local and remote edits. This is a UI hint, not a merge failure.
Edge case: Two users both offline for hours, both extensively editing the same paragraph. CRDT convergence is guaranteed, but the result interleaves their edits at the character level. Mitigation: the presence system shows that the other user was recently active in the same region, discouraging simultaneous offline edits. Post-merge, the document shows a clear diff of what changed, letting users clean up the result manually.
9.6 Version History and Snapshots
Operation log: Every Yjs update (binary blob, typically 20-100 bytes) is appended to the operation_log table in PostgreSQL. This is an append-only log, partitioned by document_id for write distribution.
Snapshots: A full Yjs document binary (the complete CRDT state) is saved to document_snapshots:
- Every 500 operations (with a minimum 60-second cooldown per document), or
- Every 5 minutes during active editing, or
- On explicit user action ("save as named version")
A snapshot captures the entire document state at a point in time. Loading a snapshot reconstructs the full Yjs document without replaying any operations.
Compaction: Operation log entries older than the most recent snapshot can be archived to S3 (cold storage). The hot operation log in PostgreSQL only needs to contain entries since the last snapshot. This bounds PostgreSQL storage growth.
Version browsing: To reconstruct the document at any past version:
- Load the nearest snapshot before the target version.
- Replay operations from the operation log forward to the target version.
- Render the resulting Yjs document state in a read-only editor.
Same pattern as database point-in-time recovery: the snapshot is a base backup, the operation log is the WAL.
Named versions: Users can trigger "Save version" (like Google Docs "Name this version"). This creates a snapshot with user-provided metadata (name, description). Named versions are never compacted.
Undo/redo: Yjs provides UndoManager, which tracks items created or deleted by the local user. Undo generates inverse operations (re-insert deleted items, delete inserted items) and applies them as new CRDT operations. This means undo is per-user: Alice undoing her last action does not affect Bob's edits. The undo history is local to each client and not persisted.
9.7 Search and Indexing
In-document search (Ctrl+F): Purely client-side. The TipTap editor searches the local ProseMirror document state using standard text matching. No server involvement. Fast regardless of document size because the full document is already in memory.
Cross-document search: Elasticsearch indexes document content for full-text search across all documents.
Indexing pipeline:
- A Yjs update arrives at the sync server.
- The server debounces (waits 30 seconds of inactivity or a maximum of 2 minutes after the first change).
- The server extracts plain text from the current Yjs document state.
- The plain text is indexed to Elasticsearch with metadata:
{ document_id, title, owner_id, updated_at }.
ACL-aware search: Search queries include the requesting user's ID. Elasticsearch filters results against precomputed access lists. Each document's index entry includes a list of user IDs and group IDs that have at least viewer permission. The query adds a filter: user_id IN accessible_ids OR group_id IN user_groups.
Incremental vs. full reindexing: Incremental reindexing (update only the changed document) handles steady-state traffic. A full reindex job runs weekly to catch any missed updates (safety net). The full reindex reads snapshots from PostgreSQL, extracts text, and bulk-indexes to Elasticsearch.
9.8 Rich Text and Document Schema
ProseMirror schema defines the allowed structure:
Block nodes: doc, paragraph, heading (levels 1-6), blockquote, code_block, image, table, table_row, table_cell, table_header, bullet_list, ordered_list, list_item, horizontal_rule
Marks (inline formatting): bold, italic, underline, strikethrough, link (with href), code, comment (with metadata)
Schema enforcement: ProseMirror validates every transaction (insert, delete, format) against the schema before it is applied. Invalid operations (e.g., a heading inside a table cell that does not allow headings) are rejected client-side before reaching the CRDT layer. This prevents malformed document states.
Defense in depth: Client-side validation is not sufficient because a malicious client can bypass ProseMirror and send crafted Yjs binary updates directly. The sync server validates incoming updates against a schema allowlist before relaying or persisting them. Raw HTML nodes, script tags, and event handler attributes are stripped. This prevents XSS injection via the CRDT layer.
Tables: Each table cell contains a mini-document (a Y.XmlFragment). Concurrent edits to different cells are fully independent. Structural changes (add row, delete column) are atomic operations that modify the table node itself. Concurrent structural changes (Alice adds a row, Bob deletes a column) are resolved by CRDT ordering: both operations apply, producing a table with the new row and without the deleted column.
Images: Upload flow:
- Client requests a presigned S3 upload URL via
POST /api/documents/:id/media. - Client uploads the image directly to S3 using the presigned URL.
- Client inserts an
imagenode into the ProseMirror document referencing the S3 key. - Other clients see the image node immediately (the S3 key resolves to a CDN URL).
Comments: Stored as marks on text ranges. Each comment mark carries metadata:
json"type": "comment", "attrs": { "commentId": "cmt-456", "author": "alice", "createdAt": "2026-03-10T14:22:00Z", "resolved": false } }
Comment positions survive concurrent edits because the mark is attached to specific CRDT items (characters), not integer ranges. If the commented text is deleted, the comment mark is tombstoned with the text. The comment thread is preserved in PostgreSQL (linked by commentId) and can still be viewed in the document's comment history, even if its anchor text is gone. If the deletion is undone, the comment mark reappears on the restored text.
Comment thread storage:
sqlid UUID PRIMARY KEY, document_id UUID NOT NULL REFERENCES documents(id), comment_id TEXT NOT NULL, -- matches commentId in the ProseMirror mark parent_id UUID REFERENCES comments(id), -- NULL for top-level, set for replies author_id UUID NOT NULL REFERENCES users(id), body TEXT NOT NULL, resolved BOOLEAN DEFAULT FALSE, created_at TIMESTAMPTZ NOT NULL DEFAULT NOW() );
API endpoints:
| Method | Endpoint | Description |
|---|---|---|
| POST | /api/documents/:id/comments | Create a comment (inserts mark into CRDT + row into PostgreSQL) |
| GET | /api/documents/:id/comments | List all comment threads for a document |
| POST | /api/documents/:id/comments/:commentId/reply | Reply to a comment thread |
| PATCH | /api/documents/:id/comments/:commentId/resolve | Resolve or reopen a comment thread |
Notifications: When a comment mentions another user (@alice), the API service publishes a notification event. Comment creation and resolution events are also candidates for webhook delivery to external integrations (Slack, email, Zapier).
9.9 Multi-Tenancy and Rate Limiting
Tenant isolation: Each document is a separate Yjs document instance. There is no shared CRDT state across documents. ACLs are enforced at the sync server: before relaying any Yjs update, the server verifies the sending client has editor permission on the document. Viewers receive updates but cannot send them.
Rate limiting:
| Limit | Value | Enforcement Point |
|---|---|---|
| Operations per user per second | 100 | Sync server |
| WebSocket messages per second | 200 | WebSocket gateway |
| Document size (characters) | 1.5M | Sync server (reject ops exceeding limit) |
| Editors per document | 100 | Sync server (reject connections) |
| Viewers per document | 200 | Sync server (reject connections) |
| Media upload size | 25 MB | Media service |
| Media uploads per document per hour | 50 | Media service |
Document size enforcement: The sync server tracks the approximate character count of each active document. When an incoming update would push the document beyond 1.5M characters, the server rejects it and sends an error to the client. The client displays a "document size limit reached" warning.
Abuse protection: If a single client exceeds the operation rate limit, the server drops excess operations and sends a rate-limit warning. Persistent abuse (10+ warnings in 1 minute) triggers a temporary connection ban (5 minutes).
9.10 GDPR and Data Deletion
CRDT items permanently embed the originating client ID. When a user requests account deletion under GDPR (right to erasure), the system must anonymize their contributions without breaking CRDT invariants. The deletion pipeline rewrites every snapshot and op-log entry that contains the user's client ID, replacing it with anon-<hash>. The CRDT only requires unique IDs, not identifiable ones, so document structure and merge correctness are preserved. Rewriting snapshots is expensive: every snapshot containing the user's edits must be re-serialized. Deletions are batched and processed nightly. During the processing window, the user's data is marked as "pending deletion" and excluded from search results and API responses.
9.11 Undo/Redo
Yjs provides an UndoManager that scopes undo to the local user's operations. Each client maintains its own undo stack, tracking which CRDT items were created or deleted by that user. Undoing an insert tombstones the inserted items; undoing a delete restores the tombstoned items. Other users' concurrent edits are unaffected. Undoing Alice's insertion does not touch Bob's edits, even if they are adjacent. The undo stack is stored client-side and survives page reloads via IndexedDB. In the collaborative context, this means each user has independent undo history: pressing Ctrl+Z only reverts your own changes, never a collaborator's.
9.12 AI-Assisted Editing
Every major collaborative editor now ships AI features: Google Docs "Help me write," Notion AI, Microsoft Loop Copilot. In a CRDT-based architecture, AI integration is architecturally clean because the AI is just another client.
How it works:
- User triggers an AI action (slash command
/ai, toolbar button, or inline suggestion prompt). - The client sends the request to an AI service via REST:
POST /api/documents/:id/aiwith the prompt, selected text range, and surrounding context. The client includes Yjs relative positions for the selection boundaries so the server can map them to the current document state. - The AI service calls the LLM (streaming response). As tokens arrive, the service inserts them into the Yjs document as a dedicated AI client (with its own
clientId). The AI's edits are CRDT operations like any other, meaning they merge correctly with concurrent human edits. - Other editors see the AI-generated text appearing in real time, with an "AI writing" indicator in the presence system (the AI client has a distinct awareness state:
{ user: { name: "AI Assistant", isAI: true } }). - The user can accept, reject, or edit the AI output. Rejecting is an undo of the AI client's operations via
UndoManagerscoped to the AI'sclientId.
Rate limiting: AI requests are rate-limited per user (5 requests/minute) and per document (20 requests/minute) to manage LLM costs. Long-running generations are capped at 60 seconds.
Why this works cleanly with CRDTs: The AI does not need special merge logic. Its edits have unique CRDT IDs like any other client. If the user types while the AI is generating, both sets of edits merge automatically. In an OT system, the AI's streaming insertions would need to be transformed against every concurrent human edit in real time, adding significant complexity.
10. Identify Bottlenecks
10.1 WebSocket Connection Thundering Herd
Symptoms: During a deployment or sync server restart, all connected clients disconnect and reconnect simultaneously. 50,000 connections hitting a single server in a 5-second window causes CPU saturation from TLS handshakes and Yjs sync-step-1 processing.
Mitigation: Graceful drain during deployments (stop accepting new connections, wait 60s for existing ones to close naturally). Client-side reconnection uses jitter (random delay 0-5s) to spread the reconnection wave. The load balancer distributes reconnections across healthy instances.
10.2 Sync Server Memory Pressure
Symptoms: A document with 100 active editors and 1.5M characters consumes ~50 MB of RAM as a Yjs document instance. If the sync server holds 10,000 active documents and several are large, memory usage spikes past the instance limit.
Mitigation: Evict idle documents from memory after 5 minutes of no activity (reload from snapshot on next access). Set a per-document memory cap (100 MB). For extremely large documents, stream the Yjs state from Redis instead of holding it entirely in process memory. Monitor per-instance document count and memory, auto-scale sync server instances based on memory pressure.
10.3 Operation Log Write Throughput
Symptoms: At 60M ops/sec globally (design ceiling; realistic steady state is 15-25M), PostgreSQL must sustain up to 60M inserts/sec into the operation log. A single PostgreSQL instance handles ~50K inserts/sec at most.
Mitigation: The operation log table is hash-partitioned by document_id across 100+ PostgreSQL shards. Each shard handles ~600K inserts/sec (well within capacity). Batch writes: the sync server buffers operations for 100ms and inserts them in bulk (500-1,000 rows per batch insert), reducing per-row overhead.
10.4 Snapshot Computation Cost
Symptoms: Serializing a 1.5M-character Yjs document into a binary snapshot takes 50-200ms of CPU time. If 10,000 documents simultaneously trigger a snapshot (e.g., all hitting the 5-minute timer), the sync server stalls.
Mitigation: Stagger snapshot timers with random jitter (5 minutes ± 60 seconds). Run snapshot serialization on a background thread (Node.js worker thread) to avoid blocking the event loop. Limit concurrent snapshots to 10 per sync server instance.
10.5 Elasticsearch Indexing Lag
Symptoms: During heavy editing, documents change faster than the 30-second debounce window. Search results show stale content. Users search for text they just typed and find nothing.
Mitigation: The 30-second debounce is a deliberate tradeoff (search freshness vs. indexing load). For most use cases, a 30-second delay is acceptable. For users who need instant search, in-document Ctrl+F (client-side) provides real-time results. Cross-document search is expected to be near-real-time, not real-time.
10.6 Redis Pub/Sub Fan-Out for Presence
Symptoms: 100 editors per document, each sending cursor updates at 20/sec. Each update fans out to 99 other editors. That is 1,980 messages/sec per document through Redis Pub/Sub. Across 10M active documents (average 3 editors): ~1.2M presence messages/sec through Redis.
Mitigation: Presence is fan-out at the sync server level, not Redis level. Each sync server subscribes once per document to a Redis channel. The server handles local fan-out to its connected clients. Redis sees one publish and N subscriber deliveries (where N = number of sync server instances hosting that document, typically 1-3, not 99).
10.7 S3 Media Upload Latency
Symptoms: User inserts an image. The document references the S3 key immediately, but the image is not yet uploaded. Other editors see a broken image placeholder for 2-5 seconds until the upload completes.
Mitigation: The client shows a local preview (from the file picker) while uploading. The S3 key is not inserted into the document until the upload completes successfully. Other editors see the image node appear only after it is available. Upload progress is shown to the inserting user.
10.8 Cross-Region WebSocket Latency
Symptoms: Alice in London edits a document hosted on a sync server in US-East. Her operations travel across the Atlantic (70-100ms one way). Bob in New York sees Alice's edits in ~70ms, but Alice sees Bob's edits in ~140ms (round trip through the server). The 300ms cross-region SLA is met, but the experience feels sluggish.
Mitigation: Route documents to the region where the majority of active editors are located. If Alice is the only London-based editor and 9 others are in New York, the document stays in US-East. If the editing pattern shifts (London team takes over), migrate the document's primary sync server to EU-West. Migration: snapshot the Yjs state, load it on the new region's sync server, redirect WebSocket connections.
11. Failure Scenarios
11.1 Sync Server Crash
Impact: All active documents on that instance lose their in-memory Yjs state. Connected clients disconnect.
Recovery:
- The load balancer detects the failed instance within 10 seconds (health check failure).
- Reconnecting clients are routed to a healthy sync server instance (consistent hashing rehash).
- The new instance loads the latest snapshot from PostgreSQL and replays operations from the operation log since that snapshot.
- Clients run the Yjs sync protocol to converge.
- Zero user-visible data loss. The operation log contains all persisted updates. If the crashed server had buffered operations in its 100ms batch window, those operations survive in the clients' local Yjs documents (IndexedDB) and are resent on reconnect.
Time to recover: 15-30 seconds. Clients experience a brief disconnection, then resume editing.
11.2 Redis Failure
Impact: Hot document state cache is unavailable. Presence data is lost. Cross-server presence relay stops working.
Recovery:
- Sync servers fall back to loading Yjs state from PostgreSQL snapshots (slower, ~500ms vs. ~5ms from Redis).
- Presence stops updating across servers. Clients on the same sync server still see each other's cursors (local awareness still works).
- Redis recovers (failover to a replica or restart). Sync servers re-cache active documents. Clients re-publish awareness state.
Time to recover: With Redis Sentinel or Cluster, automatic failover takes 10-30 seconds. Presence gap during this window.
11.3 PostgreSQL Failure
Impact: Operation log writes fail. No new snapshots can be saved. Document metadata and ACLs are unavailable for new connections.
Recovery:
- Sync servers buffer incoming operations in memory (up to 10,000 per document or 5 minutes, whichever comes first).
- Active editing sessions continue because the Yjs document is held in memory on the sync server. No immediate user impact for already-connected editors.
- New connections fail (cannot load document or verify permissions). The client shows "temporarily unavailable" and retries.
- PostgreSQL recovers (failover to standby replica). Sync servers flush buffered operations to the operation log. Normal operation resumes.
Time to recover: Depends on PostgreSQL HA setup. With synchronous replication and automatic failover, 10-30 seconds. Buffered operations cover the gap.
11.4 WebSocket Gateway Overload
Impact: New connections are rejected. Existing connections may be dropped due to resource exhaustion (file descriptors, memory).
Recovery:
- The auto-scaler adds gateway instances within 60 seconds.
- The load balancer redirects new connections to healthy instances.
- Dropped clients reconnect with jitter and are distributed across the expanded pool.
Prevention: Set connection limits per gateway instance (50K hard cap). Monitor connection counts and trigger scaling at 70% capacity.
11.5 Network Partition (Client to Server)
Impact: The client enters offline mode. Local edits continue without interruption.
Recovery:
- Client detects partition via heartbeat timeout (90 seconds).
- Client enters explicit offline mode. Edits are stored in IndexedDB.
- Client attempts reconnection with exponential backoff.
- On reconnect, the Yjs sync protocol merges all offline edits.
This is not a failure from the user's perspective. The editor works identically online and offline. The merge happens transparently.
11.6 Corrupted Yjs Document State
Impact: The in-memory Yjs document enters an inconsistent state (e.g., broken origin pointers due to a Yjs library bug). Edits produce garbled output.
Recovery:
- Detect: clients report document hash mismatches during sync (state vectors agree but rendered content differs).
- The sync server evicts the corrupted in-memory state.
- Reload from the latest known-good snapshot in PostgreSQL.
- Replay operations from the operation log since that snapshot.
- Force-resync all connected clients (server sends the full state as sync-step-2).
Prevention: Run integrity checks on snapshots before persisting (verify the Yjs document renders valid ProseMirror JSON). Keep the last 5 snapshots per document for rollback.
11.7 Split-Brain: Two Sync Servers Serving the Same Document
Impact: Two sync server instances both load the same document independently. Edits on one instance are not visible to editors on the other. The document forks.
Recovery:
- Detect: the document's Redis key is claimed by two different server instance IDs.
- One instance wins (compare instance startup timestamps, older wins).
- The losing instance drains its connections and redirects clients to the winning instance.
- The winning instance merges any operations it missed from the losing instance's operation log entries.
Prevention: Use Redis SET NX (set if not exists) with a TTL as a distributed lock when loading a document. If the lock is already held, redirect the connection to the lock-holding instance. Renew the lock every 30 seconds.
11.8 S3 Unavailability
Impact: Media uploads fail. Existing media still served from CDN cache (CloudFront). New images cannot be embedded.
Recovery:
- The media service returns a retry-able error to the client.
- The client queues the upload and retries with backoff.
- The document can reference a placeholder image node until the upload succeeds.
- S3 recovers. Queued uploads complete. Placeholders resolve to actual images.
12. Observability
Key metrics to track:
| Metric | Description | Alert Threshold |
|---|---|---|
ws_connections_total | Active WebSocket connections per gateway | Drop >20% in 5 min |
ops_per_sec_per_doc | Operations per second per document | Drops to 0 for an active doc |
sync_latency_p99 | Time from op send to remote client receive | > 500ms |
document_size_bytes | Yjs document binary size in memory | > 100 MB per document |
op_log_write_latency_p99 | PostgreSQL operation log insert latency | > 100ms |
snapshot_duration_ms | Time to serialize a Yjs document snapshot | > 1,000ms |
search_index_lag_sec | Time since last Elasticsearch index update | > 120s for active docs |
presence_broadcast_latency | Redis Pub/Sub relay time | > 200ms |
offline_reconnect_duration | Time from reconnect to full sync | > 10s |
Dashboards:
- System overview: Total connections, total ops/sec, active documents, per-region breakdown
- Per-document health: Ops/sec, connected editors, document size, last snapshot age
- Storage growth: Operation log size, snapshot storage, media storage, PostgreSQL shard utilization
- Sync performance: Sync latency heatmap (p50/p95/p99), reconnection success rate, offline sync merge times
Structured logging: Every WebSocket message, operation log write, snapshot, and auth event is logged with document_id, client_id, user_id, and region as structured fields. Log aggregation via VictoriaLogs or similar. Trace IDs propagated from client through gateway to sync server to PostgreSQL for end-to-end tracing of a single operation.
13. Deployment Strategy
Multi-region deployment:
Deploy sync server clusters in 3+ regions (e.g., US-East, EU-West, AP-Southeast). Each document has a primary region based on where the majority of its active editors are located. WebSocket connections route to the primary region.
US-East: 400 sync servers, 250 WS gateways
EU-West: 300 sync servers, 200 WS gateways
AP-SE: 300 sync servers, 150 WS gateways
Document region assignment:
- On creation: document is assigned to the creator's nearest region.
- During active editing: if >60% of active editors are in a different region for >10 minutes, trigger region migration.
- Migration: snapshot the Yjs state on the source region, transfer to the target region, load on the target sync server, redirect WebSocket connections. Brief interruption (~5 seconds) during migration.
Rolling upgrades:
- Drain a sync server instance (stop accepting new document loads, keep serving active documents).
- Wait for active documents to reach a quiet period (no ops for 30 seconds) or a maximum drain timeout (5 minutes).
- Snapshot all in-memory documents.
- Shut down the instance. Deploy the new version.
- Start the new instance. It picks up new document loads from the consistent hash ring.
Canary deployment: New versions deployed to 5% of sync server instances first. Monitor error rate, sync latency, and document integrity for 30 minutes before proceeding. Automated rollback if error rate exceeds baseline by 2x.
Stateless components (API gateway, auth service, media service, search indexer) use standard blue-green deployment. No drain needed because they hold no document state.
14. Security
Transport security: TLS 1.3 for all WebSocket and REST connections. Certificate pinning for mobile apps. WebSocket Secure (wss://) is the only allowed transport.
Authentication:
- JWT tokens issued at login via OAuth 2.0 (Google, GitHub, email/password).
- Token validated at WebSocket handshake (gateway extracts and verifies JWT before upgrading the connection).
- Tokens include
user_id,email, andexpiry. Short-lived (15 minutes) with refresh tokens. - On token expiry during an active WebSocket session, the server requests a token refresh over the existing connection. No reconnection needed.
Authorization:
- Per-document ACL checked on every operation at the sync server.
- Permission levels: owner (full control), editor (read + write), commenter (read + comment marks), viewer (read only).
- Viewers receive Yjs updates (to see live edits) but their outgoing operations are rejected by the sync server.
- Permission changes propagate within 5 seconds (ACL cache TTL). This is a deliberate tradeoff favoring sync server performance over instant revocation. Reducing the TTL increases database load linearly.
Encryption at rest:
- AES-256 for document snapshots and operation logs in PostgreSQL (using pgcrypto or transparent data encryption).
- S3 server-side encryption (SSE-S3 or SSE-KMS) for media and cold storage.
Content validation: The sync server validates incoming Yjs updates against the ProseMirror schema. Operations that would produce an invalid document structure (e.g., script injection via a malformed text node) are rejected before being persisted or relayed.
Audit logging: All permission changes, document access events, share invitations, and administrative actions are logged to an append-only audit table. Retained for 1 year. Queryable for compliance and incident investigation.
Wrapping Up
The three problems from Section 1 drive every decision in this design. Consistency is handled by CRDTs: deterministic merge rules guarantee convergence without a central authority. Latency is handled by optimistic local application: every keystroke is final before it hits the network. Rich text is handled by the Peritext approach layered on top of ProseMirror's document tree, giving concurrent formatting edits the same convergence guarantees as text.
The hardest part is not any single component. It is the interaction between all of them: offline users returning with thousands of edits, tombstone garbage collection across clients that may not be online simultaneously, presence fan-out at 100 editors per document, and keeping the operation log from swamping PostgreSQL. The architecture above handles these interactions, but building it is a multi-year effort. If you are starting from scratch, Yjs + Hocuspocus + TipTap gives you a working collaborative editor in days. Scaling it to 10 million simultaneous documents is where the real engineering begins.
Explore the Technologies
| Technology / Pattern | Role in This System | Learn More |
|---|---|---|
| Yjs | CRDT library implementing the YATA algorithm for conflict-free document merging | Yjs |
| TipTap / ProseMirror | Rich text editor framework with Yjs bindings for collaborative editing | TipTap |
| Hocuspocus | Yjs-native WebSocket sync server with auth hooks and persistence adapters | Hocuspocus |
| WebSocket | Bidirectional real-time transport for editor sync and presence broadcasts | WebSocket |
| PostgreSQL | Document metadata, operation logs, snapshots, and ACLs | PostgreSQL |
| Redis | Hot document state, presence pub/sub, and distributed locking | Redis |
| CRDTs | YATA algorithm for conflict-free merging, Peritext for rich text marks | CRDTs |
| Consistent Hashing | Document-to-sync-server routing for WebSocket connection affinity | Consistent Hashing |
| Vector Clocks | Foundation for Yjs state vectors, tracking causal ordering across clients | Vector Clocks |
| Event Sourcing | Operation log as append-only source of truth, snapshots for fast recovery | Event Sourcing |
Further Reading
- Yjs Docs: Internals: How Yjs structures CRDT items, encoding format, and document state internals
- CRDT.tech: Comprehensive CRDT resource with papers, talks, and implementations (maintained by Martin Kleppmann et al.)
- Peritext: A CRDT for Rich-Text Collaboration: How to handle concurrent formatting in CRDTs (Ink & Switch, 2022)
- How Figma's Multiplayer Technology Works: Figma's custom CRDT-inspired approach with a centralized server for ordering
- CRDTs Go Brrr: Performance comparison of CRDT implementations (Diamond Types, Yjs, Automerge) by Seph Gentle
- CRDT Benchmarks: Community benchmark suite comparing Yjs, Automerge, and other CRDT libraries