CrackingWalnuts

System DesignMarch 14, 2026· 86 min read

System Design: Real-Time Collaborative Editor

Goal: Build a real-time collaborative text editor that supports 100 concurrent editors per document, 10 million simultaneously active documents, and 1 billion total documents. Handle rich text formatting, offline editing, version history, and presence indicators. Merge concurrent edits from multiple users without conflicts, data loss, or perceptible latency.

Reading guide: This post covers the full architecture of a real-time collaborative editor, from CRDT algorithms to block-based editing paradigms. It is long and detailed. Reading it linearly is not required.

Sections 1-8: Core architecture, data model, and API design

Section 4: CRDT vs OT deep dive (the heart of this design, both algorithms explained end-to-end with concrete examples)

Section 9: Implementation deep dives (WebSocket lifecycle, sync protocol, presence, offline, version history, search, rich text, multi-tenancy, GDPR, undo/redo, AI-assisted editing)

Sections 10-14: Operational concerns (bottlenecks, failures, observability, deployment, security)

New to collaborative editing? Start with Sections 1-4. Section 4 walks through both OT and CRDTs with character-level examples.

Building something similar? Sections 4-9 have the algorithm details, sizing math, and implementation deep dives. Jump straight to Section 9 if you already understand CRDTs.

Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 4 (CRDT vs OT) is the most commonly asked follow-up. Sections 10-11 (bottlenecks and failures) round out the discussion.

TL;DR: CRDT-based architecture using Yjs (YATA algorithm) for conflict-free merging. TipTap/ProseMirror for the rich text editor. WebSocket for real-time sync. Server acts as relay and persistence layer, not as ordering authority. Document tree model with snapshot + operation log storage in PostgreSQL. Presence via ephemeral Redis Pub/Sub. Offline editing syncs automatically on reconnect via state vector diffing. The hardest problems: merging concurrent edits (solved by CRDTs), rich text conflict resolution (Peritext approach), and scaling WebSocket connections across regions.

1. Problem Statement

Real-time collaborative editing means multiple users modify the same document simultaneously and every participant sees a coherent, converging result. It looks simple until you try to build it. Three problems make this genuinely hard:

Consistency. Alice and Bob both type at the same position in the same paragraph at the same time. Without a merge strategy, their documents diverge permanently. The system must guarantee that all replicas converge to an identical document state, regardless of the order operations arrive.
Latency. Every keystroke must appear on screen within 16ms (one frame at 60fps). Waiting for a server round-trip before showing the character is unacceptable. The system must apply edits optimistically on the local client and reconcile with remote edits asynchronously.
Rich text. This is not a plain text buffer. Bold, italic, headings, tables, images, comments, and nested lists all need to survive concurrent edits. Two users formatting the same word simultaneously, or one deleting a paragraph while another adds a comment to it, must produce a sensible result.

Scale: 10 million documents open simultaneously. Average of 3 concurrent editors per document (30 million WebSocket connections). Up to 100 concurrent editors on a single document. 1 billion total documents in storage.

The core question: How to merge concurrent edits from multiple users without conflicts, data loss, or unacceptable latency. Two approaches exist: Operational Transformation (OT) and Conflict-free Replicated Data Types (CRDTs). Section 4 walks through both in full detail before making a choice.

What NOT to do:

Lock the document while one user is editing. This caps throughput at one edit at a time and makes the system feel single-player.
Send every keystroke to a central server and wait for confirmation before displaying it. At 200ms+ latency per character, the editor becomes unusable for real-time typing.
Use a simple "last write wins" strategy. This silently drops one user's edits with no notification or recovery path. Users lose work and lose trust.
Store the document as a flat string and use line-level diff/merge. Two users editing the same line simultaneously will corrupt each other's work.
Assume all users are always online. Real users close laptops, lose signal on trains, and work on planes. Offline editing is a requirement, not a nice-to-have.

2. Functional Requirements

ID	Requirement	Priority
FR-01	Real-time collaborative editing with multiple simultaneous editors	P0
FR-02	Rich text formatting (bold, italic, underline, headings, lists, code blocks)	P0
FR-03	Document CRUD (create, read, update, delete)	P0
FR-04	Sharing with permission levels (owner, editor, commenter, viewer)	P0
FR-05	Cursor and presence indicators (see who is editing and where)	P0
FR-06	Version history with ability to view and restore past versions	P0
FR-07	Tables with concurrent cell editing	P0
FR-08	Offline editing with automatic sync on reconnect	P1
FR-09	Comments and suggestions on text ranges	P1
FR-10	Image and media embedding	P1
FR-11	In-document search (Ctrl+F)	P1
FR-12	Undo/redo per user (not global)	P1
FR-13	Cross-document full-text search	P2
FR-14	Templates	P2
FR-15	Export to PDF/DOCX	P2
FR-16	Document analytics (view count, edit frequency)	P2

3. Non-Functional Requirements

Requirement	Target
Operation sync latency (same region)	< 100ms
Operation sync latency (cross region)	< 300ms
Local keystroke response	< 16ms (60fps rendering)
Concurrent editors per document	100
Concurrent viewers per document	200
Document size limit	1.5M characters (~500 pages)
Availability	99.99%
Durability	Zero user-visible data loss (client retains canonical state in IndexedDB; server persists via operation log with 100ms batch window)
Offline tolerance	Hours of offline editing, merge on reconnect
Active documents (simultaneously open)	10M
Total documents	1B

4. High-Level Approach & Technology Selection

Both OT and CRDTs are explained end-to-end with concrete examples covering: how the algorithm works, online editing flow, offline editing flow, and conflict resolution. By the end of this section, the choice between the two should feel inevitable.

4.1 The Core Problem: Concurrent Edits

Consider a simple document containing "HELLO". Alice inserts "X" after position 1 (between H and E). Bob deletes the character at position 3 (the first L). Both edits happen simultaneously on different clients.

Alice's local view: "HXELLO" (inserted X at position 1). Bob's local view: "HELO" (deleted L at position 3).

Now they exchange operations. If Bob naively applies Alice's insert at position 1, he gets "HXELO". If Alice naively applies Bob's delete at position 3, she gets "HXELO". Lucky coincidence? Not always. With three or more concurrent operations, naive application diverges fast.

Two approaches solve this problem:

Operational Transformation (OT): Transform operations against each other so positions stay correct after concurrent edits. The server is the authority.
CRDTs (Conflict-free Replicated Data Types): Give each character a globally unique identity so position is defined by neighbors, not indices. No authority needed.

Both are walked through in full below.

4.2 Approach 1: Operational Transformation (OT)

4.2.1 How OT Works (Concrete Example)

Document: "ABCD". Alice deletes position 1 (character 'B'). Bob inserts 'X' at position 3 (between 'C' and 'D'). Both edits happen simultaneously.

Server receives Alice's operation first:

Apply DELETE(1) to "ABCD" → "ACD". Server is now at revision 6.
Bob's INSERT(3, 'X') was composed against revision 5 (before Alice's delete). It needs transformation.
Transform: Alice deleted at position 1, which is before Bob's insertion point (3). Everything after position 1 shifted left by one. Bob's operation becomes INSERT(2, 'X').
Apply transformed op to "ACD" → "ACXD". Server is now at revision 7.

Verify with reverse order (TP1 property):

Apply Bob's INSERT(3, 'X') to "ABCD" → "ABCXD". Server at revision 6.
Alice's DELETE(1) was composed against revision 5. Transform: Bob inserted at position 3, which is after Alice's delete point (1). Alice's delete is unaffected. Still DELETE(1).
Apply DELETE(1) to "ABCXD" → "ACXD". Server at revision 7.

Both orderings produce "ACXD". This is the Transformation Property 1 (TP1): the result must be the same regardless of which operation the server processes first. TP1 is easy to verify for two operations of the same type. It becomes exponentially harder with three or more concurrent operations mixing inserts, deletes, and formatting. That complexity is the fundamental reason OT implementations are fragile at scale.

Transform function rules (pseudocode for all four operation pairs):

transform(op1, op2):

  INSERT(p1, c1) vs INSERT(p2, c2):
    if p1 < p2: return INSERT(p1, c1)        # op1 is before op2, no shift
    if p1 > p2: return INSERT(p1 + 1, c1)    # op2 shifted text right
    if p1 == p2: break tie by client ID       # deterministic ordering

  INSERT(p1, c1) vs DELETE(p2):
    if p1 <= p2: return INSERT(p1, c1)        # delete is after insert, no shift
    if p1 > p2: return INSERT(p1 - 1, c1)    # delete shifted text left

  DELETE(p1) vs INSERT(p2, c2):
    if p1 < p2: return DELETE(p1)             # insert is after delete, no shift
    if p1 >= p2: return DELETE(p1 + 1)        # insert shifted text right

  DELETE(p1) vs DELETE(p2):
    if p1 < p2: return DELETE(p1)             # independent deletes
    if p1 > p2: return DELETE(p1 - 1)         # earlier delete shifted text
    if p1 == p2: return NOOP                  # same character deleted twice

These four rules are the entire foundation of OT. Every operation arriving at the server gets transformed against all operations that happened since the client's last known revision.

4.2.2 Online Flow (Step by Step)

Alice types 'X' at position 3 in a shared document. Walk through every hop:

Alice's keystroke applies to her local document immediately (optimistic update). She sees the 'X' on screen within 16ms. No server round-trip.
Alice's client sends INSERT(3, 'X', rev=5) to the server over WebSocket. rev=5 means "this operation was composed against server revision 5."
Server receives Alice's operation. The server is currently at revision 7 (two other operations from Bob landed while Alice's was in flight). Server transforms Alice's op against revisions 6 and 7 to produce a shifted op. Server applies the transformed op, increments to revision 8, and broadcasts the transformed op to all other clients.
Acknowledgement: Server sends an ACK back to Alice with rev=8. Alice's client marks the pending op as confirmed. If Alice typed more keystrokes while waiting, those pending ops are now rebased against revisions 6 through 8.
Bob's client receives the broadcast of Alice's transformed op. Bob transforms any of his own pending (unacknowledged) ops against Alice's incoming op, then applies Alice's op to his local document.

Key insight: In OT, the client maintains three states at all times:

Confirmed state: What the server has acknowledged. The client knows this is canonical.
Pending op: Sent to server, waiting for ACK. Only one pending op at a time.
Buffer ops: Not yet sent. Accumulating locally while the pending op awaits acknowledgement.

When the ACK arrives, the pending op moves to confirmed. The buffer ops compose into a new pending op and get sent. This is the OT client state machine, and getting it wrong causes divergence.

4.2.3 Offline Flow

Alice goes offline. The WebSocket connection drops.
Alice continues typing. Each keystroke produces an operation applied against her local revision counter.
Alice accumulates 500 operations over 2 hours offline.
Alice reconnects. Her client reports its last known server revision (say rev=50). The server has advanced to rev=200 (150 ops from other users).
The server sends Alice the 150 missed operations. Alice's client must transform her 500 pending ops against these 150 server ops. That is 500 x 150 = 75,000 transform calls. Each transform must be mathematically correct or the document diverges permanently.
After transformation, Alice's rebased ops are sent to the server. The server transforms them again against any ops that arrived during the rebasing process.

The problem: Long offline periods create quadratic transformation cost (O(local_ops x missed_ops)). A single bug in the transform function across 75,000 calls causes permanent divergence that is extremely difficult to debug. Google Docs historically limited offline OT resolution, requiring server round-trips to reconcile diverged states.

Pros:

Server maintains a single canonical document state (easy to reason about)
Mature ecosystem (Google Docs has used OT since 2010)
Lower per-character metadata than CRDTs (no unique ID per character)
Access control is straightforward (server can reject operations)

Cons:

Transform functions must satisfy TP1 (applying transformed ops in either order yields the same result). Getting this right for every operation pair is notoriously hard
Central server is required for ordering (single point of failure for correctness)
Offline sync is expensive and fragile (quadratic transformation cost)
Client state machine (confirmed/pending/buffer) is complex to implement correctly
Every new operation type (tables, images, formatting) needs new transform functions

Used by: Google Docs, CKEditor 5, Etherpad

A third approach: Microsoft Loop and Fluid Framework.

OT requires pairwise transforms. CRDTs require per-character metadata. Fluid Framework sidesteps both with total order broadcast: the server does not transform operations, and clients do not carry CRDT metadata. Instead, the server assigns a global sequence number to every incoming operation and broadcasts them to all clients in that order.

Concrete example: Alice sends INSERT(3, 'X') and Bob sends DELETE(5) simultaneously. Both arrive at the Fluid relay server. The server does not transform them. It assigns Alice's op sequence #41 and Bob's op sequence #42, then broadcasts both to all clients in that order. Every client applies #41 first, then #42. Because everyone applies the same operations in the same order, all replicas converge without transforms.

Why this works: Divergence in OT happens because clients apply operations in different orders, requiring transforms to reconcile. Fluid eliminates divergence by forcing a single global order. No transforms needed. No CRDT item IDs needed. Operations are plain position-based, like OT, but applied in server-assigned order.

The tradeoff: The server is a sequencing authority. If the server is unreachable, clients cannot apply operations (they queue locally and wait). This makes Fluid less offline-friendly than CRDTs, where the client applies edits immediately without any server involvement. Fluid sits between OT and CRDTs: simpler than OT (no transform functions), less offline-capable than CRDTs (needs server for ordering), and lighter than CRDTs (no per-character metadata).

4.3 Approach 2: CRDTs (Conflict-free Replicated Data Types)

4.3.1 How CRDTs Work (Concrete Example)

Document: "AC". Character A has a unique ID (1, server). Character C has ID (2, server). These IDs never change.

Alice inserts 'B' after 'A'. Bob inserts 'X' after 'A'. Both happen simultaneously.

Each new character gets a globally unique ID composed of (lamport_clock, client_id):

Alice's insert: { char: 'B', id: (3, alice), originLeft: (1, server), originRight: (2, server) } Alice sees: "ABC"
Bob's insert: { char: 'X', id: (3, bob), originLeft: (1, server), originRight: (2, server) } Bob sees: "AXC"

Now they exchange operations. Both replicas receive an item that has the same originLeft as an existing item (a conflict). The tie-breaking rule: compare client_id lexicographically. "alice" < "bob", so B sorts before X.

Both replicas independently arrive at: "ABXC". No server decided this ordering. Both replicas applied the same deterministic rule.

The YATA algorithm (used by Yjs) formalizes this. Each item stores { id, originLeft, originRight, content }. On conflict (two items share the same originLeft):

Start at the conflict position.
Scan right through existing items.
Compare the conflicting item's ID against each existing item's ID.
Insert before the first item whose originLeft is different from ours, or whose client_id is greater than ours.

This produces a total order that every replica agrees on, without any communication beyond exchanging the items themselves. The mathematical property: the merge function is commutative (A merge B = B merge A), associative ((A merge B) merge C = A merge (B merge C)), and idempotent (A merge A = A). These three properties guarantee convergence regardless of message ordering or duplication.

4.3.2 Online Flow (Step by Step)

Alice types 'X' in the document. Walk through every hop:

Alice's keystroke applies to the local Yjs document immediately. A new CRDT item is created with a unique ID (clock, alice). She sees the 'X' on screen within 16ms. This edit is final. It will never be rebased, transformed, or undone by the sync process.
Yjs encodes the new item as a compact binary update (~20-50 bytes for a single character).
Alice's client sends the binary update to the sync server over WebSocket.
Acknowledgement model: There is no ACK required for correctness. The edit is already committed to Alice's local document. The server is a relay, not an authority. It does not transform operations, though it does establish a durable total ordering via the append-only operation log. The client does not wait for server confirmation before proceeding. If the server is slow, unreachable, or crashed, Alice keeps typing with zero impact on her local experience. Eventual consistency is verified via periodic state vector reconciliation, not per-operation ACKs.
The sync server receives the update, appends it to the operation log, and broadcasts it to all other connected clients editing the same document.
Bob's client receives the update. Yjs merges it into Bob's local document using the CRDT rules (unique IDs + deterministic ordering). If Bob has concurrent edits, they merge automatically. No transformation needed.

Key insight: In CRDTs, there is no "pending" state. Every local edit is immediately part of the document's permanent history. There is no rebasing. The merge function is idempotent (applying the same update twice has no effect) and commutative (the order updates arrive does not matter). This eliminates the entire class of bugs related to OT's client state machine.

4.3.3 Offline Flow

Alice goes offline. The WebSocket connection drops.
Alice continues typing. Each keystroke creates a new CRDT item with a unique ID in her local Yjs document. Everything is stored in IndexedDB.
Alice accumulates 500 edits over 2 hours offline.
Meanwhile, Bob and others make 150 edits on the server.
Alice reconnects. Her Yjs client sends its state vector: a compact summary of what it has already seen. Example: { alice: 550, bob: 200, carol: 100 } meaning "I have all edits from Alice up to clock 550, Bob up to 200, Carol up to 100."
The server compares Alice's state vector against its own. It computes the diff: "Alice is missing these 150 updates from Bob and Carol." The server sends only those 150 updates.
Alice's client applies the 150 incoming updates. The CRDT merge handles ordering automatically. No transforms. No 75,000 pairwise transform calls. Just 150 merge operations, each O(log n) where n is the document length.
Alice's client sends her 500 offline edits to the server (the server's state vector showed it was missing those). The server merges them the same way.
All clients converge. Total merge work: O(m log n) where m is the number of missed operations and n is the document length. No quadratic blowup.

The advantage: Offline for 2 hours or 2 weeks makes no difference to correctness. The merge protocol is identical whether the gap is 1 operation or 10,000. No rebasing, no transform chains, no divergence risk. This is what makes CRDTs genuinely offline-first, instead of bolting offline support onto an architecture that was never built for it.

4.3.4 Conflict Resolution

Same-position insert: Alice inserts 'A', Bob inserts 'B', both after the same character. In CRDTs, "same position" means both items share the same originLeft. The YATA algorithm compares client IDs to break the tie deterministically. Every replica applies the same rule independently. No server arbitration.

Delete vs. insert: Alice deletes a word. Bob simultaneously inserts new text inside that word. The deleted characters become tombstones (marked as deleted but retained in the CRDT structure). Bob's insert still lands correctly because its position is defined by originLeft and originRight pointers to specific item IDs, not integer indices. The inserted text appears, and the surrounding deleted text remains invisible. Clean resolution, no special case needed.

Delete vs. format: Alice deletes a paragraph. Bob simultaneously bolds a word in that paragraph. The bold mark applies to tombstoned characters (no visible effect). When both operations merge, the paragraph is gone and the formatting goes with it. If the paragraph is later restored (undo), the bold formatting reappears.

Pros:

Offline-first by mathematical design (commutative, associative, idempotent merges)
No central ordering authority required (server is a relay, not a coordinator)
Local edits are instant and permanent (no pending state, no rebasing)
Merge complexity is O(m log n) (m = missed ops, n = doc length), not O(n x m) for offline catch-up
Adding new operation types does not require new transform functions

Cons:

Per-character metadata overhead (each character carries a unique ID + origin pointers)
Tombstones for deleted text consume memory until garbage collected
Document size is 1.5-3x the plain text size (Yjs binary encoding)
Undo/redo requires per-user tracking (Yjs UndoManager handles this; see Section 9.12)
Garbage collection of tombstones requires coordination across all connected clients

Used by: Figma, Apple Notes, Jupyter Notebooks (Yjs), AFFiNE, Notion (partial)

4.4 Side-by-Side Comparison

Dimension	OT	CRDT
Offline sync	Expensive: O(local x missed) transforms	Cheap: O(m log n) state vector diff
Acknowledgement model	Server ACKs each op; client has pending/buffer states	No ACK needed; local edits are final
Metadata per character	None (positions are integers)	Unique ID + originLeft + originRight (~20 bytes)
Transform/merge complexity	O(n x m) for n local, m remote ops	O(m log n) for m ops, n doc length
Correctness risk	High (TP1 bugs cause silent divergence)	Low (mathematical convergence guarantee)
Rich text maturity	Google Docs (15+ years of production)	Yjs + Peritext (newer but production-proven)
Latency model	Local instant, server round-trip for confirmation	Local instant, no confirmation needed
Used by	Google Docs, CKEditor 5, Etherpad	Figma, Apple Notes, Jupyter (Yjs), AFFiNE

Summary: OT gives the server full control at the cost of offline fragility and transformation correctness risk. CRDTs provide offline-first and mathematical convergence guarantees at the cost of metadata overhead and tombstone management. For a modern collaborative editor where offline capability matters, CRDTs have won the practical argument. Yjs's optimized binary encoding keeps metadata overhead to 1.5-3x plain text, down from 16x in naive CRDT implementations.

4.5 Our Choice: CRDT (Yjs) with a Relay Server

After walking through both algorithms in detail, the choice:

Why CRDT over OT:

Offline sync is a first-class citizen, not a bolted-on afterthought
No transformation correctness bugs. I've seen teams spend months debugging OT transform bugs that only surface under specific three-user concurrent editing scenarios. CRDTs eliminate this entire class of problem.
Local edits are instant and final (no pending/buffer state machine, no rebasing)
Merge complexity scales linearly, not quadratically, with offline duration

Why Yjs specifically:

Mature ecosystem with ProseMirror and TipTap integrations
Optimized binary encoding (1.5-3x plain text size, not 16x like academic CRDT implementations)
Production-proven at scale (Jupyter Notebooks, AFFiNE, multiple enterprise deployments)
Run-length encoding for consecutive edits (typing "hello" stores as one item, not five)
Built-in awareness protocol for presence and cursor tracking

Why still use a server:

Access control enforcement (reject unauthorized edits before relaying)
Durable persistence (operation log + snapshots in PostgreSQL and S3)
Presence relay across clients (cursor positions, online status)
Single WebSocket endpoint (simpler than peer-to-peer NAT traversal at scale)

The hybrid model: Yjs handles the merge algorithm. The server handles auth, persistence, relay, and presence. The client handles editing via TipTap/ProseMirror with Yjs bindings. Each layer does what it is best at.

4.6 Collaboration Paradigms: Where This Design Fits

This post designs a rich text collaborative editor. That is one of four distinct paradigms, each requiring a different CRDT data model and sync architecture:

Paradigm	Examples	CRDT Data Model	Key Difference
Plain text	Etherpad, VS Code Live Share	Sequence CRDT (characters only)	No formatting marks, simpler merge
Rich text	Google Docs, this design	Sequence CRDT + marks (Peritext)	ProseMirror tree with inline formatting
Block-based	Notion, Coda, AFFiNE, Microsoft Loop	Tree CRDT (blocks) + per-block text CRDT	Each block is an independent CRDT subdocument
Canvas / design	Figma, Miro, FigJam	Map CRDT (objects with properties)	CRDTs operate on position/color/size, not text

Block-Based Editors (Notion, AFFiNE, Coda)

In our rich text design, the entire document is a single ProseMirror tree backed by one Yjs document. Every character, heading, and table lives in the same CRDT. In a block-based editor, the document is a tree of independent blocks, each with its own CRDT:

Document
├── Block (heading): "Project Roadmap"          ← Y.Text CRDT
├── Block (paragraph): "The first milestone..." ← Y.Text CRDT
├── Block (table)                               ← Y.Map CRDT
│   ├── Cell (0,0): "Task"                     ← Y.Text CRDT
│   └── Cell (0,1): "Owner"                    ← Y.Text CRDT
├── Block (toggle): "Implementation details"    ← Y.Text CRDT
│   ├── Block (paragraph): "Step 1..."         ← Y.Text CRDT (nested child)
│   └── Block (paragraph): "Step 2..."         ← Y.Text CRDT (nested child)
└── Block (image): screenshot.png               ← Y.Map CRDT (properties only)

In Yjs terms, the data structure looks like this:

Document = Y.Array<BlockID>              // ordering CRDT (which blocks, in what order)

Block = Y.Map {
  id:       string,                      // globally unique block ID
  type:     "paragraph" | "heading" | "table" | "image" | "toggle" | ...,
  content:  Y.Text,                      // inline text with marks (per-block CRDT)
  children: Y.Array<BlockID>,            // nested blocks (toggles, columns, callouts)
  props:    Y.Map { level, checked, ... } // block-level properties
}

The key difference: in our design, Alice editing paragraph 1 and Bob editing paragraph 5 both modify the same Yjs document. In a block model, they modify completely independent CRDTs. This has four architectural consequences:

1. Per-block lazy loading. The client does not need the entire document to start editing. On page open, the server sends block metadata (IDs, types, ordering) and the content of blocks visible in the viewport. As the user scrolls, additional blocks are loaded on demand. This is how Notion achieves fast initial page loads even for documents with hundreds of blocks. In our design, the full Yjs document must be synced before the editor is interactive.

2. Per-block permissions. Because each block is an independent CRDT, access control can be enforced per block, not just per document. A team lead can share a specific section of a planning document with an external contractor without exposing the rest. The sync server checks block-level ACLs before sending block content. Our design only supports document-level permissions.

3. Transclusion (synced blocks). A "synced block" in Notion appears in multiple documents simultaneously. The block lives in a shared block store, and documents reference it by ID. Edits to the block propagate to every document that references it. This is straightforward when each block is an independent CRDT: the block's Yjs document is synced independently of any parent document. In our ProseMirror tree model, transclusion would require splitting the single document CRDT, which defeats the tree structure.

4. The block move problem. Drag-and-drop reordering means moving a block from one position to another in the Y.Array. If Alice moves block X above block Y while Bob moves block X below block Z, the CRDT must resolve this conflict. Sequence CRDTs (including Yjs) do not natively support "move" as an atomic operation. A move is typically implemented as delete-then-insert, which can produce duplicates or losses under concurrent moves. Notion handles this by routing structural operations (block moves, block deletions) through the server as ordered transactions, while using CRDTs for within-block text editing. This is the "hybrid" approach: tree structure is server-ordered, text content is CRDT-merged. Martin Kleppmann's "Moving Elements in List CRDTs" (2020) proposes a native move operation for sequence CRDTs, but production implementations are still limited.

Trade-offs: rich text vs. block-based

Aspect	Rich Text (This Design)	Block-Based (Notion)
Sync unit	Entire document as one CRDT	Per-block, independent CRDTs
Initial load	Full document sync required	Metadata + visible blocks only
Large documents	Memory-bound (entire doc in RAM)	Viewport-bound (only visible blocks in RAM)
Permissions	Document-level only	Block-level possible
Reordering	ProseMirror transactions	Tree CRDT or server-ordered moves
Offline	Full document available offline	Only cached blocks available offline
Complexity	Single CRDT, simpler architecture	Tree of CRDTs, more moving parts

Block-based editing is the direction the industry is moving. If I were starting this system from scratch with Notion-level ambitions (databases, synced blocks, per-block permissions), I would choose the block model. For a Google Docs-like rich text editor, the single-document CRDT is simpler, better supported by existing tooling (TipTap, ProseMirror, Hocuspocus), and sufficient for the core use case.

Canvas Editors (Figma)

Canvas editors (Figma) apply CRDTs to objects with properties (x, y, width, height, color, z-order) rather than text sequences. Figma runs the CRDT on the server and sends rendered state to clients via WebGL, which enables 500+ concurrent viewers without each viewer holding a full CRDT document in memory. The data model is fundamentally different from a document tree.

Why This Design Focuses on Rich Text

This design focuses on rich text because it represents the core challenge of collaborative document editing and the most common interview question. The CRDT fundamentals (unique IDs, deterministic ordering, offline merge) apply across all four paradigms; the difference is what you attach those IDs to.

4.7 Technology Selection

Component	Technology	Why
Sync algorithm	Yjs (YATA CRDT)	Offline-first, proven binary encoding, rich ecosystem
Rich text editor	TipTap (ProseMirror)	Yjs binding (y-prosemirror), extensible schema, production-ready
Sync server	Hocuspocus / Custom Node.js	Yjs-native WebSocket server, handles auth hooks and persistence
Real-time transport	WebSocket	Bidirectional, low overhead, browser-native, wide proxy support
Document storage (hot)	Redis	In-memory Yjs document state for active documents, sub-ms access
Document storage (warm)	PostgreSQL	Operation logs, snapshots, document metadata, ACLs
Document storage (cold)	S3	Archived snapshots, old operation logs, media files
Media storage	S3 + CloudFront CDN	Presigned uploads, edge-cached delivery
Search	Elasticsearch	Full-text search across billions of documents with ACL filtering
Presence	Redis Pub/Sub	Ephemeral cursor/selection broadcast across sync server instances
Auth	JWT + OAuth 2.0	Stateless token validation at WebSocket handshake

4.8 What Runs Where

Before diving into the architecture diagram, here is what each core technology actually does and which side of the network it runs on.

Client side (runs in the browser):

Technology	What It Is	Problem It Solves
ProseMirror	Rich text editor framework	Renders the document, handles the editing UI (typing, cursor, selections, toolbar), and enforces the document schema. This is the "word processor" layer.
TipTap	Wrapper around ProseMirror	Makes ProseMirror easier to configure. Adds an extension API and the Yjs binding (`y-prosemirror`) that connects the editor to the CRDT layer.
Yjs	CRDT library (YATA algorithm)	The merge engine. Every character gets a unique ID. When two users type simultaneously, Yjs merges their edits deterministically without conflicts. Each user holds a full copy of the document in memory.
IndexedDB	Browser storage	Persists the Yjs document locally. Enables offline editing and instant page reloads without fetching from the server.

Server side:

Technology	What It Is	Problem It Solves
Hocuspocus	Yjs-native WebSocket server (Node.js)	The sync relay. Implements the y-websocket protocol, a compact binary protocol layered on top of standard WebSocket connections. Receives binary Yjs updates from one client and broadcasts them to all other clients editing the same document. Also handles auth hooks (called during the WebSocket upgrade handshake) and persistence hooks (called on document load and change).
Redis	In-memory store	Hot document state cache (sub-ms access) and presence broadcast via Pub/Sub (cursor positions across server instances).
PostgreSQL	Relational database	Durable persistence: operation log (every edit), snapshots (periodic full state), document metadata, and ACLs.
S3	Object storage	Cold storage for archived snapshots, old operation logs, and uploaded media files.

The flow when Alice types a character:

Alice's browser                     Server                       Bob's browser
───────────────                     ──────                       ─────────────
TipTap (editor UI)                  Hocuspocus                   TipTap (editor UI)
    ↓                                                                ↑
ProseMirror (document tree)                                      ProseMirror (document tree)
    ↓                                                                ↑
Yjs (creates CRDT item)                                          Yjs (merges CRDT item)
    ↓                                                                ↑
Binary update ──→ WebSocket ──→ Persist + Broadcast ──→ WebSocket ──→ Binary update
                                (PostgreSQL + Redis)

Alice types 'X' → ProseMirror creates a transaction → Yjs creates a CRDT item with a unique ID (clock, alice)
Yjs encodes the item as a binary update (~30 bytes)
The update travels over WebSocket to Hocuspocus
Hocuspocus appends it to the operation log (PostgreSQL) and broadcasts it to Bob
Bob's Yjs merges the item using deterministic ID ordering. No transforms. ProseMirror renders it.

The key point: the server does not merge. Hocuspocus is a relay, not a merge authority. All merge intelligence lives in Yjs on the client. If the server crashes, Alice and Bob keep typing locally (Yjs saves to IndexedDB). When the server comes back, they reconnect and Yjs syncs automatically via state vector diffing. This is the fundamental difference from Google Docs (OT), where the server is the ordering authority.

Tombstone lifecycle (how deletions work in Yjs):

When a character is deleted, Yjs does not remove it from the CRDT. It marks it as a tombstone: the content is cleared, but the unique ID and origin pointers are retained. A document where you type 10,000 characters and delete 9,000 still has 10,000 items internally. The 9,000 deleted items are tombstones, skipped during rendering.

Tombstones are not permanent. Yjs can garbage-collect them, but only when every connected client has integrated the deletion. If even one client is offline and has not seen the delete, the tombstone must stay. Removing it would break that client's merge on reconnect, because its state vector references items that no longer exist.

In practice, GC runs during snapshot creation. The snapshot stores only live content plus a compressed tombstone summary. Clients reconnecting after a GC cycle receive the full snapshot instead of incremental updates. This is why Yjs documents grow during active editing but shrink back to near-plain-text size when a snapshot is taken with all clients in sync.

4.9 End-to-End Example: From Keystroke to CRDT

The previous sections explain what each technology does. This section shows how they work together with actual code.

The document:

Document
├── Block b1 (heading):   "Project Plan"
└── Block b2 (paragraph): "Build authentication service using OAuth"

Two blocks. Each block gets its own TipTap editor, its own ProseMirror instance, and its own Yjs text fragment. Here is the setup with collaboration enabled:

import { Editor } from "@tiptap/core"
import StarterKit from "@tiptap/starter-kit"
import Collaboration from "@tiptap/extension-collaboration"
import { HocuspocusProvider } from "@hocuspocus/provider"
import * as Y from "yjs"

// Yjs document + WebSocket connection to the sync server
const ydoc = new Y.Doc()
const provider = new HocuspocusProvider({
  url: "wss://sync.example.com",
  name: "doc-abc",       // document ID
  document: ydoc,
})

// Block b1: heading
const headingEditor = new Editor({
  element: document.querySelector("#heading"),
  extensions: [
    StarterKit,
    Collaboration.configure({ document: ydoc, field: "block-b1" }),
  ],
})

// Block b2: paragraph
const paragraphEditor = new Editor({
  element: document.querySelector("#paragraph"),
  extensions: [
    StarterKit,
    Collaboration.configure({ document: ydoc, field: "block-b2" }),
  ],
})

The line Collaboration.configure({ document: ydoc, field: "block-b2" }) is what connects ProseMirror to Yjs. It installs the y-prosemirror binding, which automatically converts every ProseMirror transaction into Yjs CRDT operations. No manual conversion code needed.

What happens when Alice bolds "using OAuth":

Alice selects "using OAuth" (positions 28-40 in block b2) and clicks the Bold button.

// Layer 1: UI
document.getElementById("boldBtn").onclick = () => {
  paragraphEditor.commands.toggleBold()
}
// The developer writes this one line. Everything below is automatic.

// Layer 2: ProseMirror
// The editor already knows the selection from the browser.
// EditorState.selection = { from: 28, to: 40 }
// TipTap creates a ProseMirror transaction:
// tr.addMark(28, 40, schema.marks.bold)

// Layer 3: Document tree update
// Before:  paragraph → text("Build authentication service using OAuth")
// After:   paragraph → text("Build authentication service ") + bold("using OAuth")

// Layer 4: Yjs (automatic via y-prosemirror binding)
// The binding intercepts the ProseMirror transaction and converts it:
// Y.Text.format(28, 12, { bold: true })
// A CRDT operation is created with a unique ID (clock, alice).

// Layer 5: Network (automatic via HocuspocusProvider)
// Yjs encodes the operation as a binary update (~40 bytes).
// HocuspocusProvider sends it over WebSocket to the server.
// Server persists to op log, broadcasts to all other clients.
// Bob's Yjs merges the bold mark. His ProseMirror renders it.

The developer writes one line: editor.commands.toggleBold(). ProseMirror tracks the selection. The y-prosemirror binding converts to CRDT operations. HocuspocusProvider handles the network. Four layers, but the developer only touches the top one.

Concurrent edit: Alice bolds while Bob types

While Alice bolds "using OAuth", Bob types " v2.0" at the end of the same sentence. Both edits happen simultaneously.

Alice's operation: Y.Text.format(28, 12, { bold: true }) with a Peritext "no-expand" boundary at the right edge of "OAuth"
Bob's operation: inserts " v2.0" as a new CRDT item with originLeft pointing to the 'h' in "OAuth"

Result on both clients: "Build authentication service using OAuth v2.0"

Bob's " v2.0" appears unbolded because the Peritext no-expand boundary at the end of Alice's bold range prevents new text from inheriting the mark. No transforms needed. Each operation references stable CRDT item IDs, not integer positions, so they merge independently regardless of arrival order.

5. High-Level Architecture

5.1 Bird's-Eye View

5.2 Component Glossary

(1) WebSocket Connection. Client connects through a load balancer to a WebSocket gateway. The load balancer uses document ID-based routing (consistent hashing) so all editors of a single document land on the same gateway instance when possible. In practice, the "gateway" and "sync server" can be a single Hocuspocus process: Hocuspocus is itself a WebSocket server that handles TLS termination (behind NGINX), auth hooks, and Yjs sync in one process. The diagram separates them to show the logical responsibilities. At scale, you may split them: a stateless gateway tier for TLS and connection management, proxying to a stateful sync tier that holds Yjs documents in memory. The inter-tier protocol would be WebSocket passthrough (the gateway forwards raw WebSocket frames to the sync server after auth validation).

(2) Auth and Sync. On connection, the gateway validates the JWT token with the Auth Service. The sync server loads the Yjs document state (from Redis if hot, PostgreSQL snapshot if cold) and runs the Yjs sync protocol (state vector exchange) to bring the client up to date. Cold document loads (no Redis cache, loading from PostgreSQL snapshot + replaying operations) can take 500ms-1s for large documents. The <100ms sync latency target applies to steady-state incremental updates, not cold opens.

(3) Real-time Broadcast. Local edits are sent as binary Yjs updates over WebSocket. The sync server persists the update and broadcasts it via Redis Pub/Sub to other sync server instances, which relay to their connected clients.

(4) Persistence. Every Yjs update is appended to the operation log in PostgreSQL. Periodic snapshots (full Yjs document binary) are saved every 500 operations or every 5 minutes, whichever comes first. Hot document state is cached in Redis.

(5) Media. Images and files are uploaded to S3 via presigned URLs. The media node in the document tree references the S3 key. CloudFront CDN caches media at the edge.

(6) REST API. Document CRUD, sharing, permission management, version history browsing, and search all go through the REST API layer. These are not real-time operations and do not use WebSocket.

6. Back-of-the-Envelope Estimation

Documents
  Total documents in storage:             1B
  Active documents (open right now):       10M
  Average concurrent editors per active doc: 3
  Total WebSocket connections:             30M (10M × 3)
  Peak concurrent editors on one doc:      100

Operations
  Average ops/sec per active editor:       2 (typing + formatting; upper bound)
  Total ops/sec (steady state):            60M (30M × 2; design ceiling)
  Realistic ops/sec (many editors idle):   15-25M (most connections are idle readers)
  Peak ops/sec (3x burst):                 180M
  Ops/sec on a busy single document:       200 (100 editors × 2 ops/sec)

Bandwidth
  Average Yjs binary update size:          100 bytes
  Steady-state bandwidth:                  6 GB/sec (60M × 100 bytes)
  Per-document bandwidth (100 editors):    20 KB/sec (trivial)
  Per-WebSocket-server bandwidth:          ~10 MB/sec (at 50K connections)

Storage
  Average document content size:           10 KB
  Document content (1B docs):              10 TB
  Operation logs (avg 100 KB/doc):         100 TB (compacted)
  Snapshots (avg 50 KB/doc):               50 TB
  Media (avg 1 MB/doc, 30% have media):    300 TB
  Total:                                   ~460 TB

WebSocket Servers
  Connections per server:                  50K (practical limit with epoll/kqueue)
  Servers needed:                          600 (30M / 50K)

Sync Servers (Hocuspocus)
  Active documents per instance:           10K (each doc held in memory ~50-200 KB)
  Memory per instance:                     ~1-2 GB for document state (avg doc ~50-200 KB; a 1.5M-char doc with 100 editors can peak at ~50 MB)
  Instances needed:                        1,000 (10M / 10K)

PostgreSQL
  Write throughput (op log):               60M inserts/sec (sharded across 100+ nodes, with batch writes; see Section 10.3)
  Read throughput (snapshots):             ~10K reads/sec (cold document loads)

7. Data Model

7.1 Document Tree (ProseMirror Schema)

The document is a tree, not a flat string. ProseMirror defines a schema of allowed node types and their nesting rules:

json

{
  "type": "doc",
  "content": [
    {
      "type": "heading",
      "attrs": { "level": 1 },
      "content": [
        { "type": "text", "text": "Project Roadmap" }
      ]
    },
    {
      "type": "paragraph",
      "content": [
        { "type": "text", "text": "The first milestone ", "marks": [{ "type": "bold" }] },
        { "type": "text", "text": "is due next week." }
      ]
    }
  ]
}

Yjs maps this tree structure using Y.XmlFragment (for block nodes) and Y.Text (for inline text with marks). Each node in the ProseMirror tree corresponds to a CRDT type in Yjs, enabling per-character and per-node conflict resolution.

7.2 CRDT Operation Format

Each Yjs item (the fundamental unit of the CRDT) carries:

{
  id:          { client: 42, clock: 157 },    // globally unique
  originLeft:  { client: 42, clock: 156 },    // left neighbor at insertion time
  originRight: { client: 7,  clock: 89 },     // right neighbor at insertion time
  parent:      "paragraph_node_id",            // which Y.XmlFragment this belongs to
  content:     "hello"                         // the actual text (run-length encoded)
}

Binary encoding: Yjs uses a custom binary format that encodes items compactly. A single character insert is ~20-30 bytes. A run of 100 characters by the same user is ~120 bytes (not 2,000-3,000). Deleted items are tombstoned: the content field is cleared but the ID and origin pointers are retained for ordering.

7.3 Storage Schema (PostgreSQL)

sql

-- Document metadata
CREATE TABLE documents (
    id              UUID PRIMARY KEY,
    title           TEXT NOT NULL,
    owner_id        UUID NOT NULL REFERENCES users(id),
    permission_default TEXT DEFAULT 'viewer',  -- viewer, commenter, editor
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Yjs document snapshots (periodic full state)
CREATE TABLE document_snapshots (
    id              BIGSERIAL PRIMARY KEY,
    document_id     UUID NOT NULL REFERENCES documents(id),
    snapshot_blob   BYTEA NOT NULL,           -- full Yjs binary state
    version         BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Operation log (every Yjs update appended here)
CREATE TABLE operation_log (
    id              BIGSERIAL PRIMARY KEY,
    document_id     UUID NOT NULL,
    client_id       BIGINT NOT NULL,
    op_binary       BYTEA NOT NULL,           -- Yjs binary update
    version         BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY HASH (document_id);            -- sharded by document

-- Document sharing / ACL
CREATE TABLE document_shares (
    document_id     UUID NOT NULL REFERENCES documents(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    permission      TEXT NOT NULL,             -- owner, editor, commenter, viewer
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (document_id, user_id)
);

7.4 Media Schema

S3 key pattern: media/{document_id}/{media_id}.{extension}

Media is referenced in the document tree as an image or file node:

json

{
  "type": "image",
  "attrs": {
    "src": "media/doc-abc/img-123.png",
    "width": 800,
    "height": 600,
    "alt": "Architecture diagram"
  }
}

The client resolves the S3 key to a CDN URL at render time.

8. API Design

8.1 REST Endpoints

Method	Endpoint	Description
POST	`/api/documents`	Create a new document
GET	`/api/documents/:id`	Get document metadata (title, owner, permissions)
PATCH	`/api/documents/:id`	Update document metadata (title, settings)
DELETE	`/api/documents/:id`	Soft-delete a document
POST	`/api/documents/:id/share`	Add/update sharing permissions
DELETE	`/api/documents/:id/share/:userId`	Revoke a share
GET	`/api/documents/:id/history`	List version history (snapshots + named versions)
GET	`/api/documents/:id/history/:version`	Get document content at a specific version
POST	`/api/documents/:id/history`	Create a named version (user-triggered save point)
POST	`/api/documents/:id/media`	Upload media (returns presigned S3 URL)
GET	`/api/search?q=term`	Cross-document full-text search (ACL-filtered)

8.2 WebSocket Protocol and the y-websocket Layer

Connection: wss://sync.example.com/doc/:id?token=<JWT>

Protocol stack: Hocuspocus implements the y-websocket protocol, which is a thin binary framing layer built on top of standard WebSocket connections. The stack looks like this:

┌─────────────────────────────────────┐
│  Yjs Sync Protocol                  │  sync-step-1, sync-step-2, update, awareness
├─────────────────────────────────────┤
│  y-websocket Protocol               │  Message type prefix byte + binary Yjs payload
├─────────────────────────────────────┤
│  WebSocket (RFC 6455)               │  Bidirectional binary frames over a single TCP connection
├─────────────────────────────────────┤
│  TLS 1.3                            │  Encryption
├─────────────────────────────────────┤
│  TCP                                │  Reliable transport
└─────────────────────────────────────┘

Every message on the wire is a binary WebSocket frame. The first byte is a message type identifier (0 = sync-step-1, 1 = sync-step-2, 2 = update, 3 = awareness). The remaining bytes are the Yjs binary payload. No JSON. No text frames. This is why the per-message overhead is tiny (a single character insert is ~30 bytes on the wire including the WebSocket frame header).

This is not Socket.IO. There is no HTTP long-polling fallback baked into the protocol. The connection is a raw WebSocket, which is why it works with any standard WebSocket-compatible load balancer or proxy. The tradeoff: if WebSocket is completely blocked by a corporate firewall or restrictive proxy, you need an explicit fallback mechanism. Section 11.8 covers this scenario in detail, including HTTP long-polling, SSE, and WebTransport as alternative transports. The Yjs sync protocol is transport-agnostic, so the same binary messages work over any channel.

The WebSocket connection follows the Yjs sync protocol:

Message Type	Direction	Payload	Purpose
`sync-step-1`	Client → Server	Client's state vector (binary)	"Here is what I have"
`sync-step-2`	Server → Client	Missing updates (binary)	"Here is what the client is missing"
`update`	Bidirectional	Yjs binary update	Incremental edit broadcast
`awareness`	Bidirectional	`{ clientId, user, cursor, selection, color }`	Presence and cursor tracking
`token-refresh`	Client → Server	New JWT	Client sends a fresh JWT before the current token expires. Server validates and updates the session without disconnecting
`ping`	Client → Server	(empty)	Heartbeat every 30s
`pong`	Server → Client	(empty)	Heartbeat response

Initial sync flow:

Client sends sync-step-1 with its state vector
Server computes the diff and responds with sync-step-2 containing all missing updates
Client sends sync-step-1 back (the server now acts as client to receive any updates the client has that the server is missing)
Connection enters steady state: incremental update messages flow bidirectionally

9. Deep Dives

9.1 WebSocket Connection Lifecycle

Connection flow:

Reconnection: On disconnect, the client uses exponential backoff with jitter (initial delay 1s, max 30s, jitter factor 0.5). On reconnect, the client resumes from its last known state vector. The Yjs sync protocol handles catching up automatically. No operations are lost because the local Yjs document retains all state in IndexedDB.

Connection routing: The load balancer uses consistent hashing on document_id to route all editors of a single document to the same sync server instance. This avoids cross-server coordination for most operations. When a sync server fails, connections rehash to a different instance, which loads the document from persistence.

Heartbeat: Client sends a ping every 30 seconds. Server responds with pong. If no pong within 90 seconds, the client treats the connection as dead and initiates reconnection. The server evicts clients that miss 3 consecutive heartbeats.

9.2 CRDT Sync Protocol (Yjs Internals)

The state vector and sync protocol are described in Section 4.3.3. This section covers the implementation-level details that matter for production systems.

Incremental updates: After initial sync, every local edit is immediately encoded as a binary Yjs update and sent to the server. The server persists it, then broadcasts it to all other connected clients for that document. Each update is typically 20-100 bytes.

Document encoding efficiency: Yjs binary encoding achieves 1.5-3x the plain text size for a typical document. A 10 KB plain text document is 15-30 KB as a Yjs binary. This compactness comes from three optimizations:

Run-length encoding for consecutive edits by the same client (typing "hello" stores as one item, not five)
Delta encoding for lamport clocks (store increments, not absolute values)
Compact varint encoding for IDs and positions

Tombstone garbage collection: When a character is deleted, Yjs tombstones it (marks it as deleted but retains the ID and origin pointers). Tombstones accumulate over time. Yjs can garbage-collect tombstones when all connected clients have integrated the deletion. In practice, GC runs during snapshot creation: the snapshot stores only live content plus a compressed tombstone summary. Clients that reconnect after a GC cycle receive the full snapshot instead of incremental updates.

9.3 Multi-Instance Sync: When Users Land on Different Servers

Hocuspocus is single-instance by design. Each server holds Yjs documents in memory. If two Hocuspocus instances both load the same document, they have no awareness of each other. Edits on instance A never reach instance B. The document forks silently.

The primary defense is consistent hashing on document_id: a load balancer (or sticky session layer) hashes the document ID and routes every WebSocket connection for that document to the same Hocuspocus instance. Under normal operation, this means all editors of a document share one in-memory Yjs doc on one server. Cross-instance sync never needs to happen.

But it does happen. A Hocuspocus instance crashes and its clients reconnect to a different node. A scaling event adds or removes instances, reshuffling the consistent hash ring. A load balancer reconfiguration changes routing rules. A rolling deployment briefly runs old and new instances simultaneously. In all these cases, two users editing the same document can end up connected to different Hocuspocus instances, each holding its own copy of the Yjs document.

What goes wrong without cross-instance sync

Each Hocuspocus instance applies edits to its local Yjs document and broadcasts to its locally connected clients. Instance A's clients see each other's edits in real time. Instance B's clients see each other's edits in real time. But the two groups are editing in isolation. When the situation resolves (one instance drains, or the document is eventually persisted and reloaded), edits from one side may silently overwrite the other unless a proper merge happens.

@hocuspocus/extension-redis

The official solution is @hocuspocus/extension-redis. It uses Redis Pub/Sub to relay Yjs updates between Hocuspocus instances. The setup is straightforward:

typescript

import { Server } from "@hocuspocus/server";
import { Redis } from "@hocuspocus/extension-redis";

const server = Server.configure({
  extensions: [
    new Redis({
      host: "redis.internal",
      port: 6379,
    }),
  ],
});

Each Hocuspocus instance subscribes to a Redis Pub/Sub channel specific to each document it has loaded. The channel name follows the pattern hocuspocus:<document_name> — so a document named project-roadmap publishes and subscribes on the channel hocuspocus:project-roadmap. When a client sends an edit, the flow is:

User A on Hocuspocus instance A makes an edit
Instance A applies the Yjs update to its local in-memory document
Instance A broadcasts the update to all its locally connected clients (they see the edit immediately)
Instance A publishes the binary Yjs update to the Redis Pub/Sub channel for that document
Hocuspocus instance B, subscribed to the same channel, receives the update
Instance B applies the Yjs update to its local in-memory document
Instance B broadcasts to its locally connected clients

The total additional latency is one Redis Pub/Sub hop, typically under 1ms within the same datacenter.

Why CRDTs make this safe

The key insight is that Yjs updates are commutative, associative, and idempotent. The same set of updates applied in any order on any number of replicas will converge to the same document state. This means:

Order does not matter. If Redis delivers updates out of order, the final document state is still correct.
Duplicates are harmless. If an update is delivered twice (Redis at-least-once semantics), Yjs silently ignores the duplicate because it has already seen that item ID.
Temporary divergence is fine. If there is a brief window where instance A and instance B have different document states (because a Pub/Sub message is in flight), they will converge as soon as the message arrives. No conflict resolution logic is needed beyond what the CRDT already provides.

This is fundamentally different from OT-based systems, where operation ordering is critical and a central server must serialize all edits. With CRDTs, the relay layer (Redis Pub/Sub) is a best-effort delivery mechanism, not an ordering authority.

What happens if Redis Pub/Sub fails

If Redis goes down or Pub/Sub messages are delayed, the two instances continue operating independently. Edits accumulate on each side without being relayed. This is operationally identical to the "no cross-instance sync" scenario described above, but with a critical difference: when Redis recovers, the instances do not automatically reconcile. The Pub/Sub messages that were lost during the outage are gone.

Recovery depends on the persistence layer. When either instance persists its document state (to PostgreSQL or S3), that state includes all the Yjs updates it has seen. When the document is next loaded from storage (after a restart, or by a new instance), the full state is available. If both instances persisted during the outage, merging their two persisted states is a standard Yjs merge operation: load both, apply one to the other, and the CRDT guarantees convergence.

In practice, the defense-in-depth strategy is:

Consistent hashing prevents most multi-instance scenarios entirely
@hocuspocus/extension-redis handles the cases that slip through (scaling events, failover, rolling deploys)
CRDT convergence guarantees correctness even if the relay layer has gaps — the data is never lost, only temporarily invisible to some clients
Persistence-layer merge catches anything Redis Pub/Sub missed, at the cost of higher latency (next persist/load cycle rather than real-time)

See also Section 11.7 for the operational playbook when split-brain is detected.

9.4 Conflict Resolution Deep Dive

Scenario: Three users edit the same sentence simultaneously.

Document: "The quick brown fox". Alice, Bob, and Carol all edit at the same time.

Alice inserts " lazy" before "fox", targeting position between "brown " and "fox"
Bob deletes "quick " (characters 4-9)
Carol bolds the word "brown"

Each operation in CRDT terms:

Alice's insert: Creates items with originLeft pointing to the space after "brown" and originRight pointing to 'f' in "fox". These pointers are stable item IDs, not integer positions.
Bob's delete: Tombstones the items corresponding to "quick ". The items still exist in the CRDT with their IDs and origin pointers intact, but their content is marked as deleted.
Carol's formatting: Uses the Peritext approach. Bold mark starts at the 'b' in "brown" with an "expand" boundary (new characters typed at the end inherit the mark) and ends after 'n' with a "no-expand" boundary.

Merge result regardless of arrival order: "The brown lazy fox" with "brown" in bold. Alice's "lazy" appears because its origin pointers still reference live items. Bob's deletion removes "quick " without affecting Alice's or Carol's operations. Carol's bold mark applies to "brown" even though Bob's delete changed the text around it. Zero transforms needed. Each operation references stable item IDs, so the merge function handles all three independently.

9.5 Presence and Cursor Tracking

End-to-end flow: from cursor move to remote render

Presence data travels through the same Hocuspocus WebSocket connection as document edits. There is no separate channel. Here is the full path when Alice moves her cursor:

Alice moves cursor or selects text in TipTap
    ↓
ProseMirror fires a selection change event (anchor + head positions)
    ↓
y-prosemirror plugin converts ProseMirror integer positions
to Yjs relative positions (anchored to CRDT item IDs)
    ↓
Yjs awareness protocol packages the cursor + user metadata
into a compact awareness update (~200 bytes)
    ↓
HocuspocusProvider sends it over the existing WebSocket connection
(y-websocket message type byte = 1 for awareness, then the binary payload)
    ↓
Hocuspocus server receives it, does NOT persist it
    ↓
Broadcasts to all other clients connected to that document
(if clients span multiple Hocuspocus instances → relayed via Redis Pub/Sub)
    ↓
Bob's HocuspocusProvider receives the awareness update
    ↓
y-prosemirror converts the Yjs relative positions back to
ProseMirror integer positions in Bob's local document
    ↓
ProseMirror renders a colored cursor bar + floating name label
using a DecorationSet (CSS-based, no DOM mutation of the document itself)

The critical detail: Hocuspocus is the relay for awareness just like it is for document edits. The awareness update is a different message type on the same WebSocket frame, not a separate connection. This means if the WebSocket drops, both document sync and presence stop together. When the client reconnects, it re-publishes its full awareness state as part of the reconnection handshake.

Awareness state object

Each connected client publishes an awareness state object:

json

{
  "clientId": 42,
  "user": { "name": "Alice", "color": "#FF6B6B", "avatar": "/avatars/alice.jpg" },
  "cursor": {
    "anchor": { "type": "relative", "item": "(155, alice)", "assoc": -1 },
    "head": { "type": "relative", "item": "(160, alice)", "assoc": 1 }
  },
  "status": "active"
}

When anchor and head point to the same item, it's a blinking cursor. When they point to different items, it's a text selection and the remote client renders a colored highlight over that range.

assoc indicates cursor affinity: -1 associates with the character to the left (end of a word), 1 with the character to the right (start of a word). This determines where new text appears when another user inserts at the same position.

Why relative positions instead of integer offsets

Cursor positions use Yjs relative positions, not integer offsets. A relative position references a specific CRDT item ID. When remote edits insert or delete text around the cursor, the cursor position remains correct because it is anchored to an item, not an index.

Concrete example: Alice's cursor sits between characters 5 and 6. Bob inserts 3 characters at position 2. With integer offsets, Alice's cursor should now be between characters 8 and 9, but nobody told Alice's client to update. With Yjs relative positions, Alice's cursor references item (155, alice) regardless of what Bob inserts. When ProseMirror needs to render it, y-prosemirror resolves the item ID to the current integer position. The cursor stays in the right place automatically.

In OT systems, cursor positions must be transformed against every incoming operation using the same transform function as document edits. This adds complexity and is a common source of flickering cursor bugs.

Cursor rendering

Remote cursors are rendered as ProseMirror Decorations, not as actual document content. The y-prosemirror plugin maintains a DecorationSet that adds:

A colored vertical bar (1-2px wide CSS border) at the cursor position. The color is deterministic: each user gets a consistent color derived from a hash of their user ID across the predefined palette. Alice is always red. Bob is always blue. The colors don't change between sessions.
A floating name label (small pill-shaped badge with the user's name) anchored to the top of the cursor bar. The label fades out after 3 seconds of cursor inactivity to reduce visual clutter, and reappears when the cursor moves again.
A colored background highlight when the remote user has a text selection (anchor ≠ head). The highlight uses the same user color at ~20% opacity so the underlying text remains readable.

These decorations are purely visual. They don't modify the Yjs document or create CRDT items. They exist only in ProseMirror's rendering layer and are recalculated whenever an awareness update arrives.

User presence indicators

Beyond cursors, the awareness protocol powers the "who's here" indicator: the row of user avatars typically shown in the document header.

The client subscribes to all awareness state changes and maintains a list of active users. Each user's awareness state includes their name, avatar URL, and a status field. Status transitions:

active: User has made an edit or moved their cursor within the last 60 seconds.
idle: No local activity for 60 seconds. The client automatically updates its own status to idle and broadcasts the change. Remote clients dim that user's avatar or show an "idle" indicator.
viewing: User has the document open but has not made any edits (read-only viewer or commenter). Cursor position is still broadcast so collaborators know where the viewer is reading.

The transition from active to idle happens client-side via a simple inactivity timer. The transition back to active happens immediately on any local edit or selection change.

What triggers an awareness update

Not every cursor blink sends a message. Awareness updates are sent when:

Cursor position changes (user clicks somewhere new, arrow keys, typing moves the cursor forward)
Selection changes (user drags to select text, Shift+click, Ctrl+A)
User status changes (active → idle, idle → active, or user metadata changes)
On reconnect (client re-publishes full awareness state after WebSocket reconnection)

Updates are NOT sent on every keystroke if the cursor is just advancing one position at a time during normal typing. The y-prosemirror plugin debounces these, sending at most one awareness update per 50ms.

Throttling and bandwidth

Awareness updates are sent at most every 50ms (20 updates/sec). This limits bandwidth while keeping cursor movement smooth. At 100 editors, each cursor move generates 99 broadcasts. At 50ms throttling, that is 99 x 20 = 1,980 presence messages/sec per document. Small payload (~200 bytes each), so ~400 KB/sec per document. Manageable.

At the upper end (100 concurrent editors), the presence traffic is actually larger than the document edit traffic for most documents. A room full of people moving cursors around generates more messages than a room full of people typing. This is fine at the per-document level but worth knowing when sizing your WebSocket gateway bandwidth.

Cross-server relay

When editors of the same document are split across multiple sync server instances (due to load balancer rebalancing or failover), presence updates are relayed via Redis Pub/Sub. Each sync server subscribes to a presence channel keyed by document ID. The flow: Hocuspocus instance A receives Alice's awareness update → publishes to Redis channel presence:doc-abc → Hocuspocus instance B receives it → broadcasts to its locally connected clients. The Redis hop adds 1-5ms of latency, which is imperceptible for cursor rendering.

Ephemeral by design

Presence data is never persisted. Not to PostgreSQL, not to Redis (beyond the transient Pub/Sub message), not anywhere. If a sync server restarts, connected clients re-publish their awareness state within one heartbeat cycle (30s). Disconnected clients are automatically removed after the heartbeat timeout (90s). This means a user who closes their laptop appears to leave the document after 90 seconds, not immediately.

If you want "last seen" or "recently active" indicators (common in Notion-style products), that requires a separate system. Track user access events via the REST API (GET /api/documents/:id logs a view event) and store them in PostgreSQL. Don't try to derive "last seen" from the ephemeral awareness protocol.

9.6 Offline Editing and Sync

Client-side persistence: The Yjs document state is continuously persisted to IndexedDB. Every local edit updates IndexedDB within a debounced window (100ms). On page reload or app restart, the Yjs document is reconstructed from IndexedDB, not fetched from the server.

Offline workflow:

WebSocket disconnects (network loss, airplane mode, etc.)
The editor remains fully functional. Every keystroke creates CRDT items in the local Yjs document.
IndexedDB stores the growing document state.
A local queue tracks which updates have not been sent to the server.

Reconnection:

WebSocket reconnects.
Yjs sync protocol resumes: client sends its state vector, server responds with missed updates, client sends its offline edits.
Both sides merge automatically using CRDT rules.
Within one sync round-trip, the client and server converge.

Conflict visibility: When Alice reconnects after a long offline period and other users have edited the same regions she edited, the CRDT merge produces a technically correct result. But the merged text might not be semantically meaningful (both Alice and Bob rewrote the same sentence differently). The client can detect "conflicting regions" by comparing the merged document against the offline-start version and highlighting areas that changed from both local and remote edits. This is a UI hint, not a merge failure.

Edge case: Two users both offline for hours, both extensively editing the same paragraph. CRDT convergence is guaranteed, but the result interleaves their edits at the character level. Mitigation: the presence system shows that the other user was recently active in the same region, discouraging simultaneous offline edits. Post-merge, the document shows a clear diff of what changed, letting users clean up the result manually.

9.7 Version History and Snapshots

Operation log: Every Yjs update (binary blob, typically 20-100 bytes) is appended to the operation_log table in PostgreSQL. This is an append-only log, partitioned by document_id for write distribution.

Snapshots: A full Yjs document binary (the complete CRDT state) is saved to document_snapshots:

Every 500 operations (with a minimum 60-second cooldown per document), or
Every 5 minutes during active editing, or
On explicit user action ("save as named version")

A snapshot captures the entire document state at a point in time. Loading a snapshot reconstructs the full Yjs document without replaying any operations.

Compaction: Operation log entries older than the most recent snapshot can be archived to S3 (cold storage). The hot operation log in PostgreSQL only needs to contain entries since the last snapshot. This bounds PostgreSQL storage growth.

Version browsing: To reconstruct the document at any past version:

Load the nearest snapshot before the target version.
Replay operations from the operation log forward to the target version.
Render the resulting Yjs document state in a read-only editor.

Same pattern as database point-in-time recovery: the snapshot is a base backup, the operation log is the WAL.

Named versions: Users can trigger "Save version" (like Google Docs "Name this version"). This creates a snapshot with user-provided metadata (name, description). Named versions are never compacted.

Undo/redo: Yjs provides UndoManager, which tracks items created or deleted by the local user. Undo generates inverse operations (re-insert deleted items, delete inserted items) and applies them as new CRDT operations. This means undo is per-user: Alice undoing her last action does not affect Bob's edits. The undo history is local to each client and not persisted.

9.8 Search and Indexing

In-document search (Ctrl+F): Purely client-side. The TipTap editor searches the local ProseMirror document state using standard text matching. No server involvement. Fast regardless of document size because the full document is already in memory.

Cross-document search: Elasticsearch indexes document content for full-text search across all documents.

Indexing pipeline:

A Yjs update arrives at the sync server.
The server debounces (waits 30 seconds of inactivity or a maximum of 2 minutes after the first change).
The server extracts plain text from the current Yjs document state.
The plain text is indexed to Elasticsearch with metadata: { document_id, title, owner_id, updated_at }.

ACL-aware search: Search queries include the requesting user's ID. Elasticsearch filters results against precomputed access lists. Each document's index entry includes a list of user IDs and group IDs that have at least viewer permission. The query adds a filter: user_id IN accessible_ids OR group_id IN user_groups.

Incremental vs. full reindexing: Incremental reindexing (update only the changed document) handles steady-state traffic. A full reindex job runs weekly to catch any missed updates (safety net). The full reindex reads snapshots from PostgreSQL, extracts text, and bulk-indexes to Elasticsearch.

9.9 Rich Text and Document Schema

ProseMirror schema defines the allowed structure:

Block nodes: doc, paragraph, heading (levels 1-6), blockquote, code_block, image, table, table_row, table_cell, table_header, bullet_list, ordered_list, list_item, horizontal_rule

Marks (inline formatting): bold, italic, underline, strikethrough, link (with href), code, comment (with metadata)

Schema enforcement: ProseMirror validates every transaction (insert, delete, format) against the schema before it is applied. Invalid operations (e.g., a heading inside a table cell that does not allow headings) are rejected client-side before reaching the CRDT layer. This prevents malformed document states.

Defense in depth: Client-side validation is not sufficient because a malicious client can bypass ProseMirror and send crafted Yjs binary updates directly. The sync server validates incoming updates against a schema allowlist before relaying or persisting them. Raw HTML nodes, script tags, and event handler attributes are stripped. This prevents XSS injection via the CRDT layer.

Tables: Each table cell contains a mini-document (a Y.XmlFragment). Concurrent edits to different cells are fully independent. Structural changes (add row, delete column) are atomic operations that modify the table node itself. Concurrent structural changes (Alice adds a row, Bob deletes a column) are resolved by CRDT ordering: both operations apply, producing a table with the new row and without the deleted column.

Images: Upload flow:

Client requests a presigned S3 upload URL via POST /api/documents/:id/media.
Client uploads the image directly to S3 using the presigned URL.
Client inserts an image node into the ProseMirror document referencing the S3 key.
Other clients see the image node immediately (the S3 key resolves to a CDN URL).

Comments: Stored as marks on text ranges. Each comment mark carries metadata:

json

{
  "type": "comment",
  "attrs": {
    "commentId": "cmt-456",
    "author": "alice",
    "createdAt": "2026-03-10T14:22:00Z",
    "resolved": false
  }
}

Comment positions survive concurrent edits because the mark is attached to specific CRDT items (characters), not integer ranges. If the commented text is deleted, the comment mark is tombstoned with the text. The comment thread is preserved in PostgreSQL (linked by commentId) and can still be viewed in the document's comment history, even if its anchor text is gone. If the deletion is undone, the comment mark reappears on the restored text.

Comment thread storage:

sql

CREATE TABLE comments (
    id              UUID PRIMARY KEY,
    document_id     UUID NOT NULL REFERENCES documents(id),
    comment_id      TEXT NOT NULL,           -- matches commentId in the ProseMirror mark
    parent_id       UUID REFERENCES comments(id),  -- NULL for top-level, set for replies
    author_id       UUID NOT NULL REFERENCES users(id),
    body            TEXT NOT NULL,
    resolved        BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

API endpoints:

Method	Endpoint	Description
POST	`/api/documents/:id/comments`	Create a comment (inserts mark into CRDT + row into PostgreSQL)
GET	`/api/documents/:id/comments`	List all comment threads for a document
POST	`/api/documents/:id/comments/:commentId/reply`	Reply to a comment thread
PATCH	`/api/documents/:id/comments/:commentId/resolve`	Resolve or reopen a comment thread

Notifications: When a comment mentions another user (@alice), the API service publishes a notification event. Comment creation and resolution events are also candidates for webhook delivery to external integrations (Slack, email, Zapier).

9.10 Multi-Tenancy and Rate Limiting

Tenant isolation: Each document is a separate Yjs document instance. There is no shared CRDT state across documents. ACLs are enforced at the sync server: before relaying any Yjs update, the server verifies the sending client has editor permission on the document. Viewers receive updates but cannot send them.

Rate limiting:

Limit	Value	Enforcement Point
Operations per user per second	100	Sync server
WebSocket messages per second	200	WebSocket gateway
Document size (characters)	1.5M	Sync server (reject ops exceeding limit)
Editors per document	100	Sync server (reject connections)
Viewers per document	200	Sync server (reject connections)
Media upload size	25 MB	Media service
Media uploads per document per hour	50	Media service

Document size enforcement: The sync server tracks the approximate character count of each active document. When an incoming update would push the document beyond 1.5M characters, the server rejects it and sends an error to the client. The client displays a "document size limit reached" warning.

Abuse protection: If a single client exceeds the operation rate limit, the server drops excess operations and sends a rate-limit warning. Persistent abuse (10+ warnings in 1 minute) triggers a temporary connection ban (5 minutes).

CRDT items permanently embed the originating client ID. When a user requests account deletion under GDPR (right to erasure), the system must anonymize their contributions without breaking CRDT invariants. The deletion pipeline rewrites every snapshot and op-log entry that contains the user's client ID, replacing it with anon-<hash>. The CRDT only requires unique IDs, not identifiable ones, so document structure and merge correctness are preserved. Rewriting snapshots is expensive: every snapshot containing the user's edits must be re-serialized. Deletions are batched and processed nightly. During the processing window, the user's data is marked as "pending deletion" and excluded from search results and API responses.

9.12 Undo/Redo

Yjs provides an UndoManager that scopes undo to the local user's operations. Each client maintains its own undo stack, tracking which CRDT items were created or deleted by that user. Undoing an insert tombstones the inserted items; undoing a delete restores the tombstoned items. Other users' concurrent edits are unaffected. Undoing Alice's insertion does not touch Bob's edits, even if they are adjacent. The undo stack is stored client-side and survives page reloads via IndexedDB. In the collaborative context, this means each user has independent undo history: pressing Ctrl+Z only reverts your own changes, never a collaborator's.

9.13 AI-Assisted Editing

Every major collaborative editor now ships AI features: Google Docs "Help me write," Notion AI, Microsoft Loop Copilot. In a CRDT-based architecture, AI integration is architecturally clean because the AI is just another client.

How it works:

User triggers an AI action (slash command /ai, toolbar button, or inline suggestion prompt).
The client sends the request to an AI service via REST: POST /api/documents/:id/ai with the prompt, selected text range, and surrounding context. The client includes Yjs relative positions for the selection boundaries so the server can map them to the current document state.
The AI service calls the LLM (streaming response). As tokens arrive, the service inserts them into the Yjs document as a dedicated AI client (with its own clientId). The AI's edits are CRDT operations like any other, meaning they merge correctly with concurrent human edits.
Other editors see the AI-generated text appearing in real time, with an "AI writing" indicator in the presence system (the AI client has a distinct awareness state: { user: { name: "AI Assistant", isAI: true } }).
The user can accept, reject, or edit the AI output. Rejecting is an undo of the AI client's operations via UndoManager scoped to the AI's clientId.

Rate limiting: AI requests are rate-limited per user (5 requests/minute) and per document (20 requests/minute) to manage LLM costs. Long-running generations are capped at 60 seconds.

Why this works cleanly with CRDTs: The AI does not need special merge logic. Its edits have unique CRDT IDs like any other client. If the user types while the AI is generating, both sets of edits merge automatically. In an OT system, the AI's streaming insertions would need to be transformed against every concurrent human edit in real time, adding significant complexity.

10. Identify Bottlenecks

10.1 WebSocket Connection Thundering Herd

Symptoms: During a deployment or sync server restart, all connected clients disconnect and reconnect simultaneously. 50,000 connections hitting a single server in a 5-second window causes CPU saturation from TLS handshakes and Yjs sync-step-1 processing.

Mitigation: Graceful drain during deployments (stop accepting new connections, wait 60s for existing ones to close naturally). Client-side reconnection uses jitter (random delay 0-5s) to spread the reconnection wave. The load balancer distributes reconnections across healthy instances.

10.2 Sync Server Memory Pressure

Symptoms: A document with 100 active editors and 1.5M characters consumes ~50 MB of RAM as a Yjs document instance. If the sync server holds 10,000 active documents and several are large, memory usage spikes past the instance limit.

Mitigation: Evict idle documents from memory after 5 minutes of no activity (reload from snapshot on next access). Set a per-document memory cap (100 MB). For extremely large documents, stream the Yjs state from Redis instead of holding it entirely in process memory. Monitor per-instance document count and memory, auto-scale sync server instances based on memory pressure.

10.3 Operation Log Write Throughput

Symptoms: At 60M ops/sec globally (design ceiling; realistic steady state is 15-25M), PostgreSQL must sustain up to 60M inserts/sec into the operation log. A single PostgreSQL instance handles ~50K inserts/sec at most.

Mitigation: The operation log table is hash-partitioned by document_id across 100+ PostgreSQL shards. Each shard handles ~600K inserts/sec (well within capacity). Batch writes: the sync server buffers operations for 100ms and inserts them in bulk (500-1,000 rows per batch insert), reducing per-row overhead.

10.4 Snapshot Computation Cost

Symptoms: Serializing a 1.5M-character Yjs document into a binary snapshot takes 50-200ms of CPU time. If 10,000 documents simultaneously trigger a snapshot (e.g., all hitting the 5-minute timer), the sync server stalls.

Mitigation: Stagger snapshot timers with random jitter (5 minutes ± 60 seconds). Run snapshot serialization on a background thread (Node.js worker thread) to avoid blocking the event loop. Limit concurrent snapshots to 10 per sync server instance.

10.5 Elasticsearch Indexing Lag

Symptoms: During heavy editing, documents change faster than the 30-second debounce window. Search results show stale content. Users search for text they just typed and find nothing.

Mitigation: The 30-second debounce is a deliberate tradeoff (search freshness vs. indexing load). For most use cases, a 30-second delay is acceptable. For users who need instant search, in-document Ctrl+F (client-side) provides real-time results. Cross-document search is expected to be near-real-time, not real-time.

10.6 Redis Pub/Sub Fan-Out for Presence

Symptoms: 100 editors per document, each sending cursor updates at 20/sec. Each update fans out to 99 other editors. That is 1,980 messages/sec per document through Redis Pub/Sub. Across 10M active documents (average 3 editors): ~1.2M presence messages/sec through Redis.

Mitigation: Presence is fan-out at the sync server level, not Redis level. Each sync server subscribes once per document to a Redis channel. The server handles local fan-out to its connected clients. Redis sees one publish and N subscriber deliveries (where N = number of sync server instances hosting that document, typically 1-3, not 99).

10.7 S3 Media Upload Latency

Symptoms: User inserts an image. The document references the S3 key immediately, but the image is not yet uploaded. Other editors see a broken image placeholder for 2-5 seconds until the upload completes.

Mitigation: The client shows a local preview (from the file picker) while uploading. The S3 key is not inserted into the document until the upload completes successfully. Other editors see the image node appear only after it is available. Upload progress is shown to the inserting user.

10.8 Cross-Region WebSocket Latency

Symptoms: Alice in London edits a document hosted on a sync server in US-East. Her operations travel across the Atlantic (70-100ms one way). Bob in New York sees Alice's edits in ~70ms, but Alice sees Bob's edits in ~140ms (round trip through the server). The 300ms cross-region SLA is met, but the experience feels sluggish.

Mitigation: Route documents to the region where the majority of active editors are located. If Alice is the only London-based editor and 9 others are in New York, the document stays in US-East. If the editing pattern shifts (London team takes over), migrate the document's primary sync server to EU-West. Migration: snapshot the Yjs state, load it on the new region's sync server, redirect WebSocket connections.

11. Failure Scenarios

11.1 Sync Server Crash

Impact: All active documents on that instance lose their in-memory Yjs state. Connected clients disconnect.

Recovery:

The load balancer detects the failed instance within 10 seconds (health check failure).
Reconnecting clients are routed to a healthy sync server instance (consistent hashing rehash).
The new instance loads the latest snapshot from PostgreSQL and replays operations from the operation log since that snapshot.
Clients run the Yjs sync protocol to converge.
Zero user-visible data loss. The operation log contains all persisted updates. If the crashed server had buffered operations in its 100ms batch window, those operations survive in the clients' local Yjs documents (IndexedDB) and are resent on reconnect.

Time to recover: 15-30 seconds. Clients experience a brief disconnection, then resume editing.

11.2 Redis Failure

Impact: Hot document state cache is unavailable. Presence data is lost. Cross-server presence relay stops working.

Recovery:

Sync servers fall back to loading Yjs state from PostgreSQL snapshots (slower, ~500ms vs. ~5ms from Redis).
Presence stops updating across servers. Clients on the same sync server still see each other's cursors (local awareness still works).
Redis recovers (failover to a replica or restart). Sync servers re-cache active documents. Clients re-publish awareness state.

Time to recover: With Redis Sentinel or Cluster, automatic failover takes 10-30 seconds. Presence gap during this window.

11.3 PostgreSQL Failure

Impact: Operation log writes fail. No new snapshots can be saved. Document metadata and ACLs are unavailable for new connections.

Recovery:

Sync servers buffer incoming operations in memory (up to 10,000 per document or 5 minutes, whichever comes first).
Active editing sessions continue because the Yjs document is held in memory on the sync server. No immediate user impact for already-connected editors.
New connections fail (cannot load document or verify permissions). The client shows "temporarily unavailable" and retries.
PostgreSQL recovers (failover to standby replica). Sync servers flush buffered operations to the operation log. Normal operation resumes.

Time to recover: Depends on PostgreSQL HA setup. With synchronous replication and automatic failover, 10-30 seconds. Buffered operations cover the gap.

11.4 WebSocket Gateway Overload

Impact: New connections are rejected. Existing connections may be dropped due to resource exhaustion (file descriptors, memory).

Recovery:

The auto-scaler adds gateway instances within 60 seconds.
The load balancer redirects new connections to healthy instances.
Dropped clients reconnect with jitter and are distributed across the expanded pool.

Prevention: Set connection limits per gateway instance (50K hard cap). Monitor connection counts and trigger scaling at 70% capacity.

11.5 Network Partition (Client to Server)

Impact: The client enters offline mode. Local edits continue without interruption.

Recovery:

Client detects partition via heartbeat timeout (90 seconds).
Client enters explicit offline mode. Edits are stored in IndexedDB.
Client attempts reconnection with exponential backoff.
On reconnect, the Yjs sync protocol merges all offline edits.

This is not a failure from the user's perspective. The editor works identically online and offline. The merge happens transparently.

11.6 Corrupted Yjs Document State

Impact: The in-memory Yjs document enters an inconsistent state (e.g., broken origin pointers due to a Yjs library bug). Edits produce garbled output.

Recovery:

Detect: clients report document hash mismatches during sync (state vectors agree but rendered content differs).
The sync server evicts the corrupted in-memory state.
Reload from the latest known-good snapshot in PostgreSQL.
Replay operations from the operation log since that snapshot.
Force-resync all connected clients (server sends the full state as sync-step-2).

Prevention: Run integrity checks on snapshots before persisting (verify the Yjs document renders valid ProseMirror JSON). Keep the last 5 snapshots per document for rollback.

11.7 Split-Brain: Two Sync Servers Serving the Same Document

Impact: Two sync server instances both load the same document independently. Edits on one instance are not visible to editors on the other. The document forks. See Section 9.3 for a detailed explanation of why this happens (consistent hash ring changes, failover, scaling events) and how @hocuspocus/extension-redis relays Yjs updates between instances via Pub/Sub.

Recovery:

Detect: the document's Redis key is claimed by two different server instance IDs.
One instance wins (compare instance startup timestamps, older wins).
The losing instance drains its connections and redirects clients to the winning instance.
The winning instance merges any operations it missed from the losing instance's operation log entries. Because Yjs updates are commutative and idempotent, this merge is automatic — apply the other instance's updates and the CRDT converges.

Prevention: Use Redis SET NX (set if not exists) with a TTL as a distributed lock when loading a document. If the lock is already held, redirect the connection to the lock-holding instance. Renew the lock every 30 seconds. As a safety net, configure @hocuspocus/extension-redis so that even if two instances do serve the same document, edits are relayed between them in real time (Section 9.3).

11.8 WebSocket Blocked by Firewall or Corporate Proxy

Impact: The client cannot establish a WebSocket connection at all. The HTTP Upgrade handshake fails because a corporate proxy strips the Upgrade: websocket header, a firewall blocks non-HTTP traffic on port 443, or a network appliance terminates long-lived connections after a few seconds. The Yjs sync protocol never runs. The client is stuck without server state unless it previously connected from a different network and has IndexedDB data.

This is different from a network partition. In a partition, the connection drops after it was established and the client falls back to offline mode. Here, the connection never succeeds in the first place.

How common is WebSocket blocking in 2025-2026?

Honestly, not very. WebSocket (RFC 6455) has been supported by every major browser since 2012, and most modern proxy servers, load balancers, and CDNs handle WebSocket connections without issues. The wss:// protocol runs over TLS on port 443, so it looks like regular HTTPS traffic to most network equipment.

That said, it still happens. Old corporate proxy servers (Squid, older Nginx configs that weren't updated) strip the Upgrade header. Deep packet inspection appliances in government and financial networks sometimes block anything that doesn't look like standard HTTP request/response. Some hotel and airport captive portal networks terminate long-lived connections. And there are still enterprise environments running proxy infrastructure from 2015 that nobody wants to touch.

The realistic assessment: for a consumer-facing product, you will almost never encounter WebSocket blocking. For an enterprise product targeting government agencies, hospitals, or heavily regulated industries, you might see it on 1-5% of client networks. Decide whether that 1-5% matters enough to justify the engineering cost of a fallback transport.

How Google Docs, Figma, and Notion handle this

Google Docs never used WebSocket. This surprises people. Google Docs launched in 2006 and WebSocket didn't exist until 2011 (RFC 6455). Google built their real-time sync on BrowserChannel, a proprietary HTTP long-polling library. The client makes a regular HTTPS request, the server holds the connection open and streams updates as chunked responses. When the connection closes (proxy timeout, network hiccup), the client opens a new one and picks up where it left off. Google has since migrated to HTTP/2-based streaming, which works the same way but with multiplexed streams over a single connection. Because Google Docs uses OT with the server as the ordering authority, every operation already requires a server round-trip. HTTP request/response is a natural fit for that model. They never had a WebSocket fallback problem because they never needed WebSocket in the first place.

Figma uses WebSocket with a custom multiplayer protocol built on top of their CRDT. When the connection drops, the client holds state locally and resumes on reconnect. Figma does not publicize an HTTP fallback transport. Their approach is straightforward: WebSocket works for the vast majority of users, and the CRDT handles disconnection gracefully regardless of the cause.

Notion uses WebSocket for real-time sync as well. Their block-based model syncs individual blocks over the WebSocket connection. Like Figma, they handle disconnection through their sync protocol rather than maintaining a separate HTTP transport.

The pattern is clear: Google sidestepped the problem entirely by never adopting WebSocket. Figma and Notion chose WebSocket and rely on their CRDT/sync layer to handle disconnection, without building explicit HTTP fallback transports. None of them use Socket.IO.

Do you need a fallback transport?

For most teams building a collaborative editor, the answer is no. WebSocket works on 95%+ of networks. CRDTs handle disconnection gracefully. If a user is on a network that blocks WebSocket, they can still edit offline and sync later when they're on a different network. That's an acceptable experience for most products.

Build an explicit fallback transport only if:

Your product targets enterprise/government environments where WebSocket blocking is a known issue
"Sync later from a different network" is not an acceptable answer for your users
You're building a product where a user might only ever access it from a restrictive network (e.g., a hospital-only app, a government agency tool)

Fallback strategies (if you need them)

The key insight is that the Yjs sync protocol is transport-agnostic. The same binary messages (sync-step-1, sync-step-2, update, awareness) can travel over any channel. You are not locked into WebSocket.

Detection: The client attempts the WebSocket connection. If the HTTP Upgrade handshake fails (HTTP 403, connection reset, or a timeout after 5 seconds with no 101 Switching Protocols response), the client knows WebSocket is blocked. Distinguish this from a server outage by hitting a health check endpoint (GET /api/health) over regular HTTPS. If the health check succeeds but WebSocket fails, the problem is transport-level, not server-level.

Option 1: HTTP long-polling. Build an HTTP sync endpoint that accepts the same binary Yjs payloads. The client sends its state vector via POST /api/sync/:docId and receives missed updates in the response body. For ongoing sync, the client polls every 500ms-2s, sending any accumulated local updates and receiving any new remote updates. Latency goes from ~50ms (WebSocket) to 500ms-2s (polling interval). Not ideal for real-time cursor tracking, but perfectly usable for document editing. Disable presence updates in polling mode since they generate too much traffic for HTTP round-trips. This is the most practical fallback and closest to how Google Docs historically worked.

Option 2: Server-Sent Events (SSE) + POST. Use an SSE stream (GET /api/stream/:docId) for server-to-client updates and a regular POST /api/sync/:docId endpoint for client-to-server updates. SSE works over standard HTTPS and survives most proxies because it looks like a regular HTTP response that never finishes. Combined latency: server-to-client is near real-time (~100ms), client-to-server is batched per POST (~200-500ms). Better than long-polling, but some aggressive proxies kill long-lived HTTP connections too, which breaks the SSE stream.

Option 3: WebTransport (HTTP/3). WebTransport runs over QUIC and does not require the HTTP Upgrade handshake that firewalls block. It provides bidirectional streams with lower overhead than WebSocket. Browser support is growing. This is the long-term solution, not the one you ship today.

Recommended approach:

Client connects:
1. Attempt WebSocket (wss://sync.example.com/doc/:id)
2. If handshake fails within 5 seconds AND health check succeeds:
   a. Try SSE + POST (best latency among fallbacks)
   b. If SSE stream also dies within 10 seconds:
      Fall back to HTTP long-polling (POST every 1 second)
3. If all transports fail:
   Enter offline-only mode (full editor functionality, sync deferred)

The client remembers which transport worked in localStorage and skips failed transports on subsequent page loads. Show a subtle indicator so the user knows sync latency is degraded. Do not block the editor. The whole point of using CRDTs is that the editor works without any server connection.

Architecture cost: The sync server needs both WebSocket and HTTP endpoints. Hocuspocus handles WebSocket. For HTTP fallback, add Express/Fastify routes that accept binary Yjs payloads and apply them to the same server-side Yjs document. The server-side logic is identical for both transports because the sync protocol does not care how the bytes arrive. Realistically, this is a few hundred lines of code on top of what you already have. Whether it's worth building depends on your user base.

11.9 S3 Unavailability

Impact: Media uploads fail. Existing media still served from CDN cache (CloudFront). New images cannot be embedded.

Recovery:

The media service returns a retry-able error to the client.
The client queues the upload and retries with backoff.
The document can reference a placeholder image node until the upload succeeds.
S3 recovers. Queued uploads complete. Placeholders resolve to actual images.

12. Observability

Key metrics to track:

Metric	Description	Alert Threshold
`ws_connections_total`	Active WebSocket connections per gateway	Drop >20% in 5 min
`ops_per_sec_per_doc`	Operations per second per document	Drops to 0 for an active doc
`sync_latency_p99`	Time from op send to remote client receive	> 500ms
`document_size_bytes`	Yjs document binary size in memory	> 100 MB per document
`op_log_write_latency_p99`	PostgreSQL operation log insert latency	> 100ms
`snapshot_duration_ms`	Time to serialize a Yjs document snapshot	> 1,000ms
`search_index_lag_sec`	Time since last Elasticsearch index update	> 120s for active docs
`presence_broadcast_latency`	Redis Pub/Sub relay time	> 200ms
`offline_reconnect_duration`	Time from reconnect to full sync	> 10s

Dashboards:

System overview: Total connections, total ops/sec, active documents, per-region breakdown
Per-document health: Ops/sec, connected editors, document size, last snapshot age
Storage growth: Operation log size, snapshot storage, media storage, PostgreSQL shard utilization
Sync performance: Sync latency heatmap (p50/p95/p99), reconnection success rate, offline sync merge times

Structured logging: Every WebSocket message, operation log write, snapshot, and auth event is logged with document_id, client_id, user_id, and region as structured fields. Log aggregation via VictoriaLogs or similar. Trace IDs propagated from client through gateway to sync server to PostgreSQL for end-to-end tracing of a single operation.

13. Deployment Strategy

Multi-region deployment:

Deploy sync server clusters in 3+ regions (e.g., US-East, EU-West, AP-Southeast). Each document has a primary region based on where the majority of its active editors are located. WebSocket connections route to the primary region.

US-East:    400 sync servers, 250 WS gateways
EU-West:    300 sync servers, 200 WS gateways
AP-SE:      300 sync servers, 150 WS gateways

Document region assignment:

On creation: document is assigned to the creator's nearest region.
During active editing: if >60% of active editors are in a different region for >10 minutes, trigger region migration.
Migration: snapshot the Yjs state on the source region, transfer to the target region, load on the target sync server, redirect WebSocket connections. Brief interruption (~5 seconds) during migration.

Rolling upgrades:

Drain a sync server instance (stop accepting new document loads, keep serving active documents).
Wait for active documents to reach a quiet period (no ops for 30 seconds) or a maximum drain timeout (5 minutes).
Snapshot all in-memory documents.
Shut down the instance. Deploy the new version.
Start the new instance. It picks up new document loads from the consistent hash ring.

Canary deployment: New versions deployed to 5% of sync server instances first. Monitor error rate, sync latency, and document integrity for 30 minutes before proceeding. Automated rollback if error rate exceeds baseline by 2x.

Stateless components (API gateway, auth service, media service, search indexer) use standard blue-green deployment. No drain needed because they hold no document state.

14. Security

Transport security: TLS 1.3 for all WebSocket and REST connections. Certificate pinning for mobile apps. WebSocket Secure (wss://) is the only allowed transport.

Authentication:

JWT tokens issued at login via OAuth 2.0 (Google, GitHub, email/password).
Token validated at WebSocket handshake (gateway extracts and verifies JWT before upgrading the connection).
Tokens include user_id, email, and expiry. Short-lived (15 minutes) with refresh tokens.
On token expiry during an active WebSocket session, the server requests a token refresh over the existing connection. No reconnection needed.

Authorization:

Per-document ACL checked on every operation at the sync server.
Permission levels: owner (full control), editor (read + write), commenter (read + comment marks), viewer (read only).
Viewers receive Yjs updates (to see live edits) but their outgoing operations are rejected by the sync server.
Permission changes propagate within 5 seconds (ACL cache TTL). This is a deliberate tradeoff favoring sync server performance over instant revocation. Reducing the TTL increases database load linearly.

Encryption at rest:

AES-256 for document snapshots and operation logs in PostgreSQL (using pgcrypto or transparent data encryption).
S3 server-side encryption (SSE-S3 or SSE-KMS) for media and cold storage.

Content validation: The sync server validates incoming Yjs updates against the ProseMirror schema. Operations that would produce an invalid document structure (e.g., script injection via a malformed text node) are rejected before being persisted or relayed.

Audit logging: All permission changes, document access events, share invitations, and administrative actions are logged to an append-only audit table. Retained for 1 year. Queryable for compliance and incident investigation.

Wrapping Up

The three problems from Section 1 drive every decision in this design. Consistency is handled by CRDTs: deterministic merge rules guarantee convergence without a central authority. Latency is handled by optimistic local application: every keystroke is final before it hits the network. Rich text is handled by the Peritext approach layered on top of ProseMirror's document tree, giving concurrent formatting edits the same convergence guarantees as text.

The hardest part is not any single component. It is the interaction between all of them: offline users returning with thousands of edits, tombstone garbage collection across clients that may not be online simultaneously, presence fan-out at 100 editors per document, and keeping the operation log from swamping PostgreSQL. The architecture above handles these interactions, but building it is a multi-year effort. If you are starting from scratch, Yjs + Hocuspocus + TipTap gives you a working collaborative editor in days. Scaling it to 10 million simultaneous documents is where the real engineering begins.

Explore the Technologies

Technology / Pattern	Role in This System	Learn More
Yjs	CRDT library implementing the YATA algorithm for conflict-free document merging	Yjs
TipTap / ProseMirror	Rich text editor framework with Yjs bindings for collaborative editing	TipTap
Hocuspocus	Yjs-native WebSocket sync server with auth hooks and persistence adapters	Hocuspocus
WebSocket	Bidirectional real-time transport for editor sync and presence broadcasts	WebSocket
PostgreSQL	Document metadata, operation logs, snapshots, and ACLs	PostgreSQL
Redis	Hot document state, presence pub/sub, and distributed locking	Redis
CRDTs	YATA algorithm for conflict-free merging, Peritext for rich text marks	CRDTs
Consistent Hashing	Document-to-sync-server routing for WebSocket connection affinity	Consistent Hashing
Vector Clocks	Foundation for Yjs state vectors, tracking causal ordering across clients	Vector Clocks
Event Sourcing	Operation log as append-only source of truth, snapshots for fast recovery	Event Sourcing

System Design: Maps Platform (Navigation, Routing, and Map Rendering)

April 27, 2026 · 79 min read

System Design: GitHub (200M Repos, Git Object Storage, Sparse Trigram Code Search, Per-Job VM CI)

April 23, 2026 · 69 min read

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignMarch 14, 2026· 86 min read

System Design: Real-Time Collaborative Editor

Goal: Build a real-time collaborative text editor that supports 100 concurrent editors per document, 10 million simultaneously active documents, and 1 billion total documents. Handle rich text formatting, offline editing, version history, and presence indicators. Merge concurrent edits from multiple users without conflicts, data loss, or perceptible latency.

Reading guide: This post covers the full architecture of a real-time collaborative editor, from CRDT algorithms to block-based editing paradigms. It is long and detailed. Reading it linearly is not required.

Sections 1-8: Core architecture, data model, and API design

Section 4: CRDT vs OT deep dive (the heart of this design, both algorithms explained end-to-end with concrete examples)

Section 9: Implementation deep dives (WebSocket lifecycle, sync protocol, presence, offline, version history, search, rich text, multi-tenancy, GDPR, undo/redo, AI-assisted editing)

Sections 10-14: Operational concerns (bottlenecks, failures, observability, deployment, security)

New to collaborative editing? Start with Sections 1-4. Section 4 walks through both OT and CRDTs with character-level examples.

Building something similar? Sections 4-9 have the algorithm details, sizing math, and implementation deep dives. Jump straight to Section 9 if you already understand CRDTs.

Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 4 (CRDT vs OT) is the most commonly asked follow-up. Sections 10-11 (bottlenecks and failures) round out the discussion.

TL;DR: CRDT-based architecture using Yjs (YATA algorithm) for conflict-free merging. TipTap/ProseMirror for the rich text editor. WebSocket for real-time sync. Server acts as relay and persistence layer, not as ordering authority. Document tree model with snapshot + operation log storage in PostgreSQL. Presence via ephemeral Redis Pub/Sub. Offline editing syncs automatically on reconnect via state vector diffing. The hardest problems: merging concurrent edits (solved by CRDTs), rich text conflict resolution (Peritext approach), and scaling WebSocket connections across regions.

1. Problem Statement

Consistency. Alice and Bob both type at the same position in the same paragraph at the same time. Without a merge strategy, their documents diverge permanently. The system must guarantee that all replicas converge to an identical document state, regardless of the order operations arrive.
Latency. Every keystroke must appear on screen within 16ms (one frame at 60fps). Waiting for a server round-trip before showing the character is unacceptable. The system must apply edits optimistically on the local client and reconcile with remote edits asynchronously.
Rich text. This is not a plain text buffer. Bold, italic, headings, tables, images, comments, and nested lists all need to survive concurrent edits. Two users formatting the same word simultaneously, or one deleting a paragraph while another adds a comment to it, must produce a sensible result.

What NOT to do:

Lock the document while one user is editing. This caps throughput at one edit at a time and makes the system feel single-player.
Send every keystroke to a central server and wait for confirmation before displaying it. At 200ms+ latency per character, the editor becomes unusable for real-time typing.
Use a simple "last write wins" strategy. This silently drops one user's edits with no notification or recovery path. Users lose work and lose trust.
Store the document as a flat string and use line-level diff/merge. Two users editing the same line simultaneously will corrupt each other's work.
Assume all users are always online. Real users close laptops, lose signal on trains, and work on planes. Offline editing is a requirement, not a nice-to-have.

2. Functional Requirements

ID	Requirement	Priority
FR-01	Real-time collaborative editing with multiple simultaneous editors	P0
FR-02	Rich text formatting (bold, italic, underline, headings, lists, code blocks)	P0
FR-03	Document CRUD (create, read, update, delete)	P0
FR-04	Sharing with permission levels (owner, editor, commenter, viewer)	P0
FR-05	Cursor and presence indicators (see who is editing and where)	P0
FR-06	Version history with ability to view and restore past versions	P0
FR-07	Tables with concurrent cell editing	P0
FR-08	Offline editing with automatic sync on reconnect	P1
FR-09	Comments and suggestions on text ranges	P1
FR-10	Image and media embedding	P1
FR-11	In-document search (Ctrl+F)	P1
FR-12	Undo/redo per user (not global)	P1
FR-13	Cross-document full-text search	P2
FR-14	Templates	P2
FR-15	Export to PDF/DOCX	P2
FR-16	Document analytics (view count, edit frequency)	P2

3. Non-Functional Requirements

Requirement	Target
Operation sync latency (same region)	< 100ms
Operation sync latency (cross region)	< 300ms
Local keystroke response	< 16ms (60fps rendering)
Concurrent editors per document	100
Concurrent viewers per document	200
Document size limit	1.5M characters (~500 pages)
Availability	99.99%
Durability	Zero user-visible data loss (client retains canonical state in IndexedDB; server persists via operation log with 100ms batch window)
Offline tolerance	Hours of offline editing, merge on reconnect
Active documents (simultaneously open)	10M
Total documents	1B

4. High-Level Approach & Technology Selection

4.1 The Core Problem: Concurrent Edits

Alice's local view: "HXELLO" (inserted X at position 1). Bob's local view: "HELO" (deleted L at position 3).

Two approaches solve this problem:

Operational Transformation (OT): Transform operations against each other so positions stay correct after concurrent edits. The server is the authority.
CRDTs (Conflict-free Replicated Data Types): Give each character a globally unique identity so position is defined by neighbors, not indices. No authority needed.

Both are walked through in full below.

4.2 Approach 1: Operational Transformation (OT)

4.2.1 How OT Works (Concrete Example)

Document: "ABCD". Alice deletes position 1 (character 'B'). Bob inserts 'X' at position 3 (between 'C' and 'D'). Both edits happen simultaneously.

Server receives Alice's operation first:

Apply DELETE(1) to "ABCD" → "ACD". Server is now at revision 6.
Bob's INSERT(3, 'X') was composed against revision 5 (before Alice's delete). It needs transformation.
Transform: Alice deleted at position 1, which is before Bob's insertion point (3). Everything after position 1 shifted left by one. Bob's operation becomes INSERT(2, 'X').
Apply transformed op to "ACD" → "ACXD". Server is now at revision 7.

Verify with reverse order (TP1 property):

Apply Bob's INSERT(3, 'X') to "ABCD" → "ABCXD". Server at revision 6.
Alice's DELETE(1) was composed against revision 5. Transform: Bob inserted at position 3, which is after Alice's delete point (1). Alice's delete is unaffected. Still DELETE(1).
Apply DELETE(1) to "ABCXD" → "ACXD". Server at revision 7.

Transform function rules (pseudocode for all four operation pairs):

transform(op1, op2):

  INSERT(p1, c1) vs INSERT(p2, c2):
    if p1 < p2: return INSERT(p1, c1)        # op1 is before op2, no shift
    if p1 > p2: return INSERT(p1 + 1, c1)    # op2 shifted text right
    if p1 == p2: break tie by client ID       # deterministic ordering

  INSERT(p1, c1) vs DELETE(p2):
    if p1 <= p2: return INSERT(p1, c1)        # delete is after insert, no shift
    if p1 > p2: return INSERT(p1 - 1, c1)    # delete shifted text left

  DELETE(p1) vs INSERT(p2, c2):
    if p1 < p2: return DELETE(p1)             # insert is after delete, no shift
    if p1 >= p2: return DELETE(p1 + 1)        # insert shifted text right

  DELETE(p1) vs DELETE(p2):
    if p1 < p2: return DELETE(p1)             # independent deletes
    if p1 > p2: return DELETE(p1 - 1)         # earlier delete shifted text
    if p1 == p2: return NOOP                  # same character deleted twice

These four rules are the entire foundation of OT. Every operation arriving at the server gets transformed against all operations that happened since the client's last known revision.

4.2.2 Online Flow (Step by Step)

Alice types 'X' at position 3 in a shared document. Walk through every hop:

Alice's keystroke applies to her local document immediately (optimistic update). She sees the 'X' on screen within 16ms. No server round-trip.
Alice's client sends INSERT(3, 'X', rev=5) to the server over WebSocket. rev=5 means "this operation was composed against server revision 5."
Server receives Alice's operation. The server is currently at revision 7 (two other operations from Bob landed while Alice's was in flight). Server transforms Alice's op against revisions 6 and 7 to produce a shifted op. Server applies the transformed op, increments to revision 8, and broadcasts the transformed op to all other clients.
Acknowledgement: Server sends an ACK back to Alice with rev=8. Alice's client marks the pending op as confirmed. If Alice typed more keystrokes while waiting, those pending ops are now rebased against revisions 6 through 8.
Bob's client receives the broadcast of Alice's transformed op. Bob transforms any of his own pending (unacknowledged) ops against Alice's incoming op, then applies Alice's op to his local document.

Key insight: In OT, the client maintains three states at all times:

Confirmed state: What the server has acknowledged. The client knows this is canonical.
Pending op: Sent to server, waiting for ACK. Only one pending op at a time.
Buffer ops: Not yet sent. Accumulating locally while the pending op awaits acknowledgement.

When the ACK arrives, the pending op moves to confirmed. The buffer ops compose into a new pending op and get sent. This is the OT client state machine, and getting it wrong causes divergence.

4.2.3 Offline Flow

Alice goes offline. The WebSocket connection drops.
Alice continues typing. Each keystroke produces an operation applied against her local revision counter.
Alice accumulates 500 operations over 2 hours offline.
Alice reconnects. Her client reports its last known server revision (say rev=50). The server has advanced to rev=200 (150 ops from other users).
The server sends Alice the 150 missed operations. Alice's client must transform her 500 pending ops against these 150 server ops. That is 500 x 150 = 75,000 transform calls. Each transform must be mathematically correct or the document diverges permanently.
After transformation, Alice's rebased ops are sent to the server. The server transforms them again against any ops that arrived during the rebasing process.

Pros:

Server maintains a single canonical document state (easy to reason about)
Mature ecosystem (Google Docs has used OT since 2010)
Lower per-character metadata than CRDTs (no unique ID per character)
Access control is straightforward (server can reject operations)

Cons:

Transform functions must satisfy TP1 (applying transformed ops in either order yields the same result). Getting this right for every operation pair is notoriously hard
Central server is required for ordering (single point of failure for correctness)
Offline sync is expensive and fragile (quadratic transformation cost)
Client state machine (confirmed/pending/buffer) is complex to implement correctly
Every new operation type (tables, images, formatting) needs new transform functions

Used by: Google Docs, CKEditor 5, Etherpad

A third approach: Microsoft Loop and Fluid Framework.

OT requires pairwise transforms. CRDTs require per-character metadata. Fluid Framework sidesteps both with total order broadcast: the server does not transform operations, and clients do not carry CRDT metadata. Instead, the server assigns a global sequence number to every incoming operation and broadcasts them to all clients in that order.

Concrete example: Alice sends INSERT(3, 'X') and Bob sends DELETE(5) simultaneously. Both arrive at the Fluid relay server. The server does not transform them. It assigns Alice's op sequence #41 and Bob's op sequence #42, then broadcasts both to all clients in that order. Every client applies #41 first, then #42. Because everyone applies the same operations in the same order, all replicas converge without transforms.

Why this works: Divergence in OT happens because clients apply operations in different orders, requiring transforms to reconcile. Fluid eliminates divergence by forcing a single global order. No transforms needed. No CRDT item IDs needed. Operations are plain position-based, like OT, but applied in server-assigned order.

The tradeoff: The server is a sequencing authority. If the server is unreachable, clients cannot apply operations (they queue locally and wait). This makes Fluid less offline-friendly than CRDTs, where the client applies edits immediately without any server involvement. Fluid sits between OT and CRDTs: simpler than OT (no transform functions), less offline-capable than CRDTs (needs server for ordering), and lighter than CRDTs (no per-character metadata).

4.3 Approach 2: CRDTs (Conflict-free Replicated Data Types)

4.3.1 How CRDTs Work (Concrete Example)

Document: "AC". Character A has a unique ID (1, server). Character C has ID (2, server). These IDs never change.

Alice inserts 'B' after 'A'. Bob inserts 'X' after 'A'. Both happen simultaneously.

Each new character gets a globally unique ID composed of (lamport_clock, client_id):

Alice's insert: { char: 'B', id: (3, alice), originLeft: (1, server), originRight: (2, server) } Alice sees: "ABC"
Bob's insert: { char: 'X', id: (3, bob), originLeft: (1, server), originRight: (2, server) } Bob sees: "AXC"

Both replicas independently arrive at: "ABXC". No server decided this ordering. Both replicas applied the same deterministic rule.

The YATA algorithm (used by Yjs) formalizes this. Each item stores { id, originLeft, originRight, content }. On conflict (two items share the same originLeft):

Start at the conflict position.
Scan right through existing items.
Compare the conflicting item's ID against each existing item's ID.
Insert before the first item whose originLeft is different from ours, or whose client_id is greater than ours.

4.3.2 Online Flow (Step by Step)

Alice types 'X' in the document. Walk through every hop:

Alice's keystroke applies to the local Yjs document immediately. A new CRDT item is created with a unique ID (clock, alice). She sees the 'X' on screen within 16ms. This edit is final. It will never be rebased, transformed, or undone by the sync process.
Yjs encodes the new item as a compact binary update (~20-50 bytes for a single character).
Alice's client sends the binary update to the sync server over WebSocket.
Acknowledgement model: There is no ACK required for correctness. The edit is already committed to Alice's local document. The server is a relay, not an authority. It does not transform operations, though it does establish a durable total ordering via the append-only operation log. The client does not wait for server confirmation before proceeding. If the server is slow, unreachable, or crashed, Alice keeps typing with zero impact on her local experience. Eventual consistency is verified via periodic state vector reconciliation, not per-operation ACKs.
The sync server receives the update, appends it to the operation log, and broadcasts it to all other connected clients editing the same document.
Bob's client receives the update. Yjs merges it into Bob's local document using the CRDT rules (unique IDs + deterministic ordering). If Bob has concurrent edits, they merge automatically. No transformation needed.

4.3.3 Offline Flow

Alice goes offline. The WebSocket connection drops.
Alice continues typing. Each keystroke creates a new CRDT item with a unique ID in her local Yjs document. Everything is stored in IndexedDB.
Alice accumulates 500 edits over 2 hours offline.
Meanwhile, Bob and others make 150 edits on the server.
Alice reconnects. Her Yjs client sends its state vector: a compact summary of what it has already seen. Example: { alice: 550, bob: 200, carol: 100 } meaning "I have all edits from Alice up to clock 550, Bob up to 200, Carol up to 100."
The server compares Alice's state vector against its own. It computes the diff: "Alice is missing these 150 updates from Bob and Carol." The server sends only those 150 updates.
Alice's client applies the 150 incoming updates. The CRDT merge handles ordering automatically. No transforms. No 75,000 pairwise transform calls. Just 150 merge operations, each O(log n) where n is the document length.
Alice's client sends her 500 offline edits to the server (the server's state vector showed it was missing those). The server merges them the same way.
All clients converge. Total merge work: O(m log n) where m is the number of missed operations and n is the document length. No quadratic blowup.

4.3.4 Conflict Resolution

Pros:

Offline-first by mathematical design (commutative, associative, idempotent merges)
No central ordering authority required (server is a relay, not a coordinator)
Local edits are instant and permanent (no pending state, no rebasing)
Merge complexity is O(m log n) (m = missed ops, n = doc length), not O(n x m) for offline catch-up
Adding new operation types does not require new transform functions

Cons:

Per-character metadata overhead (each character carries a unique ID + origin pointers)
Tombstones for deleted text consume memory until garbage collected
Document size is 1.5-3x the plain text size (Yjs binary encoding)
Undo/redo requires per-user tracking (Yjs UndoManager handles this; see Section 9.12)
Garbage collection of tombstones requires coordination across all connected clients

Used by: Figma, Apple Notes, Jupyter Notebooks (Yjs), AFFiNE, Notion (partial)

4.4 Side-by-Side Comparison

Dimension	OT	CRDT
Offline sync	Expensive: O(local x missed) transforms	Cheap: O(m log n) state vector diff
Acknowledgement model	Server ACKs each op; client has pending/buffer states	No ACK needed; local edits are final
Metadata per character	None (positions are integers)	Unique ID + originLeft + originRight (~20 bytes)
Transform/merge complexity	O(n x m) for n local, m remote ops	O(m log n) for m ops, n doc length
Correctness risk	High (TP1 bugs cause silent divergence)	Low (mathematical convergence guarantee)
Rich text maturity	Google Docs (15+ years of production)	Yjs + Peritext (newer but production-proven)
Latency model	Local instant, server round-trip for confirmation	Local instant, no confirmation needed
Used by	Google Docs, CKEditor 5, Etherpad	Figma, Apple Notes, Jupyter (Yjs), AFFiNE

4.5 Our Choice: CRDT (Yjs) with a Relay Server

After walking through both algorithms in detail, the choice:

Why CRDT over OT:

Offline sync is a first-class citizen, not a bolted-on afterthought
No transformation correctness bugs. I've seen teams spend months debugging OT transform bugs that only surface under specific three-user concurrent editing scenarios. CRDTs eliminate this entire class of problem.
Local edits are instant and final (no pending/buffer state machine, no rebasing)
Merge complexity scales linearly, not quadratically, with offline duration

Why Yjs specifically:

Mature ecosystem with ProseMirror and TipTap integrations
Optimized binary encoding (1.5-3x plain text size, not 16x like academic CRDT implementations)
Production-proven at scale (Jupyter Notebooks, AFFiNE, multiple enterprise deployments)
Run-length encoding for consecutive edits (typing "hello" stores as one item, not five)
Built-in awareness protocol for presence and cursor tracking

Why still use a server:

Access control enforcement (reject unauthorized edits before relaying)
Durable persistence (operation log + snapshots in PostgreSQL and S3)
Presence relay across clients (cursor positions, online status)
Single WebSocket endpoint (simpler than peer-to-peer NAT traversal at scale)

4.6 Collaboration Paradigms: Where This Design Fits

This post designs a rich text collaborative editor. That is one of four distinct paradigms, each requiring a different CRDT data model and sync architecture:

Paradigm	Examples	CRDT Data Model	Key Difference
Plain text	Etherpad, VS Code Live Share	Sequence CRDT (characters only)	No formatting marks, simpler merge
Rich text	Google Docs, this design	Sequence CRDT + marks (Peritext)	ProseMirror tree with inline formatting
Block-based	Notion, Coda, AFFiNE, Microsoft Loop	Tree CRDT (blocks) + per-block text CRDT	Each block is an independent CRDT subdocument
Canvas / design	Figma, Miro, FigJam	Map CRDT (objects with properties)	CRDTs operate on position/color/size, not text

Block-Based Editors (Notion, AFFiNE, Coda)

Document
├── Block (heading): "Project Roadmap"          ← Y.Text CRDT
├── Block (paragraph): "The first milestone..." ← Y.Text CRDT
├── Block (table)                               ← Y.Map CRDT
│   ├── Cell (0,0): "Task"                     ← Y.Text CRDT
│   └── Cell (0,1): "Owner"                    ← Y.Text CRDT
├── Block (toggle): "Implementation details"    ← Y.Text CRDT
│   ├── Block (paragraph): "Step 1..."         ← Y.Text CRDT (nested child)
│   └── Block (paragraph): "Step 2..."         ← Y.Text CRDT (nested child)
└── Block (image): screenshot.png               ← Y.Map CRDT (properties only)

In Yjs terms, the data structure looks like this:

Document = Y.Array<BlockID>              // ordering CRDT (which blocks, in what order)

Block = Y.Map {
  id:       string,                      // globally unique block ID
  type:     "paragraph" | "heading" | "table" | "image" | "toggle" | ...,
  content:  Y.Text,                      // inline text with marks (per-block CRDT)
  children: Y.Array<BlockID>,            // nested blocks (toggles, columns, callouts)
  props:    Y.Map { level, checked, ... } // block-level properties
}

Trade-offs: rich text vs. block-based

Aspect	Rich Text (This Design)	Block-Based (Notion)
Sync unit	Entire document as one CRDT	Per-block, independent CRDTs
Initial load	Full document sync required	Metadata + visible blocks only
Large documents	Memory-bound (entire doc in RAM)	Viewport-bound (only visible blocks in RAM)
Permissions	Document-level only	Block-level possible
Reordering	ProseMirror transactions	Tree CRDT or server-ordered moves
Offline	Full document available offline	Only cached blocks available offline
Complexity	Single CRDT, simpler architecture	Tree of CRDTs, more moving parts

Canvas Editors (Figma)

Why This Design Focuses on Rich Text

4.7 Technology Selection

Component	Technology	Why
Sync algorithm	Yjs (YATA CRDT)	Offline-first, proven binary encoding, rich ecosystem
Rich text editor	TipTap (ProseMirror)	Yjs binding (y-prosemirror), extensible schema, production-ready
Sync server	Hocuspocus / Custom Node.js	Yjs-native WebSocket server, handles auth hooks and persistence
Real-time transport	WebSocket	Bidirectional, low overhead, browser-native, wide proxy support
Document storage (hot)	Redis	In-memory Yjs document state for active documents, sub-ms access
Document storage (warm)	PostgreSQL	Operation logs, snapshots, document metadata, ACLs
Document storage (cold)	S3	Archived snapshots, old operation logs, media files
Media storage	S3 + CloudFront CDN	Presigned uploads, edge-cached delivery
Search	Elasticsearch	Full-text search across billions of documents with ACL filtering
Presence	Redis Pub/Sub	Ephemeral cursor/selection broadcast across sync server instances
Auth	JWT + OAuth 2.0	Stateless token validation at WebSocket handshake

4.8 What Runs Where

Before diving into the architecture diagram, here is what each core technology actually does and which side of the network it runs on.

Client side (runs in the browser):

Technology	What It Is	Problem It Solves
ProseMirror	Rich text editor framework	Renders the document, handles the editing UI (typing, cursor, selections, toolbar), and enforces the document schema. This is the "word processor" layer.
TipTap	Wrapper around ProseMirror	Makes ProseMirror easier to configure. Adds an extension API and the Yjs binding (`y-prosemirror`) that connects the editor to the CRDT layer.
Yjs	CRDT library (YATA algorithm)	The merge engine. Every character gets a unique ID. When two users type simultaneously, Yjs merges their edits deterministically without conflicts. Each user holds a full copy of the document in memory.
IndexedDB	Browser storage	Persists the Yjs document locally. Enables offline editing and instant page reloads without fetching from the server.

Server side:

Technology	What It Is	Problem It Solves
Hocuspocus	Yjs-native WebSocket server (Node.js)	The sync relay. Implements the y-websocket protocol, a compact binary protocol layered on top of standard WebSocket connections. Receives binary Yjs updates from one client and broadcasts them to all other clients editing the same document. Also handles auth hooks (called during the WebSocket upgrade handshake) and persistence hooks (called on document load and change).
Redis	In-memory store	Hot document state cache (sub-ms access) and presence broadcast via Pub/Sub (cursor positions across server instances).
PostgreSQL	Relational database	Durable persistence: operation log (every edit), snapshots (periodic full state), document metadata, and ACLs.
S3	Object storage	Cold storage for archived snapshots, old operation logs, and uploaded media files.

The flow when Alice types a character:

Alice's browser                     Server                       Bob's browser
───────────────                     ──────                       ─────────────
TipTap (editor UI)                  Hocuspocus                   TipTap (editor UI)
    ↓                                                                ↑
ProseMirror (document tree)                                      ProseMirror (document tree)
    ↓                                                                ↑
Yjs (creates CRDT item)                                          Yjs (merges CRDT item)
    ↓                                                                ↑
Binary update ──→ WebSocket ──→ Persist + Broadcast ──→ WebSocket ──→ Binary update
                                (PostgreSQL + Redis)

Alice types 'X' → ProseMirror creates a transaction → Yjs creates a CRDT item with a unique ID (clock, alice)
Yjs encodes the item as a binary update (~30 bytes)
The update travels over WebSocket to Hocuspocus
Hocuspocus appends it to the operation log (PostgreSQL) and broadcasts it to Bob
Bob's Yjs merges the item using deterministic ID ordering. No transforms. ProseMirror renders it.

Tombstone lifecycle (how deletions work in Yjs):

4.9 End-to-End Example: From Keystroke to CRDT

The previous sections explain what each technology does. This section shows how they work together with actual code.

The document:

Document
├── Block b1 (heading):   "Project Plan"
└── Block b2 (paragraph): "Build authentication service using OAuth"

Two blocks. Each block gets its own TipTap editor, its own ProseMirror instance, and its own Yjs text fragment. Here is the setup with collaboration enabled:

import { Editor } from "@tiptap/core"
import StarterKit from "@tiptap/starter-kit"
import Collaboration from "@tiptap/extension-collaboration"
import { HocuspocusProvider } from "@hocuspocus/provider"
import * as Y from "yjs"

// Yjs document + WebSocket connection to the sync server
const ydoc = new Y.Doc()
const provider = new HocuspocusProvider({
  url: "wss://sync.example.com",
  name: "doc-abc",       // document ID
  document: ydoc,
})

// Block b1: heading
const headingEditor = new Editor({
  element: document.querySelector("#heading"),
  extensions: [
    StarterKit,
    Collaboration.configure({ document: ydoc, field: "block-b1" }),
  ],
})

// Block b2: paragraph
const paragraphEditor = new Editor({
  element: document.querySelector("#paragraph"),
  extensions: [
    StarterKit,
    Collaboration.configure({ document: ydoc, field: "block-b2" }),
  ],
})

What happens when Alice bolds "using OAuth":

Alice selects "using OAuth" (positions 28-40 in block b2) and clicks the Bold button.

// Layer 1: UI
document.getElementById("boldBtn").onclick = () => {
  paragraphEditor.commands.toggleBold()
}
// The developer writes this one line. Everything below is automatic.

// Layer 2: ProseMirror
// The editor already knows the selection from the browser.
// EditorState.selection = { from: 28, to: 40 }
// TipTap creates a ProseMirror transaction:
// tr.addMark(28, 40, schema.marks.bold)

// Layer 3: Document tree update
// Before:  paragraph → text("Build authentication service using OAuth")
// After:   paragraph → text("Build authentication service ") + bold("using OAuth")

// Layer 4: Yjs (automatic via y-prosemirror binding)
// The binding intercepts the ProseMirror transaction and converts it:
// Y.Text.format(28, 12, { bold: true })
// A CRDT operation is created with a unique ID (clock, alice).

// Layer 5: Network (automatic via HocuspocusProvider)
// Yjs encodes the operation as a binary update (~40 bytes).
// HocuspocusProvider sends it over WebSocket to the server.
// Server persists to op log, broadcasts to all other clients.
// Bob's Yjs merges the bold mark. His ProseMirror renders it.

Concurrent edit: Alice bolds while Bob types

While Alice bolds "using OAuth", Bob types " v2.0" at the end of the same sentence. Both edits happen simultaneously.

Alice's operation: Y.Text.format(28, 12, { bold: true }) with a Peritext "no-expand" boundary at the right edge of "OAuth"
Bob's operation: inserts " v2.0" as a new CRDT item with originLeft pointing to the 'h' in "OAuth"

Result on both clients: "Build authentication service using OAuth v2.0"

5. High-Level Architecture

5.1 Bird's-Eye View

5.2 Component Glossary

(5) Media. Images and files are uploaded to S3 via presigned URLs. The media node in the document tree references the S3 key. CloudFront CDN caches media at the edge.

(6) REST API. Document CRUD, sharing, permission management, version history browsing, and search all go through the REST API layer. These are not real-time operations and do not use WebSocket.

6. Back-of-the-Envelope Estimation

Documents
  Total documents in storage:             1B
  Active documents (open right now):       10M
  Average concurrent editors per active doc: 3
  Total WebSocket connections:             30M (10M × 3)
  Peak concurrent editors on one doc:      100

Operations
  Average ops/sec per active editor:       2 (typing + formatting; upper bound)
  Total ops/sec (steady state):            60M (30M × 2; design ceiling)
  Realistic ops/sec (many editors idle):   15-25M (most connections are idle readers)
  Peak ops/sec (3x burst):                 180M
  Ops/sec on a busy single document:       200 (100 editors × 2 ops/sec)

Bandwidth
  Average Yjs binary update size:          100 bytes
  Steady-state bandwidth:                  6 GB/sec (60M × 100 bytes)
  Per-document bandwidth (100 editors):    20 KB/sec (trivial)
  Per-WebSocket-server bandwidth:          ~10 MB/sec (at 50K connections)

Storage
  Average document content size:           10 KB
  Document content (1B docs):              10 TB
  Operation logs (avg 100 KB/doc):         100 TB (compacted)
  Snapshots (avg 50 KB/doc):               50 TB
  Media (avg 1 MB/doc, 30% have media):    300 TB
  Total:                                   ~460 TB

WebSocket Servers
  Connections per server:                  50K (practical limit with epoll/kqueue)
  Servers needed:                          600 (30M / 50K)

Sync Servers (Hocuspocus)
  Active documents per instance:           10K (each doc held in memory ~50-200 KB)
  Memory per instance:                     ~1-2 GB for document state (avg doc ~50-200 KB; a 1.5M-char doc with 100 editors can peak at ~50 MB)
  Instances needed:                        1,000 (10M / 10K)

PostgreSQL
  Write throughput (op log):               60M inserts/sec (sharded across 100+ nodes, with batch writes; see Section 10.3)
  Read throughput (snapshots):             ~10K reads/sec (cold document loads)

7. Data Model

7.1 Document Tree (ProseMirror Schema)

The document is a tree, not a flat string. ProseMirror defines a schema of allowed node types and their nesting rules:

json

{
  "type": "doc",
  "content": [
    {
      "type": "heading",
      "attrs": { "level": 1 },
      "content": [
        { "type": "text", "text": "Project Roadmap" }
      ]
    },
    {
      "type": "paragraph",
      "content": [
        { "type": "text", "text": "The first milestone ", "marks": [{ "type": "bold" }] },
        { "type": "text", "text": "is due next week." }
      ]
    }
  ]
}

7.2 CRDT Operation Format

Each Yjs item (the fundamental unit of the CRDT) carries:

{
  id:          { client: 42, clock: 157 },    // globally unique
  originLeft:  { client: 42, clock: 156 },    // left neighbor at insertion time
  originRight: { client: 7,  clock: 89 },     // right neighbor at insertion time
  parent:      "paragraph_node_id",            // which Y.XmlFragment this belongs to
  content:     "hello"                         // the actual text (run-length encoded)
}

7.3 Storage Schema (PostgreSQL)

sql

-- Document metadata
CREATE TABLE documents (
    id              UUID PRIMARY KEY,
    title           TEXT NOT NULL,
    owner_id        UUID NOT NULL REFERENCES users(id),
    permission_default TEXT DEFAULT 'viewer',  -- viewer, commenter, editor
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Yjs document snapshots (periodic full state)
CREATE TABLE document_snapshots (
    id              BIGSERIAL PRIMARY KEY,
    document_id     UUID NOT NULL REFERENCES documents(id),
    snapshot_blob   BYTEA NOT NULL,           -- full Yjs binary state
    version         BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

-- Operation log (every Yjs update appended here)
CREATE TABLE operation_log (
    id              BIGSERIAL PRIMARY KEY,
    document_id     UUID NOT NULL,
    client_id       BIGINT NOT NULL,
    op_binary       BYTEA NOT NULL,           -- Yjs binary update
    version         BIGINT NOT NULL,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
) PARTITION BY HASH (document_id);            -- sharded by document

-- Document sharing / ACL
CREATE TABLE document_shares (
    document_id     UUID NOT NULL REFERENCES documents(id),
    user_id         UUID NOT NULL REFERENCES users(id),
    permission      TEXT NOT NULL,             -- owner, editor, commenter, viewer
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (document_id, user_id)
);

7.4 Media Schema

S3 key pattern: media/{document_id}/{media_id}.{extension}

Media is referenced in the document tree as an image or file node:

json

{
  "type": "image",
  "attrs": {
    "src": "media/doc-abc/img-123.png",
    "width": 800,
    "height": 600,
    "alt": "Architecture diagram"
  }
}

The client resolves the S3 key to a CDN URL at render time.

8. API Design

8.1 REST Endpoints

Method	Endpoint	Description
POST	`/api/documents`	Create a new document
GET	`/api/documents/:id`	Get document metadata (title, owner, permissions)
PATCH	`/api/documents/:id`	Update document metadata (title, settings)
DELETE	`/api/documents/:id`	Soft-delete a document
POST	`/api/documents/:id/share`	Add/update sharing permissions
DELETE	`/api/documents/:id/share/:userId`	Revoke a share
GET	`/api/documents/:id/history`	List version history (snapshots + named versions)
GET	`/api/documents/:id/history/:version`	Get document content at a specific version
POST	`/api/documents/:id/history`	Create a named version (user-triggered save point)
POST	`/api/documents/:id/media`	Upload media (returns presigned S3 URL)
GET	`/api/search?q=term`	Cross-document full-text search (ACL-filtered)

8.2 WebSocket Protocol and the y-websocket Layer

Connection: wss://sync.example.com/doc/:id?token=<JWT>

Protocol stack: Hocuspocus implements the y-websocket protocol, which is a thin binary framing layer built on top of standard WebSocket connections. The stack looks like this:

┌─────────────────────────────────────┐
│  Yjs Sync Protocol                  │  sync-step-1, sync-step-2, update, awareness
├─────────────────────────────────────┤
│  y-websocket Protocol               │  Message type prefix byte + binary Yjs payload
├─────────────────────────────────────┤
│  WebSocket (RFC 6455)               │  Bidirectional binary frames over a single TCP connection
├─────────────────────────────────────┤
│  TLS 1.3                            │  Encryption
├─────────────────────────────────────┤
│  TCP                                │  Reliable transport
└─────────────────────────────────────┘

The WebSocket connection follows the Yjs sync protocol:

Message Type	Direction	Payload	Purpose
`sync-step-1`	Client → Server	Client's state vector (binary)	"Here is what I have"
`sync-step-2`	Server → Client	Missing updates (binary)	"Here is what the client is missing"
`update`	Bidirectional	Yjs binary update	Incremental edit broadcast
`awareness`	Bidirectional	`{ clientId, user, cursor, selection, color }`	Presence and cursor tracking
`token-refresh`	Client → Server	New JWT	Client sends a fresh JWT before the current token expires. Server validates and updates the session without disconnecting
`ping`	Client → Server	(empty)	Heartbeat every 30s
`pong`	Server → Client	(empty)	Heartbeat response

Initial sync flow:

Client sends sync-step-1 with its state vector
Server computes the diff and responds with sync-step-2 containing all missing updates
Client sends sync-step-1 back (the server now acts as client to receive any updates the client has that the server is missing)
Connection enters steady state: incremental update messages flow bidirectionally

9. Deep Dives

9.1 WebSocket Connection Lifecycle

Connection flow:

9.2 CRDT Sync Protocol (Yjs Internals)

The state vector and sync protocol are described in Section 4.3.3. This section covers the implementation-level details that matter for production systems.

Run-length encoding for consecutive edits by the same client (typing "hello" stores as one item, not five)
Delta encoding for lamport clocks (store increments, not absolute values)
Compact varint encoding for IDs and positions

9.3 Multi-Instance Sync: When Users Land on Different Servers

What goes wrong without cross-instance sync

@hocuspocus/extension-redis

The official solution is @hocuspocus/extension-redis. It uses Redis Pub/Sub to relay Yjs updates between Hocuspocus instances. The setup is straightforward:

typescript

import { Server } from "@hocuspocus/server";
import { Redis } from "@hocuspocus/extension-redis";

const server = Server.configure({
  extensions: [
    new Redis({
      host: "redis.internal",
      port: 6379,
    }),
  ],
});

User A on Hocuspocus instance A makes an edit
Instance A applies the Yjs update to its local in-memory document
Instance A broadcasts the update to all its locally connected clients (they see the edit immediately)
Instance A publishes the binary Yjs update to the Redis Pub/Sub channel for that document
Hocuspocus instance B, subscribed to the same channel, receives the update
Instance B applies the Yjs update to its local in-memory document
Instance B broadcasts to its locally connected clients

The total additional latency is one Redis Pub/Sub hop, typically under 1ms within the same datacenter.

Why CRDTs make this safe

Order does not matter. If Redis delivers updates out of order, the final document state is still correct.
Duplicates are harmless. If an update is delivered twice (Redis at-least-once semantics), Yjs silently ignores the duplicate because it has already seen that item ID.
Temporary divergence is fine. If there is a brief window where instance A and instance B have different document states (because a Pub/Sub message is in flight), they will converge as soon as the message arrives. No conflict resolution logic is needed beyond what the CRDT already provides.

What happens if Redis Pub/Sub fails

In practice, the defense-in-depth strategy is:

Consistent hashing prevents most multi-instance scenarios entirely
@hocuspocus/extension-redis handles the cases that slip through (scaling events, failover, rolling deploys)
CRDT convergence guarantees correctness even if the relay layer has gaps — the data is never lost, only temporarily invisible to some clients
Persistence-layer merge catches anything Redis Pub/Sub missed, at the cost of higher latency (next persist/load cycle rather than real-time)

See also Section 11.7 for the operational playbook when split-brain is detected.

9.4 Conflict Resolution Deep Dive

Scenario: Three users edit the same sentence simultaneously.

Document: "The quick brown fox". Alice, Bob, and Carol all edit at the same time.

Alice inserts " lazy" before "fox", targeting position between "brown " and "fox"
Bob deletes "quick " (characters 4-9)
Carol bolds the word "brown"

Each operation in CRDT terms:

Alice's insert: Creates items with originLeft pointing to the space after "brown" and originRight pointing to 'f' in "fox". These pointers are stable item IDs, not integer positions.
Bob's delete: Tombstones the items corresponding to "quick ". The items still exist in the CRDT with their IDs and origin pointers intact, but their content is marked as deleted.
Carol's formatting: Uses the Peritext approach. Bold mark starts at the 'b' in "brown" with an "expand" boundary (new characters typed at the end inherit the mark) and ends after 'n' with a "no-expand" boundary.

9.5 Presence and Cursor Tracking

End-to-end flow: from cursor move to remote render

Presence data travels through the same Hocuspocus WebSocket connection as document edits. There is no separate channel. Here is the full path when Alice moves her cursor:

Alice moves cursor or selects text in TipTap
    ↓
ProseMirror fires a selection change event (anchor + head positions)
    ↓
y-prosemirror plugin converts ProseMirror integer positions
to Yjs relative positions (anchored to CRDT item IDs)
    ↓
Yjs awareness protocol packages the cursor + user metadata
into a compact awareness update (~200 bytes)
    ↓
HocuspocusProvider sends it over the existing WebSocket connection
(y-websocket message type byte = 1 for awareness, then the binary payload)
    ↓
Hocuspocus server receives it, does NOT persist it
    ↓
Broadcasts to all other clients connected to that document
(if clients span multiple Hocuspocus instances → relayed via Redis Pub/Sub)
    ↓
Bob's HocuspocusProvider receives the awareness update
    ↓
y-prosemirror converts the Yjs relative positions back to
ProseMirror integer positions in Bob's local document
    ↓
ProseMirror renders a colored cursor bar + floating name label
using a DecorationSet (CSS-based, no DOM mutation of the document itself)

Awareness state object

Each connected client publishes an awareness state object:

json

{
  "clientId": 42,
  "user": { "name": "Alice", "color": "#FF6B6B", "avatar": "/avatars/alice.jpg" },
  "cursor": {
    "anchor": { "type": "relative", "item": "(155, alice)", "assoc": -1 },
    "head": { "type": "relative", "item": "(160, alice)", "assoc": 1 }
  },
  "status": "active"
}

When anchor and head point to the same item, it's a blinking cursor. When they point to different items, it's a text selection and the remote client renders a colored highlight over that range.

Why relative positions instead of integer offsets

Cursor rendering

Remote cursors are rendered as ProseMirror Decorations, not as actual document content. The y-prosemirror plugin maintains a DecorationSet that adds:

A colored vertical bar (1-2px wide CSS border) at the cursor position. The color is deterministic: each user gets a consistent color derived from a hash of their user ID across the predefined palette. Alice is always red. Bob is always blue. The colors don't change between sessions.
A floating name label (small pill-shaped badge with the user's name) anchored to the top of the cursor bar. The label fades out after 3 seconds of cursor inactivity to reduce visual clutter, and reappears when the cursor moves again.
A colored background highlight when the remote user has a text selection (anchor ≠ head). The highlight uses the same user color at ~20% opacity so the underlying text remains readable.

User presence indicators

Beyond cursors, the awareness protocol powers the "who's here" indicator: the row of user avatars typically shown in the document header.

The client subscribes to all awareness state changes and maintains a list of active users. Each user's awareness state includes their name, avatar URL, and a status field. Status transitions:

active: User has made an edit or moved their cursor within the last 60 seconds.
idle: No local activity for 60 seconds. The client automatically updates its own status to idle and broadcasts the change. Remote clients dim that user's avatar or show an "idle" indicator.
viewing: User has the document open but has not made any edits (read-only viewer or commenter). Cursor position is still broadcast so collaborators know where the viewer is reading.

The transition from active to idle happens client-side via a simple inactivity timer. The transition back to active happens immediately on any local edit or selection change.

What triggers an awareness update

Not every cursor blink sends a message. Awareness updates are sent when:

Cursor position changes (user clicks somewhere new, arrow keys, typing moves the cursor forward)
Selection changes (user drags to select text, Shift+click, Ctrl+A)
User status changes (active → idle, idle → active, or user metadata changes)
On reconnect (client re-publishes full awareness state after WebSocket reconnection)

Throttling and bandwidth

Cross-server relay

Ephemeral by design

9.6 Offline Editing and Sync

Offline workflow:

WebSocket disconnects (network loss, airplane mode, etc.)
The editor remains fully functional. Every keystroke creates CRDT items in the local Yjs document.
IndexedDB stores the growing document state.
A local queue tracks which updates have not been sent to the server.

Reconnection:

WebSocket reconnects.
Yjs sync protocol resumes: client sends its state vector, server responds with missed updates, client sends its offline edits.
Both sides merge automatically using CRDT rules.
Within one sync round-trip, the client and server converge.

9.7 Version History and Snapshots

Snapshots: A full Yjs document binary (the complete CRDT state) is saved to document_snapshots:

Every 500 operations (with a minimum 60-second cooldown per document), or
Every 5 minutes during active editing, or
On explicit user action ("save as named version")

A snapshot captures the entire document state at a point in time. Loading a snapshot reconstructs the full Yjs document without replaying any operations.

Version browsing: To reconstruct the document at any past version:

Load the nearest snapshot before the target version.
Replay operations from the operation log forward to the target version.
Render the resulting Yjs document state in a read-only editor.

Same pattern as database point-in-time recovery: the snapshot is a base backup, the operation log is the WAL.

9.8 Search and Indexing

Cross-document search: Elasticsearch indexes document content for full-text search across all documents.

Indexing pipeline:

A Yjs update arrives at the sync server.
The server debounces (waits 30 seconds of inactivity or a maximum of 2 minutes after the first change).
The server extracts plain text from the current Yjs document state.
The plain text is indexed to Elasticsearch with metadata: { document_id, title, owner_id, updated_at }.

9.9 Rich Text and Document Schema

ProseMirror schema defines the allowed structure:

Marks (inline formatting): bold, italic, underline, strikethrough, link (with href), code, comment (with metadata)

Images: Upload flow:

Client requests a presigned S3 upload URL via POST /api/documents/:id/media.
Client uploads the image directly to S3 using the presigned URL.
Client inserts an image node into the ProseMirror document referencing the S3 key.
Other clients see the image node immediately (the S3 key resolves to a CDN URL).

Comments: Stored as marks on text ranges. Each comment mark carries metadata:

json

{
  "type": "comment",
  "attrs": {
    "commentId": "cmt-456",
    "author": "alice",
    "createdAt": "2026-03-10T14:22:00Z",
    "resolved": false
  }
}

Comment thread storage:

sql

CREATE TABLE comments (
    id              UUID PRIMARY KEY,
    document_id     UUID NOT NULL REFERENCES documents(id),
    comment_id      TEXT NOT NULL,           -- matches commentId in the ProseMirror mark
    parent_id       UUID REFERENCES comments(id),  -- NULL for top-level, set for replies
    author_id       UUID NOT NULL REFERENCES users(id),
    body            TEXT NOT NULL,
    resolved        BOOLEAN DEFAULT FALSE,
    created_at      TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

API endpoints:

Method	Endpoint	Description
POST	`/api/documents/:id/comments`	Create a comment (inserts mark into CRDT + row into PostgreSQL)
GET	`/api/documents/:id/comments`	List all comment threads for a document
POST	`/api/documents/:id/comments/:commentId/reply`	Reply to a comment thread
PATCH	`/api/documents/:id/comments/:commentId/resolve`	Resolve or reopen a comment thread

9.10 Multi-Tenancy and Rate Limiting

Rate limiting:

Limit	Value	Enforcement Point
Operations per user per second	100	Sync server
WebSocket messages per second	200	WebSocket gateway
Document size (characters)	1.5M	Sync server (reject ops exceeding limit)
Editors per document	100	Sync server (reject connections)
Viewers per document	200	Sync server (reject connections)
Media upload size	25 MB	Media service
Media uploads per document per hour	50	Media service

9.12 Undo/Redo

9.13 AI-Assisted Editing

How it works:

User triggers an AI action (slash command /ai, toolbar button, or inline suggestion prompt).
The client sends the request to an AI service via REST: POST /api/documents/:id/ai with the prompt, selected text range, and surrounding context. The client includes Yjs relative positions for the selection boundaries so the server can map them to the current document state.
The AI service calls the LLM (streaming response). As tokens arrive, the service inserts them into the Yjs document as a dedicated AI client (with its own clientId). The AI's edits are CRDT operations like any other, meaning they merge correctly with concurrent human edits.
Other editors see the AI-generated text appearing in real time, with an "AI writing" indicator in the presence system (the AI client has a distinct awareness state: { user: { name: "AI Assistant", isAI: true } }).
The user can accept, reject, or edit the AI output. Rejecting is an undo of the AI client's operations via UndoManager scoped to the AI's clientId.

Rate limiting: AI requests are rate-limited per user (5 requests/minute) and per document (20 requests/minute) to manage LLM costs. Long-running generations are capped at 60 seconds.

10. Identify Bottlenecks

10.1 WebSocket Connection Thundering Herd

10.2 Sync Server Memory Pressure

10.3 Operation Log Write Throughput

10.4 Snapshot Computation Cost

10.5 Elasticsearch Indexing Lag

Symptoms: During heavy editing, documents change faster than the 30-second debounce window. Search results show stale content. Users search for text they just typed and find nothing.

10.6 Redis Pub/Sub Fan-Out for Presence

10.7 S3 Media Upload Latency

10.8 Cross-Region WebSocket Latency

11. Failure Scenarios

11.1 Sync Server Crash

Impact: All active documents on that instance lose their in-memory Yjs state. Connected clients disconnect.

Recovery:

The load balancer detects the failed instance within 10 seconds (health check failure).
Reconnecting clients are routed to a healthy sync server instance (consistent hashing rehash).
The new instance loads the latest snapshot from PostgreSQL and replays operations from the operation log since that snapshot.
Clients run the Yjs sync protocol to converge.
Zero user-visible data loss. The operation log contains all persisted updates. If the crashed server had buffered operations in its 100ms batch window, those operations survive in the clients' local Yjs documents (IndexedDB) and are resent on reconnect.

Time to recover: 15-30 seconds. Clients experience a brief disconnection, then resume editing.

11.2 Redis Failure

Impact: Hot document state cache is unavailable. Presence data is lost. Cross-server presence relay stops working.

Recovery:

Sync servers fall back to loading Yjs state from PostgreSQL snapshots (slower, ~500ms vs. ~5ms from Redis).
Presence stops updating across servers. Clients on the same sync server still see each other's cursors (local awareness still works).
Redis recovers (failover to a replica or restart). Sync servers re-cache active documents. Clients re-publish awareness state.

Time to recover: With Redis Sentinel or Cluster, automatic failover takes 10-30 seconds. Presence gap during this window.

11.3 PostgreSQL Failure

Impact: Operation log writes fail. No new snapshots can be saved. Document metadata and ACLs are unavailable for new connections.

Recovery:

Sync servers buffer incoming operations in memory (up to 10,000 per document or 5 minutes, whichever comes first).
Active editing sessions continue because the Yjs document is held in memory on the sync server. No immediate user impact for already-connected editors.
New connections fail (cannot load document or verify permissions). The client shows "temporarily unavailable" and retries.
PostgreSQL recovers (failover to standby replica). Sync servers flush buffered operations to the operation log. Normal operation resumes.

Time to recover: Depends on PostgreSQL HA setup. With synchronous replication and automatic failover, 10-30 seconds. Buffered operations cover the gap.

11.4 WebSocket Gateway Overload

Impact: New connections are rejected. Existing connections may be dropped due to resource exhaustion (file descriptors, memory).

Recovery:

The auto-scaler adds gateway instances within 60 seconds.
The load balancer redirects new connections to healthy instances.
Dropped clients reconnect with jitter and are distributed across the expanded pool.

Prevention: Set connection limits per gateway instance (50K hard cap). Monitor connection counts and trigger scaling at 70% capacity.

11.5 Network Partition (Client to Server)

Impact: The client enters offline mode. Local edits continue without interruption.

Recovery:

Client detects partition via heartbeat timeout (90 seconds).
Client enters explicit offline mode. Edits are stored in IndexedDB.
Client attempts reconnection with exponential backoff.
On reconnect, the Yjs sync protocol merges all offline edits.

This is not a failure from the user's perspective. The editor works identically online and offline. The merge happens transparently.

11.6 Corrupted Yjs Document State

Impact: The in-memory Yjs document enters an inconsistent state (e.g., broken origin pointers due to a Yjs library bug). Edits produce garbled output.

Recovery:

Detect: clients report document hash mismatches during sync (state vectors agree but rendered content differs).
The sync server evicts the corrupted in-memory state.
Reload from the latest known-good snapshot in PostgreSQL.
Replay operations from the operation log since that snapshot.
Force-resync all connected clients (server sends the full state as sync-step-2).

Prevention: Run integrity checks on snapshots before persisting (verify the Yjs document renders valid ProseMirror JSON). Keep the last 5 snapshots per document for rollback.

11.7 Split-Brain: Two Sync Servers Serving the Same Document

Recovery:

Detect: the document's Redis key is claimed by two different server instance IDs.
One instance wins (compare instance startup timestamps, older wins).
The losing instance drains its connections and redirects clients to the winning instance.
The winning instance merges any operations it missed from the losing instance's operation log entries. Because Yjs updates are commutative and idempotent, this merge is automatic — apply the other instance's updates and the CRDT converges.

11.8 WebSocket Blocked by Firewall or Corporate Proxy

How common is WebSocket blocking in 2025-2026?

How Google Docs, Figma, and Notion handle this

Do you need a fallback transport?

Build an explicit fallback transport only if:

Your product targets enterprise/government environments where WebSocket blocking is a known issue
"Sync later from a different network" is not an acceptable answer for your users
You're building a product where a user might only ever access it from a restrictive network (e.g., a hospital-only app, a government agency tool)

Fallback strategies (if you need them)

Recommended approach:

Client connects:
1. Attempt WebSocket (wss://sync.example.com/doc/:id)
2. If handshake fails within 5 seconds AND health check succeeds:
   a. Try SSE + POST (best latency among fallbacks)
   b. If SSE stream also dies within 10 seconds:
      Fall back to HTTP long-polling (POST every 1 second)
3. If all transports fail:
   Enter offline-only mode (full editor functionality, sync deferred)

11.9 S3 Unavailability

Impact: Media uploads fail. Existing media still served from CDN cache (CloudFront). New images cannot be embedded.

Recovery:

The media service returns a retry-able error to the client.
The client queues the upload and retries with backoff.
The document can reference a placeholder image node until the upload succeeds.
S3 recovers. Queued uploads complete. Placeholders resolve to actual images.

12. Observability

Key metrics to track:

Metric	Description	Alert Threshold
`ws_connections_total`	Active WebSocket connections per gateway	Drop >20% in 5 min
`ops_per_sec_per_doc`	Operations per second per document	Drops to 0 for an active doc
`sync_latency_p99`	Time from op send to remote client receive	> 500ms
`document_size_bytes`	Yjs document binary size in memory	> 100 MB per document
`op_log_write_latency_p99`	PostgreSQL operation log insert latency	> 100ms
`snapshot_duration_ms`	Time to serialize a Yjs document snapshot	> 1,000ms
`search_index_lag_sec`	Time since last Elasticsearch index update	> 120s for active docs
`presence_broadcast_latency`	Redis Pub/Sub relay time	> 200ms
`offline_reconnect_duration`	Time from reconnect to full sync	> 10s

Dashboards:

System overview: Total connections, total ops/sec, active documents, per-region breakdown
Per-document health: Ops/sec, connected editors, document size, last snapshot age
Storage growth: Operation log size, snapshot storage, media storage, PostgreSQL shard utilization
Sync performance: Sync latency heatmap (p50/p95/p99), reconnection success rate, offline sync merge times

13. Deployment Strategy

Multi-region deployment:

US-East:    400 sync servers, 250 WS gateways
EU-West:    300 sync servers, 200 WS gateways
AP-SE:      300 sync servers, 150 WS gateways

Document region assignment:

On creation: document is assigned to the creator's nearest region.
During active editing: if >60% of active editors are in a different region for >10 minutes, trigger region migration.
Migration: snapshot the Yjs state on the source region, transfer to the target region, load on the target sync server, redirect WebSocket connections. Brief interruption (~5 seconds) during migration.

Rolling upgrades:

Drain a sync server instance (stop accepting new document loads, keep serving active documents).
Wait for active documents to reach a quiet period (no ops for 30 seconds) or a maximum drain timeout (5 minutes).
Snapshot all in-memory documents.
Shut down the instance. Deploy the new version.
Start the new instance. It picks up new document loads from the consistent hash ring.

Stateless components (API gateway, auth service, media service, search indexer) use standard blue-green deployment. No drain needed because they hold no document state.

14. Security

Transport security: TLS 1.3 for all WebSocket and REST connections. Certificate pinning for mobile apps. WebSocket Secure (wss://) is the only allowed transport.

Authentication:

JWT tokens issued at login via OAuth 2.0 (Google, GitHub, email/password).
Token validated at WebSocket handshake (gateway extracts and verifies JWT before upgrading the connection).
Tokens include user_id, email, and expiry. Short-lived (15 minutes) with refresh tokens.
On token expiry during an active WebSocket session, the server requests a token refresh over the existing connection. No reconnection needed.

Authorization:

Per-document ACL checked on every operation at the sync server.
Permission levels: owner (full control), editor (read + write), commenter (read + comment marks), viewer (read only).
Viewers receive Yjs updates (to see live edits) but their outgoing operations are rejected by the sync server.
Permission changes propagate within 5 seconds (ACL cache TTL). This is a deliberate tradeoff favoring sync server performance over instant revocation. Reducing the TTL increases database load linearly.

Encryption at rest:

AES-256 for document snapshots and operation logs in PostgreSQL (using pgcrypto or transparent data encryption).
S3 server-side encryption (SSE-S3 or SSE-KMS) for media and cold storage.

Wrapping Up

Explore the Technologies

Technology / Pattern	Role in This System	Learn More
Yjs	CRDT library implementing the YATA algorithm for conflict-free document merging	Yjs
TipTap / ProseMirror	Rich text editor framework with Yjs bindings for collaborative editing	TipTap
Hocuspocus	Yjs-native WebSocket sync server with auth hooks and persistence adapters	Hocuspocus
WebSocket	Bidirectional real-time transport for editor sync and presence broadcasts	WebSocket
PostgreSQL	Document metadata, operation logs, snapshots, and ACLs	PostgreSQL
Redis	Hot document state, presence pub/sub, and distributed locking	Redis
CRDTs	YATA algorithm for conflict-free merging, Peritext for rich text marks	CRDTs
Consistent Hashing	Document-to-sync-server routing for WebSocket connection affinity	Consistent Hashing
Vector Clocks	Foundation for Yjs state vectors, tracking causal ordering across clients	Vector Clocks
Event Sourcing	Operation log as append-only source of truth, snapshots for fast recovery	Event Sourcing

System Design: Maps Platform (Navigation, Routing, and Map Rendering)

April 27, 2026 · 79 min read

System Design: GitHub (200M Repos, Git Object Storage, Sparse Trigram Code Search, Per-Job VM CI)

April 23, 2026 · 69 min read

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

System Design: Real-Time Collaborative Editor

1. Problem Statement

2. Functional Requirements

3. Non-Functional Requirements

4. High-Level Approach & Technology Selection

4.1 The Core Problem: Concurrent Edits

4.2 Approach 1: Operational Transformation (OT)

4.2.1 How OT Works (Concrete Example)

4.2.2 Online Flow (Step by Step)

4.2.3 Offline Flow

4.3 Approach 2: CRDTs (Conflict-free Replicated Data Types)

4.3.1 How CRDTs Work (Concrete Example)

4.3.2 Online Flow (Step by Step)

4.3.3 Offline Flow

4.3.4 Conflict Resolution

4.4 Side-by-Side Comparison

4.5 Our Choice: CRDT (Yjs) with a Relay Server

4.6 Collaboration Paradigms: Where This Design Fits

Block-Based Editors (Notion, AFFiNE, Coda)

Canvas Editors (Figma)

Why This Design Focuses on Rich Text

4.7 Technology Selection

4.8 What Runs Where

4.9 End-to-End Example: From Keystroke to CRDT

5. High-Level Architecture

5.1 Bird's-Eye View

5.2 Component Glossary

6. Back-of-the-Envelope Estimation

7. Data Model

7.1 Document Tree (ProseMirror Schema)

7.2 CRDT Operation Format

7.3 Storage Schema (PostgreSQL)

7.4 Media Schema

8. API Design

8.1 REST Endpoints

8.2 WebSocket Protocol and the y-websocket Layer

9. Deep Dives

9.1 WebSocket Connection Lifecycle

9.2 CRDT Sync Protocol (Yjs Internals)

9.3 Multi-Instance Sync: When Users Land on Different Servers

What goes wrong without cross-instance sync

@hocuspocus/extension-redis

Why CRDTs make this safe

What happens if Redis Pub/Sub fails

9.4 Conflict Resolution Deep Dive

9.5 Presence and Cursor Tracking

End-to-end flow: from cursor move to remote render

Awareness state object

Why relative positions instead of integer offsets

Cursor rendering

User presence indicators

What triggers an awareness update

Throttling and bandwidth

Cross-server relay

Ephemeral by design

9.6 Offline Editing and Sync

9.7 Version History and Snapshots

9.8 Search and Indexing

9.9 Rich Text and Document Schema

9.10 Multi-Tenancy and Rate Limiting

9.11 GDPR and Data Deletion

9.12 Undo/Redo

9.13 AI-Assisted Editing

10. Identify Bottlenecks

11. Failure Scenarios

11.1 Sync Server Crash

11.2 Redis Failure

11.3 PostgreSQL Failure

11.4 WebSocket Gateway Overload

11.5 Network Partition (Client to Server)

11.6 Corrupted Yjs Document State

11.7 Split-Brain: Two Sync Servers Serving the Same Document

11.8 WebSocket Blocked by Firewall or Corporate Proxy

How common is WebSocket blocking in 2025-2026?

How Google Docs, Figma, and Notion handle this

Do you need a fallback transport?

Fallback strategies (if you need them)

11.9 S3 Unavailability

12. Observability

13. Deployment Strategy