Spreadsheet Engine
Reactive cell dependency graph with Observer propagation, Composite expression trees for formulas, Interpreter for parsing, and Memento for undo/redo. The Google Sheets interview classic.
Key Abstractions
Holds a raw value or formula, observes its dependencies for changes, and notifies dependent cells to recalculate
Composite interface for formula AST nodes — supports evaluate(context) and accept(visitor) for double dispatch
Composite leaf and branch nodes representing constants, cell lookups, arithmetic, and aggregations like SUM
Interpreter that tokenizes and parses formula strings like '=A1+SUM(B1:B3)' into an expression tree
Snapshot of all cell values and formulas at a point in time, enabling full undo/redo without replaying history
Visits the expression tree for evaluation, dependency extraction, and cycle detection — no modifications to expression classes
Class Diagram
How It Works
A spreadsheet is really just a dependency graph displayed as a grid. Each cell holds either a raw value or a formula. Formulas reference other cells, creating edges in a directed graph. When a value changes, every downstream cell must recalculate, and in the right order.
The Observer pattern handles change propagation. Each cell is both subject (notifying dependents when its value changes) and observer (watching the cells it references). Observer alone is not enough. If B1 depends on A1 and C1 depends on B1, you must recalculate B1 before C1. That requires a topological sort over the dependency graph, not arbitrary notification order.
The Composite pattern gives formulas their tree structure. A formula like =A1+SUM(B1,B2) is a BinaryOp whose left child is a CellReference and whose right child is a FunctionCall containing two more CellReference nodes. Every node implements the same Expression interface, so the evaluator walks the tree uniformly.
The Interpreter pattern handles parsing. The formula language is small: cell references, numbers, arithmetic operators, and functions. A recursive descent parser tokenizes the input and builds the composite expression tree, respecting operator precedence.
The Visitor pattern makes the expression tree multi-purpose. EvalVisitor computes the value. DependencyVisitor extracts which cells a formula references (needed to build the dependency graph). Future visitors could detect cycles, optimize constant sub-expressions, or rename cell references during column insertion. None of this requires changing the expression node classes.
The Memento pattern provides undo/redo. Before every mutation, the engine snapshots all cell states into a SpreadsheetMemento. Undo pops from the undo stack and pushes the current state onto the redo stack. Each operation is O(1) in terms of restoring state, not replaying the entire edit history.
Requirements
Functional
- Store typed values (numbers) in cells identified by column-row IDs like A1, B2
- Parse and evaluate formulas with arithmetic operators (+, -, *, /) and functions (SUM, MAX, MIN, AVG)
- Formulas can reference other cells and nest arbitrarily:
=SUM(A1, B1*2+3) - When a cell value changes, all dependent cells recalculate automatically
- Detect circular references before accepting a formula
- Full undo and redo support for all cell mutations
Non-Functional
- Propagation must follow topological order: no stale reads during recalculation
- Adding new functions (COUNT, IF) should not require changes to the expression node classes
- Undo/redo should be O(1) per operation, not O(n) replay
Design Decisions
Does polling for changes work at spreadsheet scale?
Polling means iterating every cell after every change, checking if its dependencies have new values. For a sheet with 10,000 cells where only one value changed, that's 9,999 wasted checks. Observer inverts the control: only the cells that actually depend on the changed cell recalculate. The dependency graph encodes exactly which cells care about which changes.
Observer alone is not enough. You need topological ordering. If A1 fires notifications to B1 and C1 simultaneously, and C1 depends on B1, C1 might evaluate before B1 has its new value. The engine does a BFS traversal from the changed cell, producing an evaluation order that respects the dependency edges. Recalculation follows that order, not the arbitrary notification order.
Couldn't you just eval() the formula string?
You could eval() the formula string directly. Fast to implement, impossible to inspect. You cannot extract dependencies from a string without parsing it. You cannot detect cycles without knowing which cells a formula references. You cannot optimize without a tree to transform.
The Composite approach gives you the AST. Every formula is a tree of Expression nodes. Walk it with DependencyVisitor to find references. Walk it with EvalVisitor to compute the value. The tree is the reusable intermediate representation that every downstream operation depends on.
What if you put evaluate() directly on expression nodes?
Putting evaluate() on every expression node works for one operation, but you need at least three: evaluate, extract dependencies, detect cycles. With polymorphic methods, each new operation adds a method to every node class. With Visitor, each new operation is one new class. The expression hierarchy stays frozen. The expression types (NumberLiteral, CellReference, BinaryOp, FunctionCall) are stable; the operations over them are what grow.
Couldn't Command-based undo work here?
Command-based undo requires each command to know how to reverse itself. SetValue must remember the old value. SetFormula must remember the old formula, old dependencies, and old computed values of all affected cells. A single change that cascades through 50 cells means the command must capture 50 old values.
Memento sidesteps this by snapshotting the entire state. It uses more memory, but the logic is trivial: save before, restore after. In a spreadsheet where mutations cascade unpredictably through the dependency graph, Memento is far simpler than trying to compute the inverse of every propagation.
Interview Follow-ups
-
Lazy vs. eager evaluation: Eager recalculates all dependents immediately on change. Lazy marks dependents as dirty and recalculates only when a cell value is read. Lazy is better for batch updates (setting 100 cells, then reading results), but adds complexity: you need dirty flags, and reads become unpredictable in cost. Most production spreadsheets use eager evaluation with batch coalescing.
-
Range references like B1:B3: The parser expands
B1:B3into a list ofCellReferencenodes[B1, B2, B3]at parse time. Alternatively, introduce aRangeReferenceexpression node that lazily resolves to the constituent cells. The lazy approach handles inserted rows gracefully: the range adapts without re-parsing. -
Collaborative editing with multiple users: Each user's edits need conflict resolution. Operational Transformation (OT) or CRDTs can merge concurrent cell edits. The dependency graph must be globally consistent: if user A changes A1 and user B changes B1 simultaneously, both dependents must see both updates. The topological propagation becomes a distributed coordination problem.
-
Performance at 1M cells: The dependency graph should use adjacency lists, not a matrix. Topological sort should be incremental: when A1 changes, only sort the subgraph reachable from A1, not all one million cells. Memento at this scale needs delta snapshots (store only changed cells, not the full sheet). Cache formula parsing results and only re-parse when the formula string changes.
Code Implementation
1 from __future__ import annotations
2 from abc import ABC, abstractmethod
3 from dataclasses import dataclass, field
4 from typing import Any
5 import copy, re
6
7
8 # ── Expression AST (Composite + Visitor) ────────────────────────
9
10 class Expression(ABC):
11 @abstractmethod
12 def accept(self, visitor: ExpressionVisitor) -> Any:
13 ...
14
15
16 @dataclass
17 class NumberLiteral(Expression):
18 value: float
19
20 def accept(self, visitor: ExpressionVisitor):
21 return visitor.visit_number(self)
22
23
24 @dataclass
25 class CellReference(Expression):
26 cell_id: str
27
28 def accept(self, visitor: ExpressionVisitor):
29 return visitor.visit_cell_ref(self)
30
31
32 @dataclass
33 class BinaryOp(Expression):
34 left: Expression
35 op: str
36 right: Expression
37
38 def accept(self, visitor: ExpressionVisitor):
39 return visitor.visit_binary(self)
40
41
42 @dataclass
43 class FunctionCall(Expression):
44 name: str
45 args: list[Expression]
46
47 def accept(self, visitor: ExpressionVisitor):
48 return visitor.visit_function(self)
49
50
51 # ── Visitor Interface + Concrete Visitors ───────────────────────
52
53 class ExpressionVisitor(ABC):
54 @abstractmethod
55 def visit_number(self, expr: NumberLiteral) -> Any: ...
56 @abstractmethod
57 def visit_cell_ref(self, expr: CellReference) -> Any: ...
58 @abstractmethod
59 def visit_binary(self, expr: BinaryOp) -> Any: ...
60 @abstractmethod
61 def visit_function(self, expr: FunctionCall) -> Any: ...
62
63
64 class EvalVisitor(ExpressionVisitor):
65 """Evaluates an expression tree given a context mapping cell ids to values."""
66
67 def __init__(self, context: dict[str, float]):
68 self.context = context
69
70 def visit_number(self, expr: NumberLiteral) -> float:
71 return expr.value
72
73 def visit_cell_ref(self, expr: CellReference) -> float:
74 return self.context.get(expr.cell_id.upper(), 0.0)
75
76 def visit_binary(self, expr: BinaryOp) -> float:
77 left = expr.left.accept(self)
78 right = expr.right.accept(self)
79 if expr.op == "+": return left + right
80 if expr.op == "-": return left - right
81 if expr.op == "*": return left * right
82 if expr.op == "/":
83 if right == 0:
84 raise ZeroDivisionError(f"Division by zero in formula")
85 return left / right
86 raise ValueError(f"Unknown operator: {expr.op}")
87
88 def visit_function(self, expr: FunctionCall) -> float:
89 values = [arg.accept(self) for arg in expr.args]
90 if expr.name == "SUM": return sum(values)
91 if expr.name == "MAX": return max(values)
92 if expr.name == "MIN": return min(values)
93 if expr.name == "AVG": return sum(values) / len(values)
94 raise ValueError(f"Unknown function: {expr.name}")
95
96
97 class DependencyVisitor(ExpressionVisitor):
98 """Collects all cell references an expression depends on."""
99
100 def __init__(self):
101 self.deps: set[str] = set()
102
103 def visit_number(self, expr: NumberLiteral): pass
104
105 def visit_cell_ref(self, expr: CellReference):
106 self.deps.add(expr.cell_id.upper())
107
108 def visit_binary(self, expr: BinaryOp):
109 expr.left.accept(self)
110 expr.right.accept(self)
111
112 def visit_function(self, expr: FunctionCall):
113 for arg in expr.args:
114 arg.accept(self)
115
116
117 # ── Formula Parser (Interpreter Pattern) ────────────────────────
118
119 @dataclass
120 class Token:
121 type: str # NUMBER, CELLREF, OP, LPAREN, RPAREN, COMMA, FUNC
122 value: str
123
124
125 class FormulaParser:
126 """Parses formulas like '=A1+5', '=A1*B1+3', '=SUM(A1,B1,C1)'."""
127
128 def __init__(self, text: str):
129 raw = text.lstrip("=").strip()
130 self.tokens = self._tokenize(raw)
131 self.pos = 0
132
133 def _tokenize(self, text: str) -> list[Token]:
134 tokens: list[Token] = []
135 i = 0
136 while i < len(text):
137 ch = text[i]
138 if ch.isspace():
139 i += 1
140 continue
141 if ch in "+-*/":
142 tokens.append(Token("OP", ch))
143 i += 1
144 elif ch == "(":
145 tokens.append(Token("LPAREN", ch))
146 i += 1
147 elif ch == ")":
148 tokens.append(Token("RPAREN", ch))
149 i += 1
150 elif ch == ",":
151 tokens.append(Token("COMMA", ch))
152 i += 1
153 elif ch.isdigit() or ch == ".":
154 start = i
155 while i < len(text) and (text[i].isdigit() or text[i] == "."):
156 i += 1
157 tokens.append(Token("NUMBER", text[start:i]))
158 elif ch.isalpha():
159 start = i
160 while i < len(text) and text[i].isalnum():
161 i += 1
162 word = text[start:i]
163 if i < len(text) and text[i] == "(":
164 tokens.append(Token("FUNC", word.upper()))
165 elif re.match(r"^[A-Za-z]+\d+$", word):
166 tokens.append(Token("CELLREF", word.upper()))
167 else:
168 raise ValueError(f"Unknown identifier: {word}")
169 else:
170 raise ValueError(f"Unexpected character: {ch}")
171 return tokens
172
173 def _peek(self) -> Token | None:
174 return self.tokens[self.pos] if self.pos < len(self.tokens) else None
175
176 def _consume(self) -> Token:
177 tok = self.tokens[self.pos]
178 self.pos += 1
179 return tok
180
181 def parse(self) -> Expression:
182 expr = self._parse_expr()
183 if self.pos < len(self.tokens):
184 raise ValueError(f"Unexpected token: {self.tokens[self.pos].value}")
185 return expr
186
187 def _parse_expr(self) -> Expression:
188 left = self._parse_term()
189 while (tok := self._peek()) and tok.type == "OP" and tok.value in "+-":
190 op = self._consume().value
191 right = self._parse_term()
192 left = BinaryOp(left, op, right)
193 return left
194
195 def _parse_term(self) -> Expression:
196 left = self._parse_factor()
197 while (tok := self._peek()) and tok.type == "OP" and tok.value in "*/":
198 op = self._consume().value
199 right = self._parse_factor()
200 left = BinaryOp(left, op, right)
201 return left
202
203 def _parse_factor(self) -> Expression:
204 tok = self._peek()
205 if tok is None:
206 raise ValueError("Unexpected end of formula")
207 if tok.type == "NUMBER":
208 self._consume()
209 return NumberLiteral(float(tok.value))
210 if tok.type == "CELLREF":
211 self._consume()
212 return CellReference(tok.value)
213 if tok.type == "FUNC":
214 return self._parse_function()
215 if tok.type == "LPAREN":
216 self._consume()
217 expr = self._parse_expr()
218 closing = self._peek()
219 if closing is None or closing.type != "RPAREN":
220 raise ValueError("Missing closing parenthesis")
221 self._consume()
222 return expr
223 raise ValueError(f"Unexpected token: {tok.value}")
224
225 def _parse_function(self) -> Expression:
226 name = self._consume().value
227 self._consume() # LPAREN
228 args: list[Expression] = []
229 if self._peek() and self._peek().type != "RPAREN":
230 args.append(self._parse_expr())
231 while self._peek() and self._peek().type == "COMMA":
232 self._consume()
233 args.append(self._parse_expr())
234 closing = self._peek()
235 if closing is None or closing.type != "RPAREN":
236 raise ValueError(f"Missing closing parenthesis for {name}")
237 self._consume()
238 return FunctionCall(name, args)
239
240
241 # ── Cell (Observer : subject and observer) ──────────────────────
242
243 @dataclass
244 class Cell:
245 cell_id: str
246 raw_value: float | None = None
247 formula: Expression | None = None
248 formula_str: str = ""
249 computed_value: float = 0.0
250 dependents: set[str] = field(default_factory=set)
251 dependencies: set[str] = field(default_factory=set)
252
253
254 # ── Memento ─────────────────────────────────────────────────────
255
256 @dataclass
257 class CellSnapshot:
258 raw_value: float | None
259 formula: Expression | None
260 formula_str: str
261 computed_value: float
262 dependencies: set[str]
263
264 @dataclass
265 class SpreadsheetMemento:
266 state: dict[str, CellSnapshot]
267
268
269 # ── Spreadsheet Engine ──────────────────────────────────────────
270
271 class Spreadsheet:
272 def __init__(self):
273 self.cells: dict[str, Cell] = {}
274 self._undo_stack: list[SpreadsheetMemento] = []
275 self._redo_stack: list[SpreadsheetMemento] = []
276
277 def _get_or_create(self, cell_id: str) -> Cell:
278 cell_id = cell_id.upper()
279 if cell_id not in self.cells:
280 self.cells[cell_id] = Cell(cell_id=cell_id)
281 return self.cells[cell_id]
282
283 def _save_memento(self):
284 state = {}
285 for cid, cell in self.cells.items():
286 state[cid] = CellSnapshot(
287 raw_value=cell.raw_value,
288 formula=copy.deepcopy(cell.formula),
289 formula_str=cell.formula_str,
290 computed_value=cell.computed_value,
291 dependencies=set(cell.dependencies),
292 )
293 self._undo_stack.append(SpreadsheetMemento(state))
294 self._redo_stack.clear()
295
296 def _restore_memento(self, memento: SpreadsheetMemento):
297 self.cells.clear()
298 for cid, snap in memento.state.items():
299 cell = Cell(
300 cell_id=cid,
301 raw_value=snap.raw_value,
302 formula=copy.deepcopy(snap.formula),
303 formula_str=snap.formula_str,
304 computed_value=snap.computed_value,
305 dependencies=set(snap.dependencies),
306 )
307 self.cells[cid] = cell
308 # Rebuild dependents from dependencies
309 for cid, cell in self.cells.items():
310 for dep in cell.dependencies:
311 self._get_or_create(dep).dependents.add(cid)
312
313 def undo(self) -> bool:
314 if not self._undo_stack:
315 return False
316 # Save current state to redo
317 current = {}
318 for cid, cell in self.cells.items():
319 current[cid] = CellSnapshot(
320 cell.raw_value, copy.deepcopy(cell.formula),
321 cell.formula_str, cell.computed_value, set(cell.dependencies),
322 )
323 self._redo_stack.append(SpreadsheetMemento(current))
324 self._restore_memento(self._undo_stack.pop())
325 return True
326
327 def redo(self) -> bool:
328 if not self._redo_stack:
329 return False
330 current = {}
331 for cid, cell in self.cells.items():
332 current[cid] = CellSnapshot(
333 cell.raw_value, copy.deepcopy(cell.formula),
334 cell.formula_str, cell.computed_value, set(cell.dependencies),
335 )
336 self._undo_stack.append(SpreadsheetMemento(current))
337 self._restore_memento(self._redo_stack.pop())
338 return True
339
340 def _get_context(self) -> dict[str, float]:
341 return {cid: c.computed_value for cid, c in self.cells.items()}
342
343 def _detect_cycle(self, start: str, deps: set[str]) -> bool:
344 visited: set[str] = set()
345 stack = list(deps)
346 while stack:
347 cid = stack.pop()
348 if cid == start:
349 return True
350 if cid in visited:
351 continue
352 visited.add(cid)
353 if cid in self.cells:
354 stack.extend(self.cells[cid].dependencies)
355 return False
356
357 def _topological_sort(self, start: str) -> list[str]:
358 """BFS topological order of all cells downstream from start."""
359 order: list[str] = []
360 visited: set[str] = set()
361 queue = [start]
362 while queue:
363 cid = queue.pop(0)
364 if cid in visited:
365 continue
366 visited.add(cid)
367 order.append(cid)
368 cell = self.cells.get(cid)
369 if cell:
370 for dep in sorted(cell.dependents):
371 if dep not in visited:
372 queue.append(dep)
373 return order
374
375 def _propagate(self, cell_id: str):
376 order = self._topological_sort(cell_id)
377 ctx = self._get_context()
378 for cid in order:
379 cell = self.cells.get(cid)
380 if cell and cell.formula:
381 evaluator = EvalVisitor(ctx)
382 cell.computed_value = cell.formula.accept(evaluator)
383 ctx[cid] = cell.computed_value
384
385 def set_value(self, cell_id: str, value: float):
386 self._save_memento()
387 cell_id = cell_id.upper()
388 cell = self._get_or_create(cell_id)
389 # Clear old dependencies
390 for dep in cell.dependencies:
391 if dep in self.cells:
392 self.cells[dep].dependents.discard(cell_id)
393 cell.raw_value = value
394 cell.formula = None
395 cell.formula_str = ""
396 cell.computed_value = value
397 cell.dependencies = set()
398 self._propagate(cell_id)
399
400 def set_formula(self, cell_id: str, formula_str: str):
401 self._save_memento()
402 cell_id = cell_id.upper()
403 cell = self._get_or_create(cell_id)
404 expr = FormulaParser(formula_str).parse()
405 # Extract dependencies via visitor
406 dep_visitor = DependencyVisitor()
407 expr.accept(dep_visitor)
408 new_deps = dep_visitor.deps
409 # Cycle detection
410 if cell_id in new_deps or self._detect_cycle(cell_id, new_deps):
411 self._undo_stack.pop() # Revert saved memento
412 raise ValueError(f"Circular reference detected for {cell_id}")
413 # Clear old dependencies
414 for dep in cell.dependencies:
415 if dep in self.cells:
416 self.cells[dep].dependents.discard(cell_id)
417 cell.formula = expr
418 cell.formula_str = formula_str
419 cell.raw_value = None
420 cell.dependencies = new_deps
421 # Register as dependent on each referenced cell
422 for dep in new_deps:
423 self._get_or_create(dep).dependents.add(cell_id)
424 self._propagate(cell_id)
425
426 def get_value(self, cell_id: str) -> float:
427 cell_id = cell_id.upper()
428 if cell_id in self.cells:
429 return self.cells[cell_id].computed_value
430 return 0.0
431
432 def display(self, cell_ids: list[str]):
433 for cid in cell_ids:
434 cid = cid.upper()
435 cell = self.cells.get(cid)
436 if cell and cell.formula_str:
437 print(f" {cid} = {cell.formula_str} -> {cell.computed_value}")
438 elif cell:
439 print(f" {cid} = {cell.computed_value}")
440 else:
441 print(f" {cid} = (empty)")
442
443
444 # ── Demo ────────────────────────────────────────────────────────
445
446 if __name__ == "__main__":
447 ss = Spreadsheet()
448
449 print("=== Setting raw values ===")
450 ss.set_value("A1", 10)
451 ss.set_value("A2", 20)
452 ss.set_value("A3", 30)
453 ss.display(["A1", "A2", "A3"])
454
455 print("\n=== Setting formulas ===")
456 ss.set_formula("B1", "=A1+A2") # B1 = 10+20 = 30
457 ss.set_formula("C1", "=B1*2") # C1 = 30*2 = 60
458 ss.set_formula("D1", "=SUM(A1,A2,A3)") # D1 = 60
459 ss.display(["B1", "C1", "D1"])
460
461 print("\n=== Observer propagation: changing A1 to 50 ===")
462 ss.set_value("A1", 50)
463 print("After A1 = 50:")
464 ss.display(["A1", "B1", "C1", "D1"])
465 # B1 = 50+20 = 70, C1 = 70*2 = 140, D1 = 50+20+30 = 100
466
467 print("\n=== Cycle detection ===")
468 try:
469 ss.set_formula("A1", "=C1+1")
470 except ValueError as e:
471 print(f" Caught: {e}")
472
473 print("\n=== Undo: reverting A1 back to 50 ===")
474 ss.undo()
475 ss.display(["A1", "B1", "C1", "D1"])
476
477 print("\n=== Undo again: reverting A1 back to 10 ===")
478 ss.undo()
479 ss.display(["A1", "B1", "C1", "D1"])
480
481 print("\n=== Redo: A1 back to 50 ===")
482 ss.redo()
483 ss.display(["A1", "B1", "C1", "D1"])Common Mistakes
- ✗No cycle detection: A1=B1, B1=A1 causes infinite recalculation. You must detect cycles before accepting a formula.
- ✗Evaluating dependents in arbitrary order: can read stale values. Topological sort of the dependency graph is required.
- ✗Storing formatted strings instead of typed values: breaks arithmetic on cells displayed as currency or percentages.
- ✗Not using Memento: undo requires replaying all edits from the start, which is O(n) per undo instead of O(1).
Key Points
- ✓Observer with topological ordering: when A1 changes, recalculate B1=A1+5 before C1=B1*2. You must evaluate the dependency graph in topological order so no cell reads a stale value.
- ✓Composite: formulas are expression trees. SUM(A1, A2+3) is a FunctionCall containing a CellReference and a BinaryOp. Every node implements the same Expression interface.
- ✓Interpreter: the formula mini-language (=, +, -, *, SUM, MAX) is parsed into an AST via recursive descent and evaluated by walking the tree.
- ✓Visitor: evaluate(), getDependencies(), and detectCycles() are visitors over the expression tree. You can add new operations without modifying the expression node classes.