Cell-Based Architecture

Why Horizontal Scaling Hits a Wall

Scaling horizontally by adding more instances behind a load balancer works until it does not. The problem is not capacity. It is blast radius. One bad deploy goes to every instance. One schema migration locks every user's data. One configuration change poisons the entire fleet. Slack learned this the hard way during a series of incidents in 2020-2021 where fleet-wide rollouts caused cascading failures affecting all customers simultaneously. Their response was to carve their infrastructure into cells.

DoorDash arrived at the same conclusion from a different angle. After their high-profile outages in 2022, post-incident analysis revealed that their architecture had too many shared dependencies. A single Redis cluster going unhealthy would cascade across the entire platform. Cells gave them isolation boundaries that limited the damage any single failure could cause.

Cell Boundary Decisions

Choosing how to partition users into cells is the most consequential decision you will make. Three common approaches:

Customer ID hash is the simplest. Run the customer ID through a consistent hash to assign them to a cell. Stripe uses a variant of this for payment processing isolation. The advantage: even distribution, deterministic routing. The downside: customers from the same organization might land in different cells, making cross-account features harder to build.

Geographic partitioning maps cells to physical regions or metros. DoorDash uses this because their marketplace is inherently local. A cell serving San Francisco does not need to know about restaurants in Chicago. Geographic cells also help with data residency requirements.

Tenant-based partitioning assigns each large enterprise customer (or group of smaller customers) to a dedicated cell. Salesforce has done this for years. It gives you per-tenant isolation and makes it easier to offer dedicated capacity to high-value accounts. The trade-off is uneven cell utilization.

Cell-Aware Routing

Every request needs to reach the right cell, and routing decisions must happen at the edge. The typical approach uses a lightweight routing service that maps identifiers (user ID, API key, tenant slug) to cell assignments stored in a fast lookup table. Cloudflare Workers or AWS Lambda@Edge can handle this at the CDN layer.

Keep the routing table small and cacheable. Slack's routing layer resolves cell assignment in under 1ms by keeping the full mapping in memory. If your lookup requires a database round-trip per request, you have already lost.

Cross-Cell Communication

Cells should be self-contained for the vast majority of operations. When cross-cell communication is unavoidable (a user in Cell A sends a message to a user in Cell B), treat it as an asynchronous operation. Use event buses or message queues with cell-specific topics. Synchronous cross-cell RPC creates the tight coupling that cells are designed to eliminate.

Keep a strict inventory of cross-cell data flows. At Slack, cross-cell traffic accounts for less than 5% of total request volume. If your cross-cell communication exceeds 10-15%, your cell boundaries are probably wrong.

When Cells Are Overkill

Cell-based architecture is heavy machinery. If you have fewer than 100K users, a single region with availability zone redundancy handles your blast radius concerns. If your failure scenarios are mostly code bugs (not infrastructure), feature flags and canary deploys give you similar protection at a fraction of the operational cost.

Cells start paying for themselves when you have a large enough user base that a full-fleet incident causes meaningful business damage, when your SLAs demand isolation guarantees, or when regulatory requirements force you to separate specific customer groups. Below that threshold, the operational overhead of managing multiple independent environments will slow your team down more than it protects your customers.

Why Horizontal Scaling Hits a Wall

Cell Boundary Decisions

Choosing how to partition users into cells is the most consequential decision you will make. Three common approaches:

Cell-Aware Routing

Cross-Cell Communication

When Cells Are Overkill

Architecture Diagram

Why Horizontal Scaling Hits a Wall

Cell Boundary Decisions

Cell-Aware Routing

Cross-Cell Communication

When Cells Are Overkill

Key Points

Common Mistakes

Related Topics

Cell-Based Architecture

Architecture Diagram

Why Horizontal Scaling Hits a Wall

Cell Boundary Decisions

Cell-Aware Routing

Cross-Cell Communication

When Cells Are Overkill

Key Points

Common Mistakes

Related Topics