Cell-Based Architecture
Architecture Diagram
Why Horizontal Scaling Hits a Wall
Scaling horizontally by adding more instances behind a load balancer works until it does not. The problem is not capacity. It is blast radius. One bad deploy goes to every instance. One schema migration locks every user's data. One configuration change poisons the entire fleet. Slack learned this the hard way during a series of incidents in 2020-2021 where fleet-wide rollouts caused cascading failures affecting all customers simultaneously. Their response was to carve their infrastructure into cells.
DoorDash arrived at the same conclusion from a different angle. After their high-profile outages in 2022, post-incident analysis revealed that their architecture had too many shared dependencies. A single Redis cluster going unhealthy would cascade across the entire platform. Cells gave them isolation boundaries that limited the damage any single failure could cause.
Cell Boundary Decisions
Choosing how to partition users into cells is the most consequential decision you will make. Three common approaches:
Customer ID hash is the simplest. Run the customer ID through a consistent hash to assign them to a cell. Stripe uses a variant of this for payment processing isolation. The advantage: even distribution, deterministic routing. The downside: customers from the same organization might land in different cells, making cross-account features harder to build.
Geographic partitioning maps cells to physical regions or metros. DoorDash uses this because their marketplace is inherently local. A cell serving San Francisco does not need to know about restaurants in Chicago. Geographic cells also help with data residency requirements.
Tenant-based partitioning assigns each large enterprise customer (or group of smaller customers) to a dedicated cell. Salesforce has done this for years. It gives you per-tenant isolation and makes it easier to offer dedicated capacity to high-value accounts. The trade-off is uneven cell utilization.
Cell-Aware Routing
Every request needs to reach the right cell, and routing decisions must happen at the edge. The typical approach uses a lightweight routing service that maps identifiers (user ID, API key, tenant slug) to cell assignments stored in a fast lookup table. Cloudflare Workers or AWS Lambda@Edge can handle this at the CDN layer.
Keep the routing table small and cacheable. Slack's routing layer resolves cell assignment in under 1ms by keeping the full mapping in memory. If your lookup requires a database round-trip per request, you have already lost.
Cross-Cell Communication
Cells should be self-contained for the vast majority of operations. When cross-cell communication is unavoidable (a user in Cell A sends a message to a user in Cell B), treat it as an asynchronous operation. Use event buses or message queues with cell-specific topics. Synchronous cross-cell RPC creates the tight coupling that cells are designed to eliminate.
Keep a strict inventory of cross-cell data flows. At Slack, cross-cell traffic accounts for less than 5% of total request volume. If your cross-cell communication exceeds 10-15%, your cell boundaries are probably wrong.
When Cells Are Overkill
Cell-based architecture is heavy machinery. If you have fewer than 100K users, a single region with availability zone redundancy handles your blast radius concerns. If your failure scenarios are mostly code bugs (not infrastructure), feature flags and canary deploys give you similar protection at a fraction of the operational cost.
Cells start paying for themselves when you have a large enough user base that a full-fleet incident causes meaningful business damage, when your SLAs demand isolation guarantees, or when regulatory requirements force you to separate specific customer groups. Below that threshold, the operational overhead of managing multiple independent environments will slow your team down more than it protects your customers.
Key Points
- •Cells are isolation boundaries, not scaling units. The primary value is blast radius reduction. A bad deploy, a runaway query, or a poisoned config change only affects one cell's users, not your entire customer base
- •Cell sizing is a business decision disguised as a technical one. Slack's cells serve roughly 50K concurrent users each. DoorDash sized theirs by metro region. The right boundary depends on your failure cost per customer segment
- •Cross-cell communication must be treated as a foreign API call with circuit breakers, retries, and explicit contracts. The moment cells start sharing state liberally, you have a distributed monolith with extra network hops
- •Cell-aware routing at the edge is the linchpin. If your routing layer cannot deterministically map a request to the correct cell within 1-2ms, you will eat the latency budget before your application code runs
- •Operational tooling cost dwarfs the infrastructure cost. You need per-cell dashboards, per-cell deployment pipelines, per-cell runbooks, and engineers who can reason about 20+ independent environments simultaneously
Common Mistakes
- ✗Sharing a database across cells. This is the single most common way to accidentally couple cells together. Each cell needs its own data store, even if that means duplicating reference data
- ✗Making cells too small. A team that built 200 cells for 50K users discovered they spent more time on cell management than on product development. Cell count should grow with customer scale, not ahead of it
- ✗Deploying to all cells simultaneously. Staggered rollouts across cells (deploy to 2 cells, observe for 30 minutes, continue) give you a natural canary mechanism. Deploying everywhere at once defeats the isolation benefit
- ✗Ignoring cell rebalancing. Customers grow, usage patterns shift, and cells become hot. Without automated rebalancing or at least clear playbooks for cell migration, you end up with severe skew within 6-12 months