CRUSH: Topology-Aware Data Placement
Architecture
The Problem
Consider a cluster with 20,000 storage nodes spread across 3 availability zones and 500 racks. An object needs 14 shards (10 data + 4 parity). The system must pick 14 specific nodes for those shards, subject to three constraints that fight each other:
-
No central lookup. At 350K requests/sec, a central placement database becomes a bottleneck. Every read and write would need to ask "which nodes hold this object?" before doing anything. That's a single point of failure and a latency tax on every operation.
-
Failure domain isolation. The 14 nodes can't be random. If 5 shards land on the same rack and that rack loses power, 5 of 14 shards are lost. With 10+4 EC, you can only survive 4 losses. Same-rack placement turns a routine rack failure into data loss. Shards need to spread across different racks and different AZs.
-
Minimal data movement. Adding or removing a node cannot trigger a full reshuffle. With 100 trillion objects, even a lightweight per-object check is impossible. Only the data that was on the changed node should move.
Naive solutions fail at least one of these:
| Approach | Central lookup? | Failure domain aware? | Minimal movement? |
|---|---|---|---|
| Metadata DB lookup | Yes (bottleneck) | Yes (if you code it) | Yes |
hash(key) % N | No | No | No (adding 1 node moves ~everything) |
| Consistent hashing | No | No (ring has no topology) | Yes |
| CRUSH | No | Yes | Yes |
CRUSH is the only approach that solves all three simultaneously.
How CRUSH Works
CRUSH takes two inputs and produces one output:
Input: A placement group ID (integer) and a cluster topology map. Output: An ordered list of N nodes (e.g., 14 nodes for 10+4 EC).
No database lookup. No network call. Any node that has the cluster map can run CRUSH locally and get the same answer.
The Cluster Topology Map
The map is a tree describing your physical infrastructure:
root (entire cluster)
├── AZ-1 (us-east-1a)
│ ├── Rack-A (weight: 720TB)
│ │ ├── Node-1 (weight: 720TB, 36 drives × 20TB)
│ │ ├── Node-2
│ │ └── Node-3
│ ├── Rack-B
│ │ ├── Node-4
│ │ └── Node-5
│ └── ... (180 racks)
├── AZ-2 (us-east-1b)
│ └── ... (180 racks)
└── AZ-3 (us-east-1c)
└── ... (180 racks)
Each node has a weight proportional to its storage capacity. A node with 36 × 20TB drives has weight 720. A smaller node with 12 × 12TB drives has weight 144. CRUSH gives more data to heavier nodes.
The map is small (a few MB for 20K nodes) and changes rarely (only when nodes join, leave, or fail). Every component in the system caches a copy.
The Selection Algorithm
CRUSH walks the tree from root to leaf, selecting one branch at each level using a pseudo-random function seeded by the placement group ID. Here's the step-by-step for placing 14 shards with the rule "spread across 3 AZs, different rack per shard":
Step 1: Select AZs.
CRUSH needs to pick 3 AZs. For each AZ slot, it computes:
az_index = hash(pg_id, attempt=0) % number_of_AZs
The hash is deterministic. PG-42 always selects the same AZs. The algorithm distributes shards across AZs: say 5 in AZ-1, 5 in AZ-2, 4 in AZ-3.
Step 2: Select racks within each AZ.
For each shard assigned to AZ-1, CRUSH picks a rack:
rack_index = hash(pg_id, shard_number, attempt=0) % racks_in_AZ1
The shard_number in the hash ensures different shards pick different racks. If a collision happens (two shards pick the same rack), CRUSH increments the attempt counter and rehashes. This guarantees each shard lands on a different rack.
Step 3: Select a node within each rack.
node_index = hash(pg_id, shard_number, attempt=0) % nodes_in_rack
The node is selected proportional to weight. Heavier nodes are more likely to be chosen. This is how CRUSH handles heterogeneous hardware.
Result for PG-42:
Shard 0 → AZ-1, Rack-A, Node-1
Shard 1 → AZ-1, Rack-B, Node-4
Shard 2 → AZ-1, Rack-C, Node-8
Shard 3 → AZ-1, Rack-D, Node-12
Shard 4 → AZ-1, Rack-E, Node-15
Shard 5 → AZ-2, Rack-F, Node-20
Shard 6 → AZ-2, Rack-G, Node-25
Shard 7 → AZ-2, Rack-H, Node-30
Shard 8 → AZ-2, Rack-I, Node-33
Shard 9 → AZ-2, Rack-J, Node-38
Shard 10 → AZ-3, Rack-K, Node-42
Shard 11 → AZ-3, Rack-L, Node-47
Shard 12 → AZ-3, Rack-M, Node-51
Shard 13 → AZ-3, Rack-N, Node-55
14 shards across 14 different racks, 3 AZs, weighted by capacity. Every node in the cluster can compute this exact same mapping independently.
Concrete Example: What Happens When a Node is Added
The cluster has 20,000 nodes and 100,000 placement groups. Each PG maps to 14 nodes. Node-20001 is added to Rack-Z in AZ-2.
Before: CRUSH runs on every PG using the old cluster map. PG-42 maps to [Node-1, Node-4, Node-8, ...]. No PG mentions Node-20001 (it didn't exist).
After: The cluster map is updated with Node-20001. CRUSH re-runs on every PG using the new map.
For most PGs, the hash computation lands on the same nodes as before. Node-20001 has a small weight relative to the cluster, so only a few PGs' random selections shift to include it.
Typically, ~5 PGs move some shards to the new node. The rest of the 100,000 PGs are completely unaffected. Compare this to hash(key) % N where adding one node moves nearly every object.
The data migration for those ~5 PGs happens in the background: read the shards from the old node, write them to Node-20001, update the PG membership in the cluster map.
CRUSH vs Consistent Hashing
People often ask why not just use consistent hashing. The answer is topology.
| Property | Consistent Hashing | CRUSH |
|---|---|---|
| Failure domain aware? | No. Ring positions are random. Two replicas can land on the same rack by chance. | Yes. Rules enforce rack/AZ separation. |
| Weighted placement? | Via virtual nodes (more vnodes = more data). Works but imprecise. | Native. Weight directly controls selection probability. |
| Topology changes? | Add/remove moves ~1/N of keys. | Add/remove moves only PGs on the changed node. |
| Central state? | Ring stored locally (sorted array). | Topology map stored locally (tree). |
| Computation? | Hash + binary search. O(log(N×V)). | Hash + tree walk. O(depth × replicas). |
| Used for? | Caches, load balancers, simple KV stores. | Storage systems that care about physical failure domains. |
Consistent hashing is fine when physical topology doesn't matter. For a cache cluster where losing a node just means cache misses, consistent hashing is simpler and works well. For a storage system where losing the wrong combination of nodes means data loss, CRUSH's topology awareness is not optional.
CRUSH Rules
CRUSH rules define the placement policy. A rule for 10+4 EC across 3 AZs looks like this (pseudocode):
rule ec_10_4 {
type erasure_coded
min_size 10
max_size 14
step take root
step choose 3 type az # pick 3 AZs
step chooseleaf 5 type rack # within each AZ, pick racks, then nodes
# (5+5+4 distribution across AZs)
step emit
}
The chooseleaf step picks a rack and then a node within it in one operation. The 5 means "up to 5 per AZ" but CRUSH distributes as evenly as possible (5+5+4 for 14 shards across 3 AZs).
Multiple rules can serve different use cases:
rule replicated_3x: 3 replicas across 3 different racks (for small objects where EC is overkill)rule ec_16_4: 16+4 EC for Glacier-tier data (more data shards, less overhead)rule local_az: keep all replicas in one AZ (for latency-sensitive workloads that sacrifice AZ-failure tolerance)
Where Does the Cluster Map Come From?
CRUSH is a function. It needs a cluster map as input. But who builds and maintains that map?
The control plane manages the cluster map. In Ceph this is the Monitor cluster. In our object storage design, it's the Placement Service running on the control plane. The map is versioned (v1, v2, v3...) and changes only when the cluster topology changes.
How a node joins the cluster:
- New storage node starts up and contacts the control plane: "I'm Node-20001, I have 36 drives of 20TB each (weight: 720), I'm in AZ-2, Rack-Z."
- Control plane adds Node-20001 to the topology tree, assigns its weight, and generates a new cluster map (v47 → v48).
- New map is distributed to every component: API Routers, Data Routers, all storage nodes, repair workers. Usually via a watch/subscription mechanism (push) or periodic poll.
- Now every component has the updated map. CRUSH computations that involve Rack-Z in AZ-2 may now select Node-20001. Only the PGs whose CRUSH output changes need to migrate data.
How a node fails:
- Heartbeat monitor detects Node-500 is unreachable (30-second timeout).
- Control plane marks Node-500 as DOWN in the map. New version generated (v48 → v49).
- Map distributed. CRUSH now excludes Node-500 from all computations.
- PGs that had shards on Node-500 get new node assignments. Repair workers read surviving shards, reconstruct the missing ones, and write them to the newly assigned nodes.
Key insight: CRUSH is local computation. The cluster map is shared global state. The map is small (a few MB for 20K nodes), changes rarely (a few times per day at most), and is cached everywhere. No per-request network call needed.
End-to-End Example: Storing an Object
Let's walk through storing photos/vacation/beach.jpg (10MB) from start to finish. We'll use a small cluster for clarity: 12 nodes, 3 AZs, 6 racks, erasure coding 4+2 (4 data shards, 2 parity shards).
Cluster map:
root
├── AZ-1
│ ├── Rack-A
│ │ ├── Node-1 (weight: 720)
│ │ └── Node-2 (weight: 720)
│ └── Rack-B
│ ├── Node-3 (weight: 720)
│ └── Node-4 (weight: 720)
├── AZ-2
│ ├── Rack-C
│ │ ├── Node-5 (weight: 720)
│ │ └── Node-6 (weight: 720)
│ └── Rack-D
│ ├── Node-7 (weight: 360) ← smaller node
│ └── Node-8 (weight: 720)
└── AZ-3
├── Rack-E
│ ├── Node-9 (weight: 720)
│ └── Node-10 (weight: 720)
└── Rack-F
├── Node-11 (weight: 720)
└── Node-12 (weight: 720)
Step 1: Object to Placement Group.
pg_id = hash("photos/vacation/beach.jpg") % 1000 = 42
Every component in the cluster computes this same hash. PG-42 is now the unit of placement.
Step 2: CRUSH computes placement for PG-42.
Rule: 4+2 EC, spread across 3 AZs, each shard on a different rack.
CRUSH walks the tree:
CRUSH(pg_id=42, map, rule=ec_4_2)
Step 1: Select AZs → AZ-1, AZ-2, AZ-3 (distribute 2+2+2)
Step 2: Within AZ-1, select 2 racks:
hash(42, shard=0) % 2 racks → Rack-A
hash(42, shard=1) % 2 racks → Rack-B
Step 3: Within each rack, select node (weighted):
Rack-A: hash(42, 0) selects Node-1 (weight 720)
Rack-B: hash(42, 1) selects Node-3 (weight 720)
Step 4: Repeat for AZ-2:
Rack-C → Node-5
Rack-D → Node-7 (lower weight, but hash landed here)
Step 5: Repeat for AZ-3:
Rack-E → Node-9
Rack-F → Node-11
Result:
Shard 0 (data) → AZ-1, Rack-A, Node-1
Shard 1 (data) → AZ-1, Rack-B, Node-3
Shard 2 (data) → AZ-2, Rack-C, Node-5
Shard 3 (data) → AZ-2, Rack-D, Node-7
Shard 4 (parity) → AZ-3, Rack-E, Node-9
Shard 5 (parity) → AZ-3, Rack-F, Node-11
6 shards, 6 different racks, 3 AZs. No two shards share a failure domain.
Step 3: Write the shards.
The Data Router (API path) or storage node (presigned path) runs the exact same CRUSH computation, gets the exact same 6 nodes, and sends:
Node-1 ← shard 0 (2.5MB data shard) ✓ written
Node-3 ← shard 1 (2.5MB data shard) ✓ written
Node-5 ← shard 2 (2.5MB data shard) ✓ written
Node-7 ← shard 3 (2.5MB data shard) ✓ written (slow, smaller node)
Node-9 ← shard 4 (2.5MB parity shard) ✓ written
Node-11 ← shard 5 (2.5MB parity shard) ✓ written
Wait for quorum (say 5 of 6). All 6 responded. Commit metadata via Raft. Done.
Step 4: Metadata stores the mapping.
key: "photos/vacation/beach.jpg"
pg: 42
version: v1
etag: "a1b2c3..."
shard_map: [Node-1, Node-3, Node-5, Node-7, Node-9, Node-11]
The metadata records PG-42 and the shard locations. But notice: anyone with the cluster map can recompute the same node list from PG-42 using CRUSH. The shard_map in metadata is a cache for fast lookup, not the source of truth for placement.
End-to-End Example: Reading the Object (Happy Path)
Client wants to read photos/vacation/beach.jpg.
Step 1: Metadata lookup.
GET "photos/vacation/beach.jpg"
→ Metadata Service: PG-42, ETag "a1b2c3...", shard_map: [Node-1, Node-3, Node-5, Node-7, Node-9, Node-11]
Step 2: Fetch data shards.
We need any 4 of 6 shards to reconstruct (4+2 EC). Start by fetching the 4 data shards:
Node-1 → shard 0 ✓ (15ms)
Node-3 → shard 1 ✓ (12ms)
Node-5 → shard 2 ✓ (14ms)
Node-7 → shard 3 ✓ (22ms, slower node)
All 4 data shards arrived. Concatenate them in order. Return beach.jpg to the client. Parity shards not needed.
End-to-End Example: Reading During Node Failure
Same object, but Node-5 crashed 10 minutes ago.
Step 1: Metadata lookup. Same as before. Returns PG-42, shard map.
Step 2: Fetch data shards.
Node-1 → shard 0 ✓ (15ms)
Node-3 → shard 1 ✓ (12ms)
Node-5 → shard 2 ✗ TIMEOUT (50ms)
Node-7 → shard 3 ✓ (22ms)
Only 3 of 4 data shards. Not enough. Need 1 more shard (any shard).
Step 3: Fetch a parity shard.
Node-9 → shard 4 (parity) ✓ (18ms)
Now we have 4 shards: [0, 1, 3, 4]. Reed-Solomon reconstruction recovers shard 2 from these 4.
Step 4: Return object. Client gets beach.jpg. Total latency: ~70ms instead of ~22ms. The extra time is the timeout waiting for Node-5 plus the parity fetch. With speculative reads (fire off a parity request before the timeout), this drops to ~25ms.
Step 5: Background repair. The repair worker notices Node-5 is down. For PG-42, it reads shards [0, 1, 3, 4], reconstructs shard 2, and writes it to a new node (say Node-6 in the same rack). CRUSH with the updated cluster map (Node-5 marked DOWN) selects Node-6 as the replacement. Metadata updated.
What Happens When the Cluster Map Changes
All CRUSH computations depend on the cluster map. When the map changes, some PGs get different node assignments.
Adding Node-13 to Rack-A in AZ-1:
- Cluster map v48 → v49 (Node-13 added).
- CRUSH re-runs for all PGs. Most PGs are unaffected because their hash didn't land on Rack-A.
- A few PGs that previously selected Node-1 or Node-2 (the other nodes in Rack-A) now select Node-13 instead, because the weight distribution in Rack-A changed.
- For those PGs, shards migrate from Node-1/Node-2 to Node-13 in the background. Rate-limited to avoid I/O storms.
Removing Node-7 (failure):
- Cluster map v49 → v50 (Node-7 marked DOWN).
- CRUSH re-runs. PGs that had shards on Node-7 now get a different node in Rack-D (Node-8).
- Repair workers reconstruct the missing shards on Node-8.
In both cases, only PGs involving the changed node are affected. The rest of the cluster is untouched.
What CRUSH is NOT
CRUSH is not a storage system. It doesn't store data. It doesn't replicate data. It doesn't track which objects exist. It is purely a mathematical function:
f(pg_id, cluster_map, rule) → [node_1, node_2, ..., node_N]
The metadata service answers: "Does this object exist, and what's its PG?" CRUSH answers: "Given this PG, which nodes should hold the shards?" These are different questions answered by different systems.
Think of it this way: the cluster map is the world map. CRUSH is the GPS. You don't ask "where is my data?" You compute "where should it be?" And every node with the same map computes the same answer.
Why It Matters
At 350K requests/sec with 100 trillion objects, a central placement lookup would need to handle every read and write. That's a database serving 350K queries/sec just to answer "where is this data?" before the actual read or write even starts.
CRUSH eliminates that entire lookup. The API Router, Data Router, storage node, or repair worker can all compute placement locally in microseconds. No network call. No database dependency. No bottleneck. The cluster map that CRUSH needs is a few MB that changes a few times per day, cached everywhere.
This is what makes it possible to scale storage to exabytes without the placement layer becoming the bottleneck.
Key Points
- •CRUSH replaces the question 'where IS my data?' with 'where SHOULD my data be?' Any node can compute the answer locally using the same algorithm and cluster map, with no central placement database and no network call
- •The algorithm walks a hierarchy (region → AZ → rack → node) selecting one node at each level using a pseudo-random function seeded by the placement group ID. This makes it deterministic: same PG ID + same cluster map = same node selection every time
- •Failure domain awareness is built in. CRUSH rules enforce constraints like 'no two shards on the same rack' or 'max 5 shards per AZ'. This is what prevents a single rack power failure from losing too many shards of the same object
- •When a node is added or removed, CRUSH recomputes placement and only the placement groups that were on the changed node need to move. With 100K PGs and 20K nodes, adding one node moves roughly 5 PGs worth of data, not a full reshuffle
- •Weights control how much data each node gets. A node with 36TB drives gets 3x the data of a node with 12TB drives. This handles heterogeneous hardware without manual balancing
- •CRUSH is not a storage system, not a replication protocol, not a metadata store. It is purely a placement function: input = PG ID + cluster topology, output = list of N nodes
Used By
Common Mistakes
- ✗Confusing CRUSH with consistent hashing. Consistent hashing maps keys to positions on a ring. CRUSH walks a topology tree with failure domain awareness. Consistent hashing has no concept of racks or AZs
- ✗Not defining CRUSH rules for your failure domains. Default rules spread data across hosts but not across racks. If two replicas land on the same rack and the rack loses power, both copies are gone
- ✗Ignoring weight rebalancing when adding nodes with different disk sizes. A new node with 20TB drives gets the same default weight as an old node with 8TB drives, leading to uneven utilization
- ✗Making the topology tree too deep. Every level in the hierarchy adds a random selection step. Region → AZ → rack → node is fine. Adding sub-rack and sub-node levels increases computation and makes debugging placement harder