Shard Rebalancing & Virtual Buckets
Architecture
Why Rebalancing Exists
You start with 3 database nodes, each holding a third of the data. The system grows. One node fills up, or you need more read throughput, or you want to survive a hardware failure without losing a third of your data. You add a fourth node.
Now what? The new node sits empty. You need to move data from the existing nodes to the new one. This is shard rebalancing, and it is one of the most operationally dangerous procedures in a distributed database. Move data too fast and you overwhelm the network. Move it too slow and the cluster stays imbalanced for hours. Get the routing wrong during the transition and queries return stale or missing results.
Every distributed database handles rebalancing differently, but they all deal with the same fundamental trade-offs.
Sharding Strategies
Range-Based Sharding
Divide the key space into contiguous ranges. Keys "aardvark" through "mammoth" go to shard 1. Keys "mammoth" through "zebra" go to shard 2.
Range-based sharding makes range queries efficient: "give me all users with last names starting with S" hits one shard instead of all of them. It also makes the data model intuitive; you can reason about which shard owns which data.
The problem is hotspots. If everyone creates accounts starting with "user_" today, one shard gets all the writes. Time-series data is even worse: if shards are split by timestamp ranges, the "current" shard gets 100% of writes while historical shards sit idle.
CockroachDB uses range-based sharding but with automatic range splitting. When a range gets too large (default: 512 MB) or too hot (too many reads/writes per second), CockroachDB splits it into two smaller ranges. This is the best of both worlds: you get the range query benefit plus automatic load balancing.
Hash-Based Sharding
Hash the key and assign the hash to a shard: shard = hash(key) % num_shards. Distribution is uniform by construction. No hotspots unless the key space itself is skewed (e.g., one user generating millions of keys).
The cost: range queries must scatter to all shards because hash functions destroy key ordering. "Give me all keys between X and Y" requires asking every shard.
DynamoDB uses hash-based partitioning on the partition key. This is why DynamoDB access patterns must be designed around single-key lookups and avoids range scans across partitions.
Hash + Range (Compound)
Cassandra combines both: the partition key is hashed to pick a shard, and the clustering key determines sort order within the shard. This lets you do range queries within a partition efficiently while distributing partitions evenly across nodes.
Virtual Buckets: The Key Abstraction
The modulo approach (shard = hash(key) % N) has a fatal problem: changing N remaps almost every key. Adding a fourth shard to a three-shard system moves 75% of the data.
Virtual buckets fix this. Instead of mapping keys directly to physical nodes, create a much larger number of virtual buckets (say, 1024) and assign those buckets to physical nodes.
key → hash(key) % 1024 → virtual_bucket → physical_node
With 1024 virtual buckets and 3 physical nodes, each node owns about 341 buckets. Adding a fourth node means reassigning about 256 buckets (one-quarter) to the new node. The key-to-bucket mapping never changes. Only the bucket-to-node mapping changes.
This is exactly the same principle as virtual nodes in consistent hashing, just viewed from a different angle.
Vitess calls them "vindexes" and manages the bucket-to-shard mapping through a VSchema. MongoDB calls them "chunks" and moves chunks between shards. Kafka calls them "partitions" and reassigns partitions between brokers. Different names, same idea.
Online Rebalancing
The hard part is not deciding which data to move. It is moving data while the system is still serving traffic.
The Double-Serve Period
During rebalancing, a key might exist on both the old shard and the new shard. The system needs to decide: which shard handles reads and writes for this key right now?
Approach 1: Move then switch. Copy all data to the new shard. Once the copy is complete, atomically update the routing table so new requests go to the new shard. This is clean but requires the old shard to forward any writes that happen during the copy to the new shard (or buffer them for replay).
Approach 2: Switch then backfill. Update the routing table first so new writes go to the new shard. Then copy historical data in the background. Reads during the backfill might need to check both shards. This is faster to start but more complex to implement correctly.
CockroachDB uses approach 1 with a Raft-based handoff. The range is first replicated to the new node (as a Raft follower), then leadership is transferred, then the old replica is removed. At no point is data unavailable.
Rate Limiting
Every production system rate-limits rebalancing to avoid starving the live workload.
Cassandra's stream_throughput_outbound setting (default: 200 MB/s) caps how fast data streams between nodes during repair and rebalancing. Set it too high, and compaction, reads, and writes compete for I/O. Set it too low, and rebalancing takes days.
Kafka's kafka-reassign-partitions tool has a --throttle flag (in bytes per second) that limits replication bandwidth during partition reassignment. The default behavior (no throttle) has caused plenty of production incidents.
CockroachDB uses a combination of rate limiting and snapshot queuing. It limits the number of concurrent snapshot transfers per node and caps the transfer rate.
Detecting Balance
How do you know when rebalancing is needed? Common triggers:
- Storage skew. One node has 2x the data of others. Measure by bytes, not key count.
- Load skew. One node handles 3x the QPS of others. This can happen even with even data distribution if some keys are hotter than others.
- Node addition. A new empty node joins and needs data.
- Node removal. A node is being decommissioned and its data needs to move elsewhere.
Most systems monitor these automatically and either alert (requiring manual intervention) or trigger automatic rebalancing.
CockroachDB Range Splits in Detail
CockroachDB is one of the most sophisticated examples of automatic shard rebalancing, so it is worth examining closely.
Data is divided into ranges, each covering a contiguous span of the key space. Each range is replicated across 3-5 nodes using Raft. Ranges are the unit of replication, splitting, and rebalancing.
Automatic splitting. When a range exceeds 512 MB, CockroachDB splits it into two ranges at the midpoint of the key space. Each half gets its own Raft group. The split is an atomic operation that does not block reads or writes.
Automatic merging. When two adjacent ranges are both small (under 128 MB combined) and have low traffic, CockroachDB merges them back into one. This prevents range fragmentation.
Load-based splitting. Even if a range is small, if it receives high QPS, CockroachDB splits it at a key that divides the load. This prevents hotspots where a single range's Raft leader becomes a bottleneck.
Rebalancing. A background process continuously monitors replica distribution across nodes. If one node has too many replicas, it transfers some to nodes with fewer. The transfer uses Raft: the new node joins the Raft group as a learner, catches up, gets promoted to a voter, and then the old replica is removed.
Kafka Partition Reassignment
Kafka has a different rebalancing model because partitions are append-only logs, not mutable key-value data.
Each Kafka topic has a fixed number of partitions (set at creation time, hard to change later). Each partition is an ordered, immutable sequence of records. Partitions are distributed across brokers.
When you add a new broker, existing partitions do not automatically move. You must explicitly trigger a reassignment using kafka-reassign-partitions.sh or a rebalancing tool like Cruise Control (LinkedIn's open-source Kafka load balancer).
Reassignment copies the partition's data to the new broker, catches up with live writes, and then switches leadership. The catch: Kafka partitions can be very large (hundreds of GB), and copying them eats network bandwidth. Without throttling, a reassignment can saturate the network and cause consumer lag spikes.
Cruise Control automates this by continuously monitoring broker load and proposing reassignment plans that minimize data movement while improving balance. It is effectively an autopilot for Kafka rebalancing.
When Rebalancing Goes Wrong
The most common rebalancing failure mode is cascading load. You start moving data from an overloaded node. The data movement itself adds load (disk reads on the source, disk writes on the destination, network transfer). The already-overloaded node gets even more overloaded. Response times spike. Clients start timing out and retrying, adding more load. The system spirals.
Prevention: aggressive rate limiting, monitoring during rebalancing, and circuit breakers that pause rebalancing if the cluster's health degrades.
The second most common failure: routing errors during the transition. A client sends a write to the old shard, but the routing table has already been updated to point to the new shard. The write goes to the wrong place. Prevention: the source shard must either proxy writes to the new shard or reject writes after the routing switch, forcing the client to retry (and hit the new shard via the updated routing table).
Key Points
- •Shard rebalancing moves data between nodes when you add capacity, remove nodes, or when load becomes uneven. The core challenge is doing this without downtime and without killing the performance of the live system that is still serving reads and writes
- •Range-based sharding splits the key space into contiguous ranges (e.g., keys A-M on shard 1, N-Z on shard 2). Easy to reason about and efficient for range queries, but prone to hotspots if certain key ranges get more traffic
- •Hash-based sharding hashes keys and assigns the hash to a shard. This gives even distribution but makes range queries impossible because adjacent keys end up on different shards. CockroachDB and DynamoDB use hash-based approaches
- •Virtual buckets (or virtual shards) decouple the logical partition from the physical node. Create 1000 virtual buckets, assign them to 10 physical nodes. Adding an 11th node means reassigning ~90 buckets, not rehashing every key
- •Online rebalancing while serving traffic requires throttling data movement to avoid saturating network and disk I/O. Every production system has a 'rebalancing rate limit' knob, and getting it right is the difference between a smooth expansion and a cascading outage
Used By
Common Mistakes
- ✗Rebalancing without rate limiting. Moving terabytes of data at full network speed starves the live workload of bandwidth and I/O. Cassandra, CockroachDB, and Kafka all have rebalancing throttle settings that default to conservative values for good reason
- ✗Splitting shards by key count instead of data size. A shard with 1 million tiny keys and a shard with 1 million large blobs have the same key count but wildly different storage and I/O requirements. Split on bytes, not rows
- ✗Not monitoring shard balance after the initial setup. Data distribution shifts as usage patterns change. The shard that was perfectly sized six months ago might now hold 3x the data of its neighbors. Set up alerts on shard size deviation
- ✗Doing a full rebalance when a partial one would suffice. If one shard is hot, split that shard. Do not reorganize the entire cluster. Minimizing data movement reduces risk and shortens the rebalancing window