Shard Rebalancing & Virtual Buckets

Why Rebalancing Exists

You start with 3 database nodes, each holding a third of the data. The system grows. One node fills up, or you need more read throughput, or you want to survive a hardware failure without losing a third of your data. You add a fourth node.

Now what? The new node sits empty. You need to move data from the existing nodes to the new one. This is shard rebalancing, and it is one of the most operationally dangerous procedures in a distributed database. Move data too fast and you overwhelm the network. Move it too slow and the cluster stays imbalanced for hours. Get the routing wrong during the transition and queries return stale or missing results.

Every distributed database handles rebalancing differently, but they all deal with the same fundamental trade-offs.

Sharding Strategies

Range-Based Sharding

Divide the key space into contiguous ranges. Keys "aardvark" through "mammoth" go to shard 1. Keys "mammoth" through "zebra" go to shard 2.

Range-based sharding makes range queries efficient: "give me all users with last names starting with S" hits one shard instead of all of them. It also makes the data model intuitive; you can reason about which shard owns which data.

The problem is hotspots. If everyone creates accounts starting with "user_" today, one shard gets all the writes. Time-series data is even worse: if shards are split by timestamp ranges, the "current" shard gets 100% of writes while historical shards sit idle.

CockroachDB uses range-based sharding but with automatic range splitting. When a range gets too large (default: 512 MB) or too hot (too many reads/writes per second), CockroachDB splits it into two smaller ranges. This is the best of both worlds: you get the range query benefit plus automatic load balancing.

Hash-Based Sharding

Hash the key and assign the hash to a shard: shard = hash(key) % num_shards. Distribution is uniform by construction. No hotspots unless the key space itself is skewed (e.g., one user generating millions of keys).

The cost: range queries must scatter to all shards because hash functions destroy key ordering. "Give me all keys between X and Y" requires asking every shard.

DynamoDB uses hash-based partitioning on the partition key. This is why DynamoDB access patterns must be designed around single-key lookups and avoids range scans across partitions.

Hash + Range (Compound)

Cassandra combines both: the partition key is hashed to pick a shard, and the clustering key determines sort order within the shard. This lets you do range queries within a partition efficiently while distributing partitions evenly across nodes.

Virtual Buckets: The Key Abstraction

The modulo approach (shard = hash(key) % N) has a fatal problem: changing N remaps almost every key. Adding a fourth shard to a three-shard system moves 75% of the data.

Virtual buckets fix this. Instead of mapping keys directly to physical nodes, create a much larger number of virtual buckets (say, 1024) and assign those buckets to physical nodes.

key → hash(key) % 1024 → virtual_bucket → physical_node

With 1024 virtual buckets and 3 physical nodes, each node owns about 341 buckets. Adding a fourth node means reassigning about 256 buckets (one-quarter) to the new node. The key-to-bucket mapping never changes. Only the bucket-to-node mapping changes.

This is exactly the same principle as virtual nodes in consistent hashing, just viewed from a different angle.

Vitess calls them "vindexes" and manages the bucket-to-shard mapping through a VSchema. MongoDB calls them "chunks" and moves chunks between shards. Kafka calls them "partitions" and reassigns partitions between brokers. Different names, same idea.

Online Rebalancing

The hard part is not deciding which data to move. It is moving data while the system is still serving traffic.

The Double-Serve Period

During rebalancing, a key might exist on both the old shard and the new shard. The system needs to decide: which shard handles reads and writes for this key right now?

Approach 1: Move then switch. Copy all data to the new shard. Once the copy is complete, atomically update the routing table so new requests go to the new shard. This is clean but requires the old shard to forward any writes that happen during the copy to the new shard (or buffer them for replay).

Approach 2: Switch then backfill. Update the routing table first so new writes go to the new shard. Then copy historical data in the background. Reads during the backfill might need to check both shards. This is faster to start but more complex to implement correctly.

CockroachDB uses approach 1 with a Raft-based handoff. The range is first replicated to the new node (as a Raft follower), then leadership is transferred, then the old replica is removed. At no point is data unavailable.

Rate Limiting

Every production system rate-limits rebalancing to avoid starving the live workload.

Cassandra's stream_throughput_outbound setting (default: 200 MB/s) caps how fast data streams between nodes during repair and rebalancing. Set it too high, and compaction, reads, and writes compete for I/O. Set it too low, and rebalancing takes days.

Kafka's kafka-reassign-partitions tool has a --throttle flag (in bytes per second) that limits replication bandwidth during partition reassignment. The default behavior (no throttle) has caused plenty of production incidents.

CockroachDB uses a combination of rate limiting and snapshot queuing. It limits the number of concurrent snapshot transfers per node and caps the transfer rate.

Detecting Balance

How do you know when rebalancing is needed? Common triggers:

Storage skew. One node has 2x the data of others. Measure by bytes, not key count.
Load skew. One node handles 3x the QPS of others. This can happen even with even data distribution if some keys are hotter than others.
Node addition. A new empty node joins and needs data.
Node removal. A node is being decommissioned and its data needs to move elsewhere.

Most systems monitor these automatically and either alert (requiring manual intervention) or trigger automatic rebalancing.

CockroachDB Range Splits in Detail

CockroachDB is one of the most sophisticated examples of automatic shard rebalancing, so it is worth examining closely.

Data is divided into ranges, each covering a contiguous span of the key space. Each range is replicated across 3-5 nodes using Raft. Ranges are the unit of replication, splitting, and rebalancing.

Automatic splitting. When a range exceeds 512 MB, CockroachDB splits it into two ranges at the midpoint of the key space. Each half gets its own Raft group. The split is an atomic operation that does not block reads or writes.

Automatic merging. When two adjacent ranges are both small (under 128 MB combined) and have low traffic, CockroachDB merges them back into one. This prevents range fragmentation.

Load-based splitting. Even if a range is small, if it receives high QPS, CockroachDB splits it at a key that divides the load. This prevents hotspots where a single range's Raft leader becomes a bottleneck.

Rebalancing. A background process continuously monitors replica distribution across nodes. If one node has too many replicas, it transfers some to nodes with fewer. The transfer uses Raft: the new node joins the Raft group as a learner, catches up, gets promoted to a voter, and then the old replica is removed.

Kafka Partition Reassignment

Kafka has a different rebalancing model because partitions are append-only logs, not mutable key-value data.

Each Kafka topic has a fixed number of partitions (set at creation time, hard to change later). Each partition is an ordered, immutable sequence of records. Partitions are distributed across brokers.

When you add a new broker, existing partitions do not automatically move. You must explicitly trigger a reassignment using kafka-reassign-partitions.sh or a rebalancing tool like Cruise Control (LinkedIn's open-source Kafka load balancer).

Reassignment copies the partition's data to the new broker, catches up with live writes, and then switches leadership. The catch: Kafka partitions can be very large (hundreds of GB), and copying them eats network bandwidth. Without throttling, a reassignment can saturate the network and cause consumer lag spikes.

Cruise Control automates this by continuously monitoring broker load and proposing reassignment plans that minimize data movement while improving balance. It is effectively an autopilot for Kafka rebalancing.

When Rebalancing Goes Wrong

The most common rebalancing failure mode is cascading load. You start moving data from an overloaded node. The data movement itself adds load (disk reads on the source, disk writes on the destination, network transfer). The already-overloaded node gets even more overloaded. Response times spike. Clients start timing out and retrying, adding more load. The system spirals.

Prevention: aggressive rate limiting, monitoring during rebalancing, and circuit breakers that pause rebalancing if the cluster's health degrades.

The second most common failure: routing errors during the transition. A client sends a write to the old shard, but the routing table has already been updated to point to the new shard. The write goes to the wrong place. Prevention: the source shard must either proxy writes to the new shard or reject writes after the routing switch, forcing the client to retry (and hit the new shard via the updated routing table).

Why Rebalancing Exists

Every distributed database handles rebalancing differently, but they all deal with the same fundamental trade-offs.

Sharding Strategies

Range-Based Sharding

Divide the key space into contiguous ranges. Keys "aardvark" through "mammoth" go to shard 1. Keys "mammoth" through "zebra" go to shard 2.

Hash-Based Sharding

The cost: range queries must scatter to all shards because hash functions destroy key ordering. "Give me all keys between X and Y" requires asking every shard.

DynamoDB uses hash-based partitioning on the partition key. This is why DynamoDB access patterns must be designed around single-key lookups and avoids range scans across partitions.

Hash + Range (Compound)

Virtual Buckets: The Key Abstraction

The modulo approach (shard = hash(key) % N) has a fatal problem: changing N remaps almost every key. Adding a fourth shard to a three-shard system moves 75% of the data.

Virtual buckets fix this. Instead of mapping keys directly to physical nodes, create a much larger number of virtual buckets (say, 1024) and assign those buckets to physical nodes.

key → hash(key) % 1024 → virtual_bucket → physical_node

This is exactly the same principle as virtual nodes in consistent hashing, just viewed from a different angle.

Online Rebalancing

The hard part is not deciding which data to move. It is moving data while the system is still serving traffic.

The Double-Serve Period

During rebalancing, a key might exist on both the old shard and the new shard. The system needs to decide: which shard handles reads and writes for this key right now?

Rate Limiting

Every production system rate-limits rebalancing to avoid starving the live workload.

CockroachDB uses a combination of rate limiting and snapshot queuing. It limits the number of concurrent snapshot transfers per node and caps the transfer rate.

Detecting Balance

How do you know when rebalancing is needed? Common triggers:

Storage skew. One node has 2x the data of others. Measure by bytes, not key count.
Load skew. One node handles 3x the QPS of others. This can happen even with even data distribution if some keys are hotter than others.
Node addition. A new empty node joins and needs data.
Node removal. A node is being decommissioned and its data needs to move elsewhere.

Most systems monitor these automatically and either alert (requiring manual intervention) or trigger automatic rebalancing.

CockroachDB Range Splits in Detail

CockroachDB is one of the most sophisticated examples of automatic shard rebalancing, so it is worth examining closely.

Data is divided into ranges, each covering a contiguous span of the key space. Each range is replicated across 3-5 nodes using Raft. Ranges are the unit of replication, splitting, and rebalancing.

Automatic merging. When two adjacent ranges are both small (under 128 MB combined) and have low traffic, CockroachDB merges them back into one. This prevents range fragmentation.

Kafka Partition Reassignment

Kafka has a different rebalancing model because partitions are append-only logs, not mutable key-value data.

Each Kafka topic has a fixed number of partitions (set at creation time, hard to change later). Each partition is an ordered, immutable sequence of records. Partitions are distributed across brokers.

When Rebalancing Goes Wrong

Prevention: aggressive rate limiting, monitoring during rebalancing, and circuit breakers that pause rebalancing if the cluster's health degrades.

Architecture

Why Rebalancing Exists

Sharding Strategies

Range-Based Sharding

Hash-Based Sharding

Hash + Range (Compound)

Virtual Buckets: The Key Abstraction

Online Rebalancing

The Double-Serve Period

Rate Limiting

Detecting Balance

CockroachDB Range Splits in Detail

Kafka Partition Reassignment

When Rebalancing Goes Wrong

Key Points

Used By

Common Mistakes

Related

Shard Rebalancing & Virtual Buckets

Architecture

Why Rebalancing Exists

Sharding Strategies

Range-Based Sharding

Hash-Based Sharding

Hash + Range (Compound)

Virtual Buckets: The Key Abstraction

Online Rebalancing

The Double-Serve Period

Rate Limiting

Detecting Balance

CockroachDB Range Splits in Detail

Kafka Partition Reassignment

When Rebalancing Goes Wrong

Key Points

Used By

Common Mistakes

Related