TiKV
Distributed key-value store that delivers transactions without giving up range scans
Use Cases
Architecture
How It Works Internally
TiKV came out of PingCAP as the storage layer for TiDB, their distributed MySQL-compatible database. But it stands on its own as a general-purpose distributed key-value store. Think of it as "what if someone took RocksDB, replicated it with Raft, and added transactions?"
The data model is simple: sorted byte-string keys mapping to byte-string values. The key space gets chopped into regions, each about 96MB by default. Every region is a contiguous key range with its own 3-replica Raft group. Region 1 might cover keys a through m, Region 2 covers m through z, and so on. When a region grows past the threshold, it splits. When traffic drops and neighboring regions shrink, they merge.
Write path: Client sends a write to the region's Raft leader. The leader appends to its Raft log (stored in a dedicated RocksDB instance), replicates to followers, waits for quorum, then applies the mutation to the data RocksDB. Two RocksDB instances per node sounds weird, but it makes sense. The Raft log is append-only sequential writes, and data mutations are applied in order. Keeping them separate means compaction on one doesn't interfere with the other.
Read path: Client reads from the region leader at a specific timestamp (provided by PD's timestamp oracle). The leader serves directly from its local RocksDB. No Raft round-trip needed for reads because the leader already has the latest committed state. For stale reads (slightly old data is acceptable), any replica can serve, which spreads read load.
Transactions follow a Percolator-style two-phase commit. Phase one: the client writes "prewrite" locks on all keys involved, picking one as the primary lock. Phase two: the client commits the primary lock by writing the commit timestamp, then asynchronously cleans up secondary locks. If the client crashes mid-transaction, other clients can detect orphaned locks and resolve them. The timestamp oracle in PD provides globally unique, monotonically increasing timestamps that order all transactions.
The whole thing is written in Rust, which matters for a system that manages its own memory and does a lot of concurrent I/O. No GC pauses. Predictable tail latency.
Production Architecture
A minimal production cluster needs:
- 3 PD nodes on dedicated machines (lightweight but latency-sensitive)
- 3+ TiKV nodes (the actual data storage, scale as needed)
- Monitoring stack (Prometheus + Grafana, with TiKV's built-in metrics exporters)
PD is the scheduling brain. It allocates timestamps for transactions, tracks which TiKV node holds which region, decides when to split or merge regions, and rebalances data when nodes are added or removed. PD itself is a small Raft group. Losing PD quorum halts all write transactions because nobody can get timestamps.
For cross-AZ deployments, place Raft replicas across 3 AZs. Writes pay the cross-AZ latency (one Raft round-trip), typically adding 5-15ms. Reads from the leader are local-AZ if the leader happens to be nearby.
Scaling: Add TiKV nodes and PD automatically migrates regions to balance the load. Removing nodes works the same way in reverse: PD drains regions off the departing node before shutdown. No manual data movement.
| Cluster Size | TiKV Nodes | PD Nodes | Regions | Typical Use |
|---|---|---|---|---|
| Small | 3 | 3 | Hundreds | Dev, staging, small prod |
| Medium | 5-20 | 3 | Tens of thousands | Mid-size production |
| Large | 20-100 | 5 | Hundreds of thousands | Heavy production, multi-TB |
| Very Large | 100+ | 5 | Millions | TiDB at serious scale |
Decision Criteria
| Criteria | TiKV | etcd | FoundationDB | CockroachDB |
|---|---|---|---|---|
| Data model | Ordered KV | Flat KV | Ordered KV | SQL tables |
| Max data size | Petabytes | 8 GB | Petabytes | Terabytes |
| Transactions | Snapshot isolation, cross-region | Single-key mini-txn | Serializable, cross-shard | Serializable SQL |
| Consensus | Raft per region | Raft (single group) | Paxos + TLogs | Raft per range |
| Range scans | Native, efficient | Prefix-based | Native | SQL queries |
| Write throughput | ~100K-200K/sec | ~10K-30K/sec | Millions/sec | ~50K-100K/sec |
| Language | Rust | Go | C++ / Flow | Go |
| Primary use | TiDB backend, general KV | K8s, coordination | Infrastructure metadata | Distributed SQL |
When TiKV over FoundationDB: Snapshot isolation is sufficient for most workloads, Raft-per-region gives a simpler operational model, and the Rust ecosystem is a plus. Serializable isolation and simulation testing (FoundationDB's strengths) aren't needed.
When TiKV over etcd: Data exceeds 8GB. Real distributed transactions are needed. The system must scale beyond a single Raft group.
When TiKV over CockroachDB: A raw KV API is preferred over SQL. The use case is a custom storage layer where SQL overhead isn't justified.
Capacity Planning
Write throughput: A 3-node cluster on NVMe SSDs handles ~50K-80K writes/sec. Scale roughly linearly by adding nodes. 10 nodes reaches 200K+ writes/sec. The bottleneck is usually Raft replication latency, not RocksDB throughput.
Read throughput: Leader reads hit local RocksDB. Expect ~100K point reads/sec per TiKV node with hot data in the block cache. Range scans depend on scan width, but 10K scans/sec of 100-key ranges is reasonable per node.
Storage: Each TiKV node uses RocksDB with default compression (LZ4 for L0-L3, ZSTD for L4+). Plan for ~1.5x raw data size to account for RocksDB space amplification. A node with 2TB NVMe can hold roughly 1.3TB of logical data.
Latency: Point writes: p50 ~3ms, p99 ~15ms (single-AZ). Point reads: p50 ~1ms, p99 ~5ms (from leader, data in cache). Cross-AZ writes add 5-15ms for Raft replication.
Region count: Keep below 20K-30K regions per TiKV node. Each region's Raft group has heartbeat and election overhead. At 50K regions per node, Raft store thread pool saturation becomes a real problem.
Failure Scenarios
Scenario 1: TiKV Node Failure
Trigger: A TiKV node crashes or becomes unreachable. Could be hardware failure, OOM, or network partition.
Impact: All regions where this node was the Raft leader lose their leader. Followers in those regions detect missing heartbeats and trigger elections. New leaders get elected on the surviving nodes within 10-30 seconds (default election timeout). During the election window, writes to those regions stall. Reads from followers are stale until the new leader is confirmed.
Detection: PD detects the node is down via heartbeat timeout. Grafana dashboards show region leader count dropping on the failed node and rising on others. Monitor tikv_raftstore_region_count and tikv_server_report_failure_msg_total.
Recovery: Automatic. PD schedules new replicas for regions that lost a member, restoring the 3-replica count. The new replica syncs via Raft snapshot from the leader. Full recovery time depends on data volume on the failed node, typically minutes to hours.
Scenario 2: Hot Region Causing Latency Spikes
Trigger: A single region receives disproportionate traffic. Common causes: auto-increment primary keys, timestamp-ordered inserts, or a single popular key prefix. All writes to that key range funnel through one Raft leader.
Impact: The hot region's leader node sees CPU and I/O saturation. Latency for all regions on that node increases, not just the hot one. p99 write latency jumps from 15ms to 200ms+. Other TiKV nodes sit idle.
Detection: PD's hot region scheduler detects traffic imbalance. Check tikv_grpc_msg_duration_seconds by region. Grafana's TiKV dashboard has a "Hot Read/Write Region" panel.
Recovery: Short-term: PD can split the hot region and scatter the halves across different nodes. Medium-term: change the key design. Hash-prefix keys (sha256(id)[0:4] + id) spread sequential inserts across regions. For read-hot keys, enable follower reads to distribute load across replicas instead of hammering the leader.
Scenario 3: PD Quorum Loss
Trigger: Two of three PD nodes go down simultaneously (rack failure, bad deployment, network partition isolating them).
Impact: No new timestamps can be allocated. All distributed transactions fail because they can't get a start or commit timestamp. TiKV nodes continue serving reads at their last known timestamp, but no new writes can commit. Existing data is safe, but the cluster is effectively read-only.
Detection: All transaction-based operations return timestamp errors. PD leader metrics disappear. TiKV logs show pd is not ready or timestamp allocation failures.
Recovery: Restore PD quorum by bringing at least one failed PD node back online. If machines are permanently lost, use pd-recover tool to bootstrap a new PD cluster from the surviving member's data. PD stores very little state (cluster topology and timestamp watermark), so recovery is fast once quorum is restored. This is why PD nodes should be on separate failure domains, same as etcd coordinators.
Pros
- • Distributed ACID transactions with snapshot isolation across shards
- • Sorted keys with efficient range scans, not just point lookups
- • Raft consensus per region gives strong consistency without a single leader bottleneck
- • Automatic region splitting and merging as data grows or shrinks
- • Coprocessor pushes computation to storage nodes, cutting network round-trips
- • Built on RocksDB. Battle-tested LSM-tree performance underneath
Cons
- • Operational complexity. You're running Placement Driver (PD) + TiKV nodes + monitoring
- • Snapshot isolation, not serializable. Phantom reads are possible in edge cases
- • Write latency depends on Raft replication. Cross-AZ deployments add 5-15ms
- • Region splitting can cause brief latency spikes during transitions
- • Smaller ecosystem than etcd or Redis. Fewer client libraries and community resources
When to use
- • Need ordered key-value storage with distributed transactions
- • Outgrowing a single-node store and need horizontal scaling
- • Running TiDB and want its native storage engine
- • Workload needs both fast point reads and efficient range scans
When NOT to use
- • Simple coordination or config storage (etcd is simpler and good enough)
- • Pure cache workloads (Redis is faster and more appropriate)
- • Data fits on one machine (RocksDB embedded avoids all the distributed overhead)
- • Need serializable isolation (FoundationDB is a better fit)
- • Team doesn't have distributed systems operational experience
Key Points
- •Data is split into regions (default 96MB each). Each region is a contiguous key range with its own Raft group (3 replicas). This provides strong consistency per region without a global leader bottleneck.
- •Transactions use a two-phase commit protocol with a timestamp oracle (provided by PD). Reads snapshot at a specific timestamp. Writes lock keys, then commit atomically across regions. The isolation level is snapshot isolation, not serializable.
- •The Placement Driver (PD) is the cluster brain. It hands out timestamps, tracks region locations, schedules region splits/merges, and balances load across nodes. PD itself runs as a Raft group (3 or 5 nodes) for high availability.
- •Each TiKV node runs RocksDB underneath. Two instances, actually: one for data, one for the Raft log. Writes hit the Raft log RocksDB first, then get applied to the data RocksDB after Raft consensus.
- •The coprocessor framework pushes filtering and aggregation down to TiKV nodes. Instead of pulling all rows to the client and filtering there, the storage node does the work and sends back only matching results. This is how TiDB pushes SQL predicates into the storage layer.
- •Region splitting happens automatically when a region exceeds the size threshold. PD detects the oversized region, picks a split key, and the region leader handles the split. Brief write pause (~50ms) on the splitting region during transition.
Common Mistakes
- ✗Running PD, TiKV, and TiDB all on the same machines. PD needs stable latency for timestamp allocation. Noisy neighbors kill it. Give PD dedicated nodes.
- ✗Not monitoring region count per node. Thousands of regions on a single TiKV node means thousands of Raft groups, and Raft heartbeat overhead adds up. Keep it under 20K-30K regions per node.
- ✗Ignoring hot region detection. One region getting 80% of the traffic is common with sequential inserts (auto-increment keys, timestamps). Use scatter or hash-prefixed keys to spread the load.
- ✗Setting Raft store thread pool too small. Default is fine for small clusters, but at 50+ TiKV nodes with heavy write throughput, you'll see Raft apply lag.
- ✗Forgetting that snapshot isolation allows write skew. If two transactions read overlapping data and write to different keys based on what they read, both can commit. This matters for financial workloads.
- ✗Skipping TiKV's built-in encryption at rest for compliance workloads. It's there, but not enabled by default.