TiKV

How It Works Internally

TiKV came out of PingCAP as the storage layer for TiDB, their distributed MySQL-compatible database. But it stands on its own as a general-purpose distributed key-value store. Think of it as "what if someone took RocksDB, replicated it with Raft, and added transactions?"

The data model is simple: sorted byte-string keys mapping to byte-string values. The key space gets chopped into regions, each about 96MB by default. Every region is a contiguous key range with its own 3-replica Raft group. Region 1 might cover keys a through m, Region 2 covers m through z, and so on. When a region grows past the threshold, it splits. When traffic drops and neighboring regions shrink, they merge.

Write path: Client sends a write to the region's Raft leader. The leader appends to its Raft log (stored in a dedicated RocksDB instance), replicates to followers, waits for quorum, then applies the mutation to the data RocksDB. Two RocksDB instances per node sounds weird, but it makes sense. The Raft log is append-only sequential writes, and data mutations are applied in order. Keeping them separate means compaction on one doesn't interfere with the other.

Read path: Client reads from the region leader at a specific timestamp (provided by PD's timestamp oracle). The leader serves directly from its local RocksDB. No Raft round-trip needed for reads because the leader already has the latest committed state. For stale reads (slightly old data is acceptable), any replica can serve, which spreads read load.

Transactions follow a Percolator-style two-phase commit. Phase one: the client writes "prewrite" locks on all keys involved, picking one as the primary lock. Phase two: the client commits the primary lock by writing the commit timestamp, then asynchronously cleans up secondary locks. If the client crashes mid-transaction, other clients can detect orphaned locks and resolve them. The timestamp oracle in PD provides globally unique, monotonically increasing timestamps that order all transactions.

The whole thing is written in Rust, which matters for a system that manages its own memory and does a lot of concurrent I/O. No GC pauses. Predictable tail latency.

Production Architecture

A minimal production cluster needs:

3 PD nodes on dedicated machines (lightweight but latency-sensitive)
3+ TiKV nodes (the actual data storage, scale as needed)
Monitoring stack (Prometheus + Grafana, with TiKV's built-in metrics exporters)

PD is the scheduling brain. It allocates timestamps for transactions, tracks which TiKV node holds which region, decides when to split or merge regions, and rebalances data when nodes are added or removed. PD itself is a small Raft group. Losing PD quorum halts all write transactions because nobody can get timestamps.

For cross-AZ deployments, place Raft replicas across 3 AZs. Writes pay the cross-AZ latency (one Raft round-trip), typically adding 5-15ms. Reads from the leader are local-AZ if the leader happens to be nearby.

Scaling: Add TiKV nodes and PD automatically migrates regions to balance the load. Removing nodes works the same way in reverse: PD drains regions off the departing node before shutdown. No manual data movement.

Cluster Size	TiKV Nodes	PD Nodes	Regions	Typical Use
Small	3	3	Hundreds	Dev, staging, small prod
Medium	5-20	3	Tens of thousands	Mid-size production
Large	20-100	5	Hundreds of thousands	Heavy production, multi-TB
Very Large	100+	5	Millions	TiDB at serious scale

Decision Criteria

Criteria	TiKV	etcd	FoundationDB	CockroachDB
Data model	Ordered KV	Flat KV	Ordered KV	SQL tables
Max data size	Petabytes	8 GB	Petabytes	Terabytes
Transactions	Snapshot isolation, cross-region	Single-key mini-txn	Serializable, cross-shard	Serializable SQL
Consensus	Raft per region	Raft (single group)	Paxos + TLogs	Raft per range
Range scans	Native, efficient	Prefix-based	Native	SQL queries
Write throughput	~100K-200K/sec	~10K-30K/sec	Millions/sec	~50K-100K/sec
Language	Rust	Go	C++ / Flow	Go
Primary use	TiDB backend, general KV	K8s, coordination	Infrastructure metadata	Distributed SQL

When TiKV over FoundationDB: Snapshot isolation is sufficient for most workloads, Raft-per-region gives a simpler operational model, and the Rust ecosystem is a plus. Serializable isolation and simulation testing (FoundationDB's strengths) aren't needed.

When TiKV over etcd: Data exceeds 8GB. Real distributed transactions are needed. The system must scale beyond a single Raft group.

When TiKV over CockroachDB: A raw KV API is preferred over SQL. The use case is a custom storage layer where SQL overhead isn't justified.

Capacity Planning

Write throughput: A 3-node cluster on NVMe SSDs handles ~50K-80K writes/sec. Scale roughly linearly by adding nodes. 10 nodes reaches 200K+ writes/sec. The bottleneck is usually Raft replication latency, not RocksDB throughput.

Read throughput: Leader reads hit local RocksDB. Expect ~100K point reads/sec per TiKV node with hot data in the block cache. Range scans depend on scan width, but 10K scans/sec of 100-key ranges is reasonable per node.

Storage: Each TiKV node uses RocksDB with default compression (LZ4 for L0-L3, ZSTD for L4+). Plan for ~1.5x raw data size to account for RocksDB space amplification. A node with 2TB NVMe can hold roughly 1.3TB of logical data.

Latency: Point writes: p50 ~3ms, p99 ~15ms (single-AZ). Point reads: p50 ~1ms, p99 ~5ms (from leader, data in cache). Cross-AZ writes add 5-15ms for Raft replication.

Region count: Keep below 20K-30K regions per TiKV node. Each region's Raft group has heartbeat and election overhead. At 50K regions per node, Raft store thread pool saturation becomes a real problem.

Failure Scenarios

Scenario 1: TiKV Node Failure

Trigger: A TiKV node crashes or becomes unreachable. Could be hardware failure, OOM, or network partition.

Impact: All regions where this node was the Raft leader lose their leader. Followers in those regions detect missing heartbeats and trigger elections. New leaders get elected on the surviving nodes within 10-30 seconds (default election timeout). During the election window, writes to those regions stall. Reads from followers are stale until the new leader is confirmed.

Detection: PD detects the node is down via heartbeat timeout. Grafana dashboards show region leader count dropping on the failed node and rising on others. Monitor tikv_raftstore_region_count and tikv_server_report_failure_msg_total.

Recovery: Automatic. PD schedules new replicas for regions that lost a member, restoring the 3-replica count. The new replica syncs via Raft snapshot from the leader. Full recovery time depends on data volume on the failed node, typically minutes to hours.

Scenario 2: Hot Region Causing Latency Spikes

Trigger: A single region receives disproportionate traffic. Common causes: auto-increment primary keys, timestamp-ordered inserts, or a single popular key prefix. All writes to that key range funnel through one Raft leader.

Impact: The hot region's leader node sees CPU and I/O saturation. Latency for all regions on that node increases, not just the hot one. p99 write latency jumps from 15ms to 200ms+. Other TiKV nodes sit idle.

Detection: PD's hot region scheduler detects traffic imbalance. Check tikv_grpc_msg_duration_seconds by region. Grafana's TiKV dashboard has a "Hot Read/Write Region" panel.

Recovery: Short-term: PD can split the hot region and scatter the halves across different nodes. Medium-term: change the key design. Hash-prefix keys (sha256(id)[0:4] + id) spread sequential inserts across regions. For read-hot keys, enable follower reads to distribute load across replicas instead of hammering the leader.

Scenario 3: PD Quorum Loss

Trigger: Two of three PD nodes go down simultaneously (rack failure, bad deployment, network partition isolating them).

Impact: No new timestamps can be allocated. All distributed transactions fail because they can't get a start or commit timestamp. TiKV nodes continue serving reads at their last known timestamp, but no new writes can commit. Existing data is safe, but the cluster is effectively read-only.

Detection: All transaction-based operations return timestamp errors. PD leader metrics disappear. TiKV logs show pd is not ready or timestamp allocation failures.

Recovery: Restore PD quorum by bringing at least one failed PD node back online. If machines are permanently lost, use pd-recover tool to bootstrap a new PD cluster from the surviving member's data. PD stores very little state (cluster topology and timestamp watermark), so recovery is fast once quorum is restored. This is why PD nodes should be on separate failure domains, same as etcd coordinators.

How It Works Internally

The whole thing is written in Rust, which matters for a system that manages its own memory and does a lot of concurrent I/O. No GC pauses. Predictable tail latency.

Production Architecture

A minimal production cluster needs:

3 PD nodes on dedicated machines (lightweight but latency-sensitive)
3+ TiKV nodes (the actual data storage, scale as needed)
Monitoring stack (Prometheus + Grafana, with TiKV's built-in metrics exporters)

Cluster Size	TiKV Nodes	PD Nodes	Regions	Typical Use
Small	3	3	Hundreds	Dev, staging, small prod
Medium	5-20	3	Tens of thousands	Mid-size production
Large	20-100	5	Hundreds of thousands	Heavy production, multi-TB
Very Large	100+	5	Millions	TiDB at serious scale

Decision Criteria

Criteria	TiKV	etcd	FoundationDB	CockroachDB
Data model	Ordered KV	Flat KV	Ordered KV	SQL tables
Max data size	Petabytes	8 GB	Petabytes	Terabytes
Transactions	Snapshot isolation, cross-region	Single-key mini-txn	Serializable, cross-shard	Serializable SQL
Consensus	Raft per region	Raft (single group)	Paxos + TLogs	Raft per range
Range scans	Native, efficient	Prefix-based	Native	SQL queries
Write throughput	~100K-200K/sec	~10K-30K/sec	Millions/sec	~50K-100K/sec
Language	Rust	Go	C++ / Flow	Go
Primary use	TiDB backend, general KV	K8s, coordination	Infrastructure metadata	Distributed SQL

When TiKV over etcd: Data exceeds 8GB. Real distributed transactions are needed. The system must scale beyond a single Raft group.

When TiKV over CockroachDB: A raw KV API is preferred over SQL. The use case is a custom storage layer where SQL overhead isn't justified.

Capacity Planning

Latency: Point writes: p50 ~3ms, p99 ~15ms (single-AZ). Point reads: p50 ~1ms, p99 ~5ms (from leader, data in cache). Cross-AZ writes add 5-15ms for Raft replication.

Failure Scenarios

Scenario 1: TiKV Node Failure

Trigger: A TiKV node crashes or becomes unreachable. Could be hardware failure, OOM, or network partition.

Scenario 2: Hot Region Causing Latency Spikes

Detection: PD's hot region scheduler detects traffic imbalance. Check tikv_grpc_msg_duration_seconds by region. Grafana's TiKV dashboard has a "Hot Read/Write Region" panel.

Scenario 3: PD Quorum Loss

Trigger: Two of three PD nodes go down simultaneously (rack failure, bad deployment, network partition isolating them).

Detection: All transaction-based operations return timestamp errors. PD leader metrics disappear. TiKV logs show pd is not ready or timestamp allocation failures.

Use Cases

Architecture

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: TiKV Node Failure

Scenario 2: Hot Region Causing Latency Spikes

Scenario 3: PD Quorum Loss

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

TiKV

Use Cases

Architecture

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: TiKV Node Failure

Scenario 2: Hot Region Causing Latency Spikes

Scenario 3: PD Quorum Loss

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies