TiDB

Most teams hit the same wall eventually: the MySQL instance is maxed out on writes, the analytics pipeline is a fragile mess of ETL jobs copying data into a warehouse, and half the reports are hours stale. TiDB exists to solve that specific pain. It is a distributed database that speaks MySQL protocol and can serve both transactional and analytical queries from the same cluster. That sounds like marketing, but it actually works in practice, with tradeoffs worth understanding before committing.

PingCAP built TiDB in the open. Companies like JD.com, Bank of Beijing, and BookMyShow run it in production at serious scale. But the real reason to pay attention is the architecture. It separates concerns in a way that allows each layer to scale independently.

How It Works Internally

Three component types talk to each other over gRPC.

The TiDB server is a stateless SQL layer. It parses MySQL-compatible SQL, runs a cost-based optimizer with statistics, and coordinates distributed execution across TiKV. Because it holds no state, any number of TiDB servers can sit behind a load balancer. More instances, more SQL throughput. Simple.

TiKV is the storage layer. It encodes every SQL row as a key-value pair (table prefix + row ID maps to the encoded row), preserving primary key order. The key space gets divided into Regions of roughly 96 MB each. Every Region is replicated across three TiKV stores using an independent Raft consensus group. The Raft leader for each Region handles reads and writes, replicating to followers before acknowledging. When a Region gets too big, TiKV splits it automatically and PD rebalances the pieces across stores.

PD (Placement Driver) is the brain of the cluster, deployed as a 3-node or 5-node Raft group for its own availability. Its most latency-sensitive job is the Timestamp Oracle (TSO), which hands out globally unique, monotonically increasing timestamps. TiDB uses these for MVCC snapshot reads and transaction ordering. PD also tracks where every Region lives, who leads it, and decides when to migrate, split, or merge Regions to keep things balanced.

Transactions follow a variant of Google's Percolator protocol, which is a two-phase commit over distributed key-value storage. Phase one (prewrite): the transaction writes tentative values to all involved Regions and picks a primary key as the commit lock. Phase two (commit): it writes a commit record for the primary key. Once that lands, all other keys resolve asynchronously. The result is snapshot isolation with a global timestamp ordering, providing ACID transactions across any number of Regions.

HTAP Architecture: TiFlash

This is the feature that actually makes TiDB different from CockroachDB or Vitess.

TiFlash maintains columnar replicas of TiKV Regions using the Raft Learner protocol. Raft Learners receive log entries but never vote in leader elections. That detail matters because it means TiFlash replication adds zero latency to the OLTP write path on TiKV. None.

When a query hits a TiDB server, the optimizer decides where to route it. Point lookups and transactional queries go to TiKV (row store). Aggregations, big scans, and joins over large tables get pushed to TiFlash (columnar store). It can even split a single query across both, reading a small dimension table from TiKV and joining it with a huge fact table scan from TiFlash.

TiFlash freshness is typically within seconds of the TiKV leader. The lag is proportional to write rate versus TiFlash's apply throughput. For strict read-after-write consistency on TiFlash, use SET TIDB_READ_STALENESS to wait for TiFlash to catch up to a specific timestamp. In practice, for dashboards and reporting, a few seconds of lag is rarely an issue.

Production Architecture

Run TiDB on Kubernetes with TiDB Operator for anything production-grade. Here is the minimum topology that avoids regret:

3 PD instances across 3 availability zones (lightweight: 4 vCPU, 8 GB RAM each)
3+ TiKV instances across 3 AZs (storage-heavy: 16 vCPU, 64 GB RAM, NVMe SSD with 2-4 TB each)
2+ TiDB instances behind a load balancer (compute-heavy: 16 vCPU, 32 GB RAM each)
2+ TiFlash instances for HTAP workloads (memory-heavy: 16 vCPU, 128 GB RAM, large SSD)

Placement Rules pin specific databases or tables to TiKV stores in certain regions, which is how data residency compliance gets handled. Tune raft-store.region-split-size (default 96 MB) based on the workload. Larger Regions (144-256 MB) reduce Raft group count and metadata overhead for big datasets. Smaller Regions improve parallel query execution for analytical queries.

For monitoring, TiDB ships with a built-in Dashboard (on PD's HTTP port) that provides SQL statement analysis, slow query logs, hot Region visualization, and cluster diagnostics. Pipe metrics into Prometheus and Grafana using the standard TiDB monitoring stack. Set up those Grafana dashboards on day one, not day thirty.

Capacity Planning

TiKV throughput depends on Region count, Raft consensus latency, and RocksDB performance. On a 16-vCPU node with NVMe, expect roughly 15,000-30,000 point reads/sec and 5,000-10,000 writes/sec per TiKV instance. Writes are bounded by Raft consensus: within a single AZ, plan for 3-8 ms per write. Across AZs in the same cloud region, 5-15 ms.

Component	Sizing Guideline
TiKV storage	Raw data x 3 (replication factor) x 1.5 (compaction + MVCC overhead)
TiKV memory	64 GB per node. RocksDB block cache (40%), Raft engine (10%), system (50%)
TiFlash storage	Raw data x 0.3-0.5 (columnar compression) x replica count
TiFlash memory	128 GB per node. Columnar read cache dominates.
PD	3 nodes, 8 GB RAM each. Stores only metadata (Region locations, TSO state)

Keep an eye on tikv_raftstore_region_count per store (stay below 40,000 for good performance), pd_tso_wait_duration (should be under 5 ms, spikes mean PD is the bottleneck), and tiflash_raft_wait_index_duration for TiFlash replication lag.

Failure Scenarios

Scenario 1: The classic auto-increment hotspot. Someone creates an orders table with an auto-increment order_id as the primary key. Every single new order writes to the same Region (the one holding the highest key range), and that Region's Raft leader chokes. One TiKV store will be pinned at 90% CPU while the others sit at 20%. Writes that normally take 5-10 ms now take 50-100 ms. PD tries to help by splitting the Region, but it does not matter because the newest Region still gets all the inserts.

Detection: PD Dashboard's Hot Regions panel shows one Region with 10x the write QPS of anything else. tikv_grpc_msg_duration_seconds{type="kv_prewrite"} p99 spikes for that one store.

Fix: change the primary key to BIGINT AUTO_RANDOM. TiDB will allocate IDs with a random shard prefix that spreads inserts across multiple Regions. For tables already in production, add SHARD_ROW_ID_BITS = 4 to distribute writes across 16 Regions. Going forward, watch approximate_keys and approximate_size per Region in PD to catch skew before it becomes a problem.

Scenario 2: TiFlash falls behind during a traffic spike. The finance team has a dashboard querying TiFlash for real-time transaction summaries. A marketing campaign launches and OLTP writes triple from 10,000 to 30,000/sec. TiFlash's Raft Learner apply queue backs up because columnar compaction cannot keep pace with the incoming log entries. The tiflash_raft_wait_index_duration metric climbs from 2 seconds to 45 seconds. Now the dashboard shows totals that are 45 seconds stale, and an analyst opens a P1 ticket about "missing transactions."

Detection: alert on tiflash_raft_wait_index_duration over 10 seconds. Track the gap between TiKV's latest applied index and TiFlash's applied index per Region.

Fix: add more TiFlash nodes so each one handles fewer Regions. Bump profiles.default.max_threads to let TiFlash use more CPU for apply and compaction. For more redundancy, increase tiflash_replica_count so the load spreads further. For queries where a few seconds of staleness is fine (most dashboards), set SET TIDB_READ_STALENESS = '-5' to read the nearest available snapshot. That reduces read latency during lag spikes and keeps analysts from panicking.

How It Works Internally

Three component types talk to each other over gRPC.

HTAP Architecture: TiFlash

This is the feature that actually makes TiDB different from CockroachDB or Vitess.

Production Architecture

Run TiDB on Kubernetes with TiDB Operator for anything production-grade. Here is the minimum topology that avoids regret:

3 PD instances across 3 availability zones (lightweight: 4 vCPU, 8 GB RAM each)
3+ TiKV instances across 3 AZs (storage-heavy: 16 vCPU, 64 GB RAM, NVMe SSD with 2-4 TB each)
2+ TiDB instances behind a load balancer (compute-heavy: 16 vCPU, 32 GB RAM each)
2+ TiFlash instances for HTAP workloads (memory-heavy: 16 vCPU, 128 GB RAM, large SSD)

Capacity Planning

Component	Sizing Guideline
TiKV storage	Raw data x 3 (replication factor) x 1.5 (compaction + MVCC overhead)
TiKV memory	64 GB per node. RocksDB block cache (40%), Raft engine (10%), system (50%)
TiFlash storage	Raw data x 0.3-0.5 (columnar compression) x replica count
TiFlash memory	128 GB per node. Columnar read cache dominates.
PD	3 nodes, 8 GB RAM each. Stores only metadata (Region locations, TSO state)

Failure Scenarios

Detection: PD Dashboard's Hot Regions panel shows one Region with 10x the write QPS of anything else. tikv_grpc_msg_duration_seconds{type="kv_prewrite"} p99 spikes for that one store.

Detection: alert on tiflash_raft_wait_index_duration over 10 seconds. Track the gap between TiKV's latest applied index and TiFlash's applied index per Region.

Use Cases

Architecture

How It Works Internally

HTAP Architecture: TiFlash

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

TiDB

Use Cases

Architecture

How It Works Internally

HTAP Architecture: TiFlash

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies