TiDB
MySQL-compatible distributed database that actually handles both OLTP and OLAP
Use Cases
Architecture
Most teams hit the same wall eventually: the MySQL instance is maxed out on writes, the analytics pipeline is a fragile mess of ETL jobs copying data into a warehouse, and half the reports are hours stale. TiDB exists to solve that specific pain. It is a distributed database that speaks MySQL protocol and can serve both transactional and analytical queries from the same cluster. That sounds like marketing, but it actually works in practice, with tradeoffs worth understanding before committing.
PingCAP built TiDB in the open. Companies like JD.com, Bank of Beijing, and BookMyShow run it in production at serious scale. But the real reason to pay attention is the architecture. It separates concerns in a way that allows each layer to scale independently.
How It Works Internally
Three component types talk to each other over gRPC.
The TiDB server is a stateless SQL layer. It parses MySQL-compatible SQL, runs a cost-based optimizer with statistics, and coordinates distributed execution across TiKV. Because it holds no state, any number of TiDB servers can sit behind a load balancer. More instances, more SQL throughput. Simple.
TiKV is the storage layer. It encodes every SQL row as a key-value pair (table prefix + row ID maps to the encoded row), preserving primary key order. The key space gets divided into Regions of roughly 96 MB each. Every Region is replicated across three TiKV stores using an independent Raft consensus group. The Raft leader for each Region handles reads and writes, replicating to followers before acknowledging. When a Region gets too big, TiKV splits it automatically and PD rebalances the pieces across stores.
PD (Placement Driver) is the brain of the cluster, deployed as a 3-node or 5-node Raft group for its own availability. Its most latency-sensitive job is the Timestamp Oracle (TSO), which hands out globally unique, monotonically increasing timestamps. TiDB uses these for MVCC snapshot reads and transaction ordering. PD also tracks where every Region lives, who leads it, and decides when to migrate, split, or merge Regions to keep things balanced.
Transactions follow a variant of Google's Percolator protocol, which is a two-phase commit over distributed key-value storage. Phase one (prewrite): the transaction writes tentative values to all involved Regions and picks a primary key as the commit lock. Phase two (commit): it writes a commit record for the primary key. Once that lands, all other keys resolve asynchronously. The result is snapshot isolation with a global timestamp ordering, providing ACID transactions across any number of Regions.
HTAP Architecture: TiFlash
This is the feature that actually makes TiDB different from CockroachDB or Vitess.
TiFlash maintains columnar replicas of TiKV Regions using the Raft Learner protocol. Raft Learners receive log entries but never vote in leader elections. That detail matters because it means TiFlash replication adds zero latency to the OLTP write path on TiKV. None.
When a query hits a TiDB server, the optimizer decides where to route it. Point lookups and transactional queries go to TiKV (row store). Aggregations, big scans, and joins over large tables get pushed to TiFlash (columnar store). It can even split a single query across both, reading a small dimension table from TiKV and joining it with a huge fact table scan from TiFlash.
TiFlash freshness is typically within seconds of the TiKV leader. The lag is proportional to write rate versus TiFlash's apply throughput. For strict read-after-write consistency on TiFlash, use SET TIDB_READ_STALENESS to wait for TiFlash to catch up to a specific timestamp. In practice, for dashboards and reporting, a few seconds of lag is rarely an issue.
Production Architecture
Run TiDB on Kubernetes with TiDB Operator for anything production-grade. Here is the minimum topology that avoids regret:
- 3 PD instances across 3 availability zones (lightweight: 4 vCPU, 8 GB RAM each)
- 3+ TiKV instances across 3 AZs (storage-heavy: 16 vCPU, 64 GB RAM, NVMe SSD with 2-4 TB each)
- 2+ TiDB instances behind a load balancer (compute-heavy: 16 vCPU, 32 GB RAM each)
- 2+ TiFlash instances for HTAP workloads (memory-heavy: 16 vCPU, 128 GB RAM, large SSD)
Placement Rules pin specific databases or tables to TiKV stores in certain regions, which is how data residency compliance gets handled. Tune raft-store.region-split-size (default 96 MB) based on the workload. Larger Regions (144-256 MB) reduce Raft group count and metadata overhead for big datasets. Smaller Regions improve parallel query execution for analytical queries.
For monitoring, TiDB ships with a built-in Dashboard (on PD's HTTP port) that provides SQL statement analysis, slow query logs, hot Region visualization, and cluster diagnostics. Pipe metrics into Prometheus and Grafana using the standard TiDB monitoring stack. Set up those Grafana dashboards on day one, not day thirty.
Capacity Planning
TiKV throughput depends on Region count, Raft consensus latency, and RocksDB performance. On a 16-vCPU node with NVMe, expect roughly 15,000-30,000 point reads/sec and 5,000-10,000 writes/sec per TiKV instance. Writes are bounded by Raft consensus: within a single AZ, plan for 3-8 ms per write. Across AZs in the same cloud region, 5-15 ms.
| Component | Sizing Guideline |
|---|---|
| TiKV storage | Raw data x 3 (replication factor) x 1.5 (compaction + MVCC overhead) |
| TiKV memory | 64 GB per node. RocksDB block cache (40%), Raft engine (10%), system (50%) |
| TiFlash storage | Raw data x 0.3-0.5 (columnar compression) x replica count |
| TiFlash memory | 128 GB per node. Columnar read cache dominates. |
| PD | 3 nodes, 8 GB RAM each. Stores only metadata (Region locations, TSO state) |
Keep an eye on tikv_raftstore_region_count per store (stay below 40,000 for good performance), pd_tso_wait_duration (should be under 5 ms, spikes mean PD is the bottleneck), and tiflash_raft_wait_index_duration for TiFlash replication lag.
Failure Scenarios
Scenario 1: The classic auto-increment hotspot. Someone creates an orders table with an auto-increment order_id as the primary key. Every single new order writes to the same Region (the one holding the highest key range), and that Region's Raft leader chokes. One TiKV store will be pinned at 90% CPU while the others sit at 20%. Writes that normally take 5-10 ms now take 50-100 ms. PD tries to help by splitting the Region, but it does not matter because the newest Region still gets all the inserts.
Detection: PD Dashboard's Hot Regions panel shows one Region with 10x the write QPS of anything else. tikv_grpc_msg_duration_seconds{type="kv_prewrite"} p99 spikes for that one store.
Fix: change the primary key to BIGINT AUTO_RANDOM. TiDB will allocate IDs with a random shard prefix that spreads inserts across multiple Regions. For tables already in production, add SHARD_ROW_ID_BITS = 4 to distribute writes across 16 Regions. Going forward, watch approximate_keys and approximate_size per Region in PD to catch skew before it becomes a problem.
Scenario 2: TiFlash falls behind during a traffic spike. The finance team has a dashboard querying TiFlash for real-time transaction summaries. A marketing campaign launches and OLTP writes triple from 10,000 to 30,000/sec. TiFlash's Raft Learner apply queue backs up because columnar compaction cannot keep pace with the incoming log entries. The tiflash_raft_wait_index_duration metric climbs from 2 seconds to 45 seconds. Now the dashboard shows totals that are 45 seconds stale, and an analyst opens a P1 ticket about "missing transactions."
Detection: alert on tiflash_raft_wait_index_duration over 10 seconds. Track the gap between TiKV's latest applied index and TiFlash's applied index per Region.
Fix: add more TiFlash nodes so each one handles fewer Regions. Bump profiles.default.max_threads to let TiFlash use more CPU for apply and compaction. For more redundancy, increase tiflash_replica_count so the load spreads further. For queries where a few seconds of staleness is fine (most dashboards), set SET TIDB_READ_STALENESS = '-5' to read the nearest available snapshot. That reduces read latency during lag spikes and keeps analysts from panicking.
Pros
- • Speaks the MySQL wire protocol, so most MySQL drivers, ORMs, and tools just work
- • TiFlash columnar replicas let you run analytics without hurting OLTP performance
- • Scales horizontally with automatic Region splitting and rebalancing
- • Strong consistency through Raft consensus on every data Region
- • Open source with a real community and commercial backing from PingCAP
Cons
- • Write latency is higher than single-node MySQL because of Raft consensus round-trips
- • Not every MySQL feature is supported (stored procedures and triggers are limited)
- • Three component types to operate (TiDB, TiKV, PD), which adds real operational overhead
- • Cross-Region transactions pay for multi-Raft-group coordination
- • TiFlash replication doubles your storage cost since it keeps a columnar copy of row data
When to use
- • You have outgrown a single MySQL instance and need horizontal scale with SQL
- • You want real-time analytics on live transactional data, no ETL
- • You need distributed ACID transactions with MySQL compatibility
- • You are running separate OLTP and OLAP systems and want to merge them
When NOT to use
- • Your workload fits comfortably on a single MySQL or PostgreSQL instance
- • You depend heavily on stored procedures and triggers
- • You need sub-millisecond latency where single-node databases are just faster
- • Your team has never operated a distributed database cluster
Key Points
- •Architecture splits into three layers: stateless SQL engines (TiDB), Raft-replicated key-value storage (TiKV), and a metadata/scheduling service (PD). TiDB servers parse SQL and coordinate execution, TiKV stores data in Regions, PD hands out timestamps and balances load.
- •Data lives in ~96 MB Regions, each replicated by its own Raft group (default factor of 3). A 1 TB database runs roughly 10,000 independent Raft groups in parallel.
- •TiFlash enables real-time HTAP by keeping columnar copies of TiKV Regions via the Raft Learner protocol. The optimizer routes analytical queries to TiFlash and point queries to TiKV automatically.
- •Distributed ACID transactions use a Percolator-style two-phase commit with a Timestamp Oracle (TSO) in PD for globally ordered timestamps. This provides snapshot isolation across all TiKV nodes.
- •PD (Placement Driver) is the cluster brain. It allocates monotonically increasing timestamps, tracks Region locations, schedules splits and merges, and rebalances leaders across TiKV stores.
- •TiKV runs RocksDB under the hood. Each TiKV instance actually has two RocksDB instances (one for the Raft log, one for state machine data), which provides LSM-tree write optimization.
Common Mistakes
- ✗Using auto-increment primary keys. Sequential IDs concentrate all writes on a single Region's Raft leader, creating a hotspot. Use SHARD_ROW_ID_BITS or random UUIDs to spread inserts across Regions.
- ✗Forgetting to set TiFlash replicas for analytical tables. Without TiFlash, OLAP queries hit TiKV's row store, triggering full table scans that compete with the OLTP workload. Run ALTER TABLE t SET TIFLASH REPLICA 2.
- ✗Ignoring Region hotspots. A single hot Region is capped at one Raft leader's throughput. Watch the PD Dashboard's Hot Regions panel and split hot Regions manually or tune Region size.
- ✗Running TiDB, TiKV, and PD on the same machines. Resource contention between the SQL layer and storage layer degrades both. Give each component its own dedicated instances.
- ✗Not tuning GC life time. The default 10-minute GC window retains all MVCC versions. Long-running transactions or analytics queries that read old snapshots block GC and cause storage bloat.