CockroachDB
Distributed SQL that actually survives failures
Use Cases
Architecture
CockroachDB is the tool to reach for when the requirements include the relational model, real ACID transactions, and the ability to scale horizontally across regions. It draws heavily from Google's Spanner paper, but it can actually run without owning a fleet of atomic clocks. The core problem it solves is real: scaling a relational database horizontally without giving up consistency. Anyone who has operated a sharded Postgres cluster with application-level routing knows exactly why this matters.
That said, it is not magic. The cost of distribution shows up in write latency, operational complexity, and dollars. For many workloads, a single well-tuned Postgres instance is the right answer. CockroachDB earns its place when multi-region consistency is genuinely needed or when the data no longer fits on one machine.
How It Works Internally
There are four layers, and understanding them prevents a lot of confusion later.
First, the SQL layer. It parses queries using a PostgreSQL-compatible parser and feeds them into a cost-based optimizer. Then the Distributed SQL (DistSQL) engine figures out how to run the query across multiple nodes. It pushes computation close to the data to cut down on network transfers. Think of it as MapReduce for SQL, which is basically what it is.
Below that sits the transactional key-value store. SQL tables get encoded as sorted key-value pairs (table ID + index ID + column values). Transactions use a hybrid logical clock (HLC) that combines NTP-synchronized wall time with a logical counter. This provides a global ordering without needing perfectly synchronized clocks. Serializable isolation works through timestamp ordering. When the system detects read-write conflicts, it restarts the transaction. Transaction restarts are normal, not errors. Application code needs to handle them.
The key-value space splits into ranges, roughly 512MB each. Every range gets its own Raft consensus group, usually with a replication factor of 3 or 5. The Raft leader handles all reads and writes for that range, replicating to followers before acknowledging. When a range grows past 512MB, it splits automatically. The system rebalances replicas across nodes to keep things even. At the bottom, Pebble (a custom LSM-tree engine forked from the RocksDB/LevelDB lineage) handles the actual disk I/O, tuned specifically for CockroachDB's access patterns.
One thing that trips people up: every node runs the full stack. There are no separate coordinator or storage roles. Any node can serve any query. This simplifies deployment but means every node needs decent resources.
Production Architecture
Run at least 5 nodes across 3 availability zones for production. This allows losing an entire AZ and still maintaining quorum. Put a load balancer in front to spread client connections. Since every node can handle any query, the balancing strategy is simple.
For multi-region setups, CockroachDB offers three table-level topologies. REGIONAL BY TABLE pins the leaseholder to a primary region for low-latency reads. REGIONAL BY ROW pins each row to a specific region, which is the right choice for data residency rules. GLOBAL is for read-heavy reference tables accessed everywhere, though writes get slower. Pick the right one per table. Most teams need a mix.
Set --cache to 25-35% of available RAM and --max-sql-memory to 25% for query execution buffers. Turn on the built-in DB Console on port 8080 from day one. It shows range distribution, hot ranges, SQL stats, and Raft health. It gets used constantly.
Capacity Planning
Here are some real numbers to plan around. A 3-node cluster on 8-vCPU instances with SSDs handles roughly 2,000-5,000 simple read/write transactions per second. Write throughput is bounded by Raft consensus latency: 2-5ms within a single AZ, 50-200ms across regions. Budget 3x the data size in raw storage for the default replication factor of 3, plus another 1.5-2x for LSM compaction headroom. So 1TB of data needs about 4.5-6TB of disk. Plan accordingly.
Keep an eye on these metrics: ranges.unavailable (anything above 0 is a fire), admission.wait_durations for overload detection, and liveness.heartbeatlatency for slow nodes. Watch per-node range counts too. If they are heavily skewed, something is wrong with rebalancing. CockroachDB's own guidance is to keep per-node data under 2.5TB and per-node range count under 50,000. In practice, I would start getting nervous before those limits.
Failure Scenarios
Scenario 1: Hot range bottleneck from sequential keys. This one bites almost everyone who does not read the docs carefully. The app uses auto-incrementing integer primary keys, so every insert lands on the same range's Raft leader. That single range caps out at roughly 1,000-2,000 writes/sec. Even 100 nodes would not matter. The fix: check the DB Console's Hot Ranges page. If one range is eating all the QPS, switch to UUID DEFAULT gen_random_uuid() for the primary key, or add a hash-sharded index with USING HASH WITH BUCKET_COUNT = 8 to spread writes across ranges. Do this before going to production, not after.
Scenario 2: Multi-region transaction latency spiral. The deployment spans US-East, EU-West, and AP-Southeast. A checkout transaction reads the user's cart (pinned to US-East), checks inventory (pinned to EU-West), and writes the order. Every cross-region Raft consensus adds 80-120ms. That "simple" transaction is now 500ms+. Under load, transaction retries stack up and things get worse fast. The fix: redesign so hot-path transactions stay in one region. Pin cart, inventory snapshot, and order data to the user's region using REGIONAL BY ROW. Reconcile inventory across regions asynchronously. This is a schema design problem, not a database tuning problem. Get it right early.
Pros
- • Distributed ACID transactions that actually work
- • Automatic horizontal scaling and rebalancing
- • PostgreSQL-compatible wire protocol
- • Multi-region with locality-aware reads
- • Survives node, rack, and datacenter failures
Cons
- • Write latency is higher because of consensus overhead
- • Not fully PostgreSQL-compatible (missing extensions will surprise you)
- • Steep learning curve if you have never operated distributed SQL
- • Overkill and expensive for small datasets that fit on one Postgres box
When to use
- • You need horizontal scaling with SQL and ACID, not just one of them
- • Multi-region deployment with strong consistency
- • You want automatic failover without manual runbooks
- • Your single PostgreSQL instance is starting to sweat
When NOT to use
- • Single-region, small data volume. Just use Postgres.
- • Ultra-low latency requirements where single-node PG is measurably faster
- • You depend on PostgreSQL extensions like PostGIS or pg_cron
- • Budget-constrained projects where the infra cost is hard to justify
Key Points
- •Four layers do the heavy lifting: SQL parser/optimizer, distributed SQL execution, transactional key-value store, and per-range Raft consensus on top of Pebble (LSM-tree storage)
- •Data splits into ranges (default 512MB each). Each range runs its own Raft group. A 10TB database has roughly 20,000 Raft groups running in parallel.
- •Serializable isolation is the only level, not a choice. It uses a hybrid logical clock (HLC) that combines wall-clock time with logical counters to order transactions globally.
- •PostgreSQL wire protocol compatibility means most ORMs, drivers, and tools just work. But stored procedures, custom extensions, and some advanced types do not.
- •Geo-partitioning pins row-level data to specific regions using zone configs. This handles GDPR and data residency at the database layer instead of in application code.
- •Range splits, merges, and rebalancing run continuously. Add a node and lease redistribution kicks in within minutes. No operator intervention needed.
Common Mistakes
- ✗Expecting single-node Postgres performance. Cross-range transactions need Raft consensus round-trips. A simple insert touching one range takes 2-5ms vs sub-millisecond on Postgres.
- ✗Ignoring hotspots from sequential primary keys. Auto-increment IDs and timestamps concentrate writes on one range. Use UUIDs or hash-sharded indexes to spread the load.
- ✗Forgetting about cross-region latency. A transaction spanning regions eats WAN round-trips (50-200ms). Design the schema so hot-path transactions stay in one region.
- ✗Leaving the default GC TTL untouched. The 25-hour default keeps all MVCC revisions, eating storage fast. Drop it to 1-4 hours for write-heavy workloads.
- ✗Running too few nodes. CockroachDB needs at least 3 nodes for Raft quorum. Five nodes across 3 AZs is the realistic minimum for production.