CockroachDB

CockroachDB is the tool to reach for when the requirements include the relational model, real ACID transactions, and the ability to scale horizontally across regions. It draws heavily from Google's Spanner paper, but it can actually run without owning a fleet of atomic clocks. The core problem it solves is real: scaling a relational database horizontally without giving up consistency. Anyone who has operated a sharded Postgres cluster with application-level routing knows exactly why this matters.

That said, it is not magic. The cost of distribution shows up in write latency, operational complexity, and dollars. For many workloads, a single well-tuned Postgres instance is the right answer. CockroachDB earns its place when multi-region consistency is genuinely needed or when the data no longer fits on one machine.

How It Works Internally

There are four layers, and understanding them prevents a lot of confusion later.

First, the SQL layer. It parses queries using a PostgreSQL-compatible parser and feeds them into a cost-based optimizer. Then the Distributed SQL (DistSQL) engine figures out how to run the query across multiple nodes. It pushes computation close to the data to cut down on network transfers. Think of it as MapReduce for SQL, which is basically what it is.

Below that sits the transactional key-value store. SQL tables get encoded as sorted key-value pairs (table ID + index ID + column values). Transactions use a hybrid logical clock (HLC) that combines NTP-synchronized wall time with a logical counter. This provides a global ordering without needing perfectly synchronized clocks. Serializable isolation works through timestamp ordering. When the system detects read-write conflicts, it restarts the transaction. Transaction restarts are normal, not errors. Application code needs to handle them.

The key-value space splits into ranges, roughly 512MB each. Every range gets its own Raft consensus group, usually with a replication factor of 3 or 5. The Raft leader handles all reads and writes for that range, replicating to followers before acknowledging. When a range grows past 512MB, it splits automatically. The system rebalances replicas across nodes to keep things even. At the bottom, Pebble (a custom LSM-tree engine forked from the RocksDB/LevelDB lineage) handles the actual disk I/O, tuned specifically for CockroachDB's access patterns.

One thing that trips people up: every node runs the full stack. There are no separate coordinator or storage roles. Any node can serve any query. This simplifies deployment but means every node needs decent resources.

Production Architecture

Run at least 5 nodes across 3 availability zones for production. This allows losing an entire AZ and still maintaining quorum. Put a load balancer in front to spread client connections. Since every node can handle any query, the balancing strategy is simple.

For multi-region setups, CockroachDB offers three table-level topologies. REGIONAL BY TABLE pins the leaseholder to a primary region for low-latency reads. REGIONAL BY ROW pins each row to a specific region, which is the right choice for data residency rules. GLOBAL is for read-heavy reference tables accessed everywhere, though writes get slower. Pick the right one per table. Most teams need a mix.

Set --cache to 25-35% of available RAM and --max-sql-memory to 25% for query execution buffers. Turn on the built-in DB Console on port 8080 from day one. It shows range distribution, hot ranges, SQL stats, and Raft health. It gets used constantly.

Capacity Planning

Here are some real numbers to plan around. A 3-node cluster on 8-vCPU instances with SSDs handles roughly 2,000-5,000 simple read/write transactions per second. Write throughput is bounded by Raft consensus latency: 2-5ms within a single AZ, 50-200ms across regions. Budget 3x the data size in raw storage for the default replication factor of 3, plus another 1.5-2x for LSM compaction headroom. So 1TB of data needs about 4.5-6TB of disk. Plan accordingly.

Keep an eye on these metrics: ranges.unavailable (anything above 0 is a fire), admission.wait_durations for overload detection, and liveness.heartbeatlatency for slow nodes. Watch per-node range counts too. If they are heavily skewed, something is wrong with rebalancing. CockroachDB's own guidance is to keep per-node data under 2.5TB and per-node range count under 50,000. In practice, I would start getting nervous before those limits.

Failure Scenarios

Scenario 1: Hot range bottleneck from sequential keys. This one bites almost everyone who does not read the docs carefully. The app uses auto-incrementing integer primary keys, so every insert lands on the same range's Raft leader. That single range caps out at roughly 1,000-2,000 writes/sec. Even 100 nodes would not matter. The fix: check the DB Console's Hot Ranges page. If one range is eating all the QPS, switch to UUID DEFAULT gen_random_uuid() for the primary key, or add a hash-sharded index with USING HASH WITH BUCKET_COUNT = 8 to spread writes across ranges. Do this before going to production, not after.

Scenario 2: Multi-region transaction latency spiral. The deployment spans US-East, EU-West, and AP-Southeast. A checkout transaction reads the user's cart (pinned to US-East), checks inventory (pinned to EU-West), and writes the order. Every cross-region Raft consensus adds 80-120ms. That "simple" transaction is now 500ms+. Under load, transaction retries stack up and things get worse fast. The fix: redesign so hot-path transactions stay in one region. Pin cart, inventory snapshot, and order data to the user's region using REGIONAL BY ROW. Reconcile inventory across regions asynchronously. This is a schema design problem, not a database tuning problem. Get it right early.

How It Works Internally

There are four layers, and understanding them prevents a lot of confusion later.

Production Architecture

Capacity Planning

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

CockroachDB

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies