Google Cloud Spanner

Most distributed databases force a choice: strong consistency or horizontal scale. Spanner is the rare system that actually delivers both, and the trick behind it is wonderfully weird. Google put atomic clocks and GPS receivers in every data center, built a time API around them, and used bounded clock uncertainty to make globally consistent transactions possible. It has been running internally at Google since before it became a cloud product in 2017, powering AdWords, Google Play, and Photos at scale.

Is it the right choice for a given system? Maybe. Spanner's 99.999% multi-region SLA (under 5.3 minutes of downtime per year) is real, and the consistency guarantees are the strongest available in production. But it locks teams into Google Cloud, costs real money, and comes with design constraints that must be respected.

How It Works Internally

Three ideas make Spanner work: TrueTime, Paxos-based replication, and automatic split management. They are deeply connected, and understanding one without the others gives an incomplete picture.

TrueTime is the clever part. Every Google data center has a mix of atomic clocks (cesium and rubidium) and GPS receivers that cross-check each other. The TrueTime API does not return a single timestamp. It returns an interval [earliest, latest] representing the true current time with bounded uncertainty, typically under 7ms. When a transaction commits, Spanner assigns a commit timestamp and then waits for that uncertainty window to pass before making the data visible. This "commit wait" is the core insight. By waiting out the clock uncertainty, Spanner guarantees that any transaction starting after the wait period will see the committed data. The result is external consistency without needing perfectly synchronized clocks. The wait adds roughly 7ms of latency per commit. Compare that to the hundreds of milliseconds that software-based clock sync like NTP would burn.

Paxos replication handles fault tolerance. Each split (a contiguous range of rows sorted by primary key) gets replicated across zones using its own independent Paxos group. Multi-region configs typically use 5 replicas, single-region uses 3. The Paxos leader for each split handles all reads and writes. Writes need a quorum (majority of replicas) before committing. When a transaction touches multiple splits, Spanner runs a two-phase commit protocol, coordinated by one of the participating Paxos leaders.

Colossus (Google's successor to GFS) handles storage. Each Paxos replica stores data as SSTables on Colossus, which runs its own replication and fault tolerance at the file system level. This is an important distinction: Paxos replication is for consistency (all replicas agree on what happened), while Colossus replication is for durability (data survives disk failures). Two layers of protection, each doing a different job.

Schema Design

Spanner schema design is not like regular relational database design. Two things will cause problems if ignored.

Primary key design controls data distribution. Spanner partitions rows by primary key into splits. Sequential keys (timestamps, auto-increment integers) mean every new write slams into the last split. Guaranteed hotspot. I have seen teams discover this the hard way in production. Use UUIDv4 for uniform distribution, or prefix keys with a hash of a high-cardinality attribute like hash(customer_id) || customer_id || order_id.

Interleaved tables are the killer feature most people underuse. Declaring a child table as INTERLEAVE IN PARENT physically stores child rows right next to their parent row in the same split. A Customers table interleaved with Orders means a customer's row and all their orders live together. A join between them costs about the same as reading a single row. Skip interleaving, and that customer and their orders might land on different machines in different splits. Now that simple join is a distributed transaction. Huge difference.

Stale reads are the key to read scaling. Strong reads (the default) hit the Paxos leader and return the latest committed data. Bounded-staleness reads (e.g., MAX_STALENESS 15s) can be served by any replica in any zone, whichever is closest, because they only need data up to 15 seconds old. For dashboards, recommendation engines, and other read-heavy global workloads where slight staleness is fine, this drops p50 read latency from 50-100ms (cross-region round-trip to the leader) down to 1-5ms (local replica). That is a massive win and at no cost.

Production Architecture

Two instance configurations, and picking the right one matters a lot.

Regional instances keep all replicas within one Google Cloud region (e.g., us-central1) spread across 3 zones. Write latency sits around 5-10ms. Availability is 99.99%. If an entire zone fails, the remaining two still have quorum and serve traffic with zero data loss and near-zero downtime. For most workloads, this is the right starting point.

Multi-region instances distribute replicas across multiple regions (e.g., nam-eur-asia1 spans North America, Europe, and Asia). The result is 99.999% availability, enough to survive an entire region going offline. The cost: write latency jumps to 50-200ms because Paxos quorum now requires cross-region round-trips. Use leader placement configuration to pin Paxos leaders in the region closest to the primary write traffic.

Change Streams (Spanner's CDC feature) capture row-level changes in commit-timestamp order. Connect them to Pub/Sub for event-driven architectures, BigQuery for real-time analytics, or Dataflow for stream processing. They add minimal overhead because they read from the Paxos replication log directly rather than polling tables.

Turn on the Query Statistics dashboard. Seriously. It surfaces the top queries by CPU consumption and latency automatically. It surfaces missing indexes and lock contention that would otherwise go unnoticed. Use Query Optimizer versioning to test new optimizer versions in staging before rolling them out to production.

Capacity Planning

Spanner capacity is measured in processing units (PU). 1 node = 1,000 PU. Here is what a single node provides roughly:

Metric	Single-Region	Multi-Region
Read throughput	10,000 QPS (8 KB reads)	10,000 QPS
Write throughput	2,000 QPS (1 KB writes)	2,000 QPS
Storage per node	2 TB (recommended max)	2 TB
Storage limit	10 TB per node (hard)	10 TB per node

On cost: a 3-node regional instance runs about $2,700/month for nodes, plus $0.30/GB/month for storage. Multi-region instances cost 3x the node rate. If the workload is below 1,000 PU, use granular instance sizing (minimum 100 PU at $0.09/hour) to avoid paying for a full node that is not needed.

Autoscaling (GA) adjusts processing units between a configured min and max based on CPU utilization targets (default 65%). Scale-up takes 5-10 minutes. Scale-down is more conservative, with a 30-minute cooldown. For predictable traffic patterns, set the autoscaler minimum to the baseline PU count to avoid cold-start latency spikes.

Keep an eye on three metrics: cpu/utilization (keep under 65% for headroom), api/received_bytes_count for traffic patterns, and lock_stat/total/lock_wait_time for contention. High lock wait times point to hot rows or transactions that are too broad.

Failure Scenarios

Scenario 1: The sequential key hotspot. A team builds a logging service and defines their table as CREATE TABLE Events (event_time TIMESTAMP, ...) PRIMARY KEY (event_time). Looks reasonable. Every insert targets the most recent split because event_time keeps going up. That single split's Paxos leader saturates at 2,000 writes/sec while the other 9 nodes in the cluster sit idle. Spanner tries to help by automatically splitting the hot range, but each new split immediately becomes the hot one. In the Cloud Console, one split shows 100% CPU while api/request_latencies p99 climbs from 10ms to 500ms. The Key Visualizer tool shows a bright diagonal line, all traffic concentrated on the latest key range. split_load metrics show extreme imbalance. The fix: redesign the primary key to ShardId, event_time where ShardId = FARM_FINGERPRINT(event_source) MOD 100. Now writes spread across 100 key ranges. For the existing table, create a new table with the correct key design and backfill using a Dataflow pipeline. There is no in-place fix. The lesson: always run the schema design linter and check Key Visualizer in staging before going to production.

Scenario 2: Multi-region write latency blows past the SLOs. A fintech platform deploys a multi-region Spanner instance (nam-eur-asia1) with leaders in us-central1. European users hit 180ms write latency because every write has to cross the Atlantic for Paxos quorum with the US-based leader. During European business hours, European write traffic is 3x US traffic, but all of it travels to US-Central for leader coordination. The platform's 200ms SLO for payment confirmations leaves only 20ms of budget for application logic. Not great. In api/request_latencies segmented by method and region, the p50 write latency shows 180ms from Europe versus 8ms from the US. api/request_count by region confirms Europe is generating more write traffic. The quick fix: change the leader region during European business hours with ALTER DATABASE ... SET OPTIONS (default_leader = 'europe-west1'). Leader changes take effect within minutes. For a permanent fix, evaluate whether the eur3 multi-region configuration (all replicas in Europe) is sufficient. Or, if the workload can tolerate it, implement application-level routing that directs writes to a region-local Spanner instance with async cross-region reconciliation.

How It Works Internally

Three ideas make Spanner work: TrueTime, Paxos-based replication, and automatic split management. They are deeply connected, and understanding one without the others gives an incomplete picture.

Schema Design

Spanner schema design is not like regular relational database design. Two things will cause problems if ignored.

Production Architecture

Two instance configurations, and picking the right one matters a lot.

Capacity Planning

Spanner capacity is measured in processing units (PU). 1 node = 1,000 PU. Here is what a single node provides roughly:

Metric	Single-Region	Multi-Region
Read throughput	10,000 QPS (8 KB reads)	10,000 QPS
Write throughput	2,000 QPS (1 KB writes)	2,000 QPS
Storage per node	2 TB (recommended max)	2 TB
Storage limit	10 TB per node (hard)	10 TB per node

Use Cases

Architecture

How It Works Internally

Schema Design

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

Google Cloud Spanner

Use Cases

Architecture

How It Works Internally

Schema Design

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies