Google Cloud Spanner
The database that uses atomic clocks to solve distributed consistency
Use Cases
Architecture
Most distributed databases force a choice: strong consistency or horizontal scale. Spanner is the rare system that actually delivers both, and the trick behind it is wonderfully weird. Google put atomic clocks and GPS receivers in every data center, built a time API around them, and used bounded clock uncertainty to make globally consistent transactions possible. It has been running internally at Google since before it became a cloud product in 2017, powering AdWords, Google Play, and Photos at scale.
Is it the right choice for a given system? Maybe. Spanner's 99.999% multi-region SLA (under 5.3 minutes of downtime per year) is real, and the consistency guarantees are the strongest available in production. But it locks teams into Google Cloud, costs real money, and comes with design constraints that must be respected.
How It Works Internally
Three ideas make Spanner work: TrueTime, Paxos-based replication, and automatic split management. They are deeply connected, and understanding one without the others gives an incomplete picture.
TrueTime is the clever part. Every Google data center has a mix of atomic clocks (cesium and rubidium) and GPS receivers that cross-check each other. The TrueTime API does not return a single timestamp. It returns an interval [earliest, latest] representing the true current time with bounded uncertainty, typically under 7ms. When a transaction commits, Spanner assigns a commit timestamp and then waits for that uncertainty window to pass before making the data visible. This "commit wait" is the core insight. By waiting out the clock uncertainty, Spanner guarantees that any transaction starting after the wait period will see the committed data. The result is external consistency without needing perfectly synchronized clocks. The wait adds roughly 7ms of latency per commit. Compare that to the hundreds of milliseconds that software-based clock sync like NTP would burn.
Paxos replication handles fault tolerance. Each split (a contiguous range of rows sorted by primary key) gets replicated across zones using its own independent Paxos group. Multi-region configs typically use 5 replicas, single-region uses 3. The Paxos leader for each split handles all reads and writes. Writes need a quorum (majority of replicas) before committing. When a transaction touches multiple splits, Spanner runs a two-phase commit protocol, coordinated by one of the participating Paxos leaders.
Colossus (Google's successor to GFS) handles storage. Each Paxos replica stores data as SSTables on Colossus, which runs its own replication and fault tolerance at the file system level. This is an important distinction: Paxos replication is for consistency (all replicas agree on what happened), while Colossus replication is for durability (data survives disk failures). Two layers of protection, each doing a different job.
Schema Design
Spanner schema design is not like regular relational database design. Two things will cause problems if ignored.
Primary key design controls data distribution. Spanner partitions rows by primary key into splits. Sequential keys (timestamps, auto-increment integers) mean every new write slams into the last split. Guaranteed hotspot. I have seen teams discover this the hard way in production. Use UUIDv4 for uniform distribution, or prefix keys with a hash of a high-cardinality attribute like hash(customer_id) || customer_id || order_id.
Interleaved tables are the killer feature most people underuse. Declaring a child table as INTERLEAVE IN PARENT physically stores child rows right next to their parent row in the same split. A Customers table interleaved with Orders means a customer's row and all their orders live together. A join between them costs about the same as reading a single row. Skip interleaving, and that customer and their orders might land on different machines in different splits. Now that simple join is a distributed transaction. Huge difference.
Stale reads are the key to read scaling. Strong reads (the default) hit the Paxos leader and return the latest committed data. Bounded-staleness reads (e.g., MAX_STALENESS 15s) can be served by any replica in any zone, whichever is closest, because they only need data up to 15 seconds old. For dashboards, recommendation engines, and other read-heavy global workloads where slight staleness is fine, this drops p50 read latency from 50-100ms (cross-region round-trip to the leader) down to 1-5ms (local replica). That is a massive win and at no cost.
Production Architecture
Two instance configurations, and picking the right one matters a lot.
Regional instances keep all replicas within one Google Cloud region (e.g., us-central1) spread across 3 zones. Write latency sits around 5-10ms. Availability is 99.99%. If an entire zone fails, the remaining two still have quorum and serve traffic with zero data loss and near-zero downtime. For most workloads, this is the right starting point.
Multi-region instances distribute replicas across multiple regions (e.g., nam-eur-asia1 spans North America, Europe, and Asia). The result is 99.999% availability, enough to survive an entire region going offline. The cost: write latency jumps to 50-200ms because Paxos quorum now requires cross-region round-trips. Use leader placement configuration to pin Paxos leaders in the region closest to the primary write traffic.
Change Streams (Spanner's CDC feature) capture row-level changes in commit-timestamp order. Connect them to Pub/Sub for event-driven architectures, BigQuery for real-time analytics, or Dataflow for stream processing. They add minimal overhead because they read from the Paxos replication log directly rather than polling tables.
Turn on the Query Statistics dashboard. Seriously. It surfaces the top queries by CPU consumption and latency automatically. It surfaces missing indexes and lock contention that would otherwise go unnoticed. Use Query Optimizer versioning to test new optimizer versions in staging before rolling them out to production.
Capacity Planning
Spanner capacity is measured in processing units (PU). 1 node = 1,000 PU. Here is what a single node provides roughly:
| Metric | Single-Region | Multi-Region |
|---|---|---|
| Read throughput | 10,000 QPS (8 KB reads) | 10,000 QPS |
| Write throughput | 2,000 QPS (1 KB writes) | 2,000 QPS |
| Storage per node | 2 TB (recommended max) | 2 TB |
| Storage limit | 10 TB per node (hard) | 10 TB per node |
On cost: a 3-node regional instance runs about $2,700/month for nodes, plus $0.30/GB/month for storage. Multi-region instances cost 3x the node rate. If the workload is below 1,000 PU, use granular instance sizing (minimum 100 PU at $0.09/hour) to avoid paying for a full node that is not needed.
Autoscaling (GA) adjusts processing units between a configured min and max based on CPU utilization targets (default 65%). Scale-up takes 5-10 minutes. Scale-down is more conservative, with a 30-minute cooldown. For predictable traffic patterns, set the autoscaler minimum to the baseline PU count to avoid cold-start latency spikes.
Keep an eye on three metrics: cpu/utilization (keep under 65% for headroom), api/received_bytes_count for traffic patterns, and lock_stat/total/lock_wait_time for contention. High lock wait times point to hot rows or transactions that are too broad.
Failure Scenarios
Scenario 1: The sequential key hotspot. A team builds a logging service and defines their table as CREATE TABLE Events (event_time TIMESTAMP, ...) PRIMARY KEY (event_time). Looks reasonable. Every insert targets the most recent split because event_time keeps going up. That single split's Paxos leader saturates at 2,000 writes/sec while the other 9 nodes in the cluster sit idle. Spanner tries to help by automatically splitting the hot range, but each new split immediately becomes the hot one. In the Cloud Console, one split shows 100% CPU while api/request_latencies p99 climbs from 10ms to 500ms. The Key Visualizer tool shows a bright diagonal line, all traffic concentrated on the latest key range. split_load metrics show extreme imbalance. The fix: redesign the primary key to ShardId, event_time where ShardId = FARM_FINGERPRINT(event_source) MOD 100. Now writes spread across 100 key ranges. For the existing table, create a new table with the correct key design and backfill using a Dataflow pipeline. There is no in-place fix. The lesson: always run the schema design linter and check Key Visualizer in staging before going to production.
Scenario 2: Multi-region write latency blows past the SLOs. A fintech platform deploys a multi-region Spanner instance (nam-eur-asia1) with leaders in us-central1. European users hit 180ms write latency because every write has to cross the Atlantic for Paxos quorum with the US-based leader. During European business hours, European write traffic is 3x US traffic, but all of it travels to US-Central for leader coordination. The platform's 200ms SLO for payment confirmations leaves only 20ms of budget for application logic. Not great. In api/request_latencies segmented by method and region, the p50 write latency shows 180ms from Europe versus 8ms from the US. api/request_count by region confirms Europe is generating more write traffic. The quick fix: change the leader region during European business hours with ALTER DATABASE ... SET OPTIONS (default_leader = 'europe-west1'). Leader changes take effect within minutes. For a permanent fix, evaluate whether the eur3 multi-region configuration (all replicas in Europe) is sufficient. Or, if the workload can tolerate it, implement application-level routing that directs writes to a region-local Spanner instance with async cross-region reconciliation.
Pros
- • External consistency (the strongest guarantee you can get). Linearizable reads and writes, globally.
- • Fully managed. You do not deal with replication, sharding, or failover at all.
- • Automatic split-based sharding with zero manual partitioning
- • 99.999% SLA for multi-region setups (under 5.3 minutes of downtime per year)
- • Real SQL support with schemas, secondary indexes, and interleaved tables
Cons
- • Hard vendor lock-in to Google Cloud. No on-prem, no multi-cloud.
- • Expensive. Minimum $0.90/hour per node (~$650/month) before storage and network.
- • Custom SQL dialect. The PostgreSQL interface exists but has real limitations.
- • Write latency goes up for multi-region instances because Paxos has to cross continents
- • No stored procedures, no triggers, no user-defined functions
When to use
- • You are already on Google Cloud and need globally consistent transactions
- • Regulations or business rules require the absolute strongest consistency guarantees
- • You need a managed database that scales from 1 node to thousands without rearchitecting
- • Your workload needs multi-region writes with automatic conflict resolution
When NOT to use
- • You need multi-cloud or on-prem deployment flexibility
- • You are budget-constrained. Spanner's minimum cost is overkill for smaller workloads.
- • You depend on the full PostgreSQL extension ecosystem or stored procedures
- • Your workload is read-heavy and you can tolerate eventual consistency (much cheaper options exist)
Key Points
- •TrueTime uses atomic clocks and GPS receivers in every Google data center to bound clock uncertainty (typically under 7ms). Spanner waits out that uncertainty before committing, which provides external consistency without needing clock synchronization protocols.
- •External consistency is actually stronger than linearizability. If transaction T1 commits before T2 starts (real wall-clock time, anywhere on earth), T1's commit timestamp is guaranteed to be lower than T2's. The database behaves as if every transaction ran sequentially on one machine.
- •Data lives in splits (key ranges) managed automatically. Each split runs its own Paxos group across zones. Spanner splits hot or large key ranges and merges cold, small ones without any manual intervention.
- •Interleaved tables co-locate parent and child rows in the same split physically. A Customer row and all its Orders sit together on disk, so parent-child joins are basically free. No cross-split coordination needed.
- •Change Streams capture row-level mutations in commit-timestamp order. They can be piped to BigQuery, Pub/Sub, or external systems for CDC patterns and event-driven flows without polling.
- •Stale reads with bounded staleness (e.g., 'give me data up to 15 seconds old') get served by any Paxos replica in any zone, no leader contact required. This drops read latency dramatically for workloads that can tolerate slight staleness.
Common Mistakes
- ✗Using monotonically increasing primary keys (timestamps, auto-increment). All inserts hit the same split, creating a hotspot. Use UUIDv4, bit-reversed sequences, or hash-prefixed keys instead.
- ✗Skipping interleaved tables for parent-child relationships. Without interleaving, a Customer and their Orders can end up in different splits, turning a simple join into a distributed cross-split transaction.
- ✗Over-provisioning nodes for peak traffic instead of using the autoscaler. Spanner supports processing-unit-based autoscaling that adjusts within minutes. Static provisioning for peak wastes serious money.
- ✗Running full table scans on large tables. Spanner's distributed nature makes full scans expensive because every split has to participate. Design secondary indexes for every query pattern, and check the Query Statistics dashboard for unoptimized queries.
- ✗Not configuring leader placement for multi-region instances. By default, Paxos leaders can land in any region. For write-heavy workloads, pin the leader region near the primary write source to cut commit latency.