FoundationDB

Why It Exists

At some point, every distributed system team asks the same question: how do we get ACID transactions across shards without tanking performance? Most systems dodge this. They go eventual consistency, or limit transactions to a single shard, or skip transactions altogether.

FoundationDB actually solves it. Serializable ACID transactions across the entire keyspace, automatic sharding, and it scales. The design is clever: optimistic concurrency control with a hard 5-second transaction window, a dedicated conflict detection layer (Resolvers), and a write-ahead log (TLogs) that decouples commit durability from how fast storage servers can apply mutations.

Apple bought FoundationDB in 2015 and built iCloud's metadata layer on top of it. Hundreds of millions of users, billions of keys, strong consistency. They open-sourced it in 2018. Snowflake runs their metadata catalog on it. Apple's Record Layer adds structured records, indexes, and query planning on top of the raw key-value API.

How It Works Internally

A transaction follows this path:

Begin: Client gets a read version (timestamp) from a proxy. All reads in this transaction see a consistent snapshot at this version.
Read: Client reads keys directly from storage servers. Each storage server owns a sorted key range and serves reads from its local B-tree.
Write: Client buffers writes locally. Nothing hits the cluster until commit.
Commit: Client sends the full read set (keys read) and write set (keys + values to write) to a proxy.
Conflict check: The proxy forwards to a Resolver, which checks whether any key in the read set was modified by another committed transaction after the read version. If yes, abort (client retries). If no, proceed.
Durability: The proxy writes mutations to a quorum of Transaction Logs (TLogs). Once a quorum acknowledges, the transaction is committed and durable.
Storage: Storage servers asynchronously pull committed mutations from TLogs, apply them to their local B-trees, and become readable at the new version.

Here's what makes this fast: the commit path (steps 4-6) doesn't wait for storage servers. It only waits for TLogs. Commits take ~5ms in a well-tuned cluster regardless of what storage servers are doing. They catch up in the background.

Reads go directly to storage servers, which serve from their local sorted B-tree. Because keys are ordered, scanning from /metadata/bucket-42/a to /metadata/bucket-42/z is a sequential B-tree read. which is exactly the access pattern needed for prefix listing in object storage metadata.

Production Architecture

A production cluster has four roles:

Coordinators (3 or 5): Paxos-based. Store cluster metadata like who the current transaction system is. These are the root of trust. Give them dedicated machines with local SSDs.
Proxies: Accept client transactions, hand out read versions, forward commits to Resolvers and TLogs. Stateless. Scale horizontally.
Transaction Logs (TLogs): Durable write-ahead log for committed mutations. Run on fast SSDs. A quorum (3 of 5, typically) must ack each commit.
Storage Servers: Own key ranges, serve reads, pull mutations from TLogs. Need more capacity? Add servers. FoundationDB splits and migrates key ranges automatically.

For an object storage metadata layer storing 136 PB across 1.36M partitions, FoundationDB's ordered keyspace maps directly: partition keys are sorted, range scans for prefix listing are efficient, and cross-partition transactions (e.g., atomic bucket operations) work without application-level 2PC.

Cluster sizing guidance:

Scale	Coordinators	Proxies	TLogs	Storage Servers
< 1TB metadata	3	2-4	3	3-10
1-10TB	5	4-8	5	10-50
10-100TB	5	8-16	5-7	50-200
> 100TB	5	16+	7+	200+

Decision Criteria

Criteria	FoundationDB	etcd	CockroachDB	TiKV
Transactions	Serializable, cross-shard	Single-key or mini-txn	Serializable SQL	Distributed txn
Data model	Ordered KV (layers add structure)	Flat KV	SQL tables	Ordered KV
Max data size	Petabytes	8 GB	Terabytes	Terabytes
Range scans	Native, efficient	Prefix-based	SQL queries	Native
Write throughput	Millions/sec at scale	~10K-30K/sec	~50K-100K/sec	~100K-200K/sec
Consistency	Serializable	Linearizable	Serializable	Snapshot isolation
Query language	None (client API)	KV API	SQL	KV API
Simulation testing	Yes (deterministic)	No	No	No

Capacity Planning

Write throughput: A 20-node cluster (5 proxies, 5 TLogs, 10 storage servers) handles ~200,000 writes/sec with sub-10ms commit latency. Throughput scales linearly by adding proxies and TLogs.

Read throughput: Storage servers handle ~50,000 reads/sec each for point lookups, ~10,000/sec for range scans (depends on scan size). Add storage servers to scale reads.

Storage: Each storage server uses a modified SQLite engine. Plan for 2x raw data size to account for B-tree overhead, write amplification, and compaction. 100TB of logical metadata ≈ 200TB of physical storage across storage servers.

Commit latency: Dominated by TLog fsync. With NVMe SSDs: p50 ~2ms, p99 ~8ms. With network-attached storage: p50 ~5ms, p99 ~25ms. Use local SSDs for TLogs.

Transaction limits: 5-second duration, 10MB read/write set. Design operations to fit within these bounds. For bulk operations, use multiple transactions with continuation tokens.

Failure Scenarios

Scenario 1: TLog Failure During Commit

Trigger: One TLog node crashes (disk failure, OOM, kernel panic) while transactions are in flight.

Impact: Commits that were waiting for that TLog's acknowledgment may need to retry. If a quorum of TLogs (e.g., 3 of 5) already acknowledged, those transactions are committed. If the quorum is now impossible (e.g., 2 of 4 remaining when quorum is 3), the transaction system recruits a new TLog from available machines, recovers uncommitted mutations from surviving TLogs, and resumes. Client-visible impact: 1-5 seconds of elevated commit latency during recovery.

Detection: Monitor TLogQueueDiskAvailableSpace, Latency.CommitLatency, and process uptime metrics. Alert on TLog process restarts.

Recovery: Automatic. The recovery coordinator detects the failure, recruits a replacement TLog, and restores the transaction system. Committed transactions are safe on the surviving quorum. Uncommitted ones get aborted and clients retry.

Scenario 2: Storage Server Falls Behind TLogs

Trigger: A storage server is slow (disk degradation, compaction storm, overloaded CPU). It falls behind in pulling mutations from TLogs. The TLog mutation backlog grows.

Impact: Reads to that storage server return stale data (old read version). Proxies detect this and route reads to other servers or return errors for keys in the lagging range. If the lag exceeds the TLog retention window, the storage server must be rebuilt from a peer.

Detection: Monitor StorageQueueDiskAvailableSpace and StorageLag. Alert when storage lag exceeds 5 seconds. Dashboard DataDistributionQueueSize for rebalancing pressure.

Recovery: If the lag is temporary (compaction finished), the server catches up automatically. If persistent, FoundationDB's data distributor moves the key range to a healthier server. The lagging server is drained and can be recycled.

Scenario 3: Coordinator Quorum Loss

Trigger: A majority of coordinators become unreachable (e.g., 2 of 3 in the same rack, rack power failure).

Impact: The entire cluster stops. Coordinators are the root of trust. Without quorum, no new transaction system gets elected, no proxies get recruited, and every client operation fails.

Detection: All client operations return errors. Monitor coordinator process health and network reachability separately from the rest of the cluster.

Recovery: Get coordinator quorum back. If the machines are gone for good, use fdbcli to swap the coordinator set to surviving machines plus fresh replacements. This is the single most dangerous failure mode. It's why coordinators need to be on separate failure domains (different racks, ideally different AZs) and dedicated hardware.

Why It Exists

How It Works Internally

A transaction follows this path:

Begin: Client gets a read version (timestamp) from a proxy. All reads in this transaction see a consistent snapshot at this version.
Read: Client reads keys directly from storage servers. Each storage server owns a sorted key range and serves reads from its local B-tree.
Write: Client buffers writes locally. Nothing hits the cluster until commit.
Commit: Client sends the full read set (keys read) and write set (keys + values to write) to a proxy.
Conflict check: The proxy forwards to a Resolver, which checks whether any key in the read set was modified by another committed transaction after the read version. If yes, abort (client retries). If no, proceed.
Durability: The proxy writes mutations to a quorum of Transaction Logs (TLogs). Once a quorum acknowledges, the transaction is committed and durable.
Storage: Storage servers asynchronously pull committed mutations from TLogs, apply them to their local B-trees, and become readable at the new version.

Production Architecture

A production cluster has four roles:

Coordinators (3 or 5): Paxos-based. Store cluster metadata like who the current transaction system is. These are the root of trust. Give them dedicated machines with local SSDs.
Proxies: Accept client transactions, hand out read versions, forward commits to Resolvers and TLogs. Stateless. Scale horizontally.
Transaction Logs (TLogs): Durable write-ahead log for committed mutations. Run on fast SSDs. A quorum (3 of 5, typically) must ack each commit.
Storage Servers: Own key ranges, serve reads, pull mutations from TLogs. Need more capacity? Add servers. FoundationDB splits and migrates key ranges automatically.

Cluster sizing guidance:

Scale	Coordinators	Proxies	TLogs	Storage Servers
< 1TB metadata	3	2-4	3	3-10
1-10TB	5	4-8	5	10-50
10-100TB	5	8-16	5-7	50-200
> 100TB	5	16+	7+	200+

Decision Criteria

Criteria	FoundationDB	etcd	CockroachDB	TiKV
Transactions	Serializable, cross-shard	Single-key or mini-txn	Serializable SQL	Distributed txn
Data model	Ordered KV (layers add structure)	Flat KV	SQL tables	Ordered KV
Max data size	Petabytes	8 GB	Terabytes	Terabytes
Range scans	Native, efficient	Prefix-based	SQL queries	Native
Write throughput	Millions/sec at scale	~10K-30K/sec	~50K-100K/sec	~100K-200K/sec
Consistency	Serializable	Linearizable	Serializable	Snapshot isolation
Query language	None (client API)	KV API	SQL	KV API
Simulation testing	Yes (deterministic)	No	No	No

Capacity Planning

Write throughput: A 20-node cluster (5 proxies, 5 TLogs, 10 storage servers) handles ~200,000 writes/sec with sub-10ms commit latency. Throughput scales linearly by adding proxies and TLogs.

Read throughput: Storage servers handle ~50,000 reads/sec each for point lookups, ~10,000/sec for range scans (depends on scan size). Add storage servers to scale reads.

Commit latency: Dominated by TLog fsync. With NVMe SSDs: p50 ~2ms, p99 ~8ms. With network-attached storage: p50 ~5ms, p99 ~25ms. Use local SSDs for TLogs.

Transaction limits: 5-second duration, 10MB read/write set. Design operations to fit within these bounds. For bulk operations, use multiple transactions with continuation tokens.

Failure Scenarios

Scenario 1: TLog Failure During Commit

Trigger: One TLog node crashes (disk failure, OOM, kernel panic) while transactions are in flight.

Detection: Monitor TLogQueueDiskAvailableSpace, Latency.CommitLatency, and process uptime metrics. Alert on TLog process restarts.

Scenario 2: Storage Server Falls Behind TLogs

Trigger: A storage server is slow (disk degradation, compaction storm, overloaded CPU). It falls behind in pulling mutations from TLogs. The TLog mutation backlog grows.

Detection: Monitor StorageQueueDiskAvailableSpace and StorageLag. Alert when storage lag exceeds 5 seconds. Dashboard DataDistributionQueueSize for rebalancing pressure.

Scenario 3: Coordinator Quorum Loss

Trigger: A majority of coordinators become unreachable (e.g., 2 of 3 in the same rack, rack power failure).

Impact: The entire cluster stops. Coordinators are the root of trust. Without quorum, no new transaction system gets elected, no proxies get recruited, and every client operation fails.

Detection: All client operations return errors. Monitor coordinator process health and network reachability separately from the rest of the cluster.

Use Cases

Architecture

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: TLog Failure During Commit

Scenario 2: Storage Server Falls Behind TLogs

Scenario 3: Coordinator Quorum Loss

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

FoundationDB

Use Cases

Architecture

Why It Exists

How It Works Internally

Production Architecture

Decision Criteria

Capacity Planning

Failure Scenarios

Scenario 1: TLog Failure During Commit

Scenario 2: Storage Server Falls Behind TLogs

Scenario 3: Coordinator Quorum Loss

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies