FoundationDB
The ordered, transactional key-value store that Apple trusts with iCloud
Use Cases
Architecture
Why It Exists
At some point, every distributed system team asks the same question: how do we get ACID transactions across shards without tanking performance? Most systems dodge this. They go eventual consistency, or limit transactions to a single shard, or skip transactions altogether.
FoundationDB actually solves it. Serializable ACID transactions across the entire keyspace, automatic sharding, and it scales. The design is clever: optimistic concurrency control with a hard 5-second transaction window, a dedicated conflict detection layer (Resolvers), and a write-ahead log (TLogs) that decouples commit durability from how fast storage servers can apply mutations.
Apple bought FoundationDB in 2015 and built iCloud's metadata layer on top of it. Hundreds of millions of users, billions of keys, strong consistency. They open-sourced it in 2018. Snowflake runs their metadata catalog on it. Apple's Record Layer adds structured records, indexes, and query planning on top of the raw key-value API.
How It Works Internally
A transaction follows this path:
- Begin: Client gets a read version (timestamp) from a proxy. All reads in this transaction see a consistent snapshot at this version.
- Read: Client reads keys directly from storage servers. Each storage server owns a sorted key range and serves reads from its local B-tree.
- Write: Client buffers writes locally. Nothing hits the cluster until commit.
- Commit: Client sends the full read set (keys read) and write set (keys + values to write) to a proxy.
- Conflict check: The proxy forwards to a Resolver, which checks whether any key in the read set was modified by another committed transaction after the read version. If yes, abort (client retries). If no, proceed.
- Durability: The proxy writes mutations to a quorum of Transaction Logs (TLogs). Once a quorum acknowledges, the transaction is committed and durable.
- Storage: Storage servers asynchronously pull committed mutations from TLogs, apply them to their local B-trees, and become readable at the new version.
Here's what makes this fast: the commit path (steps 4-6) doesn't wait for storage servers. It only waits for TLogs. Commits take ~5ms in a well-tuned cluster regardless of what storage servers are doing. They catch up in the background.
Reads go directly to storage servers, which serve from their local sorted B-tree. Because keys are ordered, scanning from /metadata/bucket-42/a to /metadata/bucket-42/z is a sequential B-tree read. which is exactly the access pattern needed for prefix listing in object storage metadata.
Production Architecture
A production cluster has four roles:
- Coordinators (3 or 5): Paxos-based. Store cluster metadata like who the current transaction system is. These are the root of trust. Give them dedicated machines with local SSDs.
- Proxies: Accept client transactions, hand out read versions, forward commits to Resolvers and TLogs. Stateless. Scale horizontally.
- Transaction Logs (TLogs): Durable write-ahead log for committed mutations. Run on fast SSDs. A quorum (3 of 5, typically) must ack each commit.
- Storage Servers: Own key ranges, serve reads, pull mutations from TLogs. Need more capacity? Add servers. FoundationDB splits and migrates key ranges automatically.
For an object storage metadata layer storing 136 PB across 1.36M partitions, FoundationDB's ordered keyspace maps directly: partition keys are sorted, range scans for prefix listing are efficient, and cross-partition transactions (e.g., atomic bucket operations) work without application-level 2PC.
Cluster sizing guidance:
| Scale | Coordinators | Proxies | TLogs | Storage Servers |
|---|---|---|---|---|
| < 1TB metadata | 3 | 2-4 | 3 | 3-10 |
| 1-10TB | 5 | 4-8 | 5 | 10-50 |
| 10-100TB | 5 | 8-16 | 5-7 | 50-200 |
| > 100TB | 5 | 16+ | 7+ | 200+ |
Decision Criteria
| Criteria | FoundationDB | etcd | CockroachDB | TiKV |
|---|---|---|---|---|
| Transactions | Serializable, cross-shard | Single-key or mini-txn | Serializable SQL | Distributed txn |
| Data model | Ordered KV (layers add structure) | Flat KV | SQL tables | Ordered KV |
| Max data size | Petabytes | 8 GB | Terabytes | Terabytes |
| Range scans | Native, efficient | Prefix-based | SQL queries | Native |
| Write throughput | Millions/sec at scale | ~10K-30K/sec | ~50K-100K/sec | ~100K-200K/sec |
| Consistency | Serializable | Linearizable | Serializable | Snapshot isolation |
| Query language | None (client API) | KV API | SQL | KV API |
| Simulation testing | Yes (deterministic) | No | No | No |
Capacity Planning
Write throughput: A 20-node cluster (5 proxies, 5 TLogs, 10 storage servers) handles ~200,000 writes/sec with sub-10ms commit latency. Throughput scales linearly by adding proxies and TLogs.
Read throughput: Storage servers handle ~50,000 reads/sec each for point lookups, ~10,000/sec for range scans (depends on scan size). Add storage servers to scale reads.
Storage: Each storage server uses a modified SQLite engine. Plan for 2x raw data size to account for B-tree overhead, write amplification, and compaction. 100TB of logical metadata ≈ 200TB of physical storage across storage servers.
Commit latency: Dominated by TLog fsync. With NVMe SSDs: p50 ~2ms, p99 ~8ms. With network-attached storage: p50 ~5ms, p99 ~25ms. Use local SSDs for TLogs.
Transaction limits: 5-second duration, 10MB read/write set. Design operations to fit within these bounds. For bulk operations, use multiple transactions with continuation tokens.
Failure Scenarios
Scenario 1: TLog Failure During Commit
Trigger: One TLog node crashes (disk failure, OOM, kernel panic) while transactions are in flight.
Impact: Commits that were waiting for that TLog's acknowledgment may need to retry. If a quorum of TLogs (e.g., 3 of 5) already acknowledged, those transactions are committed. If the quorum is now impossible (e.g., 2 of 4 remaining when quorum is 3), the transaction system recruits a new TLog from available machines, recovers uncommitted mutations from surviving TLogs, and resumes. Client-visible impact: 1-5 seconds of elevated commit latency during recovery.
Detection: Monitor TLogQueueDiskAvailableSpace, Latency.CommitLatency, and process uptime metrics. Alert on TLog process restarts.
Recovery: Automatic. The recovery coordinator detects the failure, recruits a replacement TLog, and restores the transaction system. Committed transactions are safe on the surviving quorum. Uncommitted ones get aborted and clients retry.
Scenario 2: Storage Server Falls Behind TLogs
Trigger: A storage server is slow (disk degradation, compaction storm, overloaded CPU). It falls behind in pulling mutations from TLogs. The TLog mutation backlog grows.
Impact: Reads to that storage server return stale data (old read version). Proxies detect this and route reads to other servers or return errors for keys in the lagging range. If the lag exceeds the TLog retention window, the storage server must be rebuilt from a peer.
Detection: Monitor StorageQueueDiskAvailableSpace and StorageLag. Alert when storage lag exceeds 5 seconds. Dashboard DataDistributionQueueSize for rebalancing pressure.
Recovery: If the lag is temporary (compaction finished), the server catches up automatically. If persistent, FoundationDB's data distributor moves the key range to a healthier server. The lagging server is drained and can be recycled.
Scenario 3: Coordinator Quorum Loss
Trigger: A majority of coordinators become unreachable (e.g., 2 of 3 in the same rack, rack power failure).
Impact: The entire cluster stops. Coordinators are the root of trust. Without quorum, no new transaction system gets elected, no proxies get recruited, and every client operation fails.
Detection: All client operations return errors. Monitor coordinator process health and network reachability separately from the rest of the cluster.
Recovery: Get coordinator quorum back. If the machines are gone for good, use fdbcli to swap the coordinator set to surviving machines plus fresh replacements. This is the single most dangerous failure mode. It's why coordinators need to be on separate failure domains (different racks, ideally different AZs) and dedicated hardware.
Pros
- • Serializable ACID transactions across the entire keyspace
- • Ordered keys with efficient range scans
- • Extremely high write throughput (millions of writes/sec at scale)
- • Simulation testing framework catches bugs before production
- • Multi-tenant isolation at the key prefix level
- • Automatic sharding and rebalancing
Cons
- • 5-second transaction time limit, so long-running transactions must be broken up
- • Value size limit of 100KB, so large blobs must be chunked
- • Key size limit of 10KB
- • Operational complexity at scale (coordinators, storage servers, proxies)
- • Smaller community than PostgreSQL or Cassandra
- • Limited built-in query language (requires layers or client logic)
When to use
- • Need ordered key-value store with ACID transactions
- • Building infrastructure that requires strong consistency at scale (metadata stores, control planes)
- • Want to build custom data models on a reliable foundation
- • Need cross-shard transactions without 2PC complexity
When NOT to use
- • Simple CRUD applications (PostgreSQL is easier)
- • Analytics workloads (use ClickHouse or a columnar store)
- • Need values larger than 100KB without chunking
- • Small team without distributed systems experience
- • Need a full SQL query engine out of the box
Key Points
- •Transactions use optimistic concurrency control with a 5-second window. Reads snapshot at transaction start, writes buffer client-side, and commit sends the read/write set to proxies for conflict detection against concurrent commits.
- •The Resolver is the serialization point. It maintains a recent commit history and rejects transactions whose read set overlaps with another transaction's write set committed after the read snapshot. No locks, just pure conflict detection.
- •Transaction Logs (TLogs) are the durable write-ahead log. Writes are committed when a quorum of TLogs acknowledges. Storage servers pull mutations from TLogs asynchronously, making the commit path independent of storage server speed.
- •Storage servers are key-range-sharded, sorted B-trees backed by SQLite (modified). Each server owns a contiguous key range and serves reads directly. The sorted order is what makes range scans efficient.
- •The simulation testing framework (deterministic simulation) runs the entire database in a single-threaded simulator, injecting every possible failure (disk, network, process crash) deterministically. This is why FoundationDB rarely has correctness bugs in production.
- •The 5-second transaction limit is deliberate. Short transactions keep conflict windows small and prevent long-running transactions from blocking the resolver's commit history buffer.
Common Mistakes
- ✗Writing transactions that exceed the 5-second limit. Batch large operations into smaller transactions (e.g., 1000 key writes per transaction, not 1 million).
- ✗Storing values larger than 10KB without thinking about read amplification. Large values bloat the storage engine and slow range scans.
- ✗Using random key distribution when range scans are needed. FoundationDB's strength is ordered keys, so design key prefixes for the actual access patterns.
- ✗Not using the Directory Layer or Tuple Layer for key management. This leads to key encoding bugs and collisions.
- ✗Running coordinators on the same machines as storage servers in production. Coordinator availability is the single most critical factor. Give them dedicated machines.
- ✗Ignoring the transaction byte limit (10MB). Transactions that read or write too many bytes will fail. Use versionstamp-indexed queues for large batch operations.