Write-Ahead Log
Architecture
Crashes Happen Mid-Write. Now What?
Picture a database updating a B-Tree page. The page is 8KB. The operating system writes 4KB blocks. So updating one page is at least two block writes. If power goes out after the first 4KB but before the second, you now have half an old page and half a new page. That is a torn write, and the page is garbage.
This is not a theoretical concern. Production databases crash. Disks lose power. OOM killers terminate processes mid-operation. Kernels panic. Any storage system that modifies data in place needs a plan for what happens when those modifications are interrupted.
The write-ahead log is that plan. Before modifying any data file, write a description of what you intend to do to a separate, append-only log file. Only after that log entry is safely on disk (fsynced) do you proceed with the actual modification. If you crash before modifying the data, the log tells you what to finish. If you crash after modifying the data, the log entry is harmless. Either way, you can recover to a consistent state.
The "ahead" in "write-ahead" is doing all the work. Write the intent before the action. Not during. Not after. Before.
Sequential I/O: Why Appending is Fast
WAL entries are always appended to the end of the current log file. Never inserted in the middle, never overwritten. This matters enormously for performance.
An NVMe SSD doing sequential writes can sustain 2-4 GB/s. That same SSD doing random 4KB writes drops to 100-400 MB/s. The difference comes from how the flash translation layer handles writes internally, how the device can queue and batch sequential operations, and how the OS page cache interacts with write patterns.
On spinning disks, the gap is catastrophic. Sequential writes might run at 150 MB/s while random writes drop below 1 MB/s because the disk head physically moves for each write.
A WAL converts the random write pattern of "update page 7, then page 3041, then page 19" into the sequential pattern of "append entry, append entry, append entry." The actual random writes to data pages still happen eventually, but they happen asynchronously, in the background, batched together. The client only waits for the sequential WAL append.
The Mechanics: Append, Fsync, Apply
A write in a WAL-based system follows a specific sequence.
The database receives a write request. It serializes the operation (which page to modify, what the new content should be, the transaction ID) into a log record. This record gets appended to the current WAL file in memory. Then the database calls fsync on the WAL file, which forces the OS to flush the write to durable storage.
Only after fsync returns successfully does the database acknowledge the write to the client. At this point, the client has a durability guarantee: even if the process dies immediately after, the write is recoverable.
Then, separately, the database applies the change to its actual data structures. In PostgreSQL, this means modifying buffer pool pages. In RocksDB, this means inserting into the memtable. These in-memory modifications get written back to disk lazily, in the background, during checkpointing.
Group Commit: Batching Fsync Calls
Fsync is expensive. On a good NVMe drive, each fsync might take 20-50 microseconds. That puts a ceiling of roughly 20,000-50,000 fsyncs per second. If every single transaction requires its own fsync, throughput caps at that number regardless of how fast everything else is.
Group commit solves this. Instead of fsyncing after each individual write, the database accumulates several writes in the WAL buffer over a short window (often a few milliseconds), then issues a single fsync for the entire batch. Ten transactions that each would have needed their own fsync now share one.
PostgreSQL implements this automatically through its WAL writer process. When a transaction commits, it writes to the WAL buffer and waits. The WAL writer periodically flushes the buffer and wakes up all waiting transactions. Under load, this naturally batches more transactions together, so throughput scales well as concurrency increases.
The tradeoff is slightly higher per-transaction latency (you wait for the batch window) in exchange for dramatically higher throughput. In practice, the batch window is small enough that the added latency is negligible for most workloads, and the throughput gain is 10x or more.
Fsync Tradeoffs: Durability vs Speed
Different systems choose different fsync strategies, and the choice reveals what they prioritize.
PostgreSQL defaults to fsyncing on every transaction commit. When you get COMMIT back, the data is on disk. Maximum durability. You can tune this by setting synchronous_commit = off, which lets PostgreSQL acknowledge commits before fsync. This risks losing the last few milliseconds of commits on crash, but some applications (analytics ingestion, logging) accept that tradeoff for the speed boost.
Kafka takes the opposite default. It does not fsync after each produce request. Instead, it relies on the OS page cache and replication for durability. If a single broker crashes without fsyncing, the data is on other replicas. This is why Kafka recommends a replication factor of 3. The flush.messages and flush.ms configurations exist but the Kafka documentation explicitly recommends against using them because replication provides sufficient durability with much better performance.
MySQL's InnoDB has innodb_flush_log_at_trx_commit with three settings: 1 (fsync every commit, safest), 2 (write to OS cache every commit, fsync every second), and 0 (write and fsync every second). Setting 2 means you lose at most one second of commits on an OS crash but survive process crashes. Many MySQL deployments use setting 2 because the performance difference over setting 1 is significant.
Log Segments and Rotation
A WAL is not a single infinitely growing file. It gets split into fixed-size segments.
PostgreSQL uses 16MB segments by default (configurable at initdb time). When the current segment fills up, the database starts a new one. The naming convention is a monotonically increasing 24-character hex string that encodes the timeline, log file number, and segment number.
Old segments become candidates for removal after a checkpoint confirms that all the changes in those segments have been applied to the actual data files. But "removal" has options. The segments can be deleted outright, recycled (renamed and reused for future WAL data), or archived to a separate location for point-in-time recovery.
WAL archiving is critical for disaster recovery. By shipping completed WAL segments to an archive (S3, NFS, another server), you can restore a base backup and then replay archived WAL segments to reach any point in time. Without archiving, you can only restore to the moment of the last base backup.
Checkpointing: Bounding Recovery Time
Without checkpointing, crash recovery means replaying the entire WAL from the beginning. For a database that has been running for months, that could mean replaying terabytes of log records. Recovery would take hours.
Checkpointing solves this. Periodically, the database flushes all dirty pages to their data files, ensures those writes are durable, and writes a special checkpoint record to the WAL. This record says "everything before this point has been applied to the data files."
On crash recovery, the database only needs to find the last checkpoint record and replay WAL entries from that point forward. If checkpoints happen every 5 minutes, recovery replays at most 5 minutes of WAL. Predictable, bounded, fast.
The cost of checkpointing is a burst of I/O when dirty pages get flushed. PostgreSQL spreads this out using checkpoint_completion_target (default 0.9), which means the checkpoint tries to spread its writes over 90% of the checkpoint interval rather than dumping everything at once. This prevents checkpoint-related I/O spikes from disrupting normal query performance.
RocksDB has an analogous concept: when the memtable fills up, it gets flushed to an SST file on disk, and the corresponding WAL records become unnecessary. The WAL file can then be deleted.
WAL as the Replication Unit
Here is where the WAL transcends storage internals and becomes a distributed systems primitive.
Raft, the consensus protocol behind etcd, CockroachDB, and TiKV, is fundamentally a protocol for replicating a log. The leader appends entries to its log, replicates them to followers, and once a majority have the entry, it is considered committed. The log IS the shared state machine input.
PostgreSQL streaming replication works by shipping WAL records from the primary to standbys in near-real-time. The standby replays those WAL records against its own data files, staying in sync with the primary. This is physical replication: the standby gets the exact byte-level changes, not logical SQL statements.
Kafka is perhaps the purest example. A Kafka topic partition is a WAL. Producers append to it. Consumers read from it at their own pace using offsets. The entire value proposition of Kafka is "a durable, distributed, replayable log." Every stream processing framework, CDC pipeline, and event sourcing system built on Kafka is really built on a distributed WAL.
The log works so well for replication because it captures the exact sequence of changes. Ordering is built in. Deduplication is straightforward (each entry has a unique sequence number). And replaying a log is idempotent if the operations themselves are.
Log Compaction
Append-only logs grow forever. If every update to a key produces a new log entry, and the key has been updated 10,000 times, the log has 9,999 entries that will never be needed again.
Kafka's log compaction addresses this. For compacted topics, Kafka periodically scans the log and removes older entries for each key, keeping only the most recent value. The result is a log that contains exactly one entry per key: a compact snapshot of the current state. This is how Kafka stores consumer offsets internally (the __consumer_offsets topic is compacted) and how it enables changelog topics for stateful stream processors.
etcd compacts its Raft WAL for similar reasons. Without compaction, the WAL would grow unboundedly, and new cluster members would need to replay the entire history to catch up. After compaction, etcd takes a snapshot of the current state, and new members can start from the snapshot instead of replaying from zero.
WAL vs In-Place Updates: The Storage Engine Split
B-Tree databases like PostgreSQL and MySQL modify data pages in place. They read a page, update it, and write it back. The WAL provides crash safety for these in-place modifications. This means every write happens twice: once to the WAL and once to the data page. That is write amplification, and it is the cost of crash safety with in-place updates.
PostgreSQL actually has a third write: full-page images. After a checkpoint, the first modification to any page writes a complete copy of the page to the WAL (not just the diff). This protects against torn pages at the cost of even more write amplification. The full_page_writes setting controls this, and turning it off is dangerous unless your filesystem guarantees atomic page writes.
LSM-Tree databases like RocksDB take a different approach entirely. Writes go to the WAL, then to an in-memory memtable. There are no in-place updates to on-disk data structures at all. When the memtable fills up, it gets flushed to a new sorted file on disk. The WAL is only needed until the memtable is flushed. This eliminates the double-write problem but introduces its own write amplification through compaction (merging sorted files together).
The choice between B-Trees and LSM-Trees is partly a choice about how you want your WAL to interact with your data files.
When WAL is Not Needed
Not every storage system needs a WAL.
If your data structure is already append-only, the data IS the log. Time-series databases that only append new data points do not need a separate WAL because the data files themselves are written sequentially and a torn write at the tail only means losing the last few data points. Event stores have the same property.
In-memory databases that accept data loss on crash can skip the WAL entirely. Redis with appendonly disabled, Memcached, and similar caches restart empty after a crash, so there is nothing to recover. When Redis does enable its append-only file (AOF), that file is a WAL by another name.
Some databases use a combination. Redis can run with RDB snapshots only (periodic full dumps with no WAL) and accept losing writes between snapshots. This is a valid tradeoff for caching workloads where the source of truth lives elsewhere.
The WAL is the Foundation
Almost every durable storage system sits on top of a write-ahead log. Understanding how WAL works, why fsync matters, how group commit amortizes the cost, and how checkpointing bounds recovery time gives you the vocabulary to reason about database performance and reliability at a deep level. When someone asks "can we lose data if this crashes?" the answer almost always starts with "what does the WAL configuration look like?"
Key Points
- •Every durable database writes intentions to a sequential log before touching actual data files. If the process crashes mid-write, the log contains everything needed to either finish the operation or roll it back cleanly
- •Sequential I/O is the reason WAL works so well. SSDs can sustain 2-4 GB/s sequentially while random 4KB writes drop to 100-400 MB/s. On spinning disks, the gap is even more dramatic
- •Group commit batches multiple transactions into a single fsync call, turning the per-write overhead of durable logging into amortized cost. PostgreSQL does this automatically. Throughput improvements of 10x or more are common
- •The WAL doubles as a replication transport. PostgreSQL streams WAL segments to replicas. Raft uses the log as its core abstraction. Kafka is, at its heart, a distributed WAL with consumer offsets
- •Checkpointing bounds recovery time by periodically flushing dirty pages to disk and advancing the WAL replay start position. Without checkpointing, crash recovery would replay every write since the database was created
Used By
Common Mistakes
- ✗Disabling fsync for benchmarks and forgetting to re-enable it in production. This is shockingly common and leads to silent data corruption on crash. Always verify fsync settings before any production deployment
- ✗Not monitoring WAL disk usage. When the WAL disk fills up, most databases halt entirely rather than risk corruption. PostgreSQL will refuse new writes. Set alerts at 70% disk usage
- ✗Ignoring WAL fsync latency during performance tuning. The WAL fsync is often the bottleneck for write-heavy workloads, not CPU or memory. Faster storage under the WAL directory can transform throughput
- ✗Not configuring WAL archiving for point-in-time recovery. Without archived WAL segments, you can only restore to the last full backup, potentially losing hours of transactions