Page Cache & Block I/O
Mental Model
A desk in front of a filing cabinet. Pull a document out once and it stays on the desk. Every subsequent read grabs it from the desk (fast) instead of the cabinet (slow). Writes go to the desk first; a clerk files them back into the cabinet every few minutes. The desk has limited space, so the documents nobody has touched recently get put back to make room.
The Problem
A database writes and gets back in 50 microseconds. The developer assumes the data is safe on disk. Twenty-eight seconds later the server loses power -- and those 28 seconds of writes vanish. write() succeeded, but the data never left RAM; it was sitting dirty in the page cache, waiting for the kernel's writeback threads. Meanwhile, the monitoring dashboard says only 2 GB is "free" on a 64 GB server while no process claims more than 40 GB. The missing 22 GB is page cache. It is reclaimable, not lost.
Architecture
When write() is called, the data does not go to disk.
It goes to RAM.
Linux just pretends the write is done and moves on. The data sits in a memory buffer called the page cache, marked as "dirty," waiting for the kernel to flush it to disk later. That "later" could be 30 seconds. If the power goes out before then, that data is gone forever.
This is not a bug. It is a deliberate design choice. And every high-performance system in production -- Kafka, PostgreSQL, Redis -- is built around this reality.
What Actually Happens
The read path:
read() → kernel checks page cache
↳ HIT → copy from RAM to your buffer (instant)
↳ MISS → read from disk → store in cache → copy to buffer
That is why the second read of the same file is fast. The data is already on the desk.
The kernel is even smarter than that. It detects sequential read patterns and prefetches pages before they are requested. The readahead window starts at 128KB and grows up to 2MB. By the time the next chunk is requested, it is already in cache.
The write path (this is the dangerous one):
write() → copy data to page cache → mark page DIRTY → return immediately
↓
(sometime later)
writeback threads → disk
The write() call returns in microseconds. But the data is sitting in RAM, not on disk. The kernel's writeback threads will eventually flush it, triggered by one of three things:
- A timer —
dirty_writeback_centisecs(default: every 5 seconds, the kernel checks for dirty pages) - Background pressure — when dirty pages exceed
dirty_background_ratio(default 10% of available memory), background writeback kicks in - Blocking pressure — when dirty pages exceed
dirty_ratio(default 20%), thewrite()call actually blocks until pages are flushed. This is the throttle that prevents dirty pages from eating all of RAM.
The only thing that guarantees data is on disk:
fsync(fd); // forces page cache → disk, waits for confirmation
The faster alternative:
fdatasync(fd); // flushes data only, skips metadata like timestamps
// often 2x faster for append workloads
Under the Hood
The page cache is a unified, system-wide cache that stores file data in 4KB pages. It's indexed per-file through the inode's address_space structure, which contains an xarray mapping file offsets to struct page pointers. There is no separate buffer cache — it was unified with the page cache in Linux 2.4.
On a write, the kernel copies the data into the relevant page (allocating one if needed), sets the PG_dirty flag on the struct page, and returns. The page is now part of the dirty page pool.
Writeback is handled by per-device kworker/flush threads (formerly pdflush/bdflush). When triggered, they walk the dirty page list for each device, build struct bio requests (representing contiguous disk sectors), and submit them to the block I/O layer's scheduler for merging and dispatch.
fsync() vs fdatasync() vs sync():
sync()flushes ALL dirty pages across ALL filesystems. System-wide nuclear option.fsync(fd)flushes dirty pages for one file AND its inode metadata (size, timestamps, block allocation), then waits for the disk to confirm.fdatasync(fd)flushes data but only flushes metadata if it's needed for subsequent reads (e.g., file size changes are flushed, but mtime changes are not).
For in-place overwrites of fixed-size files, fdatasync() skips the metadata write entirely. For append-only logs where the file size changes with every write, the difference is minimal because the size change counts as "necessary metadata."
O_DIRECT — bypassing the page cache:
Opens a file so that reads and writes go directly between the application buffer and the disk controller. No page cache involved. Databases like PostgreSQL and MySQL InnoDB use this because they implement their own buffer pools and don't want double-caching.
The catch: O_DIRECT requires memory-aligned buffers (aligned to the filesystem block size, typically 4096 bytes) and I/O sizes that are multiples of the block size. And it does NOT guarantee durability — the disk controller has its own volatile write cache. An fsync() or O_DSYNC is still needed to flush that.
Page cache and memory pressure:
The page cache is the largest consumer of "free" memory on most Linux systems. This is by design — unused memory is wasted memory. When an application needs more memory, the kernel evicts clean pages (free to discard since data is on disk) and writes back dirty pages if needed. The OOM killer only fires as a last resort after all cache has been evicted. The free command's "available" column accounts for this reclaimable cache.
Common Questions
"A server writes data and loses power. How much data is lost?"
All data written since the last fsync(). With default settings, dirty pages can exist for up to 30 seconds before background writeback starts for them. Under load, it could be longer. For durability-critical writes — database commits, financial transactions — every write must be followed by fsync(), or the file must be opened with O_SYNC/O_DSYNC.
"Why does PostgreSQL need both O_DIRECT and fsync()?"
O_DIRECT bypasses the page cache but does not guarantee data hits the physical disk platter. Modern disks have volatile write caches. O_DIRECT gets data from the application buffer to the disk controller, but the controller may hold it in RAM. fsync() issues a cache flush command (FLUSH CACHE / FUA) that forces the controller to write to the platter. For full durability: O_DIRECT to avoid kernel cache overhead, fsync() to flush the hardware cache. Or use O_DSYNC which combines both.
"What do the drop_caches values 1, 2, and 3 do?"
Writing to /proc/sys/vm/drop_caches: (1) frees clean page cache pages, (2) frees dentries and inodes from slab caches, (3) frees both. Only clean pages are dropped — dirty pages must be written back first (run sync before drop_caches). This is a benchmarking tool, never for production. It evicts warm cache entries that must be painfully re-read from disk.
"What's the difference between posix_fadvise(FADV_DONTNEED) and madvise(MADV_DONTNEED)?"
posix_fadvise(FADV_DONTNEED) evicts pages from the page cache for a file range — subsequent reads hit disk again. madvise(MADV_DONTNEED) operates on virtual memory mappings — for anonymous pages, they're zeroed on next access. The key difference: fadvise works on fd+offset, madvise works on virtual address ranges. RocksDB uses FADV_DONTNEED after compaction to prevent cache pollution.
How Technologies Use This
Why can Kafka serve 10,000 consumers reading different offsets of the same topic at 800MB/s aggregate throughput without any custom caching code? Because Kafka delegates all caching to the OS page cache.
Producers write() log segments into page cache, and consumers read from the same cached pages via sendfile(), which transfers data directly from page cache to the network socket with zero copies into userspace. If the broker restarts after a crash, the page cache is still warm in kernel memory, so consumers resume at full speed without any cache warmup period.
This design means Kafka's heap stays small (typically 4-6GB JVM) while the page cache uses all remaining RAM, often 50GB or more, for log data. The lesson: let the OS cache what the OS is already good at caching.
Why does PostgreSQL maintain its own 8GB shared_buffers pool when the kernel already caches file data in the page cache? Because a database cannot tolerate the kernel evicting a dirty page before the corresponding WAL record is durable.
PostgreSQL's buffer pool uses a clock-sweep eviction algorithm with pin counts and dirty-page tracking that the kernel's LRU cannot replicate. For WAL writes, PostgreSQL calls fdatasync() after every commit to force dirty pages to disk, guaranteeing transaction durability.
The page cache still helps as a second-chance read cache underneath shared_buffers, reducing disk reads by roughly 20-30% for working sets that exceed the buffer pool. But PostgreSQL never relies on it for write durability. The takeaway: when the kernel's eviction policy conflicts with application durability requirements, build a dedicated buffer pool.
How does Redis persist 50,000 writes per second to the AOF log without blocking client requests? By letting dirty pages sit in the page cache.
With appendfsync=everysec, Redis calls write() which copies data to page cache and returns in microseconds, then a background thread calls fsync() once per second to flush dirty pages to disk. The trade-off is explicit: a power failure in that 1-second window loses the unflushed writes. During BGSAVE, the forked child reads the entire dataset through the page cache via copy-on-write, keeping snapshot overhead under 10% additional RSS even for datasets exceeding 20GB.
The page cache gives Redis microsecond writes with background durability and zero-copy snapshots. The cost is a 1-second data loss window, which is an acceptable trade-off for most workloads.
Same Concept Across Tech
| Technology | How page cache affects it | Key config |
|---|---|---|
| PostgreSQL | shared_buffers is a separate cache on top of the page cache. Double caching is intentional and controlled | shared_buffers + effective_cache_size |
| Redis | RDB/AOF writes go through page cache. AOF fsync policy controls durability | appendfsync: always (safe, slow), everysec (default), no (fast, risky) |
| Kafka | Relies heavily on page cache for consumer reads. Consumers reading recent data hit cache, not disk | log.flush.interval.messages controls explicit flush |
| MySQL (InnoDB) | innodb_buffer_pool is its own cache. O_DIRECT bypasses page cache for data files | innodb_flush_method = O_DIRECT avoids double caching |
| Docker | Container writes go through host page cache. Multiple containers reading same image layer share cache | Overlay2 storage driver benefits from shared page cache |
Stack layer mapping (data loss after crash):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is fsync/fdatasync called after critical writes? | Code review, strace -e fsync |
| Runtime | Does the database/framework have a durability setting? | DB config (e.g., PostgreSQL synchronous_commit) |
| Page cache | How many dirty pages are pending writeback? | cat /proc/vmstat |
| Kernel | What are the dirty page writeback settings? | sysctl vm.dirty_ratio, vm.dirty_expire_centisecs |
| Storage | Does the disk have a write cache? Is it battery-backed? | hdparm -W /dev/sda, check RAID controller |
Design Rationale Write-back is the default because the alternative -- blocking on every write() until bytes hit the platter -- would cap throughput at disk speed, turning a microsecond memory copy into a millisecond operation. Data sits in volatile RAM until writeback flushes it, which means a power failure loses recent writes. The dirty_ratio knobs throttle producers so they cannot fill all of RAM with dirty pages and starve reads. Making fsync opt-in lets each application decide where it sits on the speed-vs-safety spectrum instead of forcing everyone to pay for the safest option.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Data lost after power failure despite successful write() | Dirty pages not flushed to disk. write() only writes to page cache | Check if fsync is called after critical writes |
| Server shows 2 GB free on 64 GB but no process uses that much | Page cache consuming "free" memory. This is normal and reclaimable | Check "available" in free -h, not just "free" |
| Periodic I/O spikes every 30 seconds | Dirty page writeback timer flushing accumulated writes | Check vm.dirty_writeback_centisecs and dirty ratio |
| First read of a large file is slow, second read is instant | First read goes to disk (cache miss). Second read served from page cache | Expected behavior. Use fadvise to prefetch |
| Application I/O latency spikes during heavy writes | Dirty ratio exceeded, writes become synchronous until writeback catches up | Lower vm.dirty_ratio or increase writeback frequency |
| free memory drops to near zero under I/O load | Page cache growing. Not a problem if "available" is healthy | Only worry if available memory is also low |
When to Use / Avoid
Relevant when:
- Investigating why write() returns in microseconds -- the data is in RAM, not on disk
- Diagnosing "low free memory" alerts on a server that is running fine (page cache fills free RAM by design)
- Guaranteeing data survives a crash -- fsync/fdatasync is the only way to force dirty pages to disk
- Tuning I/O: readahead size, dirty_ratio, and writeback frequency all live here
Watch out for:
- Treating write() as durable without fsync
- Panicking at low MemFree -- check MemAvailable instead, which accounts for reclaimable cache
- Cranking dirty_ratio too high, causing massive writeback storms on sync
- Setting it too low, forcing constant small flushes that spike latency
Try It Yourself
1 # Show page cache size, dirty pages awaiting writeback, and pages currently being written back
2 grep -E 'Cached|Dirty|Writeback' /proc/meminfo
3
4 # Monitor I/O activity: bi (blocks in), bo (blocks out), wa (I/O wait %) every second
5 vmstat 1 5
6
7 # Display the three key writeback tuning parameters
8 sysctl vm.dirty_background_ratio vm.dirty_ratio vm.dirty_expire_centisecs
9
10 # Write 100MB and force fdatasync. conv=fdatasync ensures the dd reports throughput after data hits disk
11 dd if=/dev/zero of=/tmp/testfile bs=1M count=100 conv=fdatasync
12
13 # Show the readahead window size for a block device (default 128KB)
14 cat /sys/block/sda/queue/read_ahead_kb
15
16 # Flush dirty pages (sync) then drop clean page cache. for benchmarking only, not production
17 sudo sync && echo 3 | sudo tee /proc/sys/vm/drop_cachesDebug Checklist
- 1
Check page cache size: free -h (look at buff/cache column) - 2
Check dirty pages waiting for writeback: cat /proc/vmstat | grep -E 'dirty|writeback' - 3
Check dirty page settings: sysctl vm.dirty_ratio vm.dirty_background_ratio - 4
Drop page cache (careful in production): echo 3 > /proc/sys/vm/drop_caches - 5
Check cache hit rate: perf stat -e cache-misses,cache-references -p <pid> - 6
Monitor writeback activity: watch -n1 'cat /proc/vmstat | grep -E nr_dirty'
Key Takeaways
- ✓write() does NOT put data on disk. It copies to page cache, marks the page dirty, and returns. Without fsync(), you're trusting electricity.
- ✓The kernel detects sequential reads and prefetches pages before you ask. Readahead starts at 128KB and grows up to 2MB. That's why second reads are fast — data is already waiting in RAM.
- ✓dirty_background_ratio (10%) triggers background writeback. dirty_ratio (20%) triggers BLOCKING writeback — your write() call hangs. If you hit dirty_ratio, your disk can't keep up with your writes.
- ✓O_DIRECT bypasses the page cache entirely — used by databases that run their own buffer pool. But O_DIRECT alone doesn't guarantee durability. You still need fsync() to flush the disk's hardware write cache.
- ✓drop_caches evicts clean pages from RAM. It does NOT flush dirty pages. It's for benchmarking only — never use it in production.
Common Pitfalls
- ✗Thinking write() means data is safe. Reality: it's only in RAM. The default 30-second writeback delay means up to 30 seconds of data loss on power failure. For anything that matters, you need fsync().
- ✗Using fsync() when fdatasync() is enough. fsync() flushes data AND metadata (inode), requiring an extra disk write. For append workloads like logs, fdatasync() is sufficient and can be 2x faster.
- ✗Calling fsync() on the file but forgetting the parent directory. On ext4, a newly created file's directory entry may not be persisted until you fsync the directory itself. A crash can make your file vanish entirely.
- ✗Using O_DIRECT without understanding alignment. Buffers must be aligned to the filesystem block size (usually 4096 bytes). Misaligned I/O silently falls back to buffered mode on some kernels or returns EINVAL on others.
Reference
In One Line
Reads come from RAM, writes land in RAM -- nothing is durable until fsync forces dirty pages to disk.