Direct I/O (O_DIRECT): Bypassing the Page Cache
Mental Model
A warehouse has a loading dock (the storage device) and a staging area (the page cache). Normally, every delivery goes through staging -- packages are unloaded into the staging area, sorted, then moved to the final shelf. This works well when multiple people need the same package soon. But a database already has its own private staging area (the buffer pool) with a sorting system tuned to its exact needs. Routing packages through two staging areas means every box gets handled twice, and the shared staging area fills up with packages that only one tenant ever needs. Direct I/O is a dedicated loading dock door that lets the database move packages straight from the truck to its private staging area, skipping the shared one entirely.
The Problem
A database is showing high memory usage because both the database buffer pool AND the kernel page cache hold copies of the same data. On a 64 GB server with a 40 GB buffer pool, free reports only 2 GB available. Dropping the page cache with echo 3 > /proc/sys/vm/drop_caches frees 18 GB instantly, proving that nearly half the cached data was redundant. Every dirty page flush goes through two layers -- the application writes to its buffer, the kernel copies it into the page cache, and then the kernel writes it to disk. Throughput under heavy write loads is 30-40% lower than expected because the CPU spends cycles copying data into a cache that serves no purpose.
Architecture
A database holds 40 GB of hot data in its buffer pool. The server has 64 GB of RAM. But free shows only 2 GB available, and grep Cached /proc/meminfo reports 18 GB in the page cache. That 18 GB is not caching anything useful -- it is a second copy of pages the database already manages in its own buffer pool. Every write passes through two caches: the application writes to its buffer, the kernel copies the data into the page cache, and then a background thread flushes the page cache to disk. The CPU burns cycles on a memcpy that serves no purpose.
This is the double buffering problem. O_DIRECT exists to solve it.
What O_DIRECT Does
Opening a file with O_DIRECT tells the kernel to transfer data directly between the application's memory buffer and the storage device, bypassing the page cache entirely.
Normal (buffered) I/O path:
Application buffer → copy_from_user() → Page Cache → writeback → Device
Direct I/O path:
Application buffer → get_user_pages() → DMA to Device
The kernel pins the application's buffer pages in physical memory, builds a scatter-gather list for the block device, and initiates DMA directly from those pages. No intermediate copy. No page cache allocation. No writeback thread involvement.
The result: one copy of the data in memory instead of two, predictable write latency instead of writeback-induced spikes, and more free memory for processes that actually need it.
The Alignment Contract
O_DIRECT imposes strict requirements that the kernel enforces at every read and write call:
- Buffer address must be aligned to the filesystem's logical block size (typically 512 or 4096 bytes).
- File offset must be aligned to the same boundary.
- Transfer size must be a multiple of that block size.
Violating any of these returns EINVAL. There is no silent fallback.
/* Wrong: malloc gives 8 or 16 byte alignment */
char *bad = malloc(4096);
write(fd, bad, 4096); /* EINVAL */
/* Right: posix_memalign gives exact alignment */
void *good;
posix_memalign(&good, 4096, 4096);
write(fd, good, 4096); /* Success */
To determine the required alignment:
# Filesystem block size
stat -f /var/lib/postgresql/data | grep "Block size"
# Or via the logical block size of the underlying device
cat /sys/block/sda/queue/logical_block_size
Most modern filesystems and NVMe devices use 4096-byte logical blocks. Older SATA drives may report 512 bytes, but aligning to 4096 is always safe.
When the Page Cache Hurts
The page cache is one of the most effective optimizations in Linux. For general-purpose workloads -- a web server serving static files, a compiler reading source files, a shell script processing logs -- the page cache is strictly better than no cache. It absorbs repeated reads, coalesces small writes, provides transparent readahead, and requires zero application code.
But databases are not general-purpose workloads. A database buffer pool is a specialized cache with capabilities the page cache lacks:
- Transaction-aware eviction. InnoDB's LRU knows which pages are pinned by active transactions.
- Dirty page priority. The database controls when and in what order dirty pages are flushed.
- Access pattern prediction. The query executor knows which pages a sequential scan will need next.
- Memory accounting. The DBA sets buffer pool size explicitly. The page cache grows and shrinks unpredictably based on system-wide memory pressure.
When a database writes a page through buffered I/O, two things happen:
- The database copies the page from its buffer pool into a kernel buffer (the page cache).
- A kernel writeback thread eventually copies the page from the page cache to the device.
That first copy is pure waste. The database already has the authoritative version of the page. The page cache copy will never be read (the database reads from its own buffer pool). It sits in memory until writeback picks it up, evicts it, or memory pressure reclaims it.
With O_DIRECT, step 1 is eliminated. The write goes from the database buffer pool directly to the device. One copy instead of two.
Interaction with io_uring
Traditional direct I/O still pays the cost of one syscall per I/O operation. A database issuing 100,000 random 4 KB reads per second makes 100,000 pread() syscalls, each crossing the user-kernel boundary.
io_uring eliminates this overhead. The application writes I/O requests into a submission queue (SQ) in shared memory. The kernel consumes them and posts completions to a completion queue (CQ). No syscall needed for submission or completion in the fast path.
Combining O_DIRECT with io_uring:
/* Pseudocode: io_uring + O_DIRECT */
int fd = open("/data/table.ibd", O_RDWR | O_DIRECT);
struct io_uring ring;
io_uring_queue_init(256, &ring, 0);
/* Submit a direct I/O read */
struct io_uring_sqe *sqe = io_uring_get_sqe(&ring);
io_uring_prep_read(sqe, fd, aligned_buf, 4096, file_offset);
io_uring_submit(&ring);
/* Reap completion without a syscall (if SQPOLL is enabled) */
struct io_uring_cqe *cqe;
io_uring_wait_cqe(&ring, &cqe);
/* cqe->res contains bytes read or error */
io_uring_cqe_seen(&ring, cqe);
This combination gives the best of both worlds: no page cache overhead (O_DIRECT) and no syscall overhead (io_uring). Modern storage engines like TigerBeetle, ScyllaDB, and newer versions of RocksDB are built around this pairing.
With IORING_FEAT_SQPOLL, the kernel runs a dedicated polling thread that drains the submission queue without any syscall at all. The application writes to shared memory; the kernel thread picks it up and issues the I/O. Latency drops to near-device levels.
Under the Hood
The direct I/O code path. When the kernel processes a direct I/O request (__blockdev_direct_IO or the newer iomap_dio_rw), it:
- Calls
get_user_pages_fast()to pin the application's buffer pages in physical memory. - Builds a
bio(block I/O descriptor) with the physical page addresses. - Submits the bio to the block layer, which passes it to the device driver.
- On completion, unpins the pages and wakes the waiting thread (or completes the io_uring CQE).
get_user_pages and GUP. Pinning user pages for DMA is a delicate operation. The kernel must ensure the pages are not swapped out, migrated by NUMA balancing, or reclaimed while the device is doing DMA. GUP (Get User Pages) increments the page reference count and marks pages as "pinned." This is why O_DIRECT buffers must be in anonymous memory (not file-backed mmap) and properly aligned -- the DMA hardware operates on physical page boundaries.
The iomap path. Modern filesystems (XFS, ext4 on recent kernels) use the iomap infrastructure for direct I/O instead of the older __blockdev_direct_IO. The iomap path is cleaner, handles extent-based filesystems better, and supports both synchronous and asynchronous direct I/O through io_uring with less overhead.
Mixed access coherency. When a file has both buffered and direct I/O users, the kernel does not automatically synchronize them. A buffered write modifies the page cache but not the on-disk data immediately. A subsequent direct read bypasses the page cache and reads stale data from disk. The POSIX standard explicitly warns that mixing access modes on the same file produces undefined results unless the application manages synchronization (e.g., calling fsync() after buffered writes before direct reads).
Common Questions
Does O_DIRECT guarantee data durability?
No. O_DIRECT bypasses the page cache but not the storage device's volatile write cache. A power failure after a successful O_DIRECT write can lose data if the device has write caching enabled. For durability, combine O_DIRECT with fsync() or fdatasync() after writing, or open with O_DIRECT | O_DSYNC to make every write synchronous. Database engines always pair O_DIRECT with some form of sync.
Is O_DIRECT faster than buffered I/O?
It depends on the workload. For large sequential writes (database WAL, compaction output), O_DIRECT is often faster because it eliminates the copy into the page cache and avoids writeback contention. For small random reads without an application cache, buffered I/O is faster because the page cache provides readahead and absorbs repeated accesses. For databases with their own buffer pool, O_DIRECT is almost always the right choice because the page cache provides no additional benefit.
Why does Linus Torvalds dislike O_DIRECT?
Linus has called O_DIRECT "brain-damaged" because it pushes caching responsibility to the application, creating a two-tier system where some applications use the page cache and others bypass it. His preferred alternative is fadvise() with POSIX_FADV_DONTNEED to tell the kernel to evict pages after use, keeping the page cache in the loop but avoiding the accumulation problem. In practice, databases use O_DIRECT anyway because fadvise is advisory (the kernel can ignore it), and double buffering is a measurable performance problem at database scale.
What about mmap with MAP_POPULATE vs O_DIRECT?
mmap maps file pages directly into the process address space through the page cache. It avoids the explicit read/write copy but still uses the page cache. For databases, mmap has additional problems: no control over eviction timing, TLB shootdowns on unmap, and the need for msync() for durability. LMDB and early WiredTiger used mmap successfully, but most high-performance databases (PostgreSQL, MySQL, RocksDB) prefer explicit I/O with O_DIRECT for the control it provides.
Can O_DIRECT be used with NFS or network filesystems?
NFS supports O_DIRECT since NFSv3. The client bypasses its local page cache and sends the read/write RPC directly to the server. The server may still use its own page cache. This is useful for clustered databases where multiple nodes access the same files and local caching would cause coherency issues. However, NFS O_DIRECT has higher per-operation overhead than local O_DIRECT due to network round trips.
How Technologies Use This
A 32-core OLTP server running PostgreSQL handles 50,000 transactions per second, generating roughly 500 MB/s of WAL (Write-Ahead Log) traffic. Without Direct I/O, every WAL write passes through the kernel page cache first, meaning 500 MB/s of sequential, write-once data occupies page cache pages that will never be read back under normal operation. That pressure evicts hot heap and index pages from the OS cache, degrading read performance for the actual query workload.
When wal_sync_method is set to open_datasync or open_sync, PostgreSQL opens WAL segment files with the O_DIRECT flag. Each WAL write goes from the process-private WAL buffers (sized by wal_buffers, typically 64 MB) directly to the storage device, bypassing the page cache entirely. The kernel never allocates page cache pages for WAL data, so shared_buffers and the remaining page cache stay populated with frequently accessed table and index pages. The double-buffering problem disappears because PostgreSQL already manages WAL data in its own memory, and the kernel no longer duplicates that effort.
The measurable effect on a system with 64 GB RAM and shared_buffers set to 16 GB is that the remaining 48 GB of page cache stays available for data file reads instead of being consumed by write-once WAL pages. WAL fsync latency also becomes more predictable because the kernel writeback threads (kworker/flush) no longer compete with WAL writes for device bandwidth.
A production MySQL instance on a 128 GB server has innodb_buffer_pool_size set to 100 GB, leaving 28 GB for the OS, page cache, and other processes. Without Direct I/O, every InnoDB data page read or written also gets cached in the kernel page cache. The same 16 KB page exists in both the InnoDB buffer pool and the page cache simultaneously, wasting 20 to 30 GB of RAM on duplicate copies of hot data.
Setting innodb_flush_method=O_DIRECT causes InnoDB to open its tablespace files and redo log files with O_DIRECT. Reads and writes bypass the page cache, going straight between the InnoDB buffer pool and the storage device. InnoDB already implements its own sophisticated caching layer with a young/old sublist LRU, adaptive hash indexes, and change buffer merging. It tracks access patterns, dirty page ratios, and flush scheduling internally, making the kernel page cache redundant for InnoDB-managed files.
The practical impact is that those 20 to 30 GB previously wasted on duplicate caching become available for filesystem metadata caching, temporary table operations, and OS-level needs. On a server running both MySQL and a reporting workload that reads CSV exports, the freed page cache capacity can hold 25 GB of exported data instead of duplicating InnoDB pages that are already cached in the buffer pool.
A RocksDB instance serving a real-time analytics workload maintains a 16 GB block cache on a 64 GB machine. Background compaction continuously reads and rewrites SST (Sorted String Table) files ranging from 64 to 256 MB each, producing 2 to 5 GB/s of I/O during peak compaction periods. Without Direct I/O, every compaction-written SST file lands in the page cache, flooding it with cold data that will not be read for hours or days, and evicting the hot SST blocks that serve active point lookups and range scans.
Enabling use_direct_io_for_flush_and_compaction=true instructs RocksDB to open SST output files with O_DIRECT during both memtable flushes and compaction. The freshly written SST data goes straight to the NVMe device without allocating page cache pages. Read caching remains under RocksDB's control through its block cache (backed by LRUCache or ClockCache), which applies application-aware eviction policies that understand access frequency, data block vs. index block priority, and pinned iterator references.
On the 64 GB machine, this configuration keeps roughly 40 GB of page cache free for other processes and OS needs instead of allowing compaction to consume 20 to 30 GB with data that has near-zero reuse probability. The block cache hit rate for foreground reads improves measurably because compaction output no longer triggers kernel-level eviction of the very pages the block cache also relies on for fast lookups.
Same Concept Across Tech
| Technology | How it uses O_DIRECT | Key gotcha |
|---|---|---|
| PostgreSQL | WAL writes with open_datasync/open_sync bypass page cache for write-once log data | Data files still use buffered I/O by default; only WAL benefits from O_DIRECT unless using wal_level=minimal |
| MySQL/InnoDB | innodb_flush_method=O_DIRECT for data files and redo log, eliminating double buffering with buffer pool | Without O_DIRECT, the page cache and buffer pool hold duplicate copies, wasting up to 30% of RAM |
| RocksDB | Direct I/O for compaction writes (SST files) prevents cold data from evicting hot block cache entries | Read path can also use direct I/O but needs careful block cache sizing to compensate for lost page cache |
| Oracle | SGA + direct I/O since the 1990s; the original database use case for bypassing the page cache | Oracle on Linux without O_DIRECT or async I/O is a known misconfiguration that causes severe performance issues |
| ScyllaDB | Uses O_DIRECT with io_uring for all storage I/O, managing its own cache with per-shard memory allocation | Requires XFS; ext4 O_DIRECT has historically had edge cases that ScyllaDB works around |
Stack layer mapping (high memory + low cache hit ratio):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the application managing its own buffer pool? | Application config (buffer_pool_size, shared_buffers, block_cache_size) |
| Page Cache | How much RAM is the page cache using for this workload's files? | /proc/meminfo Cached, fincore on data files |
| Block Layer | Are writes going through writeback or direct? | blktrace, check for WB vs D flags |
| Filesystem | Does the filesystem support O_DIRECT properly? (XFS preferred) | mount options, filesystem type |
| Device | Is the device write cache enabled? (O_DIRECT still needs fsync for durability) | hdparm -W /dev/sdX, smartctl |
Design Rationale The page cache is one of the most effective optimizations in Linux -- for general-purpose workloads. It absorbs repeated reads, coalesces small writes, and provides transparent readahead. But databases are not general-purpose workloads. A database buffer pool is a specialized cache that understands transaction isolation, dirty page priority, pin counts, and access pattern prediction. Running a sophisticated application cache behind a generic kernel cache means every page exists in memory twice, every write gets copied an extra time, and the kernel's LRU eviction competes with the application's carefully tuned eviction policy. O_DIRECT exists because the page cache, despite being excellent, is the wrong abstraction when the application already solved the caching problem.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| High memory usage with low application cache hit ratio | Page cache and application buffer pool holding duplicate pages | grep Cached /proc/meminfo, compare with buffer pool size |
| Write latency spikes every 5-30 seconds | Kernel writeback flushing dirty pages from page cache | Check /proc/meminfo Dirty and Writeback values |
| EINVAL from read() or write() after opening with O_DIRECT | Buffer, offset, or size not aligned to filesystem block size | strace the failing call, check alignment with stat -f %S |
| Sequential read throughput much lower with O_DIRECT | No kernel readahead; application not issuing large enough reads | Increase read size or implement application-level readahead |
| Stale data when mixing buffered and direct I/O on same file | Page cache and direct path not coherent | Use O_DIRECT consistently or add fsync between access mode changes |
| High CPU usage during large writes | Without O_DIRECT, kernel copies every byte into page cache (copy_from_user) | Enable O_DIRECT to eliminate the copy, check perf top for copy_page |
When to Use / Avoid
Relevant when:
- Running a database (PostgreSQL, MySQL, RocksDB, Oracle) that manages its own buffer pool
- Writing large sequential data that has no reuse (log files, WAL segments, compaction output)
- Memory pressure is high and the page cache is holding duplicate data
- Benchmarking storage devices and need to measure true device latency without page cache interference
Watch out for:
- Alignment requirements: buffer, offset, and size must all be aligned to the filesystem block size
- No kernel readahead: sequential direct I/O requires application-level prefetching
- No write coalescing: many small direct writes are slower than buffered writes that the kernel can merge
- Mixing buffered and direct I/O on the same file risks data coherency issues
Try It Yourself
1 # Write 1 GB with direct I/O, bypassing page cache
2
3 dd if=/dev/zero of=/tmp/testfile bs=4096 count=262144 oflag=direct conv=fdatasync
4
5 # Read with direct I/O and measure throughput
6
7 dd if=/tmp/testfile of=/dev/null bs=4096 count=262144 iflag=direct
8
9 # Compare page cache usage: buffered vs direct
10
11 echo 3 > /proc/sys/vm/drop_caches && dd if=/dev/zero of=/tmp/buf bs=1M count=512 && grep Cached /proc/meminfo
12
13 echo 3 > /proc/sys/vm/drop_caches && dd if=/dev/zero of=/tmp/dio bs=1M count=512 oflag=direct && grep Cached /proc/meminfo
14
15 # Check filesystem block size (alignment requirement for O_DIRECT)
16
17 stat -f /path/to/datadir | grep "Block size"
18
19 # Trace O_DIRECT usage in a running process
20
21 strace -e trace=open,openat -p $(pidof postgres) 2>&1 | grep DIRECT
22
23 # Check if files are in page cache (requires fincore or vmtouch)
24
25 vmtouch /var/lib/postgresql/data/base/16384/*
26
27 # Monitor writeback activity (high values suggest O_DIRECT would help)
28
29 watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo'Debug Checklist
- 1
Check if O_DIRECT is being used: strace -e open,openat <process> 2>&1 | grep O_DIRECT - 2
Measure page cache usage: grep -E 'Cached|Dirty|Writeback' /proc/meminfo - 3
Verify alignment: stat -f %S /path/to/datadir (shows filesystem block size) - 4
Compare direct vs buffered throughput: dd if=/dev/zero of=test bs=4096 count=100000 oflag=direct - 5
Check for EINVAL errors in application logs (alignment violation symptom) - 6
Monitor dirty page writeback: watch -n1 'grep -E "Dirty|Writeback" /proc/meminfo' - 7
PostgreSQL WAL: SHOW wal_sync_method; (open_datasync or open_sync uses O_DIRECT) - 8
MySQL: SHOW VARIABLES LIKE 'innodb_flush_method'; (should be O_DIRECT for production)
Key Takeaways
- ✓O_DIRECT eliminates double buffering. Without it, a database with its own buffer pool stores every page twice: once in the application's managed cache and once in the kernel's page cache. On a 128 GB server with a 96 GB buffer pool, this can waste 30-40 GB of RAM holding redundant copies.
- ✓Alignment is not optional. The buffer address must be aligned to the filesystem block size (usually 512 or 4096 bytes), the file offset must be aligned, and the transfer size must be a multiple of the block size. Use posix_memalign(&buf, 4096, size) or aligned_alloc(4096, size). A misaligned O_DIRECT call returns EINVAL, not silently falls back to buffered I/O.
- ✓O_DIRECT does not mean durable. The data bypasses the page cache but may still sit in the storage device's volatile write cache. Combine O_DIRECT with fsync() for durability, or use O_DIRECT|O_DSYNC to get both bypass and synchronous writes in one flag combination.
- ✓Some filesystems handle O_DIRECT differently. ext4 falls back to buffered I/O for misaligned requests on older kernels. XFS has historically had the best O_DIRECT support and is the preferred filesystem for database workloads. btrfs supports O_DIRECT but may fall back to buffered I/O for compressed extents.
- ✓io_uring with IORING_OP_READ/WRITE and O_DIRECT gives the best of both worlds: no page cache overhead and no syscall-per-I/O overhead. The submission queue batches multiple direct I/O operations, and the kernel processes them asynchronously. This is the path modern databases like TigerBeetle and ScyllaDB are moving toward.
Common Pitfalls
- ✗Using O_DIRECT without alignment. The most common failure mode: allocating a buffer with malloc() (which returns 8 or 16 byte aligned memory) and passing it to a direct I/O read or write. The call fails with EINVAL. Always use posix_memalign() or aligned_alloc() with the filesystem block size as the alignment.
- ✗Assuming O_DIRECT means data is on disk. O_DIRECT bypasses the page cache, not the device write cache. Without fsync() or O_DSYNC, a power failure can lose data that O_DIRECT "wrote" successfully. This catches teams that switch from buffered+fsync to O_DIRECT and drop the fsync call.
- ✗Using O_DIRECT for small random reads in a general-purpose application. The page cache exists for a reason -- it absorbs repeated reads, coalesces small writes, and handles readahead. O_DIRECT makes sense when the application manages its own cache (databases) or when data has near-zero reuse (streaming writes). For typical application workloads, the page cache is strictly better.
- ✗Mixing O_DIRECT and buffered I/O on the same file from different processes. The page cache and direct I/O path can see stale data. One process writes via the page cache, another reads with O_DIRECT and gets old data because the dirty page has not been flushed yet. If multiple access patterns are required, use O_DIRECT consistently or add explicit synchronization.
- ✗Forgetting that O_DIRECT disables kernel readahead. Sequential reads through O_DIRECT get no automatic prefetching. The application must implement its own readahead by issuing larger reads or submitting multiple asynchronous read requests. Without this, sequential direct I/O throughput can be 50-70% lower than buffered I/O.
Reference
In One Line
O_DIRECT sends I/O straight to the device, skipping the page cache -- saving memory and removing double buffering when the application already manages its own cache.