File Systems & I/OTopic 17 of 19

File Systems & I/OAdvanced

io_uring: Modern Async I/O

RedisGoNode.js

🧠

Mental Model

Old drive-through: pull up, place one order, wait for it, then place the next. io_uring model: a shared order pad slides through the window. Write five orders on the pad, push it through. The kitchen works them in parallel and sets finished plates on a pickup shelf. Grab completed orders from the shelf whenever -- no line, no conversation, no waiting between submissions.

💡

The Problem

An NVMe drive rated at 2 million IOPS, an application stuck at 200,000, and 90% of the drive sitting idle. The disk is not the bottleneck -- the CPU is, spending 200-500 ns per syscall just crossing the kernel boundary. At 200K IOPS that adds up to 40-100 ms of pure overhead per second, and the remaining 1.8 million IOPS go unused because the CPU cannot submit requests fast enough through the traditional syscall interface.

Architecture

An NVMe drive can do 2 million operations per second. The application tops out at 200,000. The drive is idle 90% of the time.

The bottleneck is not disk speed. It is the cost of asking the kernel to do things. Every read(), every write(), every accept() -- each one is a syscall that crosses the user-kernel boundary. Each crossing costs microseconds. Multiply by millions and the system spends more time asking for I/O than doing it.

io_uring was built to eliminate that tax. And it does something radical to get there.

What Actually Happens

io_uring (Linux 5.1+) replaces the traditional syscall-per-operation model with two shared-memory ring buffers between userspace and kernel.

Setup happens once. io_uring_setup() creates the ring buffers and returns an fd. The application mmap()s the Submission Queue (SQ), Completion Queue (CQ), and SQE array into its address space. Optionally, io_uring_register() pre-registers file descriptors and memory buffers to skip per-operation overhead.

Submitting I/O means writing a 64-byte SQE to the submission ring. The SQE specifies what to do (read, write, accept, connect, etc.), which fd, which buffer, and how much. Then bump the tail pointer. That is it.

Completing I/O means the kernel writes a 16-byte CQE to the completion ring with the result code and the correlation tag. The application checks the CQ head pointer to see if new completions arrived.

All pointer updates are atomic. In SQPOLL mode, no syscalls are involved at all.

Under the Hood

Three syscalls, total. io_uring_setup() creates the instance. io_uring_enter() tells the kernel to process new SQEs (and optionally wait for CQEs). io_uring_register() pre-registers resources. In the steady-state hot path, only io_uring_enter() is called -- and in SQPOLL mode, even that disappears.

SQPOLL mode is the most aggressive optimization. The kernel spawns a dedicated thread (io_uring-sq, visible in ps) that continuously polls the submission queue. The application writes SQEs and bumps the tail pointer. The kernel thread picks them up immediately. No syscall. No notification. The thread parks itself after sq_thread_idle milliseconds of inactivity and wakes when new SQEs appear.

Here is the catch: that polling thread consumes a full CPU core while active. For sustained high-IOPS workloads, the tradeoff is worth it. For bursty workloads, set sq_thread_idle appropriately.

Fixed files and fixed buffers eliminate the remaining per-operation kernel overhead. Without registration, every I/O operation requires the kernel to look up the fd in the process's file table, increment reference counts, and pin user-space pages via get_user_pages(). With IORING_REGISTER_FILES and IORING_REGISTER_BUFFERS, this is done once at registration time. The difference is measurable at high IOPS -- this is what separates 1M IOPS from 2M IOPS.

Linked SQEs create operation chains. Set IOSQE_IO_LINK on an SQE and the next SQE in the ring only executes after the first completes. Chain open -> read -> close as three linked entries submitted in one batch. The kernel handles the sequencing without returning to userspace. If any link fails, subsequent entries get -ECANCELED.

The operation set goes far beyond read/write. io_uring supports accept, connect, recv, send, splice, tee, openat, close, statx, timeout, and more. This makes it a unified async interface that can replace epoll, AIO, and dozens of individual syscalls.

Common Questions

How much faster is io_uring than epoll for network workloads?

epoll needs three syscalls per event cycle: epoll_wait() to discover ready fds, then read()/recv() to get data, then write()/send() to respond. io_uring batches all of these as SQEs in a single io_uring_enter() call -- or zero syscalls in SQPOLL mode. For small-message workloads, the syscall reduction alone delivers 2-3x throughput improvement.

What happens when the CQ overflows?

This is where things break quietly. If CQEs are not drained fast enough, the kernel sets IORING_SQ_CQ_OVERFLOW and stores extras in an internal overflow list. If even that fills up, completions are dropped. Gone. There is no way to know an I/O completed. Size the CQ at 4x the SQ depth using IORING_SETUP_CQSIZE and drain aggressively.

Why have databases not all switched to io_uring?

Several reasons. Minimum kernel version (5.1+, really 5.10+ for stability) limits deployment. Some cloud providers disable io_uring in containers for security. The programming model is significantly more complex than synchronous I/O with a thread pool. And for many database workloads, the bottleneck is query processing, not I/O syscall overhead. Newer databases (TiKV, ScyllaDB) are adopting it. Legacy databases (PostgreSQL, MySQL) have deep investments in their current architecture.

What about buffered file I/O?

Here is the fine print. For O_DIRECT file I/O, io_uring submits directly to the block layer and gets true async completion. For buffered file I/O (page cache), true async is not possible because the page cache may need to allocate memory, read from disk, or take locks. io_uring handles this by offloading to its internal io-wq worker thread pool. The API looks async, but behind the scenes, worker threads are doing blocking I/O. For maximum performance, use O_DIRECT with io_uring and manage a custom buffer cache -- which is exactly what high-performance storage engines do.

Why do some distros disable io_uring?

Security. io_uring's shared-memory interface bypasses traditional syscall auditing (seccomp BPF cannot easily filter io_uring operations). The attack surface is large. Google disables it for unprivileged users in container environments. Check /proc/sys/kernel/io_uring_disabled to see the system's policy.

How Technologies Use This

Redis

Redis throughput drops and all 100,000+ connected clients see latency spikes of 5-25ms during AOF persistence. The single-threaded event loop freezes every time fsync blocks waiting for the disk to confirm a write.

The bottleneck is that fsync is a synchronous syscall. When Redis calls it on the AOF file, the entire event loop stalls until the disk responds. Redis 7.0+ uses io_uring to submit fsync and file write operations asynchronously, batching multiple AOF entries into a single io_uring_enter call so the main event loop never blocks on disk.

Upgrade to Redis 7.0+ and enable io_uring-backed AOF persistence. This eliminates the fsync-induced latency spikes that previously forced operators to choose between data durability (appendfsync always) and performance (appendfsync everysec).

A Go service handling 500,000 concurrent connections is burning significant CPU despite low application logic overhead. Profiling reveals over 1.5 million syscalls per second, with most time spent crossing the kernel boundary rather than processing data.

The Go netpoller uses epoll, and each event cycle requires three separate syscalls: epoll_wait to discover ready fds, then read, then write. Third-party libraries like go-uring expose io_uring to batch all three operations into a single submission or eliminate syscalls entirely via SQPOLL mode, where a kernel thread polls the submission queue with zero syscalls in steady state.

Integrate go-uring for small-message, high-connection workloads to see 40-60% throughput improvement. There is active discussion about making io_uring the Go runtime's native network backend to benefit all Go programs automatically.

Node.js

A file-heavy Node.js service chokes at high concurrency even with plenty of CPU available. HTTP responses stall, DNS lookups queue, and increasing UV_THREADPOOL_SIZE only delays the inevitable collapse under load.

The bottleneck is libuv's thread pool. Epoll cannot handle regular files, so all file reads and writes are dispatched to just 4 worker threads by default. When those 4 threads are blocked on disk, everything queues behind them. io_uring replaces this thread pool with true kernel-level async file I/O, where operations are submitted as SQEs and completed without any worker threads.

Adopt an io_uring-backed libuv build for file-heavy workloads. No UV_THREADPOOL_SIZE tuning is needed, and file-heavy Node.js apps see up to 3x throughput improvement because the bottleneck shifts from thread pool starvation to actual disk speed.

Same Concept Across Tech

Technology	io_uring adoption status	Key detail
RocksDB	io_uring for async reads in compaction and flush (since 6.29)	Significant improvement on NVMe for compaction-heavy workloads
PostgreSQL	Experimental io_uring support for streaming reads (since PG 16)	Still behind the direct I/O + AIO path in maturity
TigerBeetle	Built on io_uring from the ground up for deterministic performance	Designed around the io_uring submission model
Rust (tokio)	tokio-uring crate provides io_uring integration	Complement to epoll-based default runtime
Go	No native io_uring support. Third-party libraries (go-uring) exist	Go runtime's netpoller uses epoll
Node.js	No native support. libuv still uses epoll + thread pool	Thread pool for disk I/O remains the default

Evolution of Linux async I/O:

Interface	Year	Limitation
POSIX AIO	2001	Implemented via thread pool (not truly async in glibc)
Linux AIO (libaio)	2002	Only works with O_DIRECT, no buffered I/O
epoll	2002	Network only, does not work with regular files
io_uring	2019	Full async for everything, but complex and security concerns

Design Rationale The bottleneck was never kernel I/O processing -- it was crossing the user-kernel boundary on every operation. epoll required three syscalls per event cycle (wait, read, write); Linux AIO only worked with O_DIRECT. Neither could keep pace with NVMe hardware delivering millions of IOPS. Shared mmap'd ring buffers with lock-free atomic pointer updates cut per-I/O overhead to a single memory store in the best case (SQPOLL mode). That buys the ability to saturate modern storage hardware, at the price of a significantly larger API surface and kernel attack exposure.

If You See This, Think This

Symptom	Likely cause	First check
NVMe at 10% utilization despite I/O-heavy workload	Syscall overhead limiting submission rate	strace -c to count syscalls per second
High system CPU (sy%) with I/O-bound application	Each I/O op is a separate syscall crossing the kernel boundary	Consider io_uring to batch submissions
io_uring_setup fails with ENOSYS	Kernel too old or io_uring disabled	uname -r (need 5.1+), check io_uring_disabled
io_uring_setup fails with EPERM	Security policy blocking io_uring (common in containers)	Check seccomp profile and io_uring_disabled
Application using libaio hits O_DIRECT requirement	Linux AIO only works with O_DIRECT, not buffered I/O	Switch to io_uring which works with both
Completion queue overflow	Not consuming completions fast enough, SQ depth too large	Increase CQ ring size or consume completions faster

When to Use / Avoid

Use io_uring when:

Saturating NVMe storage (millions of IOPS where syscall overhead matters)
Need async disk I/O (epoll does not work with regular files)
Building high-performance databases or storage engines
Network I/O where every microsecond matters (DPDK alternative)

Avoid when:

The application is not I/O bound (io_uring adds complexity for no benefit)
Running on kernels older than 5.6 (incomplete feature set)
Security is critical and the attack surface must be minimal (io_uring has had CVEs)
epoll handles the network load fine and disk I/O is not a bottleneck

Try It Yourself

 1  # Check if io_uring is disabled (0=enabled, 1=disabled for unprivileged, 2=disabled for all). some distros disable it for security
 2  cat /proc/sys/kernel/io_uring_disabled
 3  
 4  # Benchmark random read IOPS using io_uring with 128 in-flight requests and O_DIRECT
 5  fio --name=test --ioengine=io_uring --rw=randread --bs=4k --size=1G --iodepth=128 --direct=1 --filename=/tmp/fio_test
 6  
 7  # Check if a process has io_uring instances (io_uring_fds field, Linux 6.1+)
 8  grep io_uring /proc/$$/status 2>/dev/null
 9  
10  # io_uring's io-wq worker pool is limited by this; relevant for buffered async I/O workloads
11  cat /proc/sys/kernel/threads-max
12  
13  # Count io_uring_enter() syscalls over 5 seconds. in SQPOLL mode this should be near zero
14  perf stat -e 'syscalls:sys_enter_io_uring_enter' -p <PID> -- sleep 5
15  
16  # Trace io_uring setup and submission syscalls to understand an application's io_uring usage pattern
17  strace -e io_uring_setup,io_uring_enter,io_uring_register -f <program>

Debug Checklist

1Check if io_uring is available: cat /proc/sys/kernel/io_uring_disabled (0 = enabled)
2Check kernel version: uname -r (need 5.1+ for basic, 5.6+ for full features)
3Monitor io_uring activity: perf trace -e io_uring:* -a -- sleep 5
4Check if process uses io_uring: strace -e io_uring_setup,io_uring_enter -p <pid>
5Monitor submission/completion rates: bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe { @++ }'
6Check for io_uring security restrictions: grep io_uring /proc/<pid>/status

Key Takeaways

✓SQPOLL mode spawns a kernel thread that drains the submission queue without any syscall. Your app writes an SQE, bumps a pointer, and the kernel picks it up. True zero-syscall I/O in steady state. This is how you saturate NVMe.
✓Fixed files and fixed buffers (via io_uring_register) skip per-operation fd lookup and page pinning. For high-IOPS workloads on NVMe, this eliminates the remaining kernel overhead that was not syscall-related.
✓Linked SQEs (IOSQE_IO_LINK) chain operations: read then process then write, each starting only after the previous completes. Complex I/O workflows without returning to userspace between steps.
✓Unlike epoll, which only tells you an fd is ready and then requires separate read/write syscalls, io_uring handles the entire operation -- accept, recv, send -- as a single submitted entry. One interface for both file and network I/O.
✓Size the CQ larger than the SQ (IORING_SETUP_CQ_SIZE). If completions arrive faster than you drain them, the CQ overflows and the kernel silently drops completions. This is one of the hardest bugs to debug.

Common Pitfalls

✗Mistake: Not draining the CQ frequently enough. Reality: CQ overflow causes the kernel to drop completions and set IORING_SQ_CQ_OVERFLOW. You must call io_uring_enter(IORING_ENTER_GETEVENTS) to recover, and you may have already lost data.
✗Mistake: Expecting true async behavior from buffered file I/O. Reality: Buffered I/O goes through io-wq worker threads internally. You get an async API, but workers consume CPU and memory behind the scenes. True async requires O_DIRECT.
✗Mistake: Enabling SQPOLL mode without understanding the cost. Reality: The polling thread consumes a full CPU core even when idle unless you set sq_thread_idle to park it. Also requires root or CAP_SYS_NICE since Linux 5.12.
✗Mistake: Skipping IORING_SETUP_COOP_TASKRUN in non-SQPOLL mode. Reality: Without it, the kernel uses inter-processor interrupts to signal completions, adding latency on multi-core systems. Available since Linux 5.19.

Reference

System Calls

io_uring_setupio_uring_enterio_uring_register

Tools

liburingfio (Flexible I/O tester)bpftrace / perf

📌

In One Line

Shared ring buffers replace the syscall-per-operation model -- use io_uring when NVMe hardware is faster than the CPU can submit requests through traditional interfaces.

io_uring: Modern Async I/O

RedisGoNode.js

🧠

Mental Model

💡

The Problem

Architecture

An NVMe drive can do 2 million operations per second. The application tops out at 200,000. The drive is idle 90% of the time.

io_uring was built to eliminate that tax. And it does something radical to get there.

What Actually Happens

io_uring (Linux 5.1+) replaces the traditional syscall-per-operation model with two shared-memory ring buffers between userspace and kernel.

All pointer updates are atomic. In SQPOLL mode, no syscalls are involved at all.

Under the Hood

Here is the catch: that polling thread consumes a full CPU core while active. For sustained high-IOPS workloads, the tradeoff is worth it. For bursty workloads, set sq_thread_idle appropriately.

Common Questions

How much faster is io_uring than epoll for network workloads?

What happens when the CQ overflows?

Why have databases not all switched to io_uring?

What about buffered file I/O?

Why do some distros disable io_uring?

How Technologies Use This

Redis

Node.js

Same Concept Across Tech

Technology	io_uring adoption status	Key detail
RocksDB	io_uring for async reads in compaction and flush (since 6.29)	Significant improvement on NVMe for compaction-heavy workloads
PostgreSQL	Experimental io_uring support for streaming reads (since PG 16)	Still behind the direct I/O + AIO path in maturity
TigerBeetle	Built on io_uring from the ground up for deterministic performance	Designed around the io_uring submission model
Rust (tokio)	tokio-uring crate provides io_uring integration	Complement to epoll-based default runtime
Go	No native io_uring support. Third-party libraries (go-uring) exist	Go runtime's netpoller uses epoll
Node.js	No native support. libuv still uses epoll + thread pool	Thread pool for disk I/O remains the default

Evolution of Linux async I/O:

Interface	Year	Limitation
POSIX AIO	2001	Implemented via thread pool (not truly async in glibc)
Linux AIO (libaio)	2002	Only works with O_DIRECT, no buffered I/O
epoll	2002	Network only, does not work with regular files
io_uring	2019	Full async for everything, but complex and security concerns

If You See This, Think This

Symptom	Likely cause	First check
NVMe at 10% utilization despite I/O-heavy workload	Syscall overhead limiting submission rate	strace -c to count syscalls per second
High system CPU (sy%) with I/O-bound application	Each I/O op is a separate syscall crossing the kernel boundary	Consider io_uring to batch submissions
io_uring_setup fails with ENOSYS	Kernel too old or io_uring disabled	uname -r (need 5.1+), check io_uring_disabled
io_uring_setup fails with EPERM	Security policy blocking io_uring (common in containers)	Check seccomp profile and io_uring_disabled
Application using libaio hits O_DIRECT requirement	Linux AIO only works with O_DIRECT, not buffered I/O	Switch to io_uring which works with both
Completion queue overflow	Not consuming completions fast enough, SQ depth too large	Increase CQ ring size or consume completions faster

When to Use / Avoid

Use io_uring when:

Saturating NVMe storage (millions of IOPS where syscall overhead matters)
Need async disk I/O (epoll does not work with regular files)
Building high-performance databases or storage engines
Network I/O where every microsecond matters (DPDK alternative)

Avoid when:

The application is not I/O bound (io_uring adds complexity for no benefit)
Running on kernels older than 5.6 (incomplete feature set)
Security is critical and the attack surface must be minimal (io_uring has had CVEs)
epoll handles the network load fine and disk I/O is not a bottleneck

Try It Yourself

 1  # Check if io_uring is disabled (0=enabled, 1=disabled for unprivileged, 2=disabled for all). some distros disable it for security
 2  cat /proc/sys/kernel/io_uring_disabled
 3  
 4  # Benchmark random read IOPS using io_uring with 128 in-flight requests and O_DIRECT
 5  fio --name=test --ioengine=io_uring --rw=randread --bs=4k --size=1G --iodepth=128 --direct=1 --filename=/tmp/fio_test
 6  
 7  # Check if a process has io_uring instances (io_uring_fds field, Linux 6.1+)
 8  grep io_uring /proc/$$/status 2>/dev/null
 9  
10  # io_uring's io-wq worker pool is limited by this; relevant for buffered async I/O workloads
11  cat /proc/sys/kernel/threads-max
12  
13  # Count io_uring_enter() syscalls over 5 seconds. in SQPOLL mode this should be near zero
14  perf stat -e 'syscalls:sys_enter_io_uring_enter' -p <PID> -- sleep 5
15  
16  # Trace io_uring setup and submission syscalls to understand an application's io_uring usage pattern
17  strace -e io_uring_setup,io_uring_enter,io_uring_register -f <program>

Debug Checklist

1Check if io_uring is available: cat /proc/sys/kernel/io_uring_disabled (0 = enabled)
2Check kernel version: uname -r (need 5.1+ for basic, 5.6+ for full features)
3Monitor io_uring activity: perf trace -e io_uring:* -a -- sleep 5
4Check if process uses io_uring: strace -e io_uring_setup,io_uring_enter -p <pid>
5Monitor submission/completion rates: bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe { @++ }'
6Check for io_uring security restrictions: grep io_uring /proc/<pid>/status

Key Takeaways

✓SQPOLL mode spawns a kernel thread that drains the submission queue without any syscall. Your app writes an SQE, bumps a pointer, and the kernel picks it up. True zero-syscall I/O in steady state. This is how you saturate NVMe.
✓Fixed files and fixed buffers (via io_uring_register) skip per-operation fd lookup and page pinning. For high-IOPS workloads on NVMe, this eliminates the remaining kernel overhead that was not syscall-related.
✓Linked SQEs (IOSQE_IO_LINK) chain operations: read then process then write, each starting only after the previous completes. Complex I/O workflows without returning to userspace between steps.
✓Unlike epoll, which only tells you an fd is ready and then requires separate read/write syscalls, io_uring handles the entire operation -- accept, recv, send -- as a single submitted entry. One interface for both file and network I/O.
✓Size the CQ larger than the SQ (IORING_SETUP_CQ_SIZE). If completions arrive faster than you drain them, the CQ overflows and the kernel silently drops completions. This is one of the hardest bugs to debug.

Common Pitfalls

✗Mistake: Not draining the CQ frequently enough. Reality: CQ overflow causes the kernel to drop completions and set IORING_SQ_CQ_OVERFLOW. You must call io_uring_enter(IORING_ENTER_GETEVENTS) to recover, and you may have already lost data.
✗Mistake: Expecting true async behavior from buffered file I/O. Reality: Buffered I/O goes through io-wq worker threads internally. You get an async API, but workers consume CPU and memory behind the scenes. True async requires O_DIRECT.
✗Mistake: Enabling SQPOLL mode without understanding the cost. Reality: The polling thread consumes a full CPU core even when idle unless you set sq_thread_idle to park it. Also requires root or CAP_SYS_NICE since Linux 5.12.
✗Mistake: Skipping IORING_SETUP_COOP_TASKRUN in non-SQPOLL mode. Reality: Without it, the kernel uses inter-processor interrupts to signal completions, adding latency on multi-core systems. Available since Linux 5.19.

Reference

System Calls

io_uring_setupio_uring_enterio_uring_register

Tools

liburingfio (Flexible I/O tester)bpftrace / perf

📌

In One Line

Shared ring buffers replace the syscall-per-operation model -- use io_uring when NVMe hardware is faster than the CPU can submit requests through traditional interfaces.

io_uring: Modern Async I/O

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

io_uring: Modern Async I/O

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics