io_uring: Modern Async I/O
Mental Model
Old drive-through: pull up, place one order, wait for it, then place the next. io_uring model: a shared order pad slides through the window. Write five orders on the pad, push it through. The kitchen works them in parallel and sets finished plates on a pickup shelf. Grab completed orders from the shelf whenever -- no line, no conversation, no waiting between submissions.
The Problem
An NVMe drive rated at 2 million IOPS, an application stuck at 200,000, and 90% of the drive sitting idle. The disk is not the bottleneck -- the CPU is, spending 200-500 ns per syscall just crossing the kernel boundary. At 200K IOPS that adds up to 40-100 ms of pure overhead per second, and the remaining 1.8 million IOPS go unused because the CPU cannot submit requests fast enough through the traditional syscall interface.
Architecture
An NVMe drive can do 2 million operations per second. The application tops out at 200,000. The drive is idle 90% of the time.
The bottleneck is not disk speed. It is the cost of asking the kernel to do things. Every read(), every write(), every accept() -- each one is a syscall that crosses the user-kernel boundary. Each crossing costs microseconds. Multiply by millions and the system spends more time asking for I/O than doing it.
io_uring was built to eliminate that tax. And it does something radical to get there.
What Actually Happens
io_uring (Linux 5.1+) replaces the traditional syscall-per-operation model with two shared-memory ring buffers between userspace and kernel.
Setup happens once. io_uring_setup() creates the ring buffers and returns an fd. The application mmap()s the Submission Queue (SQ), Completion Queue (CQ), and SQE array into its address space. Optionally, io_uring_register() pre-registers file descriptors and memory buffers to skip per-operation overhead.
Submitting I/O means writing a 64-byte SQE to the submission ring. The SQE specifies what to do (read, write, accept, connect, etc.), which fd, which buffer, and how much. Then bump the tail pointer. That is it.
Completing I/O means the kernel writes a 16-byte CQE to the completion ring with the result code and the correlation tag. The application checks the CQ head pointer to see if new completions arrived.
All pointer updates are atomic. In SQPOLL mode, no syscalls are involved at all.
Under the Hood
Three syscalls, total. io_uring_setup() creates the instance. io_uring_enter() tells the kernel to process new SQEs (and optionally wait for CQEs). io_uring_register() pre-registers resources. In the steady-state hot path, only io_uring_enter() is called -- and in SQPOLL mode, even that disappears.
SQPOLL mode is the most aggressive optimization. The kernel spawns a dedicated thread (io_uring-sq, visible in ps) that continuously polls the submission queue. The application writes SQEs and bumps the tail pointer. The kernel thread picks them up immediately. No syscall. No notification. The thread parks itself after sq_thread_idle milliseconds of inactivity and wakes when new SQEs appear.
Here is the catch: that polling thread consumes a full CPU core while active. For sustained high-IOPS workloads, the tradeoff is worth it. For bursty workloads, set sq_thread_idle appropriately.
Fixed files and fixed buffers eliminate the remaining per-operation kernel overhead. Without registration, every I/O operation requires the kernel to look up the fd in the process's file table, increment reference counts, and pin user-space pages via get_user_pages(). With IORING_REGISTER_FILES and IORING_REGISTER_BUFFERS, this is done once at registration time. The difference is measurable at high IOPS -- this is what separates 1M IOPS from 2M IOPS.
Linked SQEs create operation chains. Set IOSQE_IO_LINK on an SQE and the next SQE in the ring only executes after the first completes. Chain open -> read -> close as three linked entries submitted in one batch. The kernel handles the sequencing without returning to userspace. If any link fails, subsequent entries get -ECANCELED.
The operation set goes far beyond read/write. io_uring supports accept, connect, recv, send, splice, tee, openat, close, statx, timeout, and more. This makes it a unified async interface that can replace epoll, AIO, and dozens of individual syscalls.
Common Questions
How much faster is io_uring than epoll for network workloads?
epoll needs three syscalls per event cycle: epoll_wait() to discover ready fds, then read()/recv() to get data, then write()/send() to respond. io_uring batches all of these as SQEs in a single io_uring_enter() call -- or zero syscalls in SQPOLL mode. For small-message workloads, the syscall reduction alone delivers 2-3x throughput improvement.
What happens when the CQ overflows?
This is where things break quietly. If CQEs are not drained fast enough, the kernel sets IORING_SQ_CQ_OVERFLOW and stores extras in an internal overflow list. If even that fills up, completions are dropped. Gone. There is no way to know an I/O completed. Size the CQ at 4x the SQ depth using IORING_SETUP_CQSIZE and drain aggressively.
Why have databases not all switched to io_uring?
Several reasons. Minimum kernel version (5.1+, really 5.10+ for stability) limits deployment. Some cloud providers disable io_uring in containers for security. The programming model is significantly more complex than synchronous I/O with a thread pool. And for many database workloads, the bottleneck is query processing, not I/O syscall overhead. Newer databases (TiKV, ScyllaDB) are adopting it. Legacy databases (PostgreSQL, MySQL) have deep investments in their current architecture.
What about buffered file I/O?
Here is the fine print. For O_DIRECT file I/O, io_uring submits directly to the block layer and gets true async completion. For buffered file I/O (page cache), true async is not possible because the page cache may need to allocate memory, read from disk, or take locks. io_uring handles this by offloading to its internal io-wq worker thread pool. The API looks async, but behind the scenes, worker threads are doing blocking I/O. For maximum performance, use O_DIRECT with io_uring and manage a custom buffer cache -- which is exactly what high-performance storage engines do.
Why do some distros disable io_uring?
Security. io_uring's shared-memory interface bypasses traditional syscall auditing (seccomp BPF cannot easily filter io_uring operations). The attack surface is large. Google disables it for unprivileged users in container environments. Check /proc/sys/kernel/io_uring_disabled to see the system's policy.
How Technologies Use This
Redis throughput drops and all 100,000+ connected clients see latency spikes of 5-25ms during AOF persistence. The single-threaded event loop freezes every time fsync blocks waiting for the disk to confirm a write.
The bottleneck is that fsync is a synchronous syscall. When Redis calls it on the AOF file, the entire event loop stalls until the disk responds. Redis 7.0+ uses io_uring to submit fsync and file write operations asynchronously, batching multiple AOF entries into a single io_uring_enter call so the main event loop never blocks on disk.
Upgrade to Redis 7.0+ and enable io_uring-backed AOF persistence. This eliminates the fsync-induced latency spikes that previously forced operators to choose between data durability (appendfsync always) and performance (appendfsync everysec).
A Go service handling 500,000 concurrent connections is burning significant CPU despite low application logic overhead. Profiling reveals over 1.5 million syscalls per second, with most time spent crossing the kernel boundary rather than processing data.
The Go netpoller uses epoll, and each event cycle requires three separate syscalls: epoll_wait to discover ready fds, then read, then write. Third-party libraries like go-uring expose io_uring to batch all three operations into a single submission or eliminate syscalls entirely via SQPOLL mode, where a kernel thread polls the submission queue with zero syscalls in steady state.
Integrate go-uring for small-message, high-connection workloads to see 40-60% throughput improvement. There is active discussion about making io_uring the Go runtime's native network backend to benefit all Go programs automatically.
A file-heavy Node.js service chokes at high concurrency even with plenty of CPU available. HTTP responses stall, DNS lookups queue, and increasing UV_THREADPOOL_SIZE only delays the inevitable collapse under load.
The bottleneck is libuv's thread pool. Epoll cannot handle regular files, so all file reads and writes are dispatched to just 4 worker threads by default. When those 4 threads are blocked on disk, everything queues behind them. io_uring replaces this thread pool with true kernel-level async file I/O, where operations are submitted as SQEs and completed without any worker threads.
Adopt an io_uring-backed libuv build for file-heavy workloads. No UV_THREADPOOL_SIZE tuning is needed, and file-heavy Node.js apps see up to 3x throughput improvement because the bottleneck shifts from thread pool starvation to actual disk speed.
Same Concept Across Tech
| Technology | io_uring adoption status | Key detail |
|---|---|---|
| RocksDB | io_uring for async reads in compaction and flush (since 6.29) | Significant improvement on NVMe for compaction-heavy workloads |
| PostgreSQL | Experimental io_uring support for streaming reads (since PG 16) | Still behind the direct I/O + AIO path in maturity |
| TigerBeetle | Built on io_uring from the ground up for deterministic performance | Designed around the io_uring submission model |
| Rust (tokio) | tokio-uring crate provides io_uring integration | Complement to epoll-based default runtime |
| Go | No native io_uring support. Third-party libraries (go-uring) exist | Go runtime's netpoller uses epoll |
| Node.js | No native support. libuv still uses epoll + thread pool | Thread pool for disk I/O remains the default |
Evolution of Linux async I/O:
| Interface | Year | Limitation |
|---|---|---|
| POSIX AIO | 2001 | Implemented via thread pool (not truly async in glibc) |
| Linux AIO (libaio) | 2002 | Only works with O_DIRECT, no buffered I/O |
| epoll | 2002 | Network only, does not work with regular files |
| io_uring | 2019 | Full async for everything, but complex and security concerns |
Design Rationale The bottleneck was never kernel I/O processing -- it was crossing the user-kernel boundary on every operation. epoll required three syscalls per event cycle (wait, read, write); Linux AIO only worked with O_DIRECT. Neither could keep pace with NVMe hardware delivering millions of IOPS. Shared mmap'd ring buffers with lock-free atomic pointer updates cut per-I/O overhead to a single memory store in the best case (SQPOLL mode). That buys the ability to saturate modern storage hardware, at the price of a significantly larger API surface and kernel attack exposure.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| NVMe at 10% utilization despite I/O-heavy workload | Syscall overhead limiting submission rate | strace -c to count syscalls per second |
| High system CPU (sy%) with I/O-bound application | Each I/O op is a separate syscall crossing the kernel boundary | Consider io_uring to batch submissions |
| io_uring_setup fails with ENOSYS | Kernel too old or io_uring disabled | uname -r (need 5.1+), check io_uring_disabled |
| io_uring_setup fails with EPERM | Security policy blocking io_uring (common in containers) | Check seccomp profile and io_uring_disabled |
| Application using libaio hits O_DIRECT requirement | Linux AIO only works with O_DIRECT, not buffered I/O | Switch to io_uring which works with both |
| Completion queue overflow | Not consuming completions fast enough, SQ depth too large | Increase CQ ring size or consume completions faster |
When to Use / Avoid
Use io_uring when:
- Saturating NVMe storage (millions of IOPS where syscall overhead matters)
- Need async disk I/O (epoll does not work with regular files)
- Building high-performance databases or storage engines
- Network I/O where every microsecond matters (DPDK alternative)
Avoid when:
- The application is not I/O bound (io_uring adds complexity for no benefit)
- Running on kernels older than 5.6 (incomplete feature set)
- Security is critical and the attack surface must be minimal (io_uring has had CVEs)
- epoll handles the network load fine and disk I/O is not a bottleneck
Try It Yourself
1 # Check if io_uring is disabled (0=enabled, 1=disabled for unprivileged, 2=disabled for all). some distros disable it for security
2 cat /proc/sys/kernel/io_uring_disabled
3
4 # Benchmark random read IOPS using io_uring with 128 in-flight requests and O_DIRECT
5 fio --name=test --ioengine=io_uring --rw=randread --bs=4k --size=1G --iodepth=128 --direct=1 --filename=/tmp/fio_test
6
7 # Check if a process has io_uring instances (io_uring_fds field, Linux 6.1+)
8 grep io_uring /proc/$$/status 2>/dev/null
9
10 # io_uring's io-wq worker pool is limited by this; relevant for buffered async I/O workloads
11 cat /proc/sys/kernel/threads-max
12
13 # Count io_uring_enter() syscalls over 5 seconds. in SQPOLL mode this should be near zero
14 perf stat -e 'syscalls:sys_enter_io_uring_enter' -p <PID> -- sleep 5
15
16 # Trace io_uring setup and submission syscalls to understand an application's io_uring usage pattern
17 strace -e io_uring_setup,io_uring_enter,io_uring_register -f <program>Debug Checklist
- 1
Check if io_uring is available: cat /proc/sys/kernel/io_uring_disabled (0 = enabled) - 2
Check kernel version: uname -r (need 5.1+ for basic, 5.6+ for full features) - 3
Monitor io_uring activity: perf trace -e io_uring:* -a -- sleep 5 - 4
Check if process uses io_uring: strace -e io_uring_setup,io_uring_enter -p <pid> - 5
Monitor submission/completion rates: bpftrace -e 'tracepoint:io_uring:io_uring_submit_sqe { @++ }' - 6
Check for io_uring security restrictions: grep io_uring /proc/<pid>/status
Key Takeaways
- ✓SQPOLL mode spawns a kernel thread that drains the submission queue without any syscall. Your app writes an SQE, bumps a pointer, and the kernel picks it up. True zero-syscall I/O in steady state. This is how you saturate NVMe.
- ✓Fixed files and fixed buffers (via io_uring_register) skip per-operation fd lookup and page pinning. For high-IOPS workloads on NVMe, this eliminates the remaining kernel overhead that was not syscall-related.
- ✓Linked SQEs (IOSQE_IO_LINK) chain operations: read then process then write, each starting only after the previous completes. Complex I/O workflows without returning to userspace between steps.
- ✓Unlike epoll, which only tells you an fd is ready and then requires separate read/write syscalls, io_uring handles the entire operation -- accept, recv, send -- as a single submitted entry. One interface for both file and network I/O.
- ✓Size the CQ larger than the SQ (IORING_SETUP_CQ_SIZE). If completions arrive faster than you drain them, the CQ overflows and the kernel silently drops completions. This is one of the hardest bugs to debug.
Common Pitfalls
- ✗Mistake: Not draining the CQ frequently enough. Reality: CQ overflow causes the kernel to drop completions and set IORING_SQ_CQ_OVERFLOW. You must call io_uring_enter(IORING_ENTER_GETEVENTS) to recover, and you may have already lost data.
- ✗Mistake: Expecting true async behavior from buffered file I/O. Reality: Buffered I/O goes through io-wq worker threads internally. You get an async API, but workers consume CPU and memory behind the scenes. True async requires O_DIRECT.
- ✗Mistake: Enabling SQPOLL mode without understanding the cost. Reality: The polling thread consumes a full CPU core even when idle unless you set sq_thread_idle to park it. Also requires root or CAP_SYS_NICE since Linux 5.12.
- ✗Mistake: Skipping IORING_SETUP_COOP_TASKRUN in non-SQPOLL mode. Reality: Without it, the kernel uses inter-processor interrupts to signal completions, adding latency on multi-core systems. Available since Linux 5.19.
Reference
In One Line
Shared ring buffers replace the syscall-per-operation model -- use io_uring when NVMe hardware is faster than the CPU can submit requests through traditional interfaces.