I/O Models: Blocking, Non-Blocking, Async
Mental Model
Five ways to wait for food at a restaurant. Blocking: stand at the counter staring until the plate appears. Non-blocking: wander past every few seconds -- "ready yet?" Select/poll: sit in a room of buzzers and check each one. Epoll: the PA calls out the specific table when the order is done. io_uring: slide a list of orders through a window and pick up finished plates from a shelf -- no counter visits, no conversations, no waiting.
The Problem
Ten thousand users connected to a chat server; five are typing at any given moment. Thread-per-connection: 10,000 threads, 80 GB of stack, nearly all sleeping. select/poll: one thread, but it scans every socket every iteration -- O(n) per pass, burning cycles on 9,995 idle connections. Throughput plateaus at a few thousand connections while the hardware sits underutilized.
Architecture
A server has 10,000 open connections. Right now, 5 of them have data to read. The other 9,995 are idle.
How the code waits for those 5 determines everything. Whether the server handles 10 million connections or crashes at 10 thousand. Whether 10,000 threads burn doing nothing or a single thread does everything.
This is the I/O model problem. And every high-performance system in production has already solved it -- differently.
What Actually Happens
Linux offers five distinct I/O models, each with different kernel-to-userspace notification semantics.
Blocking I/O is the default. The thread calls read() and goes to sleep on the socket's wait queue. When data arrives, the kernel wakes the thread, copies data to the buffer, and read() returns. Simple, but it requires one thread per connection.
Non-blocking I/O uses O_NONBLOCK. If no data is available, read() returns immediately with EAGAIN. The code must retry. This avoids blocking but wastes CPU spinning on idle connections.
I/O multiplexing -- select, poll, epoll -- lets one thread monitor many fds. Block until ANY of them has data, then handle just the ready ones. This is the model that powers every serious network server.
Signal-driven I/O (SIGIO) delivers a signal when an fd becomes ready. Rarely used in modern code -- signals have limited queuing, restricted handler safety, and complex per-fd management.
Asynchronous I/O (POSIX AIO, io_uring) goes further. Operations are submitted to the kernel, and completion notifications arrive later. No blocking at any point. io_uring is the modern implementation of this model.
Under the Hood
Here is where things break without understanding the internals.
select() and poll() have a fundamental scaling problem. On every call, they copy the entire fd set from userspace to kernel, and the kernel scans all fds to check readiness. With 10,000 fds and 5 ready, both still scan all 10,000. That is O(n) per call, even when almost nothing is happening.
select() has an additional hard limit: FD_SETSIZE (typically 1024). Using fds above 1023 with select() corrupts its internal bitmap. This is not a graceful error. It is a buffer overflow.
epoll changes the architecture. Registration and waiting are separate operations. epoll_ctl() registers an fd once and installs a callback (ep_poll_callback) on the fd's wait queue. When data arrives, the network stack's softirq handler fires that callback, which adds the fd to epoll's ready list. Calling epoll_wait() only drains the ready list -- no scanning.
The internal data structures matter:
- An RB-tree (red-black tree) stores all monitored fds for O(log n) add/remove.
- A ready list (doubly-linked list) holds fds with pending events for O(1) access.
epoll_wait()cost is O(k) where k = number of ready fds. Not n. That is the entire point.
Level-triggered vs edge-triggered is where correctness bugs live. LT (default) re-reports an fd as ready on every epoll_wait() call as long as data remains in the buffer. Safe. Forgiving. Partial reads are fine -- the rest is picked up on the next call.
ET only reports when the state transitions from not-ready to ready. Reading half the buffer and returning to epoll_wait() means no notification about the other half. It sits there until new data arrives and triggers another transition. This is why ET requires non-blocking sockets and a read loop that drains until EAGAIN.
ET reduces wakeups. LT reduces bugs. The tradeoff depends on the workload.
Common Questions
Why is epoll O(1) while select and poll are O(n)?
select/poll copy the entire fd set to kernel space on every call and linearly scan all fds. epoll separates registration (done once) from waiting. Callbacks fire asynchronously when data arrives, populating a ready list. epoll_wait() just checks that list. The cost is proportional to ready fds, not total fds.
What is the thundering herd problem?
When multiple threads call epoll_wait() on the same listening socket, a new connection wakes ALL of them. Only one can successfully accept(). The rest waste a context switch. Solutions: EPOLLEXCLUSIVE (Linux 4.5+) wakes only one waiter. SO_REUSEPORT gives each thread its own socket. EPOLLONESHOT with explicit re-arm. Or accept in a single thread and dispatch to workers.
When is blocking I/O actually better than epoll?
When there are few connections doing heavy per-connection work. Database query execution, video transcoding, file processing -- if the bottleneck is computation, not connection management, then the overhead of non-blocking state machines and epoll syscalls is wasted complexity. Below ~1000 connections with CPU-heavy workloads, a thread pool with blocking I/O is often simpler and faster.
How does Node.js handle file I/O if epoll cannot work with regular files?
Regular files are always "ready" from epoll's perspective. They never block on readiness, only on completion (disk latency). So epoll returns immediately for files, which defeats the purpose. libuv handles this with a thread pool (default 4 threads): file operations run as blocking I/O in worker threads, and completions are signaled back to the event loop via eventfd. Network I/O uses native epoll. It is a hybrid architecture.
How Technologies Use This
A Kafka broker with 10,000 producer and consumer connections is burning 80 GB of memory and drowning the scheduler in context switches. Fewer than 100 connections have data at any instant, yet 10,000 threads sit idle consuming 8 MB of stack each.
The root cause is a thread-per-connection model where every idle socket pins a sleeping thread. Kafka eliminates this by wrapping epoll via Java NIO Selectors, multiplexing all 10,000 connections onto just 3 network threads. Each thread calls epoll_wait and processes only the sockets that have data ready.
Configure num.network.threads to match the CPU core count and leave disk I/O blocking so the page cache absorbs write bursts. This keeps CPU usage under 5% on connection handling even at peak connection counts.
An Nginx server with 10,000 connected clients is consuming 80 GB of stack memory and spending most of its time in context switches. Only 50 clients are actively sending requests, yet every idle connection holds a dedicated thread.
The thread-per-connection model wastes resources on sleeping threads. Nginx replaces it with one non-blocking event loop per worker using edge-triggered epoll. The worker sleeps in epoll_wait until sockets transition to ready, then drains them in a tight loop until EAGAIN. EPOLLEXCLUSIVE ensures only one worker wakes per new connection, eliminating thundering herd.
Set worker_connections to match the expected concurrency and use edge-triggered mode for maximum efficiency. A single Nginx worker routinely handles 10,000+ concurrent connections at under 2% CPU.
A file-heavy Node.js application randomly blocks the event loop even though libuv claims to use non-blocking I/O. HTTP responses stall, DNS lookups time out, and the process appears frozen for hundreds of milliseconds at a time.
The hidden cause is that epoll cannot do async file I/O. Regular files always report ready, so epoll_wait returns immediately and the actual disk read blocks the thread. libuv works around this with a hybrid model: network sockets go through epoll, while file reads and writes are dispatched to a thread pool of just 4 workers by default. When all 4 threads are blocked on slow disk I/O, everything queues behind them, including DNS lookups.
Set UV_THREADPOOL_SIZE to 64 or higher for file-heavy services and signal completions back to the main loop via eventfd. The lesson is that Node.js non-blocking I/O is only truly non-blocking for network sockets, not files.
Same Concept Across Tech
| Technology | I/O model used | Why this choice |
|---|---|---|
| Node.js | epoll (via libuv) for network, thread pool for disk | Event loop for network concurrency, threads for blocking disk I/O |
| Go | epoll (via netpoller) for network, goroutines for everything | Goroutines give the illusion of blocking I/O with epoll underneath |
| Nginx | epoll (edge-triggered) per worker process | Maximum network concurrency with minimal threads |
| Redis | epoll (level-triggered), single thread | Proves one thread + epoll handles 100K+ ops/sec |
| Java (Netty) | epoll via NIO Selector, optional native epoll transport | NIO Selector is the portable path, native epoll is the fast path |
| Rust (Tokio) | epoll (via mio), io_uring support emerging | Async/await model wraps epoll in ergonomic syntax |
Progression of I/O models (why each was invented):
| Model | Problem it solved | New problem it created |
|---|---|---|
| Blocking | Simple programming model | Does not scale past a few thousand threads |
| Non-blocking | No threads blocked | Busy-loop polling wastes CPU |
| select/poll | No busy loop, kernel notifies | O(n) scan of all fds every time |
| epoll | O(ready) instead of O(total) | Does not work with regular files |
| io_uring | Async everything including files | High complexity, security attack surface |
Design Rationale Each model appeared because the previous one hit a wall that could not be patched. Blocking I/O collapsed under memory pressure at a few thousand threads. select/poll eliminated threads but scanned every fd every call because the kernel kept no state between invocations. epoll split registration from waiting, maintaining a persistent ready list fed by callbacks -- but regular files are always "ready" (they block on completion, not readiness), so epoll could not help with disk I/O at all. io_uring closed that last gap by moving the entire operation -- submission, execution, completion -- into the kernel, trading a much larger API surface and attack profile for the ability to finally do async disk I/O without a thread pool.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Thousands of threads, most sleeping | Thread-per-connection model | ls /proc/PID/task |
| High CPU with mostly idle connections | Using select/poll scanning all fds | strace -c to see select/poll syscall count |
| Server hits wall at ~1000 connections | FD_SETSIZE limit on select, or thread stack memory | Check ulimit -n and thread count |
| Disk I/O blocks the event loop | epoll does not work with regular files, blocking the single thread | Move file I/O to thread pool or use io_uring |
| Latency spikes every few milliseconds | Context switching between too many threads | pidstat -w, reduce thread count |
| Application works on macOS but breaks on Linux | kqueue vs epoll differences in edge cases | Check for platform-specific behavior in event library |
When to Use / Avoid
Blocking I/O (thread-per-connection):
- Simple protocol, low connection count (under 1,000)
- Code clarity matters more than scalability
select/poll:
- Cross-platform compatibility needed, moderate connection count
epoll:
- Linux-only, thousands to millions of connections, most idle
- The standard for production servers (Nginx, Redis, Node.js, Go)
io_uring:
- Need async disk I/O (epoll does not work with regular files)
- Highest performance requirements, willing to accept complexity
Avoid:
- Thread-per-connection above a few thousand connections (stack memory and context switch overhead)
- select with more than 1,024 fds (FD_SETSIZE limit on most systems)
Try It Yourself
1 # Trace Nginx's epoll setup to see how it creates and populates the epoll instance
2 strace -e epoll_create1,epoll_ctl,epoll_wait nginx -t 2>&1 | head -20
3
4 # Inspect epoll fd 4's monitored fds, event masks, and data values (replace PID and fd)
5 cat /proc/<PID>/fdinfo/4
6
7 # Count poll/select syscalls during a curl request to see I/O multiplexing in action
8 strace -e poll -c curl -s https://example.com > /dev/null
9
10 # Show listening TCP sockets with extended process info and timer details
11 ss -tlnpe
12
13 # Check if the Python runtime has epoll support (Linux-specific)
14 python3 -c "import select; print('epoll' if hasattr(select,'epoll') else 'no epoll')"
15
16 # Show the system-wide limit on epoll watches per user (default: ~200K, tunable)
17 cat /proc/sys/fs/epoll/max_user_watchesDebug Checklist
- 1
Identify which I/O model a process uses: strace -e select,poll,epoll_wait,io_uring_enter -c -p <pid> -- sleep 5 - 2
Count threads: ls /proc/<pid>/task | wc -l - 3
Check if thread-per-connection: compare thread count with connection count via ss -tnp - 4
Monitor epoll wait times: perf trace -e epoll_wait -p <pid> -- sleep 10 - 5
Check for blocking reads: strace -e read -T -p <pid> (look for long durations) - 6
Check io_uring support: uname -r (5.1+ for basic, 5.6+ for full features)
Key Takeaways
- ✓epoll does not scan. When a packet arrives, the network stack calls ep_poll_callback(), which adds the fd to a ready list. epoll_wait() just drains that list. Cost: O(ready fds), not O(total fds). That is why it scales to millions of connections.
- ✓Level-triggered epoll re-notifies you every call if data remains. Edge-triggered only notifies on state transitions -- you MUST drain the fd with non-blocking reads until EAGAIN, or the remaining data goes silent. Missing this is the most common epoll bug.
- ✓select() is capped at 1024 fds and copies the entire fd set to/from kernel on every call. poll() removes the fd limit but still scans linearly. Both are O(n) per call, even if only one fd is ready.
- ✓EPOLLONESHOT prevents two threads from handling the same fd simultaneously in multithreaded epoll. It disables the fd after one event -- the owning thread must re-arm it with EPOLL_CTL_MOD before the next event fires.
- ✓Blocking I/O is not always wrong. For low-connection, high-throughput workloads (file processing, video transcoding), blocking I/O with a thread pool is simpler and can outperform epoll by avoiding syscall overhead and state machine complexity.
Common Pitfalls
- ✗Mistake: Using edge-triggered epoll without non-blocking sockets. Reality: If you read only part of the available data, ET will not re-notify you. The remaining data stalls until new data arrives. The connection appears frozen.
- ✗Mistake: Adding the same fd to multiple epoll instances. Reality: It works, but the kernel wakes ALL epoll instances when the fd becomes ready, causing thundering herd issues and wasted CPU.
- ✗Mistake: Assuming closed fds auto-remove from epoll. Reality: The kernel auto-removes when the underlying file description is destroyed. But if another fd still references the same struct file (via dup), the epoll entry persists and causes spurious events.
- ✗Mistake: Using select() with fds numbered above 1023. Reality: FD_SET uses the fd as an array index into a fixed-size bitfield. Fds above 1023 corrupt memory. This is undefined behavior and often a security vulnerability.
Reference
In One Line
Use epoll for network concurrency, blocking I/O with a thread pool for heavy per-connection work, and io_uring when syscall overhead is the actual bottleneck.