File Systems & I/OTopic 16 of 19

File Systems & I/OAdvanced

I/O Models: Blocking, Non-Blocking, Async

KafkaNginxNode.js

🧠

Mental Model

Five ways to wait for food at a restaurant. Blocking: stand at the counter staring until the plate appears. Non-blocking: wander past every few seconds -- "ready yet?" Select/poll: sit in a room of buzzers and check each one. Epoll: the PA calls out the specific table when the order is done. io_uring: slide a list of orders through a window and pick up finished plates from a shelf -- no counter visits, no conversations, no waiting.

💡

The Problem

Ten thousand users connected to a chat server; five are typing at any given moment. Thread-per-connection: 10,000 threads, 80 GB of stack, nearly all sleeping. select/poll: one thread, but it scans every socket every iteration -- O(n) per pass, burning cycles on 9,995 idle connections. Throughput plateaus at a few thousand connections while the hardware sits underutilized.

Architecture

A server has 10,000 open connections. Right now, 5 of them have data to read. The other 9,995 are idle.

How the code waits for those 5 determines everything. Whether the server handles 10 million connections or crashes at 10 thousand. Whether 10,000 threads burn doing nothing or a single thread does everything.

This is the I/O model problem. And every high-performance system in production has already solved it -- differently.

What Actually Happens

Linux offers five distinct I/O models, each with different kernel-to-userspace notification semantics.

Blocking I/O is the default. The thread calls read() and goes to sleep on the socket's wait queue. When data arrives, the kernel wakes the thread, copies data to the buffer, and read() returns. Simple, but it requires one thread per connection.

Non-blocking I/O uses O_NONBLOCK. If no data is available, read() returns immediately with EAGAIN. The code must retry. This avoids blocking but wastes CPU spinning on idle connections.

I/O multiplexing -- select, poll, epoll -- lets one thread monitor many fds. Block until ANY of them has data, then handle just the ready ones. This is the model that powers every serious network server.

Signal-driven I/O (SIGIO) delivers a signal when an fd becomes ready. Rarely used in modern code -- signals have limited queuing, restricted handler safety, and complex per-fd management.

Asynchronous I/O (POSIX AIO, io_uring) goes further. Operations are submitted to the kernel, and completion notifications arrive later. No blocking at any point. io_uring is the modern implementation of this model.

Under the Hood

Here is where things break without understanding the internals.

select() and poll() have a fundamental scaling problem. On every call, they copy the entire fd set from userspace to kernel, and the kernel scans all fds to check readiness. With 10,000 fds and 5 ready, both still scan all 10,000. That is O(n) per call, even when almost nothing is happening.

select() has an additional hard limit: FD_SETSIZE (typically 1024). Using fds above 1023 with select() corrupts its internal bitmap. This is not a graceful error. It is a buffer overflow.

epoll changes the architecture. Registration and waiting are separate operations. epoll_ctl() registers an fd once and installs a callback (ep_poll_callback) on the fd's wait queue. When data arrives, the network stack's softirq handler fires that callback, which adds the fd to epoll's ready list. Calling epoll_wait() only drains the ready list -- no scanning.

The internal data structures matter:

An RB-tree (red-black tree) stores all monitored fds for O(log n) add/remove.
A ready list (doubly-linked list) holds fds with pending events for O(1) access.
epoll_wait() cost is O(k) where k = number of ready fds. Not n. That is the entire point.

Level-triggered vs edge-triggered is where correctness bugs live. LT (default) re-reports an fd as ready on every epoll_wait() call as long as data remains in the buffer. Safe. Forgiving. Partial reads are fine -- the rest is picked up on the next call.

ET only reports when the state transitions from not-ready to ready. Reading half the buffer and returning to epoll_wait() means no notification about the other half. It sits there until new data arrives and triggers another transition. This is why ET requires non-blocking sockets and a read loop that drains until EAGAIN.

ET reduces wakeups. LT reduces bugs. The tradeoff depends on the workload.

Common Questions

Why is epoll O(1) while select and poll are O(n)?

select/poll copy the entire fd set to kernel space on every call and linearly scan all fds. epoll separates registration (done once) from waiting. Callbacks fire asynchronously when data arrives, populating a ready list. epoll_wait() just checks that list. The cost is proportional to ready fds, not total fds.

What is the thundering herd problem?

When multiple threads call epoll_wait() on the same listening socket, a new connection wakes ALL of them. Only one can successfully accept(). The rest waste a context switch. Solutions: EPOLLEXCLUSIVE (Linux 4.5+) wakes only one waiter. SO_REUSEPORT gives each thread its own socket. EPOLLONESHOT with explicit re-arm. Or accept in a single thread and dispatch to workers.

When is blocking I/O actually better than epoll?

When there are few connections doing heavy per-connection work. Database query execution, video transcoding, file processing -- if the bottleneck is computation, not connection management, then the overhead of non-blocking state machines and epoll syscalls is wasted complexity. Below ~1000 connections with CPU-heavy workloads, a thread pool with blocking I/O is often simpler and faster.

How does Node.js handle file I/O if epoll cannot work with regular files?

Regular files are always "ready" from epoll's perspective. They never block on readiness, only on completion (disk latency). So epoll returns immediately for files, which defeats the purpose. libuv handles this with a thread pool (default 4 threads): file operations run as blocking I/O in worker threads, and completions are signaled back to the event loop via eventfd. Network I/O uses native epoll. It is a hybrid architecture.

How Technologies Use This

Kafka

A Kafka broker with 10,000 producer and consumer connections is burning 80 GB of memory and drowning the scheduler in context switches. Fewer than 100 connections have data at any instant, yet 10,000 threads sit idle consuming 8 MB of stack each.

The root cause is a thread-per-connection model where every idle socket pins a sleeping thread. Kafka eliminates this by wrapping epoll via Java NIO Selectors, multiplexing all 10,000 connections onto just 3 network threads. Each thread calls epoll_wait and processes only the sockets that have data ready.

Configure num.network.threads to match the CPU core count and leave disk I/O blocking so the page cache absorbs write bursts. This keeps CPU usage under 5% on connection handling even at peak connection counts.

Nginx

An Nginx server with 10,000 connected clients is consuming 80 GB of stack memory and spending most of its time in context switches. Only 50 clients are actively sending requests, yet every idle connection holds a dedicated thread.

The thread-per-connection model wastes resources on sleeping threads. Nginx replaces it with one non-blocking event loop per worker using edge-triggered epoll. The worker sleeps in epoll_wait until sockets transition to ready, then drains them in a tight loop until EAGAIN. EPOLLEXCLUSIVE ensures only one worker wakes per new connection, eliminating thundering herd.

Set worker_connections to match the expected concurrency and use edge-triggered mode for maximum efficiency. A single Nginx worker routinely handles 10,000+ concurrent connections at under 2% CPU.

Node.js

A file-heavy Node.js application randomly blocks the event loop even though libuv claims to use non-blocking I/O. HTTP responses stall, DNS lookups time out, and the process appears frozen for hundreds of milliseconds at a time.

The hidden cause is that epoll cannot do async file I/O. Regular files always report ready, so epoll_wait returns immediately and the actual disk read blocks the thread. libuv works around this with a hybrid model: network sockets go through epoll, while file reads and writes are dispatched to a thread pool of just 4 workers by default. When all 4 threads are blocked on slow disk I/O, everything queues behind them, including DNS lookups.

Set UV_THREADPOOL_SIZE to 64 or higher for file-heavy services and signal completions back to the main loop via eventfd. The lesson is that Node.js non-blocking I/O is only truly non-blocking for network sockets, not files.

Same Concept Across Tech

Technology	I/O model used	Why this choice
Node.js	epoll (via libuv) for network, thread pool for disk	Event loop for network concurrency, threads for blocking disk I/O
Go	epoll (via netpoller) for network, goroutines for everything	Goroutines give the illusion of blocking I/O with epoll underneath
Nginx	epoll (edge-triggered) per worker process	Maximum network concurrency with minimal threads
Redis	epoll (level-triggered), single thread	Proves one thread + epoll handles 100K+ ops/sec
Java (Netty)	epoll via NIO Selector, optional native epoll transport	NIO Selector is the portable path, native epoll is the fast path
Rust (Tokio)	epoll (via mio), io_uring support emerging	Async/await model wraps epoll in ergonomic syntax

Progression of I/O models (why each was invented):

Model	Problem it solved	New problem it created
Blocking	Simple programming model	Does not scale past a few thousand threads
Non-blocking	No threads blocked	Busy-loop polling wastes CPU
select/poll	No busy loop, kernel notifies	O(n) scan of all fds every time
epoll	O(ready) instead of O(total)	Does not work with regular files
io_uring	Async everything including files	High complexity, security attack surface

Design Rationale Each model appeared because the previous one hit a wall that could not be patched. Blocking I/O collapsed under memory pressure at a few thousand threads. select/poll eliminated threads but scanned every fd every call because the kernel kept no state between invocations. epoll split registration from waiting, maintaining a persistent ready list fed by callbacks -- but regular files are always "ready" (they block on completion, not readiness), so epoll could not help with disk I/O at all. io_uring closed that last gap by moving the entire operation -- submission, execution, completion -- into the kernel, trading a much larger API surface and attack profile for the ability to finally do async disk I/O without a thread pool.

If You See This, Think This

Symptom	Likely cause	First check
Thousands of threads, most sleeping	Thread-per-connection model	ls /proc/PID/task
High CPU with mostly idle connections	Using select/poll scanning all fds	strace -c to see select/poll syscall count
Server hits wall at ~1000 connections	FD_SETSIZE limit on select, or thread stack memory	Check ulimit -n and thread count
Disk I/O blocks the event loop	epoll does not work with regular files, blocking the single thread	Move file I/O to thread pool or use io_uring
Latency spikes every few milliseconds	Context switching between too many threads	pidstat -w, reduce thread count
Application works on macOS but breaks on Linux	kqueue vs epoll differences in edge cases	Check for platform-specific behavior in event library

When to Use / Avoid

Blocking I/O (thread-per-connection):

Simple protocol, low connection count (under 1,000)
Code clarity matters more than scalability

select/poll:

Cross-platform compatibility needed, moderate connection count

epoll:

Linux-only, thousands to millions of connections, most idle
The standard for production servers (Nginx, Redis, Node.js, Go)

io_uring:

Need async disk I/O (epoll does not work with regular files)
Highest performance requirements, willing to accept complexity

Avoid:

Thread-per-connection above a few thousand connections (stack memory and context switch overhead)
select with more than 1,024 fds (FD_SETSIZE limit on most systems)

Try It Yourself

 1  # Trace Nginx's epoll setup to see how it creates and populates the epoll instance
 2  strace -e epoll_create1,epoll_ctl,epoll_wait nginx -t 2>&1 | head -20
 3  
 4  # Inspect epoll fd 4's monitored fds, event masks, and data values (replace PID and fd)
 5  cat /proc/<PID>/fdinfo/4
 6  
 7  # Count poll/select syscalls during a curl request to see I/O multiplexing in action
 8  strace -e poll -c curl -s https://example.com > /dev/null
 9  
10  # Show listening TCP sockets with extended process info and timer details
11  ss -tlnpe
12  
13  # Check if the Python runtime has epoll support (Linux-specific)
14  python3 -c "import select; print('epoll' if hasattr(select,'epoll') else 'no epoll')"
15  
16  # Show the system-wide limit on epoll watches per user (default: ~200K, tunable)
17  cat /proc/sys/fs/epoll/max_user_watches

Debug Checklist

1Identify which I/O model a process uses: strace -e select,poll,epoll_wait,io_uring_enter -c -p <pid> -- sleep 5
2Count threads: ls /proc/<pid>/task | wc -l
3Check if thread-per-connection: compare thread count with connection count via ss -tnp
4Monitor epoll wait times: perf trace -e epoll_wait -p <pid> -- sleep 10
5Check for blocking reads: strace -e read -T -p <pid> (look for long durations)
6Check io_uring support: uname -r (5.1+ for basic, 5.6+ for full features)

Key Takeaways

✓epoll does not scan. When a packet arrives, the network stack calls ep_poll_callback(), which adds the fd to a ready list. epoll_wait() just drains that list. Cost: O(ready fds), not O(total fds). That is why it scales to millions of connections.
✓Level-triggered epoll re-notifies you every call if data remains. Edge-triggered only notifies on state transitions -- you MUST drain the fd with non-blocking reads until EAGAIN, or the remaining data goes silent. Missing this is the most common epoll bug.
✓select() is capped at 1024 fds and copies the entire fd set to/from kernel on every call. poll() removes the fd limit but still scans linearly. Both are O(n) per call, even if only one fd is ready.
✓EPOLLONESHOT prevents two threads from handling the same fd simultaneously in multithreaded epoll. It disables the fd after one event -- the owning thread must re-arm it with EPOLL_CTL_MOD before the next event fires.
✓Blocking I/O is not always wrong. For low-connection, high-throughput workloads (file processing, video transcoding), blocking I/O with a thread pool is simpler and can outperform epoll by avoiding syscall overhead and state machine complexity.

Common Pitfalls

✗Mistake: Using edge-triggered epoll without non-blocking sockets. Reality: If you read only part of the available data, ET will not re-notify you. The remaining data stalls until new data arrives. The connection appears frozen.
✗Mistake: Adding the same fd to multiple epoll instances. Reality: It works, but the kernel wakes ALL epoll instances when the fd becomes ready, causing thundering herd issues and wasted CPU.
✗Mistake: Assuming closed fds auto-remove from epoll. Reality: The kernel auto-removes when the underlying file description is destroyed. But if another fd still references the same struct file (via dup), the epoll entry persists and causes spurious events.
✗Mistake: Using select() with fds numbered above 1023. Reality: FD_SET uses the fd as an array index into a fixed-size bitfield. Fds above 1023 corrupt memory. This is undefined behavior and often a security vulnerability.

Reference

System Calls

fcntlpollselectepoll_waitaio_read

Tools

strace -e poll,epoll_wait,selectss -tlnp/proc/PID/fdinfo

📌

In One Line

Use epoll for network concurrency, blocking I/O with a thread pool for heavy per-connection work, and io_uring when syscall overhead is the actual bottleneck.

I/O Models: Blocking, Non-Blocking, Async

KafkaNginxNode.js

🧠

Mental Model

💡

The Problem

Architecture

A server has 10,000 open connections. Right now, 5 of them have data to read. The other 9,995 are idle.

This is the I/O model problem. And every high-performance system in production has already solved it -- differently.

What Actually Happens

Linux offers five distinct I/O models, each with different kernel-to-userspace notification semantics.

Non-blocking I/O uses O_NONBLOCK. If no data is available, read() returns immediately with EAGAIN. The code must retry. This avoids blocking but wastes CPU spinning on idle connections.

Signal-driven I/O (SIGIO) delivers a signal when an fd becomes ready. Rarely used in modern code -- signals have limited queuing, restricted handler safety, and complex per-fd management.

Under the Hood

Here is where things break without understanding the internals.

select() has an additional hard limit: FD_SETSIZE (typically 1024). Using fds above 1023 with select() corrupts its internal bitmap. This is not a graceful error. It is a buffer overflow.

The internal data structures matter:

An RB-tree (red-black tree) stores all monitored fds for O(log n) add/remove.
A ready list (doubly-linked list) holds fds with pending events for O(1) access.
epoll_wait() cost is O(k) where k = number of ready fds. Not n. That is the entire point.

ET reduces wakeups. LT reduces bugs. The tradeoff depends on the workload.

Common Questions

Why is epoll O(1) while select and poll are O(n)?

What is the thundering herd problem?

When is blocking I/O actually better than epoll?

How does Node.js handle file I/O if epoll cannot work with regular files?

How Technologies Use This

Kafka

Nginx

Set worker_connections to match the expected concurrency and use edge-triggered mode for maximum efficiency. A single Nginx worker routinely handles 10,000+ concurrent connections at under 2% CPU.

Node.js

Same Concept Across Tech

Technology	I/O model used	Why this choice
Node.js	epoll (via libuv) for network, thread pool for disk	Event loop for network concurrency, threads for blocking disk I/O
Go	epoll (via netpoller) for network, goroutines for everything	Goroutines give the illusion of blocking I/O with epoll underneath
Nginx	epoll (edge-triggered) per worker process	Maximum network concurrency with minimal threads
Redis	epoll (level-triggered), single thread	Proves one thread + epoll handles 100K+ ops/sec
Java (Netty)	epoll via NIO Selector, optional native epoll transport	NIO Selector is the portable path, native epoll is the fast path
Rust (Tokio)	epoll (via mio), io_uring support emerging	Async/await model wraps epoll in ergonomic syntax

Progression of I/O models (why each was invented):

Model	Problem it solved	New problem it created
Blocking	Simple programming model	Does not scale past a few thousand threads
Non-blocking	No threads blocked	Busy-loop polling wastes CPU
select/poll	No busy loop, kernel notifies	O(n) scan of all fds every time
epoll	O(ready) instead of O(total)	Does not work with regular files
io_uring	Async everything including files	High complexity, security attack surface

If You See This, Think This

Symptom	Likely cause	First check
Thousands of threads, most sleeping	Thread-per-connection model	ls /proc/PID/task
High CPU with mostly idle connections	Using select/poll scanning all fds	strace -c to see select/poll syscall count
Server hits wall at ~1000 connections	FD_SETSIZE limit on select, or thread stack memory	Check ulimit -n and thread count
Disk I/O blocks the event loop	epoll does not work with regular files, blocking the single thread	Move file I/O to thread pool or use io_uring
Latency spikes every few milliseconds	Context switching between too many threads	pidstat -w, reduce thread count
Application works on macOS but breaks on Linux	kqueue vs epoll differences in edge cases	Check for platform-specific behavior in event library

When to Use / Avoid

Blocking I/O (thread-per-connection):

Simple protocol, low connection count (under 1,000)
Code clarity matters more than scalability

select/poll:

Cross-platform compatibility needed, moderate connection count

epoll:

Linux-only, thousands to millions of connections, most idle
The standard for production servers (Nginx, Redis, Node.js, Go)

io_uring:

Need async disk I/O (epoll does not work with regular files)
Highest performance requirements, willing to accept complexity

Avoid:

Thread-per-connection above a few thousand connections (stack memory and context switch overhead)
select with more than 1,024 fds (FD_SETSIZE limit on most systems)

Try It Yourself

 1  # Trace Nginx's epoll setup to see how it creates and populates the epoll instance
 2  strace -e epoll_create1,epoll_ctl,epoll_wait nginx -t 2>&1 | head -20
 3  
 4  # Inspect epoll fd 4's monitored fds, event masks, and data values (replace PID and fd)
 5  cat /proc/<PID>/fdinfo/4
 6  
 7  # Count poll/select syscalls during a curl request to see I/O multiplexing in action
 8  strace -e poll -c curl -s https://example.com > /dev/null
 9  
10  # Show listening TCP sockets with extended process info and timer details
11  ss -tlnpe
12  
13  # Check if the Python runtime has epoll support (Linux-specific)
14  python3 -c "import select; print('epoll' if hasattr(select,'epoll') else 'no epoll')"
15  
16  # Show the system-wide limit on epoll watches per user (default: ~200K, tunable)
17  cat /proc/sys/fs/epoll/max_user_watches

Debug Checklist

1Identify which I/O model a process uses: strace -e select,poll,epoll_wait,io_uring_enter -c -p <pid> -- sleep 5
2Count threads: ls /proc/<pid>/task | wc -l
3Check if thread-per-connection: compare thread count with connection count via ss -tnp
4Monitor epoll wait times: perf trace -e epoll_wait -p <pid> -- sleep 10
5Check for blocking reads: strace -e read -T -p <pid> (look for long durations)
6Check io_uring support: uname -r (5.1+ for basic, 5.6+ for full features)

Key Takeaways

✓epoll does not scan. When a packet arrives, the network stack calls ep_poll_callback(), which adds the fd to a ready list. epoll_wait() just drains that list. Cost: O(ready fds), not O(total fds). That is why it scales to millions of connections.
✓Level-triggered epoll re-notifies you every call if data remains. Edge-triggered only notifies on state transitions -- you MUST drain the fd with non-blocking reads until EAGAIN, or the remaining data goes silent. Missing this is the most common epoll bug.
✓select() is capped at 1024 fds and copies the entire fd set to/from kernel on every call. poll() removes the fd limit but still scans linearly. Both are O(n) per call, even if only one fd is ready.
✓EPOLLONESHOT prevents two threads from handling the same fd simultaneously in multithreaded epoll. It disables the fd after one event -- the owning thread must re-arm it with EPOLL_CTL_MOD before the next event fires.
✓Blocking I/O is not always wrong. For low-connection, high-throughput workloads (file processing, video transcoding), blocking I/O with a thread pool is simpler and can outperform epoll by avoiding syscall overhead and state machine complexity.

Common Pitfalls

✗Mistake: Using edge-triggered epoll without non-blocking sockets. Reality: If you read only part of the available data, ET will not re-notify you. The remaining data stalls until new data arrives. The connection appears frozen.
✗Mistake: Adding the same fd to multiple epoll instances. Reality: It works, but the kernel wakes ALL epoll instances when the fd becomes ready, causing thundering herd issues and wasted CPU.
✗Mistake: Assuming closed fds auto-remove from epoll. Reality: The kernel auto-removes when the underlying file description is destroyed. But if another fd still references the same struct file (via dup), the epoll entry persists and causes spurious events.
✗Mistake: Using select() with fds numbered above 1023. Reality: FD_SET uses the fd as an array index into a fixed-size bitfield. Fds above 1023 corrupt memory. This is undefined behavior and often a security vulnerability.

Reference

System Calls

fcntlpollselectepoll_waitaio_read

Tools

strace -e poll,epoll_wait,selectss -tlnp/proc/PID/fdinfo

📌

In One Line

Use epoll for network concurrency, blocking I/O with a thread pool for heavy per-connection work, and io_uring when syscall overhead is the actual bottleneck.

I/O Models: Blocking, Non-Blocking, Async

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

I/O Models: Blocking, Non-Blocking, Async

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics