epoll & I/O Multiplexing
Mental Model
A hotel notification board. Guests press a button when they need something and their room number lights up on the board. The concierge never walks door to door asking. Check the board, serve those guests, go back to waiting. Work scales with guests who actually need something, not with how many rooms the hotel has.
The Problem
Redis has 50,000 connected clients. Maybe 20 are sending commands right now. The other 49,980 are idle persistent connections with nothing to say. The event loop needs to find those 20 active sockets, and it needs to do this thousands of times per second without spending a single cycle on the silent majority.
Architecture
Redis is single-threaded. One thread. One CPU core. Yet it handles 100,000+ operations per second across tens of thousands of concurrent connections.
How? It never wastes a single cycle asking "got data?" to a socket that is silent. The kernel tells Redis exactly which sockets have something to say. Redis handles those and ignores the rest.
That mechanism is epoll. And it is not just Redis. Nginx, Node.js, Go, Kafka, Chrome, PostgreSQL -- every high-performance program on Linux sits in an epoll_wait() loop. Anyone doing systems work on Linux needs to understand this.
What Actually Happens
The epoll API (Linux 2.6+) provides scalable I/O event notification through three syscalls.
epoll_create1() creates an epoll instance -- a kernel object backed by struct eventpoll. This is the dashboard.
epoll_ctl(EPOLL_CTL_ADD, fd, events) registers a file descriptor to monitor. The kernel inserts an epitem into a red-black tree and installs a callback (ep_poll_callback) on the fd's wait queue. This happens once per fd.
epoll_wait(epfd, events, maxevents, timeout) blocks until at least one monitored fd has an event. It drains the ready list and returns only the fds with pending events. No scanning.
Here is what makes the performance difference stark. With 10,000 monitored sockets where 10 are ready:
select()copies a 10,000-bit bitmap to kernel, scans all bits, copies back. O(10,000).poll()copies a 10,000-element pollfd array, scans it all. O(10,000).epoll_wait()returns 10 events from the ready list. O(10).
That is the entire C10K problem, solved.
Under the Hood
The red-black tree stores all monitored fds. epoll_ctl(EPOLL_CTL_ADD) inserts an epitem, EPOLL_CTL_DEL removes it, EPOLL_CTL_MOD updates the event mask. All operations are O(log n). The tree ensures each fd can only be registered once per epoll instance.
The ready list is a doubly-linked list of epitem nodes that have pending events. This is the structure that makes epoll fast. It is populated asynchronously -- not by epoll_wait(), but by the kernel's network stack.
The wakeup path is where the magic happens. When a packet arrives on a socket:
- NIC interrupt fires, NAPI softirq processes the packet.
- The TCP stack puts data in the socket's receive buffer.
sock_def_readable()walks the socket's wait queue.ep_poll_callback()fires, moving theepitemto the ready list.- Any thread blocked in
epoll_wait()is woken.
No scanning. The check happened at interrupt time, not at epoll_wait() time. The cost of monitoring 10,000 idle sockets is zero.
Level-triggered vs edge-triggered changes when events fire.
Level-triggered (LT, the default) works like poll(): if data exists in the buffer, epoll_wait() reports the fd as ready. Every time. Even if it was already reported. This is forgiving -- partial reads are fine because the notification comes again.
Edge-triggered (ET) fires only on state transitions -- when NEW data arrives, not when data EXISTS. Once an ET event fires, the fd must be drained completely with non-blocking reads until EAGAIN. Reading half the buffer and going back to epoll_wait() means the other half sits there silently until more data arrives. This is the most common epoll bug.
ET reduces epoll_wait() wakeups. LT reduces correctness bugs. Nginx uses ET for maximum throughput. Redis uses LT for simplicity. Both are valid choices.
Common Questions
Why is epoll O(1) and poll O(n)?
poll() passes the entire pollfd array to the kernel on every call. The kernel checks each fd's readiness. Every call, all fds, even if nothing changed. epoll_wait() only drains a ready list that was populated asynchronously by callbacks at interrupt time. The cost is proportional to ready fds, not total fds.
What is the thundering herd problem?
When multiple threads call epoll_wait() on the same listening socket, a new connection wakes ALL of them. Only one can accept(). The rest waste a context switch. Solutions: EPOLLEXCLUSIVE (Linux 4.5+) wakes only one waiter. SO_REUSEPORT gives each thread its own socket. EPOLLONESHOT with explicit re-arm. Or accept in one thread and dispatch to workers.
When is level-triggered preferable to edge-triggered?
Start with LT. It is simpler and there is no risk of losing data by reading too little. Switch to ET only when minimizing epoll_wait() wakeups matters in a high-throughput server -- and make sure every read/write loop drains until EAGAIN. ET is an optimization, not a default.
How does epoll compare to io_uring?
epoll reports when a fd is ready. The application still calls read()/write() itself. io_uring handles the entire operation -- submit a "read this" request and get back a "here is the data" completion. io_uring also works for disk I/O (epoll does not) and batches syscalls via shared ring buffers. epoll is simpler and battle-tested. io_uring is faster but more complex.
Why does epoll not work with regular files?
Regular files are always "ready" from the kernel's perspective. A read on a regular file might block waiting for disk, but from the I/O multiplexing layer, the fd is always in a readable state. So epoll_wait() returns immediately every time. For async disk I/O, io_uring or AIO is the answer.
How Technologies Use This
A Kafka broker with 10,000 producer and consumer connections is consuming 80 GB of stack memory and drowning the scheduler in context switches. Most connections are idle, yet each one holds a dedicated thread doing nothing.
The thread-per-connection model scales linearly in memory regardless of activity. Kafka wraps epoll via Java NIO Selectors so each network thread monitors thousands of sockets at once. When epoll_wait returns, only the handful of connections with pending data are processed, and the cost of 9,900 idle connections is zero CPU.
Configure num.network.threads to a small number matching the core count. Kafka broker CPU stays under 5% on connection handling even at peak connection counts, leaving cores free for log compaction and replication.
A single Nginx worker serving 10,000 simultaneous connections is burning excessive CPU or failing to accept new connections fast enough. The thread-per-connection model would require 10,000 threads and 80 GB of stack memory.
Nginx uses edge-triggered epoll instead. The worker calls epoll_wait and blocks until at least one socket transitions to ready. It then drains every ready socket in a tight loop until EAGAIN before returning to epoll_wait. Because ET only fires on state transitions, the worker avoids redundant wakeups that level-triggered mode would cause.
Enable EPOLLEXCLUSIVE in multi-worker setups to ensure only one worker wakes per new connection. A single worker routinely handles 10,000+ concurrent connections at under 2% CPU, making edge-triggered epoll the foundation of Nginx's performance.
PgBouncer CPU climbs steadily as client connections grow toward 5,000, even though only 200 are actively running queries at any moment. The connection pooler is spending all its time scanning idle sockets instead of forwarding real traffic.
Without epoll, PgBouncer would poll all 5,000 sockets every iteration, burning CPU proportional to total connections. With epoll, PgBouncer monitors client connections, backend connections, replication streams, and internal signal pipes on a single event loop. Only sockets with actual activity trigger a wakeup.
Deploy PgBouncer on a Linux host where it can use epoll natively. This keeps CPU under 3% even at 5,000 connections, which is critical for high-connection-count deployments where the alternative of one PostgreSQL backend per client would exhaust the server's 32 GB of shared memory.
Redis ops/sec drops as persistent connections are added, even though most are idle. At 100,000 connected clients, the single thread is spending more time scanning sockets for commands than actually executing them.
Without epoll, checking each socket in a loop at 100,000 sockets would consume significant CPU every iteration. Redis wraps epoll in its ae event library using level-triggered mode. The thread sleeps in epoll_wait until the ready list is populated by kernel callbacks. Only clients with commands to process cause a wakeup, and adding 50,000 idle persistent connections costs exactly zero additional CPU.
Use level-triggered epoll as Redis does for simplicity and safety. The kernel callback mechanism means idle sockets are never touched, which is why Redis achieves 100,000+ ops/sec on one core regardless of total connection count.
A Go service running 2 million goroutines doing network I/O would need 2 million OS threads without epoll. That is 16 TB of stack memory at the 8 MB default per thread, which is obviously impossible on any real machine.
The Go runtime solves this with its netpoller, which registers every network socket with epoll. When a goroutine calls net.Conn.Read, the runtime parks it without holding an OS thread. epoll_wait wakes it only when data arrives on that socket. This decouples goroutine count from thread count entirely.
Let the Go runtime manage its netpoller automatically and focus on keeping goroutines lightweight. OS thread count stays in the single digits while goroutine count scales to millions, consuming roughly 4 KB per goroutine instead of 8 MB per thread.
A Netty-based microservice handling 20,000 concurrent connections would need 160 GB of stack memory with blocking threads at 8 MB each. The service either crashes with OutOfMemoryError or grinds to a halt under context switch overhead.
The JVM NIO Selector maps directly to epoll on Linux. Selector.select() is epoll_wait under the hood. When data arrives on any of the 20,000 sockets, the kernel fires ep_poll_callback to place it on the ready list, and just 4 event loop threads process only the ready connections. Netty also offers its own epoll transport bypassing JVM NIO for edge-triggered mode and lower latency.
Use Netty's native epoll transport on Linux for roughly 30% lower per-event overhead compared to the standard NIO Selector. This lets 4 event loop threads handle 20,000 connections without any blocking threads or excessive memory consumption.
A Node.js server with 50,000 concurrent WebSocket connections is burning 100% CPU on a single thread. The process is spinning in a tight loop checking every socket for data, with no time left to execute JavaScript callbacks.
Without epoll, the thread would poll all 50,000 sockets every iteration. With epoll, the thread spends most of its time blocked inside epoll_wait, consuming zero CPU while waiting. When any socket receives data, the kernel callback adds it to the ready list and wakes the thread. libuv fires JavaScript callbacks for only the ready sockets. The catch is that epoll does not work with regular files, so file I/O goes to a thread pool of 4 workers by default.
Let libuv manage the epoll event loop for network I/O and tune UV_THREADPOOL_SIZE for file-heavy workloads. This hybrid design lets Node.js handle massive network concurrency at under 1% CPU while idle, without multi-threaded complexity.
Chrome's browser process needs to respond instantly to tab clicks, GPU compositing, network events, and 50+ renderer IPCs simultaneously. Without a multiplexing mechanism, it would need a thread per event source or burn 100% CPU polling hundreds of channels in a busy loop.
On Linux, Chrome multiplexes IPC channels from 50+ renderer processes, GPU messages, Wayland/X11 display server events, and timer fds into a single epoll instance. The main thread sleeps in epoll_wait until any fd has activity, then dispatches to the appropriate handler. The cost of hundreds of idle event sources is zero CPU.
This epoll-based event loop is why Chrome's UI process uses under 0.5% CPU between user interactions, waking only when an event source actually has data to deliver. The takeaway is that epoll scales to desktop applications just as well as servers.
Same Concept Across Tech
| Technology | How it uses epoll | Abstraction layer |
|---|---|---|
| Node.js | libuv event loop wraps epoll for sockets. File I/O goes to a 4-thread pool because epoll does not work with regular files | libuv |
| Go | netpoller registers every socket with epoll. Goroutines park without holding OS threads, wake on data arrival | runtime netpoller |
| JVM (Netty) | NIO Selector maps to epoll. Netty offers a native epoll transport for edge-triggered mode and 30% less overhead | java.nio.Selector / Netty epoll |
| Kafka | Broker wraps epoll via NIO Selectors. num.network.threads controls event loop count, not connection count | Java NIO |
| Nginx | Edge-triggered epoll per worker process. EPOLLEXCLUSIVE prevents thundering herd across workers | direct epoll syscalls |
| Redis | ae event library uses level-triggered epoll. Single thread handles 100K+ connections at zero idle cost | ae_epoll.c |
Design Rationale Each runtime wraps epoll differently to solve the same problem: matching massive concurrency to limited OS threads. Node.js avoids thread overhead for I/O-bound work. Go decouples goroutines from OS threads entirely. JVM NIO brings event-driven I/O to Java without rewriting everything as async. Redis proves that one thread plus epoll beats thousands of threads with blocking I/O.
How does epoll compare to alternatives?
| Approach | Concurrency model | Connections per thread | File I/O | Complexity |
|---|---|---|---|---|
| Thread-per-connection | One OS thread per socket | 1 | Blocking (simple) | Low |
| select/poll | Single thread scans all fds | Thousands (but O(n) scan) | Not supported | Medium |
| epoll (level-triggered) | Single thread, kernel notifies ready fds | Tens of thousands at O(ready) | Not supported | Medium |
| epoll (edge-triggered) | Same, but fires only on state change | Same, fewer wakeups | Not supported | High |
| io_uring | Kernel-side async for everything | Same as epoll for network, also handles files | Async (native) | Very high |
Thread-per-connection is simplest but does not scale past a few thousand connections (stack memory plus context switch overhead). select/poll scale in connections but not in CPU (scan cost grows linearly). epoll scales in both. io_uring goes further by also handling disk I/O asynchronously, but with significantly more complexity.
Stack layer mapping (debugging a slow request):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the handler blocking? CPU-bound work in the event loop? | Application logs, profiler |
| Runtime | Is the thread pool exhausted? (Node: UV_THREADPOOL_SIZE, Go: GOMAXPROCS) | Runtime metrics |
| Syscall | Is epoll_wait returning slowly? Are reads blocking? | strace -e epoll_wait,read -T |
| Kernel | Is the socket receive buffer full? Is the NIC dropping packets? | ss -tnp, cat /proc/net/softnet_stat |
| Hardware | NIC ring buffer overflow? IRQ affinity misconfigured? | ethtool -S, /proc/interrupts |
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| High CPU with mostly idle connections | Using select/poll instead of epoll, or scanning all fds manually | strace -e select,poll,epoll_wait -c -p PID |
| Connection hangs after first message | Edge-triggered mode without reading until EAGAIN | strace -e read -p PID and look for missing EAGAIN |
| epoll_wait returns immediately in a loop | EPOLLHUP or EPOLLERR not handled, fd is in error state | strace -e epoll_wait -p PID, check return events |
| Too many open files error | Per-process fd limit reached, not an epoll limit | cat /proc/PID/limits |
| Kernel memory growing with idle connections | Each epitem costs ~128 bytes. 10M connections = 1.2 GB kernel slab | slabtop or cat /proc/slabinfo |
| Thundering herd on accept() | Multiple workers wake for same connection | Add EPOLLEXCLUSIVE flag |
When to Use / Avoid
Use epoll when:
- Managing thousands of concurrent network connections (servers, proxies, load balancers)
- Most connections are idle at any given moment (chat, WebSocket, long-polling)
- Building event-driven architectures on Linux
Avoid epoll when:
- Working with regular files (epoll always reports them as ready, use io_uring instead)
- Only handling a handful of connections (select is simpler for under 50 fds)
- Targeting cross-platform (epoll is Linux-only, use kqueue on macOS, IOCP on Windows)
- CPU-bound workloads where the bottleneck is computation, not I/O waiting
Try It Yourself
1 # Trace epoll syscalls of a process
2
3 strace -e epoll_create1,epoll_ctl,epoll_wait -c -p $(pidof nginx) -f &
4
5 # Show fds registered in an epoll instance
6
7 ls -la /proc/$(pidof redis-server)/fd | head
8
9 cat /proc/$(pidof redis-server)/fdinfo/5 # if 5 is the epoll fd
10
11 # Benchmark: compare select vs poll vs epoll
12
13 # Monitor event loop latency via epoll_wait timing
14
15 perf trace -e epoll_wait -p $(pidof node) -- sleep 10 2>&1 | awk '/epoll_wait/{print $NF}'
16
17 # Check max epoll instances and fd limits
18
19 cat /proc/sys/fs/epoll/max_user_watches
20
21 ulimit -n # max open file descriptors per processDebug Checklist
- 1
Check if the process is using epoll: strace -e epoll_wait -c -p <pid> -- sleep 5 - 2
Count fds in the epoll instance: ls /proc/<pid>/fd | wc -l - 3
Check max watches: cat /proc/sys/fs/epoll/max_user_watches - 4
Check per-process fd limit: cat /proc/<pid>/limits | grep 'open files' - 5
Trace event loop latency: perf trace -e epoll_wait -p <pid> -- sleep 10 - 6
Check if edge-triggered mode is draining fully: strace -e read -p <pid> (look for EAGAIN)
Key Takeaways
- ✓select is O(n) where n = highest fd number. poll is O(n) where n = total monitored fds. epoll_wait is O(k) where k = READY fds only. At 50,000 connections with 20 active, that is 50,000 vs 20. The difference is not theoretical -- it is the reason servers scale.
- ✓Edge-triggered (EPOLLET) fires only on state CHANGES -- new data arriving, not data existing. You MUST read until EAGAIN or the fd goes silent forever. Miss this and your connection hangs with data sitting unread in the buffer.
- ✓Level-triggered (default) fires every epoll_wait() call as long as data remains. Safer and simpler. More wakeups, but you cannot lose data by reading too little. Start here unless you have a specific reason for ET.
- ✓EPOLLONESHOT disarms the fd after one event delivery. Essential for multi-threaded event loops -- without it, two threads can receive events for the same fd simultaneously. The owning thread must re-arm via EPOLL_CTL_MOD after processing.
- ✓epoll does NOT work with regular files. Disk files are always "ready" from epoll's perspective, so epoll_wait returns immediately. For async disk I/O, use io_uring. This trips up every developer who tries to use epoll for file watching.
Common Pitfalls
- ✗Mistake: Using edge-triggered mode without reading until EAGAIN. Reality: If you read one buffer's worth on an ET event, remaining data stays in the socket buffer with no further notification. The connection appears to hang permanently.
- ✗Mistake: Ignoring EPOLLHUP and EPOLLERR. Reality: These are always reported regardless of what you requested. Not checking for them causes busy-loops where epoll_wait returns immediately but your code does not process the error condition.
- ✗Mistake: Sharing an epoll fd across fork() without understanding the consequences. Reality: The child inherits a reference to the same kernel eventpoll structure. Events go to whichever process calls epoll_wait() first, causing silent race conditions.
- ✗Mistake: Monitoring millions of idle connections without considering memory. Reality: Each epitem is ~128 bytes of kernel slab memory. Ten million idle connections costs ~1.2 GB of kernel memory just for epoll metadata.
Reference
In One Line
Waiting costs nothing for idle connections -- epoll_wait returns only what is active, which is why every high-concurrency Linux server lives in that loop.