File Descriptors & File Tables
Mental Model
A coat check at a restaurant. Hand over a coat, get a numbered ticket back. The ticket is not the coat -- it is a reference to where the coat hangs. After a fork, two people hold tickets for the same coat on the same hook. The attendant tracks which hook each ticket maps to, and the restaurant has a limited number of hooks.
The Problem
A web server at 10,000 concurrent connections suddenly refuses new ones: "too many open files." Each connection is a socket, each socket is an fd, and the default per-process limit is 1,024. Elsewhere, a forked worker inherited a database connection fd it should never have seen -- both processes write through the same file offset, interleaving output and corrupting the stream.
Architecture
A program calls open() and gets back the number 3. It calls open() again and gets 4. It reads, writes, seeks — all through these tiny integers.
But what IS the number 3? It's not a pointer. It's not a handle. It's an index into a per-process table, and that table points to a system-wide table, and that table points to an inode. Three levels of indirection. And the sharing semantics at each level are different.
This three-layer design explains everything: why forked children share file offsets with parents, why closing one copy of a dup'd fd doesn't close the file, why a server crashes at exactly 1024 connections, and why a leaked fd can be a security vulnerability.
What Actually Happens
When a process calls open("/tmp/data.txt", O_RDWR), the kernel does three things:
-
Finds the lowest unused slot in the process's fd table (
struct fdtable). This produces the integer — say, 3. -
Allocates a struct file (called an "open file description" in POSIX terminology) in the system-wide table. This struct holds the current file offset, access mode (read/write/append), and status flags.
-
Resolves the path to a struct inode via VFS path walk, and links the struct file to the inode.
The fd table entry points to the struct file. The struct file points to the inode. That's the three-level chain.
Now here's the critical insight. dup() and fork() create new fd table entries that point to the same struct file. The offset is shared. If the parent writes 100 bytes and advances the offset to position 100, the child's fd sees position 100 too.
Two independent open() calls on the same file? Completely separate struct files. Independent offsets. Independent flags. The only thing shared is the inode.
This distinction is not academic. It's the reason shell I/O redirection works. It's the reason >> (append mode) from multiple processes doesn't corrupt files. And it's the reason debugging fork()-based servers requires understanding which fds are shared and which are independent.
Under the Hood
O_CLOEXEC: the flag that must never be forgotten. When a process calls exec(), all open fds are inherited by the new program — unless they're marked close-on-exec. Without O_CLOEXEC, a privilege-dropping daemon that exec's a child process hands it every open socket, every database connection, every file opened with elevated privileges. Since Linux 2.6.23, O_CLOEXEC can be set atomically at open() time. Before that, calling fcntl(fd, F_SETFD, FD_CLOEXEC) separately was the only option, creating a race window in multithreaded programs where another thread could fork+exec between the open and the fcntl.
Lowest-available-fd allocation. POSIX mandates that open() returns the lowest available fd number. This is deliberate, not arbitrary. Daemon code that closes stdin/stdout/stderr (fds 0, 1, 2) before opening log files gets those log files assigned to fds 0, 1, 2. This prevents accidental writes to stdin/stdout/stderr from reaching unexpected destinations — any printf() goes to the log file instead of a random terminal.
Fd table growth. The kernel's fd table starts small (typically 64 entries) and grows dynamically using RCU-protected pointer replacement. Reads are lockless; writes (open/close) take files_struct->file_lock. High-connection servers can have fd tables with hundreds of thousands of entries, and the table resizes transparently.
The two fd limits. Per-process: RLIMIT_NOFILE, typically 1024 soft / 1048576 hard. System-wide: /proc/sys/fs/file-max, capping total struct file objects across all processes. Hit the per-process limit and open() fails with EMFILE. Hit the system-wide limit and it fails with ENFILE. Production systems must tune both — a web server with 10,000 connections needs at least that many fds per worker, and the system-wide limit must accommodate all workers plus everything else.
Common Questions
After fork(), if the child calls lseek(), does the parent see the new offset?
Yes. fork() duplicates fd table entries, but they point to the same struct file. The offset lives in the struct file, not in the fd table entry. So lseek() in the child changes the offset for both processes. The sharing semantics are identical to dup(). Only independent open() calls create separate struct files with independent offsets.
What happens if a process is at its fd limit and calls pipe()?
pipe() needs two fds. If the process is at RLIMIT_NOFILE, pipe() fails with EMFILE. Same for socketpair(). This is a common gotcha in event-loop architectures where internal signaling pipes or self-pipe tricks count against the fd limit.
How does close() work in multithreaded code?
Carefully, or not at all. close() removes the fd from the table, but if another thread is blocked in read() on that fd, the behavior is undefined on Linux. The blocked read might return an error, or — worse — the fd number might get reused by another thread's open(), and the original read returns data from the wrong file. This is why reliable code uses shutdown() on sockets instead of close() to unblock waiting threads, then closes the fd after all threads are done.
What's the difference between EMFILE and ENFILE?
EMFILE: this process has too many fds open. Fix: close unused fds or raise ulimit -n. ENFILE: the entire system has too many open file handles. Fix: system-wide intervention — check /proc/sys/fs/file-max and hunt for processes leaking fds across the machine. EMFILE is a per-process problem. ENFILE usually means something is leaking system-wide.
How Technologies Use This
A container process starts and discovers it can read a database socket and a secret file that belong to the host. Without fd sanitization, every file descriptor open in the runtime process is inherited across exec() into the container entrypoint, creating a privilege escalation vector.
File descriptors survive exec() by default. Docker's containerd runtime solves this by calling close_range(3, UINT_MAX, 0) before execing the entrypoint, atomically closing every fd above stderr in a single syscall. Internal fds used during container setup are opened with O_CLOEXEC so they cannot survive exec() even in a race with another thread forking.
This ensures exactly 3 fds (stdin, stdout, stderr) reach the container. The lesson: fd inheritance is the default, not the exception. Any runtime that exec()s untrusted code must sanitize fds explicitly or leak host state.
A Kafka broker managing 10,000 partitions suddenly stops accepting new producer connections. Writes stall cluster-wide. The broker logs show "Too many open files" but the server has plenty of memory and CPU. The default RLIMIT_NOFILE of 1024 was hit silently.
Each partition has 3 log segments plus 3 index files, totaling 60,000 file descriptors just for storage, before counting client sockets, inter-broker replication connections, and ZooKeeper links. The default per-process fd limit of 1024 is exhausted almost instantly at production scale.
Kafka's operational docs require ulimit -n of at least 100,000. The broker exposes open-file-descriptor-count via JMX so monitoring can alert when usage crosses 80% of the limit. Set the limit before the first broker starts, not after the first outage.
Nginx starts dropping connections at exactly 5,000 concurrent proxied requests per worker. The error log fills with "Too many open files" but the server is barely at 20% CPU. The culprit is invisible unless someone counts fds: each proxied connection costs 2 fds (one client socket, one upstream backend), so 5,000 connections means 10,000 fds per worker.
With 4 workers that is 40,000 fds system-wide. Nginx reads worker_rlimit_nofile from config and calls setrlimit() to raise RLIMIT_NOFILE at startup, logging a warning if the hard limit is too low. After daemonizing, it uses dup2() to redirect stderr to the error log fd, and sets O_CLOEXEC on listener sockets so spawned CGI children never inherit them.
Set worker_rlimit_nofile to at least double the expected concurrent connections per worker. Proxy architectures double fd consumption, and the default 1024 limit is exhausted almost immediately at production traffic levels.
A complex analytical query joins 20 tables, each partitioned into 25 segments, with 10 WAL segments and 20 temp sort files open simultaneously. That is over 500 file references in a single backend. With a default RLIMIT_NOFILE of 1024, the query would crash mid-execution with EMFILE.
PostgreSQL solves this with a Virtual File Descriptor (VFD) layer that tracks thousands of logical file references but keeps only a bounded number of real kernel fds open simultaneously. When a VFD needs I/O, the layer closes an LRU fd and opens the needed file, recycling kernel fds transparently.
This lets a single backend reference over 10,000 files while never exceeding a few hundred real fds. The lesson: when the OS fd limit cannot accommodate the workload, build a userspace multiplexing layer rather than demanding unlimited kernel resources.
Redis maxclients is configured to 10,000 but the startup log shows it silently dropped to 992. Clients start getting rejected long before expected. The server has plenty of memory, yet Redis refuses new connections.
Each client connection consumes exactly one file descriptor, and maxclients is directly bounded by RLIMIT_NOFILE minus 32 reserved fds for internal use (AOF persistence, RDB snapshots, replication links, Lua scripts). At startup, Redis checks the soft limit via getrlimit() and attempts setrlimit() to raise it. If the limit is only 1024, maxclients silently drops to 992 with no error, just a warning in the log.
A production Redis instance serving 10K concurrent clients needs ulimit -n set to at least 10,032. Always check the Redis startup log for fd limit warnings and set RLIMIT_NOFILE before launch, not after the first connection storm.
Same Concept Across Tech
| Technology | How fd limits affect it | Key config |
|---|---|---|
| Nginx | worker_connections is bounded by fd limit per worker | worker_rlimit_nofile in nginx.conf |
| Node.js | Each TCP connection = 1 fd, each file open = 1 fd | ulimit -n before starting node |
| JVM | Sockets + file handles + JNI resources all consume fds | -XX:MaxDirectMemorySize does not help, need ulimit |
| Docker | Container inherits host ulimit unless overridden | docker run --ulimit nofile=65536:65536 |
| Kubernetes | Pod inherits node ulimit, can override via securityContext | Set in node provisioning, not pod spec |
| Go | Goroutines share process fd table, high goroutine count can exhaust fds | Set RLIMIT_NOFILE before starting |
Stack layer mapping (too many open files debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the app closing connections/files after use? Fd leak? | Application logs, lsof -p PID |
| Runtime | Is the connection pool configured with a max? | Pool config, runtime metrics |
| Process | What is the current fd count vs limit? | ls /proc/PID/fd |
| System | Is the system-wide fd limit reached? | cat /proc/sys/fs/file-nr |
| Kernel | Is the kernel fd table full? | dmesg, /proc/sys/fs/file-max |
Design Rationale Small integers pass trivially across fork() and exec() boundaries and can live in environment variables or command-line arguments -- opaque pointers cannot. The three-layer architecture exists because sharing requirements differ at each level: fork() should share file offsets (shell pipelines depend on this), but independent opens of the same file need independent offsets (concurrent readers depend on that). The default limit of 1024 was a conservative guard against runaway processes exhausting kernel memory, set in an era when total system memory was measured in megabytes.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| EMFILE: too many open files | Per-process fd limit reached | ulimit -n and ls /proc/PID/fd |
| ENFILE: file table overflow | System-wide fd limit reached | cat /proc/sys/fs/file-nr |
| Fd count growing over time | Fd leak, not closing sockets or files | watch ls /proc/PID/fd |
| Child process has unexpected database connection | Fd inherited across fork(), missing O_CLOEXEC | lsof -p child_pid |
| Writes from two processes interleaving in same file | Shared fd (from fork) with shared file offset | Use O_APPEND or separate fds |
| Accept() fails on listening socket | Fd limit reached, cannot create new socket fd | Check ulimit, increase, restart |
When to Use / Avoid
Relevant when:
- Debugging "too many open files" errors (EMFILE vs ENFILE)
- Understanding how fork() shares or duplicates file descriptors
- Building servers that manage thousands of concurrent connections
- Diagnosing fd leaks where a long-running process slowly exhausts its limit
Watch out for:
- Forgetting to close fds before exec(), leaking them to child processes (use O_CLOEXEC)
- Confusing per-process limit (ulimit -n) with system-wide limit (/proc/sys/fs/file-max)
- Assuming dup2() closes the old fd (it does, but only the target, not the source)
Try It Yourself
1 # List all open file descriptors of the current shell, showing symlinks to actual files
2 ls -la /proc/$$/fd
3
4 # Shows three numbers: allocated file handles, free file handles, maximum . system-wide fd usage
5 cat /proc/sys/fs/file-nr
6
7 # Display the soft limit on per-process file descriptors (RLIMIT_NOFILE)
8 ulimit -n
9
10 # Count the number of open file descriptors for the current shell process
11 lsof -p $$ | wc -l
12
13 # Trace openat/close syscalls to see fd lifecycle for a simple file write
14 strace -e openat,close -f bash -c 'echo hello > /tmp/test'
15
16 # Resolve what file fd 0 (stdin) actually points to. reveals /dev/pts/N for terminals
17 readlink /proc/$$/fd/0Debug Checklist
- 1
List open fds for a process: ls -la /proc/<pid>/fd - 2
Count open fds: ls /proc/<pid>/fd | wc -l - 3
Check per-process limit: cat /proc/<pid>/limits | grep 'open files' - 4
Check system-wide fd usage: cat /proc/sys/fs/file-nr - 5
Find which files a process has open: lsof -p <pid> - 6
Check for fd leaks over time: watch -n5 'ls /proc/<pid>/fd | wc -l'
Key Takeaways
- ✓dup2() and fork() share the SAME struct file — meaning file offset changes in one process are visible in the other. Two separate open() calls create independent struct files with independent offsets. This distinction breaks people's mental models constantly
- ✓Two different limits, two different errors: RLIMIT_NOFILE (per-process, default 1024) produces EMFILE; /proc/sys/fs/file-max (system-wide) produces ENFILE. Production servers must tune both
- ✓Without O_CLOEXEC, child processes inherit every open fd across exec() — including sockets, database connections, and files opened with elevated privileges. This is a security risk and a resource leak
- ✓The kernel always assigns the lowest available fd number on open() — close stdin (fd 0) then open a file and it gets fd 0. Classic footgun in daemon code
- ✓close_range() (Linux 5.11+) atomically closes a range of fds in one syscall — far more efficient than looping close() for fd sanitization before exec()
Common Pitfalls
- ✗Thinking two fds from separate open() calls share a file offset — they don't. Each open() creates a new struct file with its own position. Only dup() and fork() share offsets, because they share the struct file
- ✗Forgetting O_CLOEXEC in multithreaded programs — between open() and a subsequent fcntl(FD_CLOEXEC), another thread can fork+exec and leak the fd. Set O_CLOEXEC atomically at open() time
- ✗Ignoring close()'s return value — on NFS, close() can return EIO if a deferred write failed. Ignoring this means silently losing data and discovering it much, much later
- ✗Leaking pipe() or socketpair() fds — each creates TWO fds, and forgetting to close the unused end in parent or child is one of the most common fd leaks in production systems
Reference
In One Line
Most "too many open files" errors trace to fd leaks or a default ulimit of 1024 that nobody raised -- check both before blaming the application.