File Systems & I/OTopic 1 of 19

File Systems & I/OStarter

File Descriptors & File Tables

DockerKafkaNginxPostgreSQLRedis

🧠

Mental Model

A coat check at a restaurant. Hand over a coat, get a numbered ticket back. The ticket is not the coat -- it is a reference to where the coat hangs. After a fork, two people hold tickets for the same coat on the same hook. The attendant tracks which hook each ticket maps to, and the restaurant has a limited number of hooks.

💡

The Problem

A web server at 10,000 concurrent connections suddenly refuses new ones: "too many open files." Each connection is a socket, each socket is an fd, and the default per-process limit is 1,024. Elsewhere, a forked worker inherited a database connection fd it should never have seen -- both processes write through the same file offset, interleaving output and corrupting the stream.

Architecture

A program calls open() and gets back the number 3. It calls open() again and gets 4. It reads, writes, seeks — all through these tiny integers.

But what IS the number 3? It's not a pointer. It's not a handle. It's an index into a per-process table, and that table points to a system-wide table, and that table points to an inode. Three levels of indirection. And the sharing semantics at each level are different.

This three-layer design explains everything: why forked children share file offsets with parents, why closing one copy of a dup'd fd doesn't close the file, why a server crashes at exactly 1024 connections, and why a leaked fd can be a security vulnerability.

What Actually Happens

When a process calls open("/tmp/data.txt", O_RDWR), the kernel does three things:

Finds the lowest unused slot in the process's fd table (struct fdtable). This produces the integer — say, 3.
Allocates a struct file (called an "open file description" in POSIX terminology) in the system-wide table. This struct holds the current file offset, access mode (read/write/append), and status flags.
Resolves the path to a struct inode via VFS path walk, and links the struct file to the inode.

The fd table entry points to the struct file. The struct file points to the inode. That's the three-level chain.

Now here's the critical insight. dup() and fork() create new fd table entries that point to the same struct file. The offset is shared. If the parent writes 100 bytes and advances the offset to position 100, the child's fd sees position 100 too.

Two independent open() calls on the same file? Completely separate struct files. Independent offsets. Independent flags. The only thing shared is the inode.

This distinction is not academic. It's the reason shell I/O redirection works. It's the reason >> (append mode) from multiple processes doesn't corrupt files. And it's the reason debugging fork()-based servers requires understanding which fds are shared and which are independent.

Under the Hood

O_CLOEXEC: the flag that must never be forgotten. When a process calls exec(), all open fds are inherited by the new program — unless they're marked close-on-exec. Without O_CLOEXEC, a privilege-dropping daemon that exec's a child process hands it every open socket, every database connection, every file opened with elevated privileges. Since Linux 2.6.23, O_CLOEXEC can be set atomically at open() time. Before that, calling fcntl(fd, F_SETFD, FD_CLOEXEC) separately was the only option, creating a race window in multithreaded programs where another thread could fork+exec between the open and the fcntl.

Lowest-available-fd allocation. POSIX mandates that open() returns the lowest available fd number. This is deliberate, not arbitrary. Daemon code that closes stdin/stdout/stderr (fds 0, 1, 2) before opening log files gets those log files assigned to fds 0, 1, 2. This prevents accidental writes to stdin/stdout/stderr from reaching unexpected destinations — any printf() goes to the log file instead of a random terminal.

Fd table growth. The kernel's fd table starts small (typically 64 entries) and grows dynamically using RCU-protected pointer replacement. Reads are lockless; writes (open/close) take files_struct->file_lock. High-connection servers can have fd tables with hundreds of thousands of entries, and the table resizes transparently.

The two fd limits. Per-process: RLIMIT_NOFILE, typically 1024 soft / 1048576 hard. System-wide: /proc/sys/fs/file-max, capping total struct file objects across all processes. Hit the per-process limit and open() fails with EMFILE. Hit the system-wide limit and it fails with ENFILE. Production systems must tune both — a web server with 10,000 connections needs at least that many fds per worker, and the system-wide limit must accommodate all workers plus everything else.

Common Questions

After fork(), if the child calls lseek(), does the parent see the new offset?

Yes. fork() duplicates fd table entries, but they point to the same struct file. The offset lives in the struct file, not in the fd table entry. So lseek() in the child changes the offset for both processes. The sharing semantics are identical to dup(). Only independent open() calls create separate struct files with independent offsets.

What happens if a process is at its fd limit and calls pipe()?

pipe() needs two fds. If the process is at RLIMIT_NOFILE, pipe() fails with EMFILE. Same for socketpair(). This is a common gotcha in event-loop architectures where internal signaling pipes or self-pipe tricks count against the fd limit.

How does close() work in multithreaded code?

Carefully, or not at all. close() removes the fd from the table, but if another thread is blocked in read() on that fd, the behavior is undefined on Linux. The blocked read might return an error, or — worse — the fd number might get reused by another thread's open(), and the original read returns data from the wrong file. This is why reliable code uses shutdown() on sockets instead of close() to unblock waiting threads, then closes the fd after all threads are done.

What's the difference between EMFILE and ENFILE?

EMFILE: this process has too many fds open. Fix: close unused fds or raise ulimit -n. ENFILE: the entire system has too many open file handles. Fix: system-wide intervention — check /proc/sys/fs/file-max and hunt for processes leaking fds across the machine. EMFILE is a per-process problem. ENFILE usually means something is leaking system-wide.

How Technologies Use This

Docker

A container process starts and discovers it can read a database socket and a secret file that belong to the host. Without fd sanitization, every file descriptor open in the runtime process is inherited across exec() into the container entrypoint, creating a privilege escalation vector.

File descriptors survive exec() by default. Docker's containerd runtime solves this by calling close_range(3, UINT_MAX, 0) before execing the entrypoint, atomically closing every fd above stderr in a single syscall. Internal fds used during container setup are opened with O_CLOEXEC so they cannot survive exec() even in a race with another thread forking.

This ensures exactly 3 fds (stdin, stdout, stderr) reach the container. The lesson: fd inheritance is the default, not the exception. Any runtime that exec()s untrusted code must sanitize fds explicitly or leak host state.

Kafka

A Kafka broker managing 10,000 partitions suddenly stops accepting new producer connections. Writes stall cluster-wide. The broker logs show "Too many open files" but the server has plenty of memory and CPU. The default RLIMIT_NOFILE of 1024 was hit silently.

Each partition has 3 log segments plus 3 index files, totaling 60,000 file descriptors just for storage, before counting client sockets, inter-broker replication connections, and ZooKeeper links. The default per-process fd limit of 1024 is exhausted almost instantly at production scale.

Kafka's operational docs require ulimit -n of at least 100,000. The broker exposes open-file-descriptor-count via JMX so monitoring can alert when usage crosses 80% of the limit. Set the limit before the first broker starts, not after the first outage.

Nginx

Nginx starts dropping connections at exactly 5,000 concurrent proxied requests per worker. The error log fills with "Too many open files" but the server is barely at 20% CPU. The culprit is invisible unless someone counts fds: each proxied connection costs 2 fds (one client socket, one upstream backend), so 5,000 connections means 10,000 fds per worker.

With 4 workers that is 40,000 fds system-wide. Nginx reads worker_rlimit_nofile from config and calls setrlimit() to raise RLIMIT_NOFILE at startup, logging a warning if the hard limit is too low. After daemonizing, it uses dup2() to redirect stderr to the error log fd, and sets O_CLOEXEC on listener sockets so spawned CGI children never inherit them.

Set worker_rlimit_nofile to at least double the expected concurrent connections per worker. Proxy architectures double fd consumption, and the default 1024 limit is exhausted almost immediately at production traffic levels.

PostgreSQL

A complex analytical query joins 20 tables, each partitioned into 25 segments, with 10 WAL segments and 20 temp sort files open simultaneously. That is over 500 file references in a single backend. With a default RLIMIT_NOFILE of 1024, the query would crash mid-execution with EMFILE.

PostgreSQL solves this with a Virtual File Descriptor (VFD) layer that tracks thousands of logical file references but keeps only a bounded number of real kernel fds open simultaneously. When a VFD needs I/O, the layer closes an LRU fd and opens the needed file, recycling kernel fds transparently.

This lets a single backend reference over 10,000 files while never exceeding a few hundred real fds. The lesson: when the OS fd limit cannot accommodate the workload, build a userspace multiplexing layer rather than demanding unlimited kernel resources.

Redis

Redis maxclients is configured to 10,000 but the startup log shows it silently dropped to 992. Clients start getting rejected long before expected. The server has plenty of memory, yet Redis refuses new connections.

Each client connection consumes exactly one file descriptor, and maxclients is directly bounded by RLIMIT_NOFILE minus 32 reserved fds for internal use (AOF persistence, RDB snapshots, replication links, Lua scripts). At startup, Redis checks the soft limit via getrlimit() and attempts setrlimit() to raise it. If the limit is only 1024, maxclients silently drops to 992 with no error, just a warning in the log.

A production Redis instance serving 10K concurrent clients needs ulimit -n set to at least 10,032. Always check the Redis startup log for fd limit warnings and set RLIMIT_NOFILE before launch, not after the first connection storm.

Same Concept Across Tech

Technology	How fd limits affect it	Key config
Nginx	worker_connections is bounded by fd limit per worker	worker_rlimit_nofile in nginx.conf
Node.js	Each TCP connection = 1 fd, each file open = 1 fd	ulimit -n before starting node
JVM	Sockets + file handles + JNI resources all consume fds	-XX:MaxDirectMemorySize does not help, need ulimit
Docker	Container inherits host ulimit unless overridden	docker run --ulimit nofile=65536:65536
Kubernetes	Pod inherits node ulimit, can override via securityContext	Set in node provisioning, not pod spec
Go	Goroutines share process fd table, high goroutine count can exhaust fds	Set RLIMIT_NOFILE before starting

Stack layer mapping (too many open files debugging):

Layer	What to check	Tool
Application	Is the app closing connections/files after use? Fd leak?	Application logs, lsof -p PID
Runtime	Is the connection pool configured with a max?	Pool config, runtime metrics
Process	What is the current fd count vs limit?	ls /proc/PID/fd
System	Is the system-wide fd limit reached?	cat /proc/sys/fs/file-nr
Kernel	Is the kernel fd table full?	dmesg, /proc/sys/fs/file-max

Design Rationale Small integers pass trivially across fork() and exec() boundaries and can live in environment variables or command-line arguments -- opaque pointers cannot. The three-layer architecture exists because sharing requirements differ at each level: fork() should share file offsets (shell pipelines depend on this), but independent opens of the same file need independent offsets (concurrent readers depend on that). The default limit of 1024 was a conservative guard against runaway processes exhausting kernel memory, set in an era when total system memory was measured in megabytes.

If You See This, Think This

Symptom	Likely cause	First check
EMFILE: too many open files	Per-process fd limit reached	ulimit -n and ls /proc/PID/fd
ENFILE: file table overflow	System-wide fd limit reached	cat /proc/sys/fs/file-nr
Fd count growing over time	Fd leak, not closing sockets or files	watch ls /proc/PID/fd
Child process has unexpected database connection	Fd inherited across fork(), missing O_CLOEXEC	lsof -p child_pid
Writes from two processes interleaving in same file	Shared fd (from fork) with shared file offset	Use O_APPEND or separate fds
Accept() fails on listening socket	Fd limit reached, cannot create new socket fd	Check ulimit, increase, restart

When to Use / Avoid

Relevant when:

Debugging "too many open files" errors (EMFILE vs ENFILE)
Understanding how fork() shares or duplicates file descriptors
Building servers that manage thousands of concurrent connections
Diagnosing fd leaks where a long-running process slowly exhausts its limit

Watch out for:

Forgetting to close fds before exec(), leaking them to child processes (use O_CLOEXEC)
Confusing per-process limit (ulimit -n) with system-wide limit (/proc/sys/fs/file-max)
Assuming dup2() closes the old fd (it does, but only the target, not the source)

Try It Yourself

 1  # List all open file descriptors of the current shell, showing symlinks to actual files
 2  ls -la /proc/$$/fd
 3  
 4  # Shows three numbers: allocated file handles, free file handles, maximum . system-wide fd usage
 5  cat /proc/sys/fs/file-nr
 6  
 7  # Display the soft limit on per-process file descriptors (RLIMIT_NOFILE)
 8  ulimit -n
 9  
10  # Count the number of open file descriptors for the current shell process
11  lsof -p $$ | wc -l
12  
13  # Trace openat/close syscalls to see fd lifecycle for a simple file write
14  strace -e openat,close -f bash -c 'echo hello > /tmp/test'
15  
16  # Resolve what file fd 0 (stdin) actually points to. reveals /dev/pts/N for terminals
17  readlink /proc/$$/fd/0

Debug Checklist

1List open fds for a process: ls -la /proc/<pid>/fd
2Count open fds: ls /proc/<pid>/fd | wc -l
3Check per-process limit: cat /proc/<pid>/limits | grep 'open files'
4Check system-wide fd usage: cat /proc/sys/fs/file-nr
5Find which files a process has open: lsof -p <pid>
6Check for fd leaks over time: watch -n5 'ls /proc/<pid>/fd | wc -l'

Key Takeaways

✓dup2() and fork() share the SAME struct file — meaning file offset changes in one process are visible in the other. Two separate open() calls create independent struct files with independent offsets. This distinction breaks people's mental models constantly
✓Two different limits, two different errors: RLIMIT_NOFILE (per-process, default 1024) produces EMFILE; /proc/sys/fs/file-max (system-wide) produces ENFILE. Production servers must tune both
✓Without O_CLOEXEC, child processes inherit every open fd across exec() — including sockets, database connections, and files opened with elevated privileges. This is a security risk and a resource leak
✓The kernel always assigns the lowest available fd number on open() — close stdin (fd 0) then open a file and it gets fd 0. Classic footgun in daemon code
✓close_range() (Linux 5.11+) atomically closes a range of fds in one syscall — far more efficient than looping close() for fd sanitization before exec()

Common Pitfalls

✗Thinking two fds from separate open() calls share a file offset — they don't. Each open() creates a new struct file with its own position. Only dup() and fork() share offsets, because they share the struct file
✗Forgetting O_CLOEXEC in multithreaded programs — between open() and a subsequent fcntl(FD_CLOEXEC), another thread can fork+exec and leak the fd. Set O_CLOEXEC atomically at open() time
✗Ignoring close()'s return value — on NFS, close() can return EIO if a deferred write failed. Ignoring this means silently losing data and discovering it much, much later
✗Leaking pipe() or socketpair() fds — each creates TWO fds, and forgetting to close the unused end in parent or child is one of the most common fd leaks in production systems

Reference

System Calls

openclosereadwritedup2fcntl

Tools

lsof/proc/PID/fdstrace

📌

In One Line

Most "too many open files" errors trace to fd leaks or a default ulimit of 1024 that nobody raised -- check both before blaming the application.

File Descriptors & File Tables

DockerKafkaNginxPostgreSQLRedis

🧠

Mental Model

💡

The Problem

Architecture

A program calls open() and gets back the number 3. It calls open() again and gets 4. It reads, writes, seeks — all through these tiny integers.

What Actually Happens

When a process calls open("/tmp/data.txt", O_RDWR), the kernel does three things:

Finds the lowest unused slot in the process's fd table (struct fdtable). This produces the integer — say, 3.
Allocates a struct file (called an "open file description" in POSIX terminology) in the system-wide table. This struct holds the current file offset, access mode (read/write/append), and status flags.
Resolves the path to a struct inode via VFS path walk, and links the struct file to the inode.

The fd table entry points to the struct file. The struct file points to the inode. That's the three-level chain.

Two independent open() calls on the same file? Completely separate struct files. Independent offsets. Independent flags. The only thing shared is the inode.

Under the Hood

Common Questions

After fork(), if the child calls lseek(), does the parent see the new offset?

What happens if a process is at its fd limit and calls pipe()?

How does close() work in multithreaded code?

What's the difference between EMFILE and ENFILE?

How Technologies Use This

Docker

Kafka

Nginx

PostgreSQL

Redis

Same Concept Across Tech

Technology	How fd limits affect it	Key config
Nginx	worker_connections is bounded by fd limit per worker	worker_rlimit_nofile in nginx.conf
Node.js	Each TCP connection = 1 fd, each file open = 1 fd	ulimit -n before starting node
JVM	Sockets + file handles + JNI resources all consume fds	-XX:MaxDirectMemorySize does not help, need ulimit
Docker	Container inherits host ulimit unless overridden	docker run --ulimit nofile=65536:65536
Kubernetes	Pod inherits node ulimit, can override via securityContext	Set in node provisioning, not pod spec
Go	Goroutines share process fd table, high goroutine count can exhaust fds	Set RLIMIT_NOFILE before starting

Stack layer mapping (too many open files debugging):

Layer	What to check	Tool
Application	Is the app closing connections/files after use? Fd leak?	Application logs, lsof -p PID
Runtime	Is the connection pool configured with a max?	Pool config, runtime metrics
Process	What is the current fd count vs limit?	ls /proc/PID/fd
System	Is the system-wide fd limit reached?	cat /proc/sys/fs/file-nr
Kernel	Is the kernel fd table full?	dmesg, /proc/sys/fs/file-max

If You See This, Think This

Symptom	Likely cause	First check
EMFILE: too many open files	Per-process fd limit reached	ulimit -n and ls /proc/PID/fd
ENFILE: file table overflow	System-wide fd limit reached	cat /proc/sys/fs/file-nr
Fd count growing over time	Fd leak, not closing sockets or files	watch ls /proc/PID/fd
Child process has unexpected database connection	Fd inherited across fork(), missing O_CLOEXEC	lsof -p child_pid
Writes from two processes interleaving in same file	Shared fd (from fork) with shared file offset	Use O_APPEND or separate fds
Accept() fails on listening socket	Fd limit reached, cannot create new socket fd	Check ulimit, increase, restart

When to Use / Avoid

Relevant when:

Debugging "too many open files" errors (EMFILE vs ENFILE)
Understanding how fork() shares or duplicates file descriptors
Building servers that manage thousands of concurrent connections
Diagnosing fd leaks where a long-running process slowly exhausts its limit

Watch out for:

Forgetting to close fds before exec(), leaking them to child processes (use O_CLOEXEC)
Confusing per-process limit (ulimit -n) with system-wide limit (/proc/sys/fs/file-max)
Assuming dup2() closes the old fd (it does, but only the target, not the source)

Try It Yourself

 1  # List all open file descriptors of the current shell, showing symlinks to actual files
 2  ls -la /proc/$$/fd
 3  
 4  # Shows three numbers: allocated file handles, free file handles, maximum . system-wide fd usage
 5  cat /proc/sys/fs/file-nr
 6  
 7  # Display the soft limit on per-process file descriptors (RLIMIT_NOFILE)
 8  ulimit -n
 9  
10  # Count the number of open file descriptors for the current shell process
11  lsof -p $$ | wc -l
12  
13  # Trace openat/close syscalls to see fd lifecycle for a simple file write
14  strace -e openat,close -f bash -c 'echo hello > /tmp/test'
15  
16  # Resolve what file fd 0 (stdin) actually points to. reveals /dev/pts/N for terminals
17  readlink /proc/$$/fd/0

Debug Checklist

1List open fds for a process: ls -la /proc/<pid>/fd
2Count open fds: ls /proc/<pid>/fd | wc -l
3Check per-process limit: cat /proc/<pid>/limits | grep 'open files'
4Check system-wide fd usage: cat /proc/sys/fs/file-nr
5Find which files a process has open: lsof -p <pid>
6Check for fd leaks over time: watch -n5 'ls /proc/<pid>/fd | wc -l'

Key Takeaways

✓dup2() and fork() share the SAME struct file — meaning file offset changes in one process are visible in the other. Two separate open() calls create independent struct files with independent offsets. This distinction breaks people's mental models constantly
✓Two different limits, two different errors: RLIMIT_NOFILE (per-process, default 1024) produces EMFILE; /proc/sys/fs/file-max (system-wide) produces ENFILE. Production servers must tune both
✓Without O_CLOEXEC, child processes inherit every open fd across exec() — including sockets, database connections, and files opened with elevated privileges. This is a security risk and a resource leak
✓The kernel always assigns the lowest available fd number on open() — close stdin (fd 0) then open a file and it gets fd 0. Classic footgun in daemon code
✓close_range() (Linux 5.11+) atomically closes a range of fds in one syscall — far more efficient than looping close() for fd sanitization before exec()

Common Pitfalls

✗Thinking two fds from separate open() calls share a file offset — they don't. Each open() creates a new struct file with its own position. Only dup() and fork() share offsets, because they share the struct file
✗Forgetting O_CLOEXEC in multithreaded programs — between open() and a subsequent fcntl(FD_CLOEXEC), another thread can fork+exec and leak the fd. Set O_CLOEXEC atomically at open() time
✗Ignoring close()'s return value — on NFS, close() can return EIO if a deferred write failed. Ignoring this means silently losing data and discovering it much, much later
✗Leaking pipe() or socketpair() fds — each creates TWO fds, and forgetting to close the unused end in parent or child is one of the most common fd leaks in production systems

Reference

System Calls

openclosereadwritedup2fcntl

Tools

lsof/proc/PID/fdstrace

📌

In One Line

Most "too many open files" errors trace to fd leaks or a default ulimit of 1024 that nobody raised -- check both before blaming the application.

File Descriptors & File Tables

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

File Descriptors & File Tables

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics