Process Lifecycle (fork/exec/wait)
Mental Model
Cell division. A cell splits into two identical copies. One copy transforms entirely into a different cell type (exec). The original monitors its offspring. If the offspring dies and the original never acknowledges it, the dead cell stays on the registry -- taking up a slot, doing nothing, blocking new cells from that space.
The Problem
A long-running server has 200 zombie entries in the process table -- dead workers still holding PID slots because the parent never called waitpid(). PID exhaustion is approaching. In a separate incident, a containerized app crashes because PID 1 inside the container never reaps its children, and nobody configured an init process. Both problems have the same root cause: the parent side of the lifecycle was never finished.
Architecture
Every time ls is typed into a terminal, something surprisingly strange happens. The shell does not simply "run" the ls program. It clones itself. Then the clone erases its own identity and becomes ls. The original shell waits for its clone to finish, collects its exit status, and prints the next prompt.
This three-step dance -- fork, exec, wait -- is how every single program is launched on Linux. Every server worker. Every Docker container. Every browser tab. It has been this way since the 1970s, and there is a good reason it has survived.
What Actually Happens
fork() creates a new task_struct -- the kernel's ~8KB data structure representing a process. The child gets a new PID, but inherits almost everything else. Memory is shared via copy-on-write (page table entries are duplicated, physical pages are not). The file descriptor table is copied, but the underlying file descriptions are shared -- parent and child share file offsets.
fork() returns twice: once in the parent (returning the child's PID) and once in the child (returning 0). A return of -1 means failure, usually EAGAIN from PID or memory limits.
execve() replaces the process image. The kernel loads the new ELF binary, replaces text, data, BSS, heap, and stack segments, resets signal handlers to defaults, and starts execution at the new entry point. The PID stays the same. Open file descriptors survive (unless marked O_CLOEXEC). Process relationships are preserved.
Here is the critical detail: execve() is a point of no return. If it succeeds, it never returns -- the old code is gone. If it fails, it returns -1, and execution continues in the original process.
waitpid() collects the child's termination status. Between the child's _exit() and the parent's waitpid(), the child exists as a zombie -- memory freed, but task_struct retained to hold the exit status. Modern systems provide safety nets: systemd acts as a subreaper, and setting SIGCHLD to SIG_IGN tells the kernel to auto-reap.
Under the Hood
The exec family. execve() is the only real syscall. execl, execlp, execvp, execvpe are all libc wrappers that handle PATH searching and argument formatting. The p variants search $PATH; the e variants accept an explicit environment.
Zombie vs orphan. A zombie has exited but not been waited on. An orphan is a running child whose parent has died. Orphans get reparented to PID 1 (or the nearest subreaper); zombies wait to be reaped. A process can be both -- an orphan zombie is a zombie whose original parent died, now reparented to init, which will reap it promptly.
vfork() optimization. Created when fork() actually copied memory, vfork() shares the parent's address space directly and suspends the parent until the child calls exec() or _exit(). On modern kernels with COW fork, the benefit is marginal and the danger is real -- any variable modification in the child corrupts the parent. posix_spawn() is the modern safe alternative.
Signal handler reset across exec. Handlers point to functions in the old process image, which is gone after exec. So they are all reset to SIG_DFL. But SIG_IGN survives -- it is a kernel-level setting, not a pointer to user code. That is how nohup works: it sets SIGHUP to SIG_IGN before exec'ing the target program.
Common Questions
What happens when fork() is called in a multithreaded process?
Only the calling thread is replicated in the child. All other threads vanish. Any mutexes they held remain locked -- this is why only async-signal-safe functions (or an immediate exec) should be called in the child of a multithreaded fork. Use pthread_atfork() to register handlers that release locks before fork and reacquire after. Or better yet, do what Go does: combine fork and exec into a single atomic operation.
How does the kernel decide whether to run the parent or child first after fork()?
Historically, Linux ran the child first to avoid COW faults if the child immediately calls exec. Since kernel 2.6.32, this is controlled by /proc/sys/kernel/sched_child_runs_first. CFS generally runs whichever process has less vruntime, and the child inherits half the parent's remaining timeslice.
What is the maximum number of processes a user can create?
Bounded by three limits: RLIMIT_NPROC (per-user, check with ulimit -u), the system-wide /proc/sys/kernel/pid_max (default 32768, max 4194304 on 64-bit), and available memory for task_struct allocation. When any limit is hit, fork() returns -1 with errno EAGAIN.
Why does execve() reset custom signal handlers to SIG_DFL but preserve SIG_IGN?
Custom handlers are function pointers into the old process image. After exec, that code is gone -- the pointer would be dangling. But SIG_IGN is not a pointer. It is a kernel-level constant. So it survives. This design choice is what makes nohup possible: set SIGHUP to SIG_IGN, then exec the target program. The ignore sticks.
How Technologies Use This
A containerized Python app starts failing with "fork: Resource temporarily unavailable" after running for a few days. Inside the container, ps shows hundreds of zombie processes. The app spawns subprocesses that crash, but never calls wait() to reap them.
Inside a container's PID namespace, there is no systemd or init to clean up orphans. Zombies accumulate silently, each consuming a PID slot, until the namespace limit (default 4096 PIDs) is exhausted and fork() fails for every new request. The app itself appears healthy -- it is the invisible dead children that are consuming all the PIDs.
Fix: use Docker's --init flag, which injects tini as PID 1 inside the container. Tini catches SIGCHLD and calls waitpid(-1, WNOHANG) in a loop, reaping every orphaned zombie automatically. It also forwards SIGTERM to the actual application, so docker stop triggers graceful shutdown instead of a SIGKILL after the 10-second timeout.
A busy Nginx instance serving 50,000 requests per second loses a worker to an OOM kill. Without a supervisor, that worker's share of capacity is gone -- requests queue up and latency spikes within milliseconds. Nobody notices until users start complaining.
The master process monitors all workers with waitpid(). When a worker dies from any cause -- segfault, OOM, or manual kill -- the master detects it within one event loop iteration through SIGCHLD delivery and forks an immediate replacement. No manual intervention, no capacity gap.
During config reload via SIGHUP, the same lifecycle applies: the master forks new workers with the updated nginx.conf and signals old workers to drain. Old workers finish in-flight requests (up to worker_shutdown_timeout seconds), then exit and get reaped. The entire reload happens with zero dropped connections.
A PostgreSQL server has 300 connections and max_connections set to 300. A backend crashes mid-transaction, but the connection count still shows 300. New clients get "too many connections" errors even though one slot should be free. The dead backend is lingering as a zombie, its shared memory slot still allocated.
The postmaster must reap dead backends immediately, but standard signals do not queue. If 5 backends crash simultaneously, only one SIGCHLD may be delivered. A single waitpid() call per signal misses the other 4, leaving zombie backends that inflate the connection count and hold shared memory slots hostage.
The postmaster's SIGCHLD handler calls waitpid(-1, WNOHANG) in a loop to reap every exited backend, not just one. It recycles each shared memory slot and logs the cause. If a critical process like the checkpointer or autovacuum launcher crashes, the postmaster forks a replacement within seconds, ensuring background maintenance never stalls.
A single-threaded Redis server holds a 20 GB dataset. It needs to write it to disk for persistence, but even 100 ms of blocking would cascade latency spikes through every dependent service. A blocking snapshot is not an option.
Redis fork()s a child process that inherits the entire dataset through COW pages. The child writes the RDB dump sequentially while the parent continues serving reads and writes at full speed. The critical lifecycle detail: the parent cannot call a blocking waitpid() or it would stall the event loop. Instead, Redis checks waitpid(WNOHANG) on every event loop iteration, polling for child completion without blocking.
The fork itself takes 1-25 ms depending on dataset size (proportional to page table entries, not data). Monitor this as latest_fork_usec in INFO output. When the child finishes, the parent atomically renames the temp file to dump.rdb -- a complete snapshot with zero client-visible pause.
A Go program calls fork() to spawn a helper process. The child immediately deadlocks on its first malloc call. No error, no panic -- just a process that hangs forever. The same code works perfectly when run as a standalone binary.
Go is always multithreaded. Even a trivial program has garbage collector threads, network poller threads, and scheduler threads running. When fork() copies only the calling thread, every mutex held by those other threads remains permanently locked in the child. Any operation requiring an internal lock -- malloc, file open, logging -- deadlocks instantly.
Go's os/exec package solves this with a combined ForkExec: fork() and execve() happen in a single atomic sequence, so the child replaces itself with the new program before any Go runtime code runs. The only code between fork and exec is a minimal assembly stub that sets up file descriptors and calls execve. Never use raw fork() in Go -- always go through os/exec.
A Node.js HTTP server is pinned at 100% on one core while 7 other cores sit completely idle. Throughput plateaus at 15,000 req/s no matter how much hardware is thrown at the problem. The single-threaded event loop cannot use more than one core.
The cluster module calls fork() to create one worker process per core. All workers inherit the master's listening socket file descriptor, so the kernel distributes incoming TCP connections across workers automatically. The master monitors each worker through SIGCHLD and waitpid() -- when a worker crashes from OOM, uncaught exception, or segfault, the master detects it within one event loop tick and forks a replacement.
Total throughput scales from 15,000 req/s on one core to over 100,000 req/s across 8 cores, with automatic recovery from worker failures. The fork/wait lifecycle gives Node.js multi-core scaling without rewriting a single line of application code.
Clicking "New Tab" should feel instant. But a fresh renderer process needs to load V8, ICU data, Blink, and shared libraries -- about 150 MB of code. Loading this from disk for every tab would add 200-500 ms of startup latency, making Chrome feel sluggish.
Chrome's zygote process pre-loads all shared libraries and V8 at browser launch, then fork()s to create each new renderer. COW means the new process shares all 150 MB of pre-loaded code without copying a single page. The renderer only allocates private memory for the specific page's DOM, JavaScript heap, and rendering buffers.
The browser process monitors all renderers via waitpid(). If a tab crashes from a segfault or OOM, the browser detects it immediately through the child's exit status and shows the "Aw, Snap!" page without affecting other tabs. The zygote pattern gives Chrome instant tab creation with full process isolation.
A git fetch runs on a repository with 15 remotes. Each network round-trip takes 200-500 ms. Done sequentially, that is 3-8 seconds of pure waiting. With --jobs=N, it should be parallel -- but how does Git manage 15 concurrent network operations and detect individual failures?
Git uses fork()+exec() to launch each transport helper (git-remote-https, git-credential-store, git-diff) as a separate child process. Each helper runs in its own isolated address space, so a crash in a credential helper cannot corrupt Git's index or object database. The parent collects each exit code via waitpid() to determine per-remote success or failure.
A single failed remote (exit code non-zero) is reported without aborting the others. This same fork+exec+wait pattern powers git-difftool (forking an external diff viewer per file) and git-filter-branch (forking a shell per commit). Process isolation turns concurrent networking into a safe, fault-tolerant operation.
Same Concept Across Tech
| Technology | How it uses fork/exec/wait | Key detail |
|---|---|---|
| Node.js | cluster.fork() creates workers via fork+exec. Master must handle 'exit' events | Missing 'exit' handler = zombie workers |
| Docker | Container PID 1 must reap children. Use --init flag to add tini as PID 1 | Without --init, zombies accumulate inside containers |
| Nginx | Master forks worker processes, manages lifecycle with SIGCHLD | Master reaps workers, respawns on crash |
| Chrome | Zygote process forks renderer processes. Each tab = forked process | Copy-on-write makes fork fast despite large zygote memory |
| Go | Does not use fork for goroutines (they are green threads). Uses fork+exec for os/exec.Command | Goroutines are not processes |
| PostgreSQL | Postmaster forks one backend per connection, reaps on disconnect | autovacuum workers are also forked children |
Stack layer mapping (zombie process debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the parent calling waitpid() or handling SIGCHLD? | Code review, strace -e wait4 -p parent_pid |
| Runtime | Does the runtime reap children automatically? | Node cluster handles this, raw fork does not |
| Container | Is PID 1 configured to reap? Is --init or tini present? | docker inspect, check ENTRYPOINT |
| Kernel | How many PIDs are in use? Is pid_max close? | /proc/sys/kernel/pid_max, ps -e |
| Process table | Are zombie entries accumulating? | ps aux |
Design Rationale Splitting creation into fork and exec instead of a single "spawn" call was intentional: the gap between the two is where all the setup happens -- redirecting file descriptors, changing directories, dropping privileges, setting environment variables. A monolithic spawn would need dozens of parameters and still miss edge cases. The downside -- fork in a multithreaded process only copies one thread, leaving every other thread's mutexes permanently locked -- is real, but the design predates threads by two decades. Zombies exist because exit status has to live somewhere between the child's death and the parent's wait(). Retaining a 2 KB task_struct briefly is a small price for reliable status delivery.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Zombie processes accumulating (Z state) | Parent not calling waitpid() | ps -eo pid,ppid,stat,comm |
| Cannot fork: resource temporarily unavailable | PID limit or memory limit reached | Check pid_max and memory |
| Container has zombies but app does not fork | PID 1 in container not reaping adopted orphans | Use --init flag or tini |
| Fork succeeds but child immediately exits | exec() failed (binary not found, permission denied) | strace -f -e fork,execve |
| Application slow after fork | Copy-on-write pages being faulted in by writes | perf stat -e page-faults |
| Orphan processes running after parent exits | Normal, init (PID 1) adopts them | Not a bug unless orphans consume resources |
When to Use / Avoid
Relevant when:
- Any server that spawns workers -- web servers, build systems, CI runners -- must get the wait() side right
- Writing container entrypoints or init systems that adopt and reap orphaned children
- Zombies are accumulating in the process table and PID slots are leaking
- Understanding why Docker's --init flag and tini exist
Watch out for:
- No SIGCHLD handler and no waitpid() loop in the parent -- guaranteed zombie accumulation
- Running a multi-process app as PID 1 in a container without --init or tini to handle reaping
- Assuming fork() copies memory (it sets up COW page tables; the real cost is near zero until pages are written)
Try It Yourself
1 # Trace all process-related syscalls for a command
2
3 strace -f -e trace=clone,execve,wait4 ls -la 2>&1 | head -20
4
5 # Show process tree with PIDs
6
7 ps axjf | head -30
8
9 # Create a zombie for observation (bash background subshell that exits)
10
11 bash -c 'sleep 60 & echo child=$!; sleep 5' & sleep 1; ps -eo pid,ppid,stat,comm | grep Z
12
13 # Inspect a process's status including parent PID and state
14
15 cat /proc/$$/status | grep -E '^(Name|State|Pid|PPid|Threads)'
16
17 # Watch fork rates system-wide via /proc/stat
18
19 grep processes /proc/stat
20
21 # List all zombie processes and their parents
22
23 ps -eo pid,ppid,stat,comm | awk '$3 ~ /Z/ {print}'Debug Checklist
- 1
Find zombies: ps aux | grep Z - 2
Count zombies: ps -eo stat | grep -c Z - 3
Find zombie parent: ps -eo pid,ppid,stat,comm | grep Z - 4
Check if PID 1 is reaping: strace -e wait4 -p 1 - 5
Check PID limit: cat /proc/sys/kernel/pid_max - 6
Check current PID count: ls /proc | grep -c '^[0-9]'
Key Takeaways
- ✓fork() does not copy memory. It sets up copy-on-write page table entries, making the cost proportional to page table size, not memory size. This is why forking a 2GB process takes microseconds, not seconds.
- ✓execve() is a point of no return. It replaces the process image atomically -- if it succeeds, it never returns. All signal handlers with custom dispositions reset to SIG_DFL because the handler code no longer exists in memory.
- ✓Zombies are not as scary as they sound. They consume a PID and a tiny task_struct (~2KB) but no memory pages, file descriptors, or CPU. The real danger is PID exhaustion in long-running daemons that never reap their children.
- ✓When a parent dies without calling wait(), orphaned children are reparented to the nearest subreaper (set via prctl(PR_SET_CHILD_SUBREAPER)) or to PID 1 (init/systemd), which automatically reaps them. No process is truly abandoned.
- ✓Here is a subtle trap: exit() vs _exit(). The C library exit() flushes stdio buffers and runs atexit handlers. The syscall _exit() goes straight to the kernel. After fork(), the child should call _exit() if it is not exec'ing, to avoid double-flushing the parent's buffers.
Common Pitfalls
- ✗Forgetting to loop waitpid(). Mistake: calling it once and assuming you are done. Reality: waitpid() can be interrupted by signals (returns -1 with EINTR), and with WNOHANG you must loop until it returns 0 or -1. One call is not enough.
- ✗Using exit() instead of _exit() in a forked child that does not exec. This flushes the parent's stdio buffers a second time, corrupting output. The fix is simple: always use _exit() in the child after fork unless you are about to exec.
- ✗Setting SA_NOCLDWAIT or ignoring SIGCHLD without understanding the side effects. With SA_NOCLDWAIT, children are auto-reaped and wait() returns ECHILD. This can break code that relies on collecting exit status.
- ✗Assuming fork() copies threads. It does not. In a multithreaded process, only the calling thread is duplicated. Mutexes held by other threads remain locked in the child -- deadlock city. This is why Go's os/exec does fork+exec atomically.
Reference
In One Line
Fork, exec, wait -- skip the last step and dead children pile up as zombies until PID slots run out.