Copy-on-Write & Process Creation Internals
Mental Model
Two people reading from the same printed document. As long as both just read, there is one copy. The moment someone picks up a pen to edit a page, that page -- and only that page -- gets photocopied. The editor marks up their copy; the other person's original stays untouched. Pages nobody edits are never duplicated. Sharing costs nothing until a write actually happens.
The Problem
Redis holds 30 GB of data and kicks off a background snapshot. fork() completes in 2ms. Memory barely moves. Everything seems fine -- until writes keep hitting the parent during the snapshot. RSS creeps from 30 GB toward 50 GB over the next minute as each modified page triggers a copy. With Transparent Huge Pages on, a single byte write copies 2 MB instead of 4 KB. The spike is real. It is just delayed and unpredictable.
Architecture
A process is using 10 gigabytes of memory. It calls fork(). How long does it take? How much extra memory does it need?
If the guess was "seconds" and "10 more gigabytes," that mental model is wrong. On Linux, that fork finishes in milliseconds and barely touches free memory. Two processes, sharing everything, paying for nothing.
This is copy-on-write. It is the single most important optimization in Unix process creation, and once it clicks, containers, database snapshots, and the entire fork-exec pattern suddenly make sense.
What Actually Happens
Here is the sequence when a process calls fork():
- The kernel creates a new
task_structfor the child -- PID, scheduling info, all of it. - It calls
dup_mm()to duplicate the parent's memory descriptor (mm_struct). - It walks the parent's page tables and copies the page table entries (PTEs) -- the metadata that maps virtual addresses to physical pages.
- Every writable PTE in both parent and child is marked read-only.
- The reference count on each physical page is incremented.
That is it. No data pages are copied. The child gets a set of page table entries pointing to the same physical frames as the parent.
The cost? Proportional to the number of PTEs, not the amount of memory. A process with 10GB of 4KB pages has about 2.5 million PTEs. Copying those takes roughly 1-10ms. The 10GB of data? Untouched.
Now when either process writes to a shared page, the CPU triggers a page fault. The PTE says read-only, but the VMA says the process has write permission. The kernel recognizes this mismatch as a COW fault.
The fault handler (do_wp_page() in mm/memory.c) does the following:
- Allocates a fresh physical page.
- Copies the contents of the shared page.
- Updates the faulting process's PTE to point to the new page, now with write permission.
- Decrements the old page's reference count.
- If the reference count drops to 1, the remaining process gets its PTE upgraded to writable. No future COW fault needed.
The beauty is that pages never written are never copied. Code pages, read-only data, pages the child ignores before calling exec() -- all free.
Under the Hood
Page table walk and TLB. On x86-64, virtual-to-physical translation uses 4-level page tables (PML4, PDPT, PD, PT). Each level is a 4KB page containing 512 entries. After fork(), all four levels are duplicated. The TLB (Translation Lookaside Buffer) caches translations and must be flushed on the forking CPU. Other CPUs get TLB shootdown IPIs to invalidate their cached entries. On large NUMA systems, these IPIs alone can be the bottleneck.
The do_wp_page fast paths. The COW handler is not a single code path. It has several optimizations:
mapcount == 1: only one process maps this page. Just flip the write bit. Zero cost.- Zero page: allocate a new zeroed page. No copy needed.
- Swap cache: careful handling to avoid races with the swap subsystem.
- The actual copy-and-update is the slow path, used only when multiple processes genuinely share a page.
vfork() is a different beast. Created in the pre-COW era, vfork() does not copy page tables at all. The child shares the parent's mm_struct directly (CLONE_VM | CLONE_VFORK), and the parent is frozen until the child calls exec() or _exit(). It is faster, but terrifyingly dangerous -- any write by the child corrupts the parent. posix_spawn() is the modern safe replacement.
clone() as the universal primitive. This is where things get elegant. fork() is just clone(SIGCHLD, 0). A thread is clone(CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD, stack). A container is clone(CLONE_NEWNS|CLONE_NEWPID|CLONE_NEWNET|.., stack). Same syscall, different flags. The sharing spectrum goes from "full process" (nothing shared) to "thread" (everything shared), with containers somewhere in between.
Transparent Huge Pages and COW don't mix well. THP merges contiguous 4KB pages into 2MB huge pages for TLB efficiency. But COW copies at the huge page granularity -- a single byte write copies 2MB instead of 4KB. For fork-heavy workloads like Redis BGSAVE, THP causes massive memory spikes. Redis recommends echo madvise > /sys/kernel/mm/transparent_hugepage/enabled. PostgreSQL recommends never.
Common Questions
What is the actual cost of fork() for a process with a large address space?
The cost scales with page table size, not RSS. A process with 10GB RSS and 4KB pages has about 2.5 million PTEs. Copying them at roughly 60ns per entry takes around 150ms. Add TLB shootdown IPIs across all CPUs where the process has run, and fork() for large-memory processes (Redis at 30GB, PostgreSQL at 100GB shared_buffers) lands in the 10-200ms range. Huge pages reduce PTE count by 512x but increase per-fault COW cost by the same factor.
How does the kernel tell a COW fault from a real segfault?
It checks the faulting address against the VMA (vm_area_struct) list. If the VMA says VM_WRITE (the process is allowed to write) but the PTE says read-only, it is a COW fault. If the VMA does not have VM_WRITE, it is a genuine access violation -- SIGSEGV. If the address is not in any VMA, also SIGSEGV.
Can COW lead to OOM kills after fork()?
Yes, and this catches people off guard. The kernel allows fork() to succeed because no new memory is allocated at fork time (overcommit). But as processes write to shared pages, physical memory is consumed by COW copies. If the system runs out of memory and swap, the OOM killer strikes. A Redis instance with 25GB of data on a 32GB machine can fork successfully for BGSAVE, then trigger OOM as writes pile up. The fix: set maxmemory to 60-75% of available RAM, or disable overcommit with vm.overcommit_memory=2.
What is KSM (Kernel Same-page Merging) and how does it relate to COW?
KSM is COW in reverse. Instead of splitting shared pages on write, KSM scans for pages with identical content across processes and merges them into a single COW-protected page. It is mainly used in virtualization -- multiple VMs running the same OS share many identical pages (kernel code, shared libraries). KSM can recover 20-40% of memory in homogeneous VM environments. Configure it via /sys/kernel/mm/ksm/.
How Technologies Use This
Fifty containers launch from the same 2 GB image on one host. Memory usage should be 100 GB, but the system barely flinches. Startup takes under a second per container, not the multi-second delay a full memory copy would require.
The trick is that no memory is actually copied at launch. Every container shares the image's pages read-only, and only the pages a container writes to get duplicated. Docker's OverlayFS extends this to the filesystem level -- editing a config file copies just that one file from the read-only layer, not the entire image.
On a 50-container host, shared base layers exist in memory exactly once, cutting per-container overhead by 80-90%. This is why container density works at all -- without COW, the host would need 50x the RAM or accept multi-second startup times.
A Redis BGSAVE fails with "Can't save in background: fork" on a 30 GB dataset. The server has 64 GB of RAM. Why is fork failing when there seems to be plenty of memory left?
The hidden cost: fork() copies page tables, not data. But as the parent keeps writing during the snapshot, each modified page triggers a COW fault that allocates a new physical page. In a typical workload where 10-20% of keys change during BGSAVE, Redis needs only 3-6 GB of extra memory. But with Transparent Huge Pages enabled, a single byte write copies 2 MB instead of 4 KB, and write-heavy workloads can temporarily double RSS from 30 GB to 60 GB.
Fix: set maxmemory to 60-75% of available RAM to leave headroom for COW copies. Disable THP so each fault copies 4 KB, not 2 MB. Monitor latest_fork_usec in INFO persistence to catch degradation early.
A PostgreSQL instance has 200 clients connected. Each connection is a separate forked backend process. Quick math says 200 copies of the 8 GB shared buffer pool should require 1.6 TB of RAM. The server has 32 GB. Yet everything runs fine.
Every forked backend shares the same physical pages for code segments, shared library code, and shared_buffers mappings through COW. Only per-query work memory, stack frames, and local buffers get private allocations. The result is roughly 5-10 MB of private memory per connection instead of hundreds of megabytes.
This is why PostgreSQL's one-process-per-connection model scales at all. With 8 GB of shared_buffers, 200 connections use about 10 GB total. Watch per-backend memory with /proc/[pid]/smaps -- Shared_Clean pages are COW shared, Private_Dirty pages are the real per-connection cost.
Same Concept Across Tech
| Technology | How it uses CoW | Key gotcha |
|---|---|---|
| Redis | fork() for BGSAVE snapshots. Parent keeps serving, child writes snapshot from shared pages | Write-heavy load during BGSAVE causes RSS to spike. THP makes it worse (2 MB per fault) |
| PostgreSQL | fork() per backend connection. 200 backends share shared_buffers via CoW | Only per-query work memory is private (~5-10 MB). Shared_buffers pages stay shared unless modified |
| Docker | OverlayFS layers are CoW at the filesystem level. Writes go to the upper (container) layer | Base image stays read-only. Container writes create copies in the diff directory |
| Go | fork()+exec() for os/exec.Command. Brief CoW window between fork and exec | Fast exec minimizes CoW exposure. Large Go processes still copy page tables |
| JVM | Rarely forks directly, but GC can trigger CoW storms if a forked child exists | Avoid fork in JVM-heavy containers (e.g., metrics collectors that fork) |
Stack layer mapping (unexpected memory growth after fork):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the parent writing heavily during the child's lifetime? | Application write rate metrics |
| Runtime | Is THP enabled, amplifying each CoW fault from 4 KB to 2 MB? | cat /sys/kernel/mm/transparent_hugepage/enabled |
| Process | What fraction of pages are now private (copied) vs shared? | /proc/PID/smaps_rollup (Shared vs Private) |
| Kernel | How many minor faults are occurring? | perf stat -e minor-faults |
| Hardware | Is physical RAM sufficient for worst-case CoW doubling? | free -h during fork+write storm |
Design Rationale The dominant use of fork() in Unix is immediately followed by exec(), which throws the entire address space away. Eagerly copying gigabytes of memory only to discard them milliseconds later was indefensible. vfork() -- sharing the address space directly -- was faster but so dangerous that a single child write could corrupt the parent. COW landed in the sweet spot: fork() finishes in milliseconds regardless of memory size, the fork-then-exec path pays nearly nothing, and the rarer fork-then-write path pays only for pages actually modified rather than the entire address space.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| RSS doubles after fork despite CoW | Parent writing to most pages during child lifetime | Compare Shared vs Private in smaps_rollup |
| Redis BGSAVE causes latency spikes | CoW page faults on write-heavy keys during snapshot | redis-cli info persistence, check latest_fork_usec |
| Fork takes 10ms+ for large process | Page table copying (proportional to number of PTEs, not data size) | perf trace -e clone for fork latency |
| RSS grows 2 MB at a time, not 4 KB | THP enabled, each CoW fault copies a 2 MB huge page | Set THP to madvise instead of always |
| Container memory grows beyond image size | Writes creating CoW copies in OverlayFS upper layer | Check docker diff or overlay diff directory |
| Minor page faults spike after fork | Expected. Each first write to a shared page triggers a CoW fault | perf stat -e minor-faults, this is normal behavior |
When to Use / Avoid
Relevant when:
- Understanding why fork() completes in milliseconds regardless of process memory size
- Diagnosing memory spikes during Redis BGSAVE or PostgreSQL checkpoints
- Working with container filesystems (OverlayFS layers use the same CoW principle)
- Estimating actual memory usage of forked processes (RSS vs VSZ)
Watch out for:
- Write-heavy workloads after fork cause RSS to grow as each written page gets duplicated
- Transparent Huge Pages amplify CoW cost (2 MB per write fault instead of 4 KB)
- vfork() is NOT copy-on-write. The child shares the address space directly and the parent is frozen
Try It Yourself
1 # Watch COW in action: fork a large process and observe memory
2
3 python3 -c "import os; d=[0]*10000000; pid=os.fork(); os.waitpid(pid,0) if pid else exit()" &
4
5 # Check shared vs private pages for a process
6
7 grep -E '^(Shared|Private)' /proc/$$/smaps | awk '{a[$1]+=$2}END{for(k in a)print k,a[k],"kB"}'
8
9 # Count page faults during a fork-heavy workload
10
11 perf stat -e page-faults,minor-faults,major-faults -- bash -c 'for i in $(seq 100); do true; done'
12
13 # Show clone flags used by a program
14
15 strace -f -e trace=clone,clone3 bash -c 'echo hello' 2>&1 | head -5
16
17 # Monitor fork rate system-wide
18
19 watch -n1 'grep processes /proc/stat'
20
21 # Check THP (Transparent Huge Pages) status. affects COW granularity
22
23 cat /sys/kernel/mm/transparent_hugepage/enabled && grep -i huge /proc/meminfoDebug Checklist
- 1
Check shared vs private pages: cat /proc/<pid>/smaps_rollup | grep -E 'Shared|Private' - 2
Monitor RSS growth after fork: watch -n1 'cat /proc/<pid>/status | grep VmRSS' - 3
Count minor page faults (CoW triggers): perf stat -e minor-faults -p <pid> - 4
Check THP impact: grep AnonHugePages /proc/<pid>/smaps_rollup - 5
Check THP mode: cat /sys/kernel/mm/transparent_hugepage/enabled - 6
Redis fork overhead: redis-cli info persistence | grep latest_fork_usec
Key Takeaways
- ✓fork() cost is proportional to page table entries, not memory size. A 100GB process with 25M pages needs ~200MB of page tables copied. That takes 1-10ms. The actual data pages? Untouched.
- ✓The COW fault handler (do_wp_page) is smarter than you think. If only one process maps the page (mapcount=1), it just flips the write bit -- no copy at all. Zero page? Allocate a fresh zeroed page. The expensive copy-and-update path is the last resort.
- ✓vfork() is NOT copy-on-write. The child shares the parent's mm_struct directly, and the parent is frozen until the child calls exec or _exit. It is faster because there is no page table copy, but the child must not modify any data. posix_spawn() is the modern safe alternative.
- ✓clone3() is the modern, extensible version of clone(). It uses a struct clone_args with explicit size field, supporting all clone flags plus CLONE_CLEAR_SIGHAND, CLONE_INTO_CGROUP, and CLONE_NEWTIME. New flags will only be added to clone3(), not clone().
- ✓Here is the elegant part: Linux uses the same clone() syscall for processes (no sharing flags), threads (CLONE_VM|CLONE_FILES|..), and containers (CLONE_NEWNS|CLONE_NEWPID|..). The only difference is which flags you pass. There is no separate 'create thread' or 'create container' syscall.
Common Pitfalls
- ✗Thinking fork() is slow for large processes. Reality: with COW, a 50GB process forks in milliseconds. The pages are not copied until written. But the page table copying still takes time, and the TLB flush across all CPUs can cause latency spikes.
- ✗Forking Redis with 30GB of data and then writing extensively. Each written page triggers a COW fault, allocating physical memory. If you write to most pages during a bulk update, you temporarily need double the memory. Transparent Huge Pages make this worse -- a single byte write copies an entire 2MB page instead of 4KB.
- ✗Using vfork() and modifying variables. Since the child shares the parent's address space directly, any modification corrupts the parent. The only safe operations after vfork() are exec() and _exit(). Even calling exit() is unsafe because atexit handlers may modify global state.
- ✗Forgetting that shared pages after fork still share the same struct file for open fds. Parent and child share file offsets. If both write to the same fd without coordination, output is interleaved. Each should close or dup fds they do not need.
Reference
In One Line
fork() is fast because nothing is copied -- but write-heavy workloads afterward pay the real bill, and Transparent Huge Pages make the invoice 500x larger per fault.