OverlayFS & Union File Systems
Mental Model
A laminated reference poster is stuck to a whiteboard. A clear dry-erase sheet sits over it. People scrawl notes and corrections on the clear sheet, and from across the room the poster and notes merge into one view. Erase a spot and the poster shows through. Stick an opaque patch over a section and that part of the poster disappears without being damaged. After the meeting, wipe the sheet -- the poster is pristine for the next team. Thirty rooms can share the same poster, each with its own disposable clear sheet.
The Problem
Without layer sharing, 100 containers from the same 800 MB image eat 80 GB of disk and minutes of copy time just to start. Change a single byte in a 2 GB lower-layer file and the entire 2 GB gets copied into the writable layer. High container churn racks up whiteout files, each consuming an inode, until inodes are exhausted while half the disk bytes sit free -- "no space left on device" with plenty of space. And hard links across lower layers silently break on copy-up, producing divergent files that no longer share data.
Architecture
Every time docker run executes, the container sees a full filesystem -- Ubuntu packages, application code, configuration files, all of it. But nothing was copied. Not a single byte.
So where are the files coming from? And what happens when a container changes one?
This is the story of OverlayFS -- the filesystem illusion that makes containers practical.
What Actually Happens
OverlayFS (merged into Linux 3.18) is a union filesystem that stacks multiple directory trees into a single view. The mount takes four parameters:
- lowerdir: One or more read-only layers (the image). Stacked bottom-to-top.
- upperdir: A single read-write layer (the container's private space).
- workdir: Scratch space for atomic operations. Must be on the same filesystem as upperdir.
- merged: The mount point where the unified view appears.
Reading a file is straightforward. The kernel checks upperdir first. If the file is there, serve it. If not, check each lowerdir from top to bottom. First match wins.
Writing a file is where things get interesting. If the file already lives in upperdir, the write goes directly there. But if the file only exists in a lowerdir, the kernel must perform a copy-up: it copies the entire file to upperdir first, then applies the write. From that point forward, all access goes through the upper copy. The lower copy remains untouched.
Deleting a file does not delete anything at all. The kernel creates a whiteout -- a character device with major/minor 0/0 -- in upperdir. When the path lookup encounters a whiteout, it reports the file as nonexistent, even though the original still sits in a lower layer, completely intact.
Under the Hood
Copy-up is file-granular, not block-granular. This is the key design tradeoff. Writing one byte to a 5 GB lower-layer file triggers the kernel to copy all 5 GB to upperdir before applying the change. The workdir is used as staging to make the copy atomic -- if the system crashes mid-copy, the result is the old version, not a corrupt half-copy.
This is simpler than block-level copy-on-write (used by btrfs or device-mapper), but it means the first write to a large lower-layer file is painful. Subsequent writes are cheap because they hit the upper copy directly.
Opaque directories handle a subtle edge case. If rm -rf /app/logs followed by mkdir /app/logs runs inside a container, the new directory needs to hide all lower-layer contents of the old /app/logs. The kernel sets trusted.overlay.opaque=y as an extended attribute on the new directory in upperdir. This tells lookup to stop searching lower layers for that path.
Docker's overlay2 driver maps this directly to container images. Each image layer becomes a lowerdir. Each container gets a fresh, empty upperdir. The l/ directory uses shortened symlinks to work around the 4096-byte mount option string limit. When docker build runs a RUN instruction, it starts a temporary container, executes the command (writes go to upperdir), then snapshots that upperdir as a new read-only layer. That snapshot becomes a lowerdir for the next instruction.
Metacopy (Linux 4.19+) is a clever optimization. If only metadata changes -- chmod, chown, touch -- there is no reason to copy gigabytes of file data. The kernel creates a tiny metacopy node in upperdir that stores the new metadata and redirects data reads back to the lower layer. This turns a multi-gigabyte copy-up into a near-instant operation for permission changes.
Common Questions
Why did Docker choose overlay2 over device-mapper or AUFS?
Three reasons. AUFS was never accepted into the mainline kernel, making it a maintenance headache. Device-mapper requires pre-allocated block devices, adds block-level overhead, and had historical stability issues with metadata snapshots. overlay2 works on top of any ext4/xfs filesystem, requires no pre-allocation, shares layers efficiently, and has had continuous kernel development since 3.18. It is the sweet spot of simplicity, performance, and mainline support.
What happens when a container modifies one byte of a 5 GB base image file?
The full 5 GB gets copied to upperdir before that one byte is written. The container's writable layer just grew by 5 GB. Best practice: never modify large base-image files at runtime. If a file must be writable, generate it at container startup (so it starts in upperdir) or use a volume mount that bypasses overlay entirely.
How does overlay handle hard links?
Carefully, and imperfectly. Hard links within a single lower layer are preserved in the merged view (same inode). But copy-up breaks the relationship -- each copied-up name gets a new inode in upperdir. The index=on mount option (Linux 4.13+) adds tracking to maintain hard-link consistency after copy-up, but it is not enabled by default in all configurations.
Why do containers sometimes fail with "no space left on device" when disk is not full?
Inodes. Every whiteout file and every copied-up file consumes an inode on the upper filesystem. High container churn with lots of file deletions can exhaust inodes long before bytes run out. Check with df -i on the filesystem backing /var/lib/docker.
How Technologies Use This
Launching 100 containers from the same 800 MB Node.js image causes startup stalls. Disk usage spikes to 80 GB as each container copies the full image. The host runs out of storage before hitting the pod target.
The problem is that without layer sharing, every container gets its own complete copy of every image file. Docker solves this with OverlayFS, stacking read-only image layers as lowerdirs and giving each container a thin upperdir for writes. Only files a container actually modifies trigger a copy-up, typically under 5 MB per container.
Enable the overlay2 storage driver and structure Dockerfiles so large files are created in early layers, not modified in later ones. The result is sub-second startup, roughly 95% less disk usage, and hundreds of containers on a single host without storage becoming the bottleneck.
Scheduling 50 pods from the same 800 MB image fills the node's 100 GB ephemeral storage. New pods fail with disk pressure evictions even though each pod only writes a few megabytes of its own data.
Without OverlayFS, each pod needs its own full copy of the base filesystem, consuming over 40 GB for just 50 pods. Kubernetes relies on the container runtime's overlay2 driver to share a single set of read-only lowerdirs across all 50 pods while each pod writes only its unique changes to a private upperdir.
Ensure the container runtime uses the overlay2 driver and monitor per-pod storage overhead with kubectl describe node. This keeps per-pod overhead under 10 MB in most workloads, enabling dense scheduling of 110+ pods per node without exhausting ephemeral storage.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Layer sharing | overlay2 shares lowerdirs across containers | N/A | N/A | N/A | Container runtime shares image layers across pods |
| Copy-on-write | Full-file copy-up to upperdir | N/A | N/A | N/A | Same as container runtime |
| Image layers | Each Dockerfile instruction = one layer | N/A | N/A | N/A | Same image spec (OCI) |
| Writable layer | Container upperdir | N/A | N/A | N/A | Pod ephemeral storage in upperdir |
| Volume bypass | -v mount bypasses overlay | N/A | N/A | N/A | PVC/emptyDir bypasses overlay |
Stack Layer Mapping
| Layer | OverlayFS Mechanism |
|---|---|
| Block device | ext4/xfs backing filesystem for upper and lower layers |
| VFS | OverlayFS registers as a filesystem type, intercepts path lookups |
| Lowerdir stack | Read-only image layers, content-addressed and shared |
| Upperdir | Per-container writable layer with copy-ups and whiteouts |
| Workdir | Atomic staging area for copy-up (crash consistency) |
| Merged mount | Unified view presented to container processes |
Design Rationale
File-granular COW won out over block-level COW (device-mapper style) because it sits on top of any ext4/xfs filesystem with no pre-allocated block devices required. Whiteouts are a consequence of shared, read-only lower layers -- since nothing can be deleted from them directly, the upper layer records the deletion instead. Metacopy came later to address an obvious pain point: chmod on a multi-gigabyte file should not trigger a multi-gigabyte copy when only a handful of metadata bytes changed.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| "No space left on device" with disk not full | Inode exhaustion from whiteouts and copy-ups | df -i /var/lib/docker to check inode usage |
| Container startup slow (seconds instead of ms) | Wrong storage driver (vfs instead of overlay2) | `docker info |
| Unexpected large container writable layer | Copy-up of large lower-layer files | docker system df -v and check container size |
| Hard links broken after container writes | Copy-up creates separate inodes per name | Check index=on mount option for hard-link tracking |
| Permission changes on large files are slow | Metacopy not enabled | `mount -t overlay |
| Container fs shows deleted file still exists | Whiteout not created properly | find <upperdir> -type c -perm 0000 for whiteout entries |
When to Use / Avoid
Use when:
- Running containers that share base image layers (Docker overlay2 default)
- Building Live CD/USB systems with tmpfs upper layer over read-only media
- Creating ephemeral build environments that write to a disposable layer
- Implementing rollback by discarding the upper layer and starting fresh
Avoid when:
- The workload writes heavily to large lower-layer files (copy-up cost is prohibitive)
- Block-level COW is needed (use btrfs or device-mapper thin provisioning)
- NFS or network filesystem is required for the upper layer (POSIX rename atomicity not guaranteed)
- Database files need direct I/O (use volume mounts that bypass overlay)
Try It Yourself
1 # Mount a basic overlay filesystem with two lower layers
2 sudo mount -t overlay overlay -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work /merged
3
4 # List all overlay mounts showing full options
5 findmnt -t overlay -o TARGET,SOURCE,OPTIONS
6
7 # Inspect Docker container's overlay2 mount details
8 docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container_id>
9
10 # Check the actual overlay mount for a running Docker container
11 grep overlay /proc/self/mountinfo
12
13 # Find whiteout files (deletions) in a container's upper layer
14 find /var/lib/docker/overlay2/<layer_id>/diff -type c -perm 0000
15
16 # Check if a directory is marked opaque (hides lower layer contents)
17 getfattr -n trusted.overlay.opaque /var/lib/docker/overlay2/<layer_id>/diff/some_dir
18
19 # Show Docker layer disk usage with per-layer breakdown
20 docker system df -v
21
22 # Inspect overlay mount options including metacopy and index features
23 mount -t overlay | grep -o 'metacopy=\w*'Debug Checklist
- 1
findmnt -t overlay -- list all overlay mounts with lowerdir/upperdir/workdir - 2
docker system df -v -- per-image and per-container disk usage - 3
df -i /var/lib/docker -- check inode usage on overlay backing filesystem - 4
find <upperdir> -type c -perm 0000 -- find whiteout files (deletions) - 5
getfattr -n trusted.overlay.opaque <dir> -- check for opaque directories - 6
docker inspect <container> | grep -A5 GraphDriver -- show overlay mount details
Key Takeaways
- ✓First write to a lower-layer file is expensive -- the kernel copies the ENTIRE file to upperdir before applying your one-byte change. A 2 GB base image file means a 2 GB copy-up, even if you only appended a newline. Subsequent writes hit the upper copy directly.
- ✓Deleting a file does not actually delete anything. The kernel drops a "whiteout" (character device 0/0) in upperdir that hides the lower-layer file. Opaque directories (xattr trusted.overlay.opaque=y) hide everything below when you rm -rf and recreate a directory.
- ✓100 containers from one image cost almost zero extra disk. All share read-only lowerdirs; only unique writes accumulate in each container's upperdir. This is why Docker images are small but containers feel full-size.
- ✓Metacopy (Linux 4.19+) is the performance shortcut for chmod/chown -- it creates a tiny metadata node in upperdir instead of copying gigabytes of file data. If you are doing permission changes on large files, this is the difference between milliseconds and minutes.
- ✓upperdir and workdir must live on the same filesystem (ext4 or xfs). tmpfs gives fast writes but no persistence. NFS is not supported because overlay needs POSIX rename atomicity that NFS cannot guarantee.
Common Pitfalls
- ✗Mistake: Containers mysteriously run out of inodes with plenty of disk space. Reality: Every whiteout file and copied-up file consumes an inode on the upper filesystem. High container churn with lots of deletions exhausts inodes before bytes.
- ✗Mistake: Assuming copy-up is instant. Reality: Writing a single byte to a 2 GB lower-layer file triggers a full 2 GB copy to upperdir. Structure Dockerfiles to modify large files in early layers, not late ones.
- ✗Mistake: Manually mounting overlayfs to debug Docker and getting confused by the options. Reality: Docker's overlay2 driver manages lowerdir stacking, link indirection, and layer metadata automatically. Debugging requires reconstructing the full lowerdir chain from /var/lib/docker/overlay2/*/diff.
- ✗Mistake: Expecting hard links to survive across layers. Reality: A file hard-linked in a lower layer becomes two separate files if both names are written to in the upper layer. The hard-link relationship silently breaks.
Reference
In One Line
Shared read-only layers plus per-container writable layers -- keep large files in early Dockerfile stages so copy-up never bites.