File Systems & I/OTopic 8 of 19

File Systems & I/OIntermediate

OverlayFS & Union File Systems

DockerKubernetes

🧠

Mental Model

A laminated reference poster is stuck to a whiteboard. A clear dry-erase sheet sits over it. People scrawl notes and corrections on the clear sheet, and from across the room the poster and notes merge into one view. Erase a spot and the poster shows through. Stick an opaque patch over a section and that part of the poster disappears without being damaged. After the meeting, wipe the sheet -- the poster is pristine for the next team. Thirty rooms can share the same poster, each with its own disposable clear sheet.

💡

The Problem

Without layer sharing, 100 containers from the same 800 MB image eat 80 GB of disk and minutes of copy time just to start. Change a single byte in a 2 GB lower-layer file and the entire 2 GB gets copied into the writable layer. High container churn racks up whiteout files, each consuming an inode, until inodes are exhausted while half the disk bytes sit free -- "no space left on device" with plenty of space. And hard links across lower layers silently break on copy-up, producing divergent files that no longer share data.

Architecture

Every time docker run executes, the container sees a full filesystem -- Ubuntu packages, application code, configuration files, all of it. But nothing was copied. Not a single byte.

So where are the files coming from? And what happens when a container changes one?

This is the story of OverlayFS -- the filesystem illusion that makes containers practical.

What Actually Happens

OverlayFS (merged into Linux 3.18) is a union filesystem that stacks multiple directory trees into a single view. The mount takes four parameters:

lowerdir: One or more read-only layers (the image). Stacked bottom-to-top.
upperdir: A single read-write layer (the container's private space).
workdir: Scratch space for atomic operations. Must be on the same filesystem as upperdir.
merged: The mount point where the unified view appears.

Reading a file is straightforward. The kernel checks upperdir first. If the file is there, serve it. If not, check each lowerdir from top to bottom. First match wins.

Writing a file is where things get interesting. If the file already lives in upperdir, the write goes directly there. But if the file only exists in a lowerdir, the kernel must perform a copy-up: it copies the entire file to upperdir first, then applies the write. From that point forward, all access goes through the upper copy. The lower copy remains untouched.

Deleting a file does not delete anything at all. The kernel creates a whiteout -- a character device with major/minor 0/0 -- in upperdir. When the path lookup encounters a whiteout, it reports the file as nonexistent, even though the original still sits in a lower layer, completely intact.

Under the Hood

Copy-up is file-granular, not block-granular. This is the key design tradeoff. Writing one byte to a 5 GB lower-layer file triggers the kernel to copy all 5 GB to upperdir before applying the change. The workdir is used as staging to make the copy atomic -- if the system crashes mid-copy, the result is the old version, not a corrupt half-copy.

This is simpler than block-level copy-on-write (used by btrfs or device-mapper), but it means the first write to a large lower-layer file is painful. Subsequent writes are cheap because they hit the upper copy directly.

Opaque directories handle a subtle edge case. If rm -rf /app/logs followed by mkdir /app/logs runs inside a container, the new directory needs to hide all lower-layer contents of the old /app/logs. The kernel sets trusted.overlay.opaque=y as an extended attribute on the new directory in upperdir. This tells lookup to stop searching lower layers for that path.

Docker's overlay2 driver maps this directly to container images. Each image layer becomes a lowerdir. Each container gets a fresh, empty upperdir. The l/ directory uses shortened symlinks to work around the 4096-byte mount option string limit. When docker build runs a RUN instruction, it starts a temporary container, executes the command (writes go to upperdir), then snapshots that upperdir as a new read-only layer. That snapshot becomes a lowerdir for the next instruction.

Metacopy (Linux 4.19+) is a clever optimization. If only metadata changes -- chmod, chown, touch -- there is no reason to copy gigabytes of file data. The kernel creates a tiny metacopy node in upperdir that stores the new metadata and redirects data reads back to the lower layer. This turns a multi-gigabyte copy-up into a near-instant operation for permission changes.

Common Questions

Why did Docker choose overlay2 over device-mapper or AUFS?

Three reasons. AUFS was never accepted into the mainline kernel, making it a maintenance headache. Device-mapper requires pre-allocated block devices, adds block-level overhead, and had historical stability issues with metadata snapshots. overlay2 works on top of any ext4/xfs filesystem, requires no pre-allocation, shares layers efficiently, and has had continuous kernel development since 3.18. It is the sweet spot of simplicity, performance, and mainline support.

What happens when a container modifies one byte of a 5 GB base image file?

The full 5 GB gets copied to upperdir before that one byte is written. The container's writable layer just grew by 5 GB. Best practice: never modify large base-image files at runtime. If a file must be writable, generate it at container startup (so it starts in upperdir) or use a volume mount that bypasses overlay entirely.

How does overlay handle hard links?

Carefully, and imperfectly. Hard links within a single lower layer are preserved in the merged view (same inode). But copy-up breaks the relationship -- each copied-up name gets a new inode in upperdir. The index=on mount option (Linux 4.13+) adds tracking to maintain hard-link consistency after copy-up, but it is not enabled by default in all configurations.

Why do containers sometimes fail with "no space left on device" when disk is not full?

Inodes. Every whiteout file and every copied-up file consumes an inode on the upper filesystem. High container churn with lots of file deletions can exhaust inodes long before bytes run out. Check with df -i on the filesystem backing /var/lib/docker.

How Technologies Use This

Docker

Launching 100 containers from the same 800 MB Node.js image causes startup stalls. Disk usage spikes to 80 GB as each container copies the full image. The host runs out of storage before hitting the pod target.

The problem is that without layer sharing, every container gets its own complete copy of every image file. Docker solves this with OverlayFS, stacking read-only image layers as lowerdirs and giving each container a thin upperdir for writes. Only files a container actually modifies trigger a copy-up, typically under 5 MB per container.

Enable the overlay2 storage driver and structure Dockerfiles so large files are created in early layers, not modified in later ones. The result is sub-second startup, roughly 95% less disk usage, and hundreds of containers on a single host without storage becoming the bottleneck.

Kubernetes

Scheduling 50 pods from the same 800 MB image fills the node's 100 GB ephemeral storage. New pods fail with disk pressure evictions even though each pod only writes a few megabytes of its own data.

Without OverlayFS, each pod needs its own full copy of the base filesystem, consuming over 40 GB for just 50 pods. Kubernetes relies on the container runtime's overlay2 driver to share a single set of read-only lowerdirs across all 50 pods while each pod writes only its unique changes to a private upperdir.

Ensure the container runtime uses the overlay2 driver and monitor per-pod storage overhead with kubectl describe node. This keeps per-pod overhead under 10 MB in most workloads, enabling dense scheduling of 110+ pods per node without exhausting ephemeral storage.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Layer sharing	overlay2 shares lowerdirs across containers	N/A	N/A	N/A	Container runtime shares image layers across pods
Copy-on-write	Full-file copy-up to upperdir	N/A	N/A	N/A	Same as container runtime
Image layers	Each Dockerfile instruction = one layer	N/A	N/A	N/A	Same image spec (OCI)
Writable layer	Container upperdir	N/A	N/A	N/A	Pod ephemeral storage in upperdir
Volume bypass	-v mount bypasses overlay	N/A	N/A	N/A	PVC/emptyDir bypasses overlay

Stack Layer Mapping

Layer	OverlayFS Mechanism
Block device	ext4/xfs backing filesystem for upper and lower layers
VFS	OverlayFS registers as a filesystem type, intercepts path lookups
Lowerdir stack	Read-only image layers, content-addressed and shared
Upperdir	Per-container writable layer with copy-ups and whiteouts
Workdir	Atomic staging area for copy-up (crash consistency)
Merged mount	Unified view presented to container processes

Design Rationale

File-granular COW won out over block-level COW (device-mapper style) because it sits on top of any ext4/xfs filesystem with no pre-allocated block devices required. Whiteouts are a consequence of shared, read-only lower layers -- since nothing can be deleted from them directly, the upper layer records the deletion instead. Metacopy came later to address an obvious pain point: chmod on a multi-gigabyte file should not trigger a multi-gigabyte copy when only a handful of metadata bytes changed.

If You See This, Think This

Symptom	Likely Cause	First Check
"No space left on device" with disk not full	Inode exhaustion from whiteouts and copy-ups	`df -i /var/lib/docker` to check inode usage
Container startup slow (seconds instead of ms)	Wrong storage driver (vfs instead of overlay2)	`docker info
Unexpected large container writable layer	Copy-up of large lower-layer files	`docker system df -v` and check container size
Hard links broken after container writes	Copy-up creates separate inodes per name	Check `index=on` mount option for hard-link tracking
Permission changes on large files are slow	Metacopy not enabled	`mount -t overlay
Container fs shows deleted file still exists	Whiteout not created properly	`find <upperdir> -type c -perm 0000` for whiteout entries

When to Use / Avoid

Use when:

Running containers that share base image layers (Docker overlay2 default)
Building Live CD/USB systems with tmpfs upper layer over read-only media
Creating ephemeral build environments that write to a disposable layer
Implementing rollback by discarding the upper layer and starting fresh

Avoid when:

The workload writes heavily to large lower-layer files (copy-up cost is prohibitive)
Block-level COW is needed (use btrfs or device-mapper thin provisioning)
NFS or network filesystem is required for the upper layer (POSIX rename atomicity not guaranteed)
Database files need direct I/O (use volume mounts that bypass overlay)

Try It Yourself

 1  # Mount a basic overlay filesystem with two lower layers
 2  sudo mount -t overlay overlay -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work /merged
 3  
 4  # List all overlay mounts showing full options
 5  findmnt -t overlay -o TARGET,SOURCE,OPTIONS
 6  
 7  # Inspect Docker container's overlay2 mount details
 8  docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container_id>
 9  
10  # Check the actual overlay mount for a running Docker container
11  grep overlay /proc/self/mountinfo
12  
13  # Find whiteout files (deletions) in a container's upper layer
14  find /var/lib/docker/overlay2/<layer_id>/diff -type c -perm 0000
15  
16  # Check if a directory is marked opaque (hides lower layer contents)
17  getfattr -n trusted.overlay.opaque /var/lib/docker/overlay2/<layer_id>/diff/some_dir
18  
19  # Show Docker layer disk usage with per-layer breakdown
20  docker system df -v
21  
22  # Inspect overlay mount options including metacopy and index features
23  mount -t overlay | grep -o 'metacopy=\w*'

Debug Checklist

1findmnt -t overlay -- list all overlay mounts with lowerdir/upperdir/workdir
2docker system df -v -- per-image and per-container disk usage
3df -i /var/lib/docker -- check inode usage on overlay backing filesystem
4find <upperdir> -type c -perm 0000 -- find whiteout files (deletions)
5getfattr -n trusted.overlay.opaque <dir> -- check for opaque directories
6docker inspect <container> | grep -A5 GraphDriver -- show overlay mount details

Key Takeaways

✓First write to a lower-layer file is expensive -- the kernel copies the ENTIRE file to upperdir before applying your one-byte change. A 2 GB base image file means a 2 GB copy-up, even if you only appended a newline. Subsequent writes hit the upper copy directly.
✓Deleting a file does not actually delete anything. The kernel drops a "whiteout" (character device 0/0) in upperdir that hides the lower-layer file. Opaque directories (xattr trusted.overlay.opaque=y) hide everything below when you rm -rf and recreate a directory.
✓100 containers from one image cost almost zero extra disk. All share read-only lowerdirs; only unique writes accumulate in each container's upperdir. This is why Docker images are small but containers feel full-size.
✓Metacopy (Linux 4.19+) is the performance shortcut for chmod/chown -- it creates a tiny metadata node in upperdir instead of copying gigabytes of file data. If you are doing permission changes on large files, this is the difference between milliseconds and minutes.
✓upperdir and workdir must live on the same filesystem (ext4 or xfs). tmpfs gives fast writes but no persistence. NFS is not supported because overlay needs POSIX rename atomicity that NFS cannot guarantee.

Common Pitfalls

✗Mistake: Containers mysteriously run out of inodes with plenty of disk space. Reality: Every whiteout file and copied-up file consumes an inode on the upper filesystem. High container churn with lots of deletions exhausts inodes before bytes.
✗Mistake: Assuming copy-up is instant. Reality: Writing a single byte to a 2 GB lower-layer file triggers a full 2 GB copy to upperdir. Structure Dockerfiles to modify large files in early layers, not late ones.
✗Mistake: Manually mounting overlayfs to debug Docker and getting confused by the options. Reality: Docker's overlay2 driver manages lowerdir stacking, link indirection, and layer metadata automatically. Debugging requires reconstructing the full lowerdir chain from /var/lib/docker/overlay2/*/diff.
✗Mistake: Expecting hard links to survive across layers. Reality: A file hard-linked in a lower layer becomes two separate files if both names are written to in the upper layer. The hard-link relationship silently breaks.

Reference

System Calls

mountopenstatreaddir

Tools

mount / findmntdocker inspect / docker system dfgetfattr / xattr

📌

In One Line

Shared read-only layers plus per-container writable layers -- keep large files in early Dockerfile stages so copy-up never bites.

OverlayFS & Union File Systems

DockerKubernetes

🧠

Mental Model

💡

The Problem

Architecture

Every time docker run executes, the container sees a full filesystem -- Ubuntu packages, application code, configuration files, all of it. But nothing was copied. Not a single byte.

So where are the files coming from? And what happens when a container changes one?

This is the story of OverlayFS -- the filesystem illusion that makes containers practical.

What Actually Happens

OverlayFS (merged into Linux 3.18) is a union filesystem that stacks multiple directory trees into a single view. The mount takes four parameters:

lowerdir: One or more read-only layers (the image). Stacked bottom-to-top.
upperdir: A single read-write layer (the container's private space).
workdir: Scratch space for atomic operations. Must be on the same filesystem as upperdir.
merged: The mount point where the unified view appears.

Reading a file is straightforward. The kernel checks upperdir first. If the file is there, serve it. If not, check each lowerdir from top to bottom. First match wins.

Under the Hood

Common Questions

Why did Docker choose overlay2 over device-mapper or AUFS?

What happens when a container modifies one byte of a 5 GB base image file?

How does overlay handle hard links?

Why do containers sometimes fail with "no space left on device" when disk is not full?

How Technologies Use This

Docker

Kubernetes

Scheduling 50 pods from the same 800 MB image fills the node's 100 GB ephemeral storage. New pods fail with disk pressure evictions even though each pod only writes a few megabytes of its own data.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Layer sharing	overlay2 shares lowerdirs across containers	N/A	N/A	N/A	Container runtime shares image layers across pods
Copy-on-write	Full-file copy-up to upperdir	N/A	N/A	N/A	Same as container runtime
Image layers	Each Dockerfile instruction = one layer	N/A	N/A	N/A	Same image spec (OCI)
Writable layer	Container upperdir	N/A	N/A	N/A	Pod ephemeral storage in upperdir
Volume bypass	-v mount bypasses overlay	N/A	N/A	N/A	PVC/emptyDir bypasses overlay

Stack Layer Mapping

Layer	OverlayFS Mechanism
Block device	ext4/xfs backing filesystem for upper and lower layers
VFS	OverlayFS registers as a filesystem type, intercepts path lookups
Lowerdir stack	Read-only image layers, content-addressed and shared
Upperdir	Per-container writable layer with copy-ups and whiteouts
Workdir	Atomic staging area for copy-up (crash consistency)
Merged mount	Unified view presented to container processes

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
"No space left on device" with disk not full	Inode exhaustion from whiteouts and copy-ups	`df -i /var/lib/docker` to check inode usage
Container startup slow (seconds instead of ms)	Wrong storage driver (vfs instead of overlay2)	`docker info
Unexpected large container writable layer	Copy-up of large lower-layer files	`docker system df -v` and check container size
Hard links broken after container writes	Copy-up creates separate inodes per name	Check `index=on` mount option for hard-link tracking
Permission changes on large files are slow	Metacopy not enabled	`mount -t overlay
Container fs shows deleted file still exists	Whiteout not created properly	`find <upperdir> -type c -perm 0000` for whiteout entries

When to Use / Avoid

Use when:

Running containers that share base image layers (Docker overlay2 default)
Building Live CD/USB systems with tmpfs upper layer over read-only media
Creating ephemeral build environments that write to a disposable layer
Implementing rollback by discarding the upper layer and starting fresh

Avoid when:

The workload writes heavily to large lower-layer files (copy-up cost is prohibitive)
Block-level COW is needed (use btrfs or device-mapper thin provisioning)
NFS or network filesystem is required for the upper layer (POSIX rename atomicity not guaranteed)
Database files need direct I/O (use volume mounts that bypass overlay)

Try It Yourself

 1  # Mount a basic overlay filesystem with two lower layers
 2  sudo mount -t overlay overlay -o lowerdir=/lower2:/lower1,upperdir=/upper,workdir=/work /merged
 3  
 4  # List all overlay mounts showing full options
 5  findmnt -t overlay -o TARGET,SOURCE,OPTIONS
 6  
 7  # Inspect Docker container's overlay2 mount details
 8  docker inspect --format '{{.GraphDriver.Data.MergedDir}}' <container_id>
 9  
10  # Check the actual overlay mount for a running Docker container
11  grep overlay /proc/self/mountinfo
12  
13  # Find whiteout files (deletions) in a container's upper layer
14  find /var/lib/docker/overlay2/<layer_id>/diff -type c -perm 0000
15  
16  # Check if a directory is marked opaque (hides lower layer contents)
17  getfattr -n trusted.overlay.opaque /var/lib/docker/overlay2/<layer_id>/diff/some_dir
18  
19  # Show Docker layer disk usage with per-layer breakdown
20  docker system df -v
21  
22  # Inspect overlay mount options including metacopy and index features
23  mount -t overlay | grep -o 'metacopy=\w*'

Debug Checklist

1findmnt -t overlay -- list all overlay mounts with lowerdir/upperdir/workdir
2docker system df -v -- per-image and per-container disk usage
3df -i /var/lib/docker -- check inode usage on overlay backing filesystem
4find <upperdir> -type c -perm 0000 -- find whiteout files (deletions)
5getfattr -n trusted.overlay.opaque <dir> -- check for opaque directories
6docker inspect <container> | grep -A5 GraphDriver -- show overlay mount details

Key Takeaways

✓First write to a lower-layer file is expensive -- the kernel copies the ENTIRE file to upperdir before applying your one-byte change. A 2 GB base image file means a 2 GB copy-up, even if you only appended a newline. Subsequent writes hit the upper copy directly.
✓Deleting a file does not actually delete anything. The kernel drops a "whiteout" (character device 0/0) in upperdir that hides the lower-layer file. Opaque directories (xattr trusted.overlay.opaque=y) hide everything below when you rm -rf and recreate a directory.
✓100 containers from one image cost almost zero extra disk. All share read-only lowerdirs; only unique writes accumulate in each container's upperdir. This is why Docker images are small but containers feel full-size.
✓Metacopy (Linux 4.19+) is the performance shortcut for chmod/chown -- it creates a tiny metadata node in upperdir instead of copying gigabytes of file data. If you are doing permission changes on large files, this is the difference between milliseconds and minutes.
✓upperdir and workdir must live on the same filesystem (ext4 or xfs). tmpfs gives fast writes but no persistence. NFS is not supported because overlay needs POSIX rename atomicity that NFS cannot guarantee.

Common Pitfalls

✗Mistake: Containers mysteriously run out of inodes with plenty of disk space. Reality: Every whiteout file and copied-up file consumes an inode on the upper filesystem. High container churn with lots of deletions exhausts inodes before bytes.
✗Mistake: Assuming copy-up is instant. Reality: Writing a single byte to a 2 GB lower-layer file triggers a full 2 GB copy to upperdir. Structure Dockerfiles to modify large files in early layers, not late ones.
✗Mistake: Manually mounting overlayfs to debug Docker and getting confused by the options. Reality: Docker's overlay2 driver manages lowerdir stacking, link indirection, and layer metadata automatically. Debugging requires reconstructing the full lowerdir chain from /var/lib/docker/overlay2/*/diff.
✗Mistake: Expecting hard links to survive across layers. Reality: A file hard-linked in a lower layer becomes two separate files if both names are written to in the upper layer. The hard-link relationship silently breaks.

Reference

System Calls

mountopenstatreaddir

Tools

mount / findmntdocker inspect / docker system dfgetfattr / xattr

📌

In One Line

Shared read-only layers plus per-container writable layers -- keep large files in early Dockerfile stages so copy-up never bites.

OverlayFS & Union File Systems

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

OverlayFS & Union File Systems

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics