File Systems & I/OTopic 4 of 19

File Systems & I/OIntermediate

chroot & pivot_root

DockerKubernetes

🧠

Mental Model

A playground sandbox. chroot draws a chalk line around it and says "stay inside." Step over the line and the rest of the playground is right there -- slides, swings, everything. pivot_root lifts the sandbox onto a flatbed truck, drives it to a fenced yard, and locks the gate. The playground is not off-limits. It is somewhere else entirely. No line to step over because there is nothing on the other side.

💡

The Problem

Untrusted code sits in a chroot jail. A root process inside escapes with 4 lines of C -- mkdir, chroot, chdir, chroot -- in under a millisecond, no special tools needed. On a multi-tenant host with 40 containers, a single compromised container using chroot-based isolation gets full read/write access to every other container and the host filesystem. CVE-2019-5736 exploited exactly this class of weak boundary to overwrite the host runc binary from inside a container.

Architecture

Run docker run alpine ls / and the output shows a clean Alpine filesystem. No trace of the host's files. No /home/alice. No /var/log from the host. Nothing.

How? Not chroot. Docker does not use chroot. Neither does Podman, containerd, or any serious container runtime.

They all use pivot_root. And the difference between these two syscalls is the difference between "the door is closed" and "the door does not exist."

What Actually Happens

chroot changes one pointer in the process's fs_struct: the root directory reference. After chroot("/jail"), every absolute path lookup starts from /jail. Open /etc/passwd? The kernel resolves it as /jail/etc/passwd.

Here is the catch. chroot does not change the current working directory. It does not close open file descriptors. It does not restrict syscalls.

A root process can escape chroot in four lines:

Open a file descriptor to a directory.
Call chroot() to a new subdirectory.
fchdir() to the saved file descriptor (now outside the new root).
Walk .. up to the real root and chroot(".").

Even without a saved fd, the classic escape works: mkdir("x"); chroot("x"); chdir("../../.."); chroot("."). This works because the first chroot does not set cwd, so repeated .. traversals reach the real root.

chroot was designed in 1979 for build environments. It was never meant to be a security boundary.

pivot_root operates at a completely different level. It does not change a pointer. It swaps mount points.

pivot_root(new_root, put_old) takes the current root mount, moves it to put_old, and makes new_root the root mount for the entire mount namespace. After this, umount2(put_old, MNT_DETACH) detaches the old root entirely.

Now there is no path traversal that reaches the host. No file descriptor. No /proc reference. The old root is not hidden -- it is unmounted. It does not exist in this mount namespace.

Under the Hood

The runc container init sequence -- the actual code that runs when Docker starts a container -- follows this exact pattern:

Step 1. clone() with CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | ... to create new namespaces.

Step 2. Mount the overlay filesystem (image layers as lowerdir, writable container layer as upperdir) onto the new root path.

Step 3. Bind-mount the new root onto itself. This is required because pivot_root demands that new_root be a mount point. A plain directory will not work.

Step 4. chdir() into the new root.

Step 5. pivot_root(".", ".old_root") to swap the root mount.

Step 6. umount2("/.old_root", MNT_DETACH) to detach the host root. This is the critical security step. Skip it, and any process in the container that navigates to /.old_root can access the entire host filesystem.

Step 7. Remount /proc, /dev, /sys with appropriate restrictions.

Step 8. Apply seccomp filters and drop capabilities.

Step 9. exec() the container entrypoint.

Steps 5 and 6 are what provide filesystem isolation. Everything else builds on top of them.

The mount namespace prerequisite is non-negotiable. pivot_root modifies the root mount of the current mount namespace. Without CLONE_NEWNS, the call would change the root for every process on the system -- including PID 1. That would be catastrophic. The mount namespace provides the isolation boundary: mount operations inside it are invisible to processes in other namespaces.

Common Questions

Why do containers use pivot_root instead of chroot?

chroot only changes a per-process pointer and is trivially escapable by a privileged process. pivot_root swaps the root mount at the namespace level, and after unmounting the old root, there is literally no path or reference back to the host filesystem. Additionally, pivot_root affects all processes in the mount namespace, providing consistent isolation for every process in the container.

Can a non-root process call chroot or pivot_root?

chroot() requires CAP_SYS_CHROOT. pivot_root() requires CAP_SYS_ADMIN. In rootless containers (Podman, rootless Docker), user namespaces provide these capabilities within the namespace without requiring real root on the host. That is how filesystem isolation works without privileges.

What happens if the old root is not unmounted after pivot_root?

The host root filesystem remains accessible at the put_old mount point inside the container. Any process that can navigate there has full access to the host. This completely defeats the isolation. That is why runc calls umount2("/.old_root", MNT_DETACH) immediately, and why container security audits verify that no host mounts leak into the container's mount table.

How does pivot_root interact with /proc?

After pivot_root, the host's /proc is part of the old root -- it gets unmounted. The container init process must mount a fresh proc filesystem for the new PID namespace. This new /proc only shows processes within the container's PID namespace. If the container inherited the host's /proc, it would leak information about all host processes and potentially allow manipulation via /proc/<pid>/ entries.

How Technologies Use This

Docker

A root process inside a container escapes to the host filesystem using the classic chroot breakout: mkdir, chroot, chdir, chroot. The four-line C program reaches the real root in under a millisecond, and the container's filesystem isolation is completely defeated.

The reason this attack works against chroot is that chroot only changes a pointer in the process's fs_struct -- the old root is still accessible through saved file descriptors, current working directory traversal, or /proc references. Chroot was designed for build environments in 1979, not security isolation, and it has never been a confinement mechanism.

Docker uses pivot_root inside a mount namespace instead. runc swaps the overlay filesystem in as the new root, then calls umount2 with MNT_DETACH to detach the old root entirely. After this sequence, no file descriptor, no path traversal, and no /proc reference back to the host filesystem survives. The container's /proc/self/mountinfo shows only the overlay mount, zero host mount entries.

Kubernetes

Forty containers on a node each need their own completely separate view of the filesystem with no path leaking back to the host. Higher-level Kubernetes controls like readOnlyRootFilesystem and allowedHostPaths must be enforceable against a determined root process inside the container.

The critical insight is that all of these higher-level filesystem controls depend on the container runtime having performed a proper pivot_root at the base of the isolation stack. If the runtime used chroot instead, a root process could escape in 4 lines of C, and every Kubernetes filesystem restriction layered on top would be meaningless.

The container runtime runs the full pivot_root sequence for each container: create mount namespace, mount overlay, pivot_root, then umount the old root. Kubernetes layers readOnlyRootFilesystem to prevent writes, allowedHostPaths to restrict bind mounts, and mount propagation controls to limit mount visibility. Because pivot_root physically removes the host root from the namespace, these controls are enforceable rather than advisory.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Filesystem isolation	pivot_root in runc init sequence	N/A (no native fs isolation)	N/A (OS-level only)	syscall.PivotRoot in container libs	Inherited from container runtime (containerd/CRI-O)
Root swap mechanism	overlay mount + pivot_root + umount old	N/A	N/A	mount + pivot_root + umount2	Pod sandbox runs full pivot_root per container
Build-time isolation	Multi-stage builds use chroot internally	N/A	N/A	N/A	InitContainers run in already-pivoted namespace
Rootless mode	User namespace + pivot_root (rootless Docker)	N/A	N/A	User namespace + pivot_root	Rootless K8s uses usernetes with pivot_root
Escape prevention	Drops CAP_SYS_ADMIN + seccomp blocks chroot	N/A	N/A	Seccomp profile blocks chroot	Pod Security Standards block privileged containers

Stack Layer Mapping

Layer	Isolation Mechanism
Hardware	N/A (filesystem isolation is a kernel abstraction)
Kernel	pivot_root syscall swaps root mount in mount namespace
VFS layer	Path resolution starts from new root, old root unreachable
Container runtime	runc/crun execute pivot_root + umount sequence
Orchestrator	K8s relies on runtime for fs isolation, adds policy enforcement
Application	Sees only container rootfs, no host paths exposed

Design Rationale

chroot dates to 1979 and was built for build environments, not adversaries. pivot_root was designed specifically for mount namespace isolation; requiring the new root to be a mount point lets the kernel swap mount trees cleanly. The umount step afterward is non-negotiable -- skip it and the host root sits at put_old, fully accessible, defeating the whole point.

If You See This, Think This

Symptom	Likely Cause	First Check
Host files visible inside container	Old root not unmounted after pivot_root	`cat /proc/self/mountinfo` for host mount entries
pivot_root fails with EINVAL	new_root is not a mount point	`mount --bind /newroot /newroot` before pivot_root
pivot_root fails with EBUSY	Old root still has active references	`lsof +D /.old_root` to find open files
Container escape via chroot	Using chroot instead of pivot_root for isolation	Verify runtime uses pivot_root: `strace -e pivot_root`
/proc shows host processes	Fresh /proc not mounted after pivot_root	`mount -t proc proc /proc` inside new namespace
Mount namespace changes affect host	Forgot to unshare(CLONE_NEWNS) before pivot_root	`lsns -t mnt -p <pid>` to verify separate namespace

When to Use / Avoid

Use when:

Building container runtimes that need real filesystem isolation (always pivot_root)
Creating hermetic build environments where the host filesystem must be invisible
Running untrusted code that might attempt filesystem escape
Implementing rootless containers with user namespaces

Avoid when:

Building packages in trusted environments (debootstrap-style chroot is fine)
The process runs as non-root with no CAP_SYS_CHROOT (chroot escape requires root)
Testing or development where isolation is convenience, not security

Try It Yourself

 1  # Create a minimal chroot environment
 2  mkdir -p /tmp/jail/{bin,lib,lib64}
 3  cp /bin/bash /tmp/jail/bin/
 4  cp $(ldd /bin/bash | grep -o '/lib[^ ]*') /tmp/jail/lib/ 2>/dev/null
 5  sudo chroot /tmp/jail /bin/bash
 6  
 7  # Demonstrate chroot escape (run as root inside chroot)
 8  # mkdir escape && chroot escape && cd ./././ && chroot .
 9  
10  # Use unshare to create mount namespace + new root (modern approach)
11  sudo unshare --mount --pid --fork --root=/path/to/rootfs /bin/sh
12  
13  # Inspect a container's mount namespace
14  sudo nsenter -t $(docker inspect -f '{{.State.Pid}}' mycontainer) --mount findmnt
15  
16  # Show mount namespaces on the system
17  lsns -t mnt
18  
19  # See the pivot_root in action via strace on runc
20  sudo strace -f -e pivot_root,mount,umount2,unshare runc run test-container 2>&1 | head -50
21  
22  # Verify container has no access to host root
23  docker run --rm alpine cat /proc/self/mountinfo
24  
25  # Create a proper pivot_root setup manually
26  sudo unshare --mount --fork bash -c '
27    mount --bind /tmp/newroot /tmp/newroot
28    cd /tmp/newroot
29    mkdir -p .old
30    pivot_root .old
31    umount -l /.old
32    exec /bin/sh
33  '

Debug Checklist

1cat /proc/self/mountinfo -- verify no host mounts leak into container
2ls -la /proc/1/root -- check what container PID 1 sees as root
3findmnt --list -- show mount tree inside the namespace
4nsenter -t <pid> --mount findmnt -- inspect container mount namespace from host
5strace -e pivot_root,mount,umount2 runc run test -- trace the root swap sequence
6lsns -t mnt -- list all mount namespaces on the system

Key Takeaways

✓chroot is a filesystem trick, not a security boundary. It does not create namespaces, does not restrict syscalls, and does not stop a root process from escaping. The classic escape: mkdir(d); chroot(d); chdir('././.'); chroot('.') -- it works because chroot never changes the current working directory.
✓pivot_root operates at the mount namespace level and requires CLONE_NEWNS. After pivot_root, the old root lands at put_old and MUST be unmounted. Skip the umount and you have a path straight back to the host filesystem inside the container.
✓The runc container init sequence is: clone(CLONE_NEWNS|CLONE_NEWPID|..), mount overlay on new root, pivot_root(new_root, old_root), umount2(old_root, MNT_DETACH), then exec the container entrypoint. After this, no file descriptor, no path, and no /proc reference to the host filesystem survives.
✓chroot has a simpler API but a weaker security model. pivot_root has a more complex API (mount namespace, mount point requirements) but provides actual isolation. In interviews, knowing WHY containers use pivot_root instead of chroot demonstrates deep understanding of Linux security.
✓After pivot_root plus umount of old root, /proc/1/root inside the container points to the overlayfs mount. There is zero reference to the host's / in the container's mount table. Verify with cat /proc/self/mountinfo.

Common Pitfalls

✗Mistake: Using chroot for security isolation in production. Reality: chroot was designed for build environments (like debootstrap), not security boundaries. Any process with CAP_SYS_CHROOT can escape with the open-dir-chroot-fchdir technique. It was never meant to contain adversaries.
✗Mistake: Forgetting to umount the old root after pivot_root. Reality: if put_old remains mounted, any process in the container that navigates to that path has full access to the host filesystem. runc unmounts it immediately and recursively. Skip this step and your container is not isolated.
✗Mistake: Calling pivot_root without creating a mount namespace first. Reality: pivot_root modifies the root mount of the current namespace. Without CLONE_NEWNS, you would change the root for ALL processes sharing the namespace -- including the host init process. This would be catastrophic.
✗Mistake: Not bind-mounting the new root onto itself before pivot_root. Reality: the kernel requires new_root to be a mount point. A plain directory will not work. You must run mount --bind /new_root /new_root first to satisfy this requirement.

Reference

System Calls

chrootpivot_rootmountumount2unshare

Tools

unshare(1)nsenter(1)lsns / findmnt

📌

In One Line

pivot_root plus umount of the old root gives real filesystem isolation; chroot was never a security boundary and never will be.

chroot & pivot_root

DockerKubernetes

🧠

Mental Model

💡

The Problem

Architecture

Run docker run alpine ls / and the output shows a clean Alpine filesystem. No trace of the host's files. No /home/alice. No /var/log from the host. Nothing.

How? Not chroot. Docker does not use chroot. Neither does Podman, containerd, or any serious container runtime.

They all use pivot_root. And the difference between these two syscalls is the difference between "the door is closed" and "the door does not exist."

What Actually Happens

Here is the catch. chroot does not change the current working directory. It does not close open file descriptors. It does not restrict syscalls.

A root process can escape chroot in four lines:

Open a file descriptor to a directory.
Call chroot() to a new subdirectory.
fchdir() to the saved file descriptor (now outside the new root).
Walk .. up to the real root and chroot(".").

chroot was designed in 1979 for build environments. It was never meant to be a security boundary.

pivot_root operates at a completely different level. It does not change a pointer. It swaps mount points.

Now there is no path traversal that reaches the host. No file descriptor. No /proc reference. The old root is not hidden -- it is unmounted. It does not exist in this mount namespace.

Under the Hood

The runc container init sequence -- the actual code that runs when Docker starts a container -- follows this exact pattern:

Step 1. clone() with CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | ... to create new namespaces.

Step 2. Mount the overlay filesystem (image layers as lowerdir, writable container layer as upperdir) onto the new root path.

Step 3. Bind-mount the new root onto itself. This is required because pivot_root demands that new_root be a mount point. A plain directory will not work.

Step 4. chdir() into the new root.

Step 5. pivot_root(".", ".old_root") to swap the root mount.

Step 7. Remount /proc, /dev, /sys with appropriate restrictions.

Step 8. Apply seccomp filters and drop capabilities.

Step 9. exec() the container entrypoint.

Steps 5 and 6 are what provide filesystem isolation. Everything else builds on top of them.

Common Questions

Why do containers use pivot_root instead of chroot?

Can a non-root process call chroot or pivot_root?

What happens if the old root is not unmounted after pivot_root?

How does pivot_root interact with /proc?

How Technologies Use This

Docker

Kubernetes

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Filesystem isolation	pivot_root in runc init sequence	N/A (no native fs isolation)	N/A (OS-level only)	syscall.PivotRoot in container libs	Inherited from container runtime (containerd/CRI-O)
Root swap mechanism	overlay mount + pivot_root + umount old	N/A	N/A	mount + pivot_root + umount2	Pod sandbox runs full pivot_root per container
Build-time isolation	Multi-stage builds use chroot internally	N/A	N/A	N/A	InitContainers run in already-pivoted namespace
Rootless mode	User namespace + pivot_root (rootless Docker)	N/A	N/A	User namespace + pivot_root	Rootless K8s uses usernetes with pivot_root
Escape prevention	Drops CAP_SYS_ADMIN + seccomp blocks chroot	N/A	N/A	Seccomp profile blocks chroot	Pod Security Standards block privileged containers

Stack Layer Mapping

Layer	Isolation Mechanism
Hardware	N/A (filesystem isolation is a kernel abstraction)
Kernel	pivot_root syscall swaps root mount in mount namespace
VFS layer	Path resolution starts from new root, old root unreachable
Container runtime	runc/crun execute pivot_root + umount sequence
Orchestrator	K8s relies on runtime for fs isolation, adds policy enforcement
Application	Sees only container rootfs, no host paths exposed

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
Host files visible inside container	Old root not unmounted after pivot_root	`cat /proc/self/mountinfo` for host mount entries
pivot_root fails with EINVAL	new_root is not a mount point	`mount --bind /newroot /newroot` before pivot_root
pivot_root fails with EBUSY	Old root still has active references	`lsof +D /.old_root` to find open files
Container escape via chroot	Using chroot instead of pivot_root for isolation	Verify runtime uses pivot_root: `strace -e pivot_root`
/proc shows host processes	Fresh /proc not mounted after pivot_root	`mount -t proc proc /proc` inside new namespace
Mount namespace changes affect host	Forgot to unshare(CLONE_NEWNS) before pivot_root	`lsns -t mnt -p <pid>` to verify separate namespace

When to Use / Avoid

Use when:

Building container runtimes that need real filesystem isolation (always pivot_root)
Creating hermetic build environments where the host filesystem must be invisible
Running untrusted code that might attempt filesystem escape
Implementing rootless containers with user namespaces

Avoid when:

Building packages in trusted environments (debootstrap-style chroot is fine)
The process runs as non-root with no CAP_SYS_CHROOT (chroot escape requires root)
Testing or development where isolation is convenience, not security

Try It Yourself

 1  # Create a minimal chroot environment
 2  mkdir -p /tmp/jail/{bin,lib,lib64}
 3  cp /bin/bash /tmp/jail/bin/
 4  cp $(ldd /bin/bash | grep -o '/lib[^ ]*') /tmp/jail/lib/ 2>/dev/null
 5  sudo chroot /tmp/jail /bin/bash
 6  
 7  # Demonstrate chroot escape (run as root inside chroot)
 8  # mkdir escape && chroot escape && cd ./././ && chroot .
 9  
10  # Use unshare to create mount namespace + new root (modern approach)
11  sudo unshare --mount --pid --fork --root=/path/to/rootfs /bin/sh
12  
13  # Inspect a container's mount namespace
14  sudo nsenter -t $(docker inspect -f '{{.State.Pid}}' mycontainer) --mount findmnt
15  
16  # Show mount namespaces on the system
17  lsns -t mnt
18  
19  # See the pivot_root in action via strace on runc
20  sudo strace -f -e pivot_root,mount,umount2,unshare runc run test-container 2>&1 | head -50
21  
22  # Verify container has no access to host root
23  docker run --rm alpine cat /proc/self/mountinfo
24  
25  # Create a proper pivot_root setup manually
26  sudo unshare --mount --fork bash -c '
27    mount --bind /tmp/newroot /tmp/newroot
28    cd /tmp/newroot
29    mkdir -p .old
30    pivot_root .old
31    umount -l /.old
32    exec /bin/sh
33  '

Debug Checklist

1cat /proc/self/mountinfo -- verify no host mounts leak into container
2ls -la /proc/1/root -- check what container PID 1 sees as root
3findmnt --list -- show mount tree inside the namespace
4nsenter -t <pid> --mount findmnt -- inspect container mount namespace from host
5strace -e pivot_root,mount,umount2 runc run test -- trace the root swap sequence
6lsns -t mnt -- list all mount namespaces on the system

Key Takeaways

✓chroot is a filesystem trick, not a security boundary. It does not create namespaces, does not restrict syscalls, and does not stop a root process from escaping. The classic escape: mkdir(d); chroot(d); chdir('././.'); chroot('.') -- it works because chroot never changes the current working directory.
✓pivot_root operates at the mount namespace level and requires CLONE_NEWNS. After pivot_root, the old root lands at put_old and MUST be unmounted. Skip the umount and you have a path straight back to the host filesystem inside the container.
✓The runc container init sequence is: clone(CLONE_NEWNS|CLONE_NEWPID|..), mount overlay on new root, pivot_root(new_root, old_root), umount2(old_root, MNT_DETACH), then exec the container entrypoint. After this, no file descriptor, no path, and no /proc reference to the host filesystem survives.
✓chroot has a simpler API but a weaker security model. pivot_root has a more complex API (mount namespace, mount point requirements) but provides actual isolation. In interviews, knowing WHY containers use pivot_root instead of chroot demonstrates deep understanding of Linux security.
✓After pivot_root plus umount of old root, /proc/1/root inside the container points to the overlayfs mount. There is zero reference to the host's / in the container's mount table. Verify with cat /proc/self/mountinfo.

Common Pitfalls

✗Mistake: Using chroot for security isolation in production. Reality: chroot was designed for build environments (like debootstrap), not security boundaries. Any process with CAP_SYS_CHROOT can escape with the open-dir-chroot-fchdir technique. It was never meant to contain adversaries.
✗Mistake: Forgetting to umount the old root after pivot_root. Reality: if put_old remains mounted, any process in the container that navigates to that path has full access to the host filesystem. runc unmounts it immediately and recursively. Skip this step and your container is not isolated.
✗Mistake: Calling pivot_root without creating a mount namespace first. Reality: pivot_root modifies the root mount of the current namespace. Without CLONE_NEWNS, you would change the root for ALL processes sharing the namespace -- including the host init process. This would be catastrophic.
✗Mistake: Not bind-mounting the new root onto itself before pivot_root. Reality: the kernel requires new_root to be a mount point. A plain directory will not work. You must run mount --bind /new_root /new_root first to satisfy this requirement.

Reference

System Calls

chrootpivot_rootmountumount2unshare

Tools

unshare(1)nsenter(1)lsns / findmnt

📌

In One Line

pivot_root plus umount of the old root gives real filesystem isolation; chroot was never a security boundary and never will be.

chroot & pivot_root

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

chroot & pivot_root

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics