chroot & pivot_root
Mental Model
A playground sandbox. chroot draws a chalk line around it and says "stay inside." Step over the line and the rest of the playground is right there -- slides, swings, everything. pivot_root lifts the sandbox onto a flatbed truck, drives it to a fenced yard, and locks the gate. The playground is not off-limits. It is somewhere else entirely. No line to step over because there is nothing on the other side.
The Problem
Untrusted code sits in a chroot jail. A root process inside escapes with 4 lines of C -- mkdir, chroot, chdir, chroot -- in under a millisecond, no special tools needed. On a multi-tenant host with 40 containers, a single compromised container using chroot-based isolation gets full read/write access to every other container and the host filesystem. CVE-2019-5736 exploited exactly this class of weak boundary to overwrite the host runc binary from inside a container.
Architecture
Run docker run alpine ls / and the output shows a clean Alpine filesystem. No trace of the host's files. No /home/alice. No /var/log from the host. Nothing.
How? Not chroot. Docker does not use chroot. Neither does Podman, containerd, or any serious container runtime.
They all use pivot_root. And the difference between these two syscalls is the difference between "the door is closed" and "the door does not exist."
What Actually Happens
chroot changes one pointer in the process's fs_struct: the root directory reference. After chroot("/jail"), every absolute path lookup starts from /jail. Open /etc/passwd? The kernel resolves it as /jail/etc/passwd.
Here is the catch. chroot does not change the current working directory. It does not close open file descriptors. It does not restrict syscalls.
A root process can escape chroot in four lines:
- Open a file descriptor to a directory.
- Call
chroot()to a new subdirectory. fchdir()to the saved file descriptor (now outside the new root).- Walk
..up to the real root andchroot(".").
Even without a saved fd, the classic escape works: mkdir("x"); chroot("x"); chdir("../../.."); chroot("."). This works because the first chroot does not set cwd, so repeated .. traversals reach the real root.
chroot was designed in 1979 for build environments. It was never meant to be a security boundary.
pivot_root operates at a completely different level. It does not change a pointer. It swaps mount points.
pivot_root(new_root, put_old) takes the current root mount, moves it to put_old, and makes new_root the root mount for the entire mount namespace. After this, umount2(put_old, MNT_DETACH) detaches the old root entirely.
Now there is no path traversal that reaches the host. No file descriptor. No /proc reference. The old root is not hidden -- it is unmounted. It does not exist in this mount namespace.
Under the Hood
The runc container init sequence -- the actual code that runs when Docker starts a container -- follows this exact pattern:
Step 1. clone() with CLONE_NEWNS | CLONE_NEWPID | CLONE_NEWNET | ... to create new namespaces.
Step 2. Mount the overlay filesystem (image layers as lowerdir, writable container layer as upperdir) onto the new root path.
Step 3. Bind-mount the new root onto itself. This is required because pivot_root demands that new_root be a mount point. A plain directory will not work.
Step 4. chdir() into the new root.
Step 5. pivot_root(".", ".old_root") to swap the root mount.
Step 6. umount2("/.old_root", MNT_DETACH) to detach the host root. This is the critical security step. Skip it, and any process in the container that navigates to /.old_root can access the entire host filesystem.
Step 7. Remount /proc, /dev, /sys with appropriate restrictions.
Step 8. Apply seccomp filters and drop capabilities.
Step 9. exec() the container entrypoint.
Steps 5 and 6 are what provide filesystem isolation. Everything else builds on top of them.
The mount namespace prerequisite is non-negotiable. pivot_root modifies the root mount of the current mount namespace. Without CLONE_NEWNS, the call would change the root for every process on the system -- including PID 1. That would be catastrophic. The mount namespace provides the isolation boundary: mount operations inside it are invisible to processes in other namespaces.
Common Questions
Why do containers use pivot_root instead of chroot?
chroot only changes a per-process pointer and is trivially escapable by a privileged process. pivot_root swaps the root mount at the namespace level, and after unmounting the old root, there is literally no path or reference back to the host filesystem. Additionally, pivot_root affects all processes in the mount namespace, providing consistent isolation for every process in the container.
Can a non-root process call chroot or pivot_root?
chroot() requires CAP_SYS_CHROOT. pivot_root() requires CAP_SYS_ADMIN. In rootless containers (Podman, rootless Docker), user namespaces provide these capabilities within the namespace without requiring real root on the host. That is how filesystem isolation works without privileges.
What happens if the old root is not unmounted after pivot_root?
The host root filesystem remains accessible at the put_old mount point inside the container. Any process that can navigate there has full access to the host. This completely defeats the isolation. That is why runc calls umount2("/.old_root", MNT_DETACH) immediately, and why container security audits verify that no host mounts leak into the container's mount table.
How does pivot_root interact with /proc?
After pivot_root, the host's /proc is part of the old root -- it gets unmounted. The container init process must mount a fresh proc filesystem for the new PID namespace. This new /proc only shows processes within the container's PID namespace. If the container inherited the host's /proc, it would leak information about all host processes and potentially allow manipulation via /proc/<pid>/ entries.
How Technologies Use This
A root process inside a container escapes to the host filesystem using the classic chroot breakout: mkdir, chroot, chdir, chroot. The four-line C program reaches the real root in under a millisecond, and the container's filesystem isolation is completely defeated.
The reason this attack works against chroot is that chroot only changes a pointer in the process's fs_struct -- the old root is still accessible through saved file descriptors, current working directory traversal, or /proc references. Chroot was designed for build environments in 1979, not security isolation, and it has never been a confinement mechanism.
Docker uses pivot_root inside a mount namespace instead. runc swaps the overlay filesystem in as the new root, then calls umount2 with MNT_DETACH to detach the old root entirely. After this sequence, no file descriptor, no path traversal, and no /proc reference back to the host filesystem survives. The container's /proc/self/mountinfo shows only the overlay mount, zero host mount entries.
Forty containers on a node each need their own completely separate view of the filesystem with no path leaking back to the host. Higher-level Kubernetes controls like readOnlyRootFilesystem and allowedHostPaths must be enforceable against a determined root process inside the container.
The critical insight is that all of these higher-level filesystem controls depend on the container runtime having performed a proper pivot_root at the base of the isolation stack. If the runtime used chroot instead, a root process could escape in 4 lines of C, and every Kubernetes filesystem restriction layered on top would be meaningless.
The container runtime runs the full pivot_root sequence for each container: create mount namespace, mount overlay, pivot_root, then umount the old root. Kubernetes layers readOnlyRootFilesystem to prevent writes, allowedHostPaths to restrict bind mounts, and mount propagation controls to limit mount visibility. Because pivot_root physically removes the host root from the namespace, these controls are enforceable rather than advisory.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Filesystem isolation | pivot_root in runc init sequence | N/A (no native fs isolation) | N/A (OS-level only) | syscall.PivotRoot in container libs | Inherited from container runtime (containerd/CRI-O) |
| Root swap mechanism | overlay mount + pivot_root + umount old | N/A | N/A | mount + pivot_root + umount2 | Pod sandbox runs full pivot_root per container |
| Build-time isolation | Multi-stage builds use chroot internally | N/A | N/A | N/A | InitContainers run in already-pivoted namespace |
| Rootless mode | User namespace + pivot_root (rootless Docker) | N/A | N/A | User namespace + pivot_root | Rootless K8s uses usernetes with pivot_root |
| Escape prevention | Drops CAP_SYS_ADMIN + seccomp blocks chroot | N/A | N/A | Seccomp profile blocks chroot | Pod Security Standards block privileged containers |
Stack Layer Mapping
| Layer | Isolation Mechanism |
|---|---|
| Hardware | N/A (filesystem isolation is a kernel abstraction) |
| Kernel | pivot_root syscall swaps root mount in mount namespace |
| VFS layer | Path resolution starts from new root, old root unreachable |
| Container runtime | runc/crun execute pivot_root + umount sequence |
| Orchestrator | K8s relies on runtime for fs isolation, adds policy enforcement |
| Application | Sees only container rootfs, no host paths exposed |
Design Rationale
chroot dates to 1979 and was built for build environments, not adversaries. pivot_root was designed specifically for mount namespace isolation; requiring the new root to be a mount point lets the kernel swap mount trees cleanly. The umount step afterward is non-negotiable -- skip it and the host root sits at put_old, fully accessible, defeating the whole point.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Host files visible inside container | Old root not unmounted after pivot_root | cat /proc/self/mountinfo for host mount entries |
| pivot_root fails with EINVAL | new_root is not a mount point | mount --bind /newroot /newroot before pivot_root |
| pivot_root fails with EBUSY | Old root still has active references | lsof +D /.old_root to find open files |
| Container escape via chroot | Using chroot instead of pivot_root for isolation | Verify runtime uses pivot_root: strace -e pivot_root |
| /proc shows host processes | Fresh /proc not mounted after pivot_root | mount -t proc proc /proc inside new namespace |
| Mount namespace changes affect host | Forgot to unshare(CLONE_NEWNS) before pivot_root | lsns -t mnt -p <pid> to verify separate namespace |
When to Use / Avoid
Use when:
- Building container runtimes that need real filesystem isolation (always pivot_root)
- Creating hermetic build environments where the host filesystem must be invisible
- Running untrusted code that might attempt filesystem escape
- Implementing rootless containers with user namespaces
Avoid when:
- Building packages in trusted environments (debootstrap-style chroot is fine)
- The process runs as non-root with no CAP_SYS_CHROOT (chroot escape requires root)
- Testing or development where isolation is convenience, not security
Try It Yourself
1 # Create a minimal chroot environment
2 mkdir -p /tmp/jail/{bin,lib,lib64}
3 cp /bin/bash /tmp/jail/bin/
4 cp $(ldd /bin/bash | grep -o '/lib[^ ]*') /tmp/jail/lib/ 2>/dev/null
5 sudo chroot /tmp/jail /bin/bash
6
7 # Demonstrate chroot escape (run as root inside chroot)
8 # mkdir escape && chroot escape && cd ./././ && chroot .
9
10 # Use unshare to create mount namespace + new root (modern approach)
11 sudo unshare --mount --pid --fork --root=/path/to/rootfs /bin/sh
12
13 # Inspect a container's mount namespace
14 sudo nsenter -t $(docker inspect -f '{{.State.Pid}}' mycontainer) --mount findmnt
15
16 # Show mount namespaces on the system
17 lsns -t mnt
18
19 # See the pivot_root in action via strace on runc
20 sudo strace -f -e pivot_root,mount,umount2,unshare runc run test-container 2>&1 | head -50
21
22 # Verify container has no access to host root
23 docker run --rm alpine cat /proc/self/mountinfo
24
25 # Create a proper pivot_root setup manually
26 sudo unshare --mount --fork bash -c '
27 mount --bind /tmp/newroot /tmp/newroot
28 cd /tmp/newroot
29 mkdir -p .old
30 pivot_root .old
31 umount -l /.old
32 exec /bin/sh
33 'Debug Checklist
- 1
cat /proc/self/mountinfo -- verify no host mounts leak into container - 2
ls -la /proc/1/root -- check what container PID 1 sees as root - 3
findmnt --list -- show mount tree inside the namespace - 4
nsenter -t <pid> --mount findmnt -- inspect container mount namespace from host - 5
strace -e pivot_root,mount,umount2 runc run test -- trace the root swap sequence - 6
lsns -t mnt -- list all mount namespaces on the system
Key Takeaways
- ✓chroot is a filesystem trick, not a security boundary. It does not create namespaces, does not restrict syscalls, and does not stop a root process from escaping. The classic escape: mkdir(d); chroot(d); chdir('././.'); chroot('.') -- it works because chroot never changes the current working directory.
- ✓pivot_root operates at the mount namespace level and requires CLONE_NEWNS. After pivot_root, the old root lands at put_old and MUST be unmounted. Skip the umount and you have a path straight back to the host filesystem inside the container.
- ✓The runc container init sequence is: clone(CLONE_NEWNS|CLONE_NEWPID|..), mount overlay on new root, pivot_root(new_root, old_root), umount2(old_root, MNT_DETACH), then exec the container entrypoint. After this, no file descriptor, no path, and no /proc reference to the host filesystem survives.
- ✓chroot has a simpler API but a weaker security model. pivot_root has a more complex API (mount namespace, mount point requirements) but provides actual isolation. In interviews, knowing WHY containers use pivot_root instead of chroot demonstrates deep understanding of Linux security.
- ✓After pivot_root plus umount of old root, /proc/1/root inside the container points to the overlayfs mount. There is zero reference to the host's / in the container's mount table. Verify with cat /proc/self/mountinfo.
Common Pitfalls
- ✗Mistake: Using chroot for security isolation in production. Reality: chroot was designed for build environments (like debootstrap), not security boundaries. Any process with CAP_SYS_CHROOT can escape with the open-dir-chroot-fchdir technique. It was never meant to contain adversaries.
- ✗Mistake: Forgetting to umount the old root after pivot_root. Reality: if put_old remains mounted, any process in the container that navigates to that path has full access to the host filesystem. runc unmounts it immediately and recursively. Skip this step and your container is not isolated.
- ✗Mistake: Calling pivot_root without creating a mount namespace first. Reality: pivot_root modifies the root mount of the current namespace. Without CLONE_NEWNS, you would change the root for ALL processes sharing the namespace -- including the host init process. This would be catastrophic.
- ✗Mistake: Not bind-mounting the new root onto itself before pivot_root. Reality: the kernel requires new_root to be a mount point. A plain directory will not work. You must run mount --bind /new_root /new_root first to satisfy this requirement.
Reference
In One Line
pivot_root plus umount of the old root gives real filesystem isolation; chroot was never a security boundary and never will be.