User Namespaces in Depth
Mental Model
A foreign embassy on sovereign soil. Inside the embassy, the ambassador (UID 0) has full authority and can issue orders, sign documents, and grant access. But step outside the embassy gates and that ambassador is just another civilian with no special powers. The embassy walls are the user namespace boundary. The ambassador's credentials (UID 0) are only valid within that boundary. On the host country's soil, the ambassador is identified by their passport number (host UID 1000), not their rank.
The Problem
Running a container as root inside but mapped to UID 1000 on the host. A file created as root (UID 0) inside appears as UID 1000 on the host filesystem. A second container with a different mapping sees that same file as "nobody" (65534) because the host UID does not fall within its mapped range. Shared volumes between containers with different UID mappings produce permission errors that have nothing to do with traditional Unix permissions.
Architecture
A container process runs as root. It can install packages, bind to port 80, and modify /etc/passwd inside the container. On the host, that same process runs as UID 1000 with no special privileges. If it escapes the container, it lands as a regular user, not root.
This is user namespaces. The mechanism that decouples "root inside the container" from "root on the host." Without understanding UID mapping, rootless containers, shared volume permissions, and the security boundaries they create remain opaque.
What Actually Happens
When a process calls unshare(CLONE_NEWUSER) or clone() with CLONE_NEWUSER, the kernel creates a new user_namespace structure. The process enters this new namespace, but initially has no valid UID or GID mapping. The kernel treats it as the overflow UID (65534) until mappings are written.
The mapping is configured by writing to /proc/PID/uid_map and /proc/PID/gid_map. Each line has three fields:
inside_start host_start count
Example mapping for a rootless container:
0 1000 1 # container UID 0 -> host UID 1000
1 100000 65535 # container UIDs 1-65535 -> host UIDs 100000-165534
This means:
- UID 0 inside the container (root) maps to host UID 1000 (an unprivileged user).
- UIDs 1 through 65535 inside map to host UIDs 100000 through 165534 (subordinate range).
- Any UID inside the container not covered by the mapping translates to 65534 (nobody) when viewed from outside.
The mapping is immutable. Once written, it cannot be changed. This is a deliberate security decision: allowing runtime remapping would open privilege escalation paths.
The Subordinate UID System
An unprivileged process can only map its own UID into the new namespace (a single mapping line: "0 1000 1"). To map additional UIDs, the system uses /etc/subuid and /etc/subgid:
alice:100000:65536
bob:165536:65536
This allocates host UIDs 100000-165535 exclusively to alice for use inside user namespaces. The newuidmap and newgidmap setuid helpers read these files and write the mapping on behalf of the unprivileged process after validating that the requested range falls within the allocation.
The sequence for creating a multi-UID user namespace as an unprivileged user:
- Call
unshare(CLONE_NEWUSER). The kernel creates the new namespace. - From outside the namespace (or using a helper), write "deny" to
/proc/PID/setgroups. - Call
newuidmap PID 0 1000 1 1 100000 65535. - Call
newgidmap PID 0 1000 1 1 100000 65535.
The setgroups deny step is mandatory. Without it, a process could use setgroups(2) to drop supplementary groups it should not have control over, then exploit the group change for privilege escalation.
Under the Hood
Kernel UID translation. Every time the kernel checks file permissions, delivers a signal, or creates a file, it translates between namespace UIDs and host UIDs. The functions from_kuid() and make_kuid() in kernel/user_namespace.c handle this. File ownership stored on disk uses host UIDs. When a process inside a user namespace calls stat(), the kernel translates the on-disk host UID through the namespace's uid_map to produce the UID the process sees.
Capability scoping. A process that is UID 0 inside a user namespace receives a full capability set, but those capabilities only apply within the namespace. CAP_SYS_ADMIN inside the namespace allows creating mount namespaces, network namespaces, and other nested namespaces. It does not grant any capability on the host. The kernel checks ns_capable() instead of capable() for namespace-scoped operations.
The namespace hierarchy. User namespaces form a tree. Each namespace has a parent (except the initial namespace). When the kernel checks whether a process has permission to signal another process, it walks the namespace tree to find a common ancestor and translates UIDs at each level. This hierarchical model allows nested containers: a container inside a container, each with its own UID mapping.
idmapped mounts (Linux 5.12+). The classic problem: two containers with different UID mappings cannot share a volume because the on-disk UIDs only make sense to one mapping. idmapped mounts solve this by adding a translation layer at the VFS level. The mount_setattr() syscall with MOUNT_ATTR_IDMAP attaches a UID mapping to a mount point. When a process accesses a file through an idmapped mount, the kernel applies the mount's mapping in addition to the namespace's mapping. This allows the same filesystem to appear with correct ownership in multiple containers without changing any on-disk UIDs.
Security implications. User namespaces expand the kernel's attack surface. An unprivileged user can exercise kernel code paths (mount, network configuration, cgroup manipulation) that previously required root. Several privilege escalation CVEs have involved user namespaces as the entry point (CVE-2022-0185, CVE-2023-32233). Some distributions disable unprivileged user namespaces by default (kernel.unprivileged_userns_clone=0) and require explicit opt-in.
Common Questions
Can a process inside a user namespace access files owned by UIDs outside its mapping?
No. If the host UID of a file does not fall within any range in the namespace's uid_map, the file appears owned by the overflow UID (65534, nobody). The process cannot chown() or otherwise manipulate ownership of unmapped UIDs. This is how user namespaces provide isolation: the process simply cannot see or interact with UIDs it has no mapping for.
What happens to setuid binaries inside a user namespace?
Setuid and setgid bits are ignored for binaries on filesystems mounted inside a user namespace (unless the filesystem was mounted with explicit privilege). The kernel clears the setuid bit effect for non-initial user namespaces. This prevents a mapped "root" from exploiting a setuid binary to gain actual host root.
How does the kernel prevent a user namespace from mapping arbitrary host UIDs?
An unprivileged process writing its own uid_map can only create a single mapping from namespace UID to its own host UID. Mapping to any other host UID requires the newuidmap setuid helper, which validates against /etc/subuid. A process with CAP_SETUID in the parent namespace can write arbitrary mappings. This two-tier model allows flexibility for administrators while preventing unprivileged users from impersonating other host users.
Why does Docker not use user namespaces by default?
Historically, user namespaces added complexity: volume permissions broke, some storage drivers required real root, and performance-sensitive paths had an extra translation step. Docker 24+ enables userns-remap support but does not activate it by default for backward compatibility. Podman, designed rootless from the start, uses user namespaces by default because it never had the legacy assumption of a root daemon.
How Technologies Use This
A Docker host running 60 containers in rootless mode serves a multi-tenant SaaS platform. Docker's rootless mode, introduced in Docker 19.03, runs dockerd itself as a regular user (UID 1000) inside a user namespace. The daemon, containerd, and all container processes map container UID 0 to host UID 1000 via the entry "0 1000 1" in /proc/PID/uid_map. A process that is root inside the container and creates files in a bind-mounted volume produces files owned by UID 1000 on the host filesystem.
The RootlessKit component sets up the user namespace and configures network isolation via slirp4netns or pasta. Each container's UID range is drawn from /etc/subuid, which allocates host UIDs 100000 through 165535 to the docker user. Container UID 1 maps to host UID 100000, UID 2 to 100001, and so on. The newuidmap setuid helper writes these multi-UID mappings because an unprivileged process can only map its own single UID without assistance. The overlay2 storage driver works inside the user namespace because Linux 5.11+ permits unprivileged overlayfs mounts within a user namespace.
If a container process exploits a vulnerability and escapes the container boundary, it lands on the host as UID 1000 with zero elevated capabilities. Without rootless mode, that same escape would grant host UID 0. The trade-off is that rootless Docker cannot bind to ports below 1024 without setting net.ipv4.ip_unprivileged_port_start=0 and cannot use storage drivers like devicemapper or btrfs that require real root privileges.
A 200-pod Kubernetes cluster runs multi-tenant workloads where each tenant's pods must be isolated so that a container escape does not grant host root. KEP-127 (User Namespaces), which reached beta in Kubernetes 1.30, assigns each pod its own user namespace with a unique UID mapping. Pod A maps container UID 0 to host UID 200000, while Pod B maps container UID 0 to host UID 265536. Neither pod's "root" has any privilege on the host, and neither can access the other's files because the host UIDs do not overlap.
The kubelet coordinates with the container runtime (CRI-O or containerd) to allocate non-overlapping subordinate UID ranges from a pool. When a pod spec includes hostUsers: false, the runtime creates a user namespace before launching the container. The /proc/PID/uid_map inside the pod shows the assigned mapping. On the host, all container files appear owned by high-numbered UIDs (200000+) that belong to no real user. The kernel enforces this mapping on every credential check, file access, and signal delivery.
The practical impact is defense in depth. Even if an attacker exploits a container runtime vulnerability to escape the mount and PID namespaces, the user namespace ensures they land as an unprivileged high-UID user on the host with no capabilities. Combined with seccomp profiles and AppArmor policies, per-pod user namespaces close the last major privilege escalation path that traditional Kubernetes containers left open. The requirement is a container runtime that supports idmapped mounts and a host kernel at version 6.3 or newer for full functionality.
A development team of 15 engineers runs Podman on shared Linux workstations to build and test microservices. Each engineer operates as a regular user (UIDs 1001 through 1015). Podman runs rootless by default since version 4.0, requiring no daemon and no root privileges. When engineer alice (UID 1001) runs "podman run -it fedora bash," Podman creates a user namespace that maps container UID 0 to host UID 1001 and maps container UIDs 1 through 65535 to host UIDs 100000 through 165534, as defined in alice's /etc/subuid entry "alice:100000:65536."
The newuidmap and newgidmap setuid helpers write the multi-UID mapping into /proc/PID/uid_map and /proc/PID/gid_map on alice's behalf. These helpers validate every requested range against /etc/subuid before writing, preventing alice from mapping UIDs that belong to another user's subordinate range. Inside the container, alice's processes see a full UID space from 0 to 65535. On the host filesystem, every file created by the container is owned by UIDs in the 100000-165534 range, which belong exclusively to alice's namespace allocation.
When engineer bob (UID 1002, subuid range 200000-265535) runs a separate Podman container on the same workstation, his container's files use a completely different host UID range. Alice's containers cannot read or write bob's container files because the UIDs do not overlap. If either container is compromised, the attacker has the host privileges of that specific engineer's unprivileged account, not root. Podman achieves all of this without a root daemon, without setuid binaries (other than the newuidmap/newgidmap helpers), and without any kernel modules beyond the built-in user namespace support.
Same Concept Across Tech
| Technology | How it uses user namespaces | Key gotcha |
|---|---|---|
| Podman | Default rootless mode. All containers run in user namespace mapped via /etc/subuid | Shared volumes between pods with different mappings show files as nobody |
| Docker | Rootless mode via RootlessKit. Daemon itself runs in user namespace | Ports below 1024 require sysctl or setcap workaround |
| Kubernetes | Pod-level user namespaces (beta in 1.30). Each pod gets unique host UID range | Requires runtime support (CRI-O, containerd with idmapped mounts) |
| Flatpak | bubblewrap creates user namespace for app sandbox. App sees UID 0 inside | Host files shared into sandbox need correct permission setup |
| Chromium | Renderer processes sandboxed in user namespace. No host capabilities | Systems with unprivileged_userns_clone=0 break the sandbox |
Stack layer mapping (unexpected file ownership in container volumes):
| Layer | What to check | Tool |
|---|---|---|
| Application | Which UID is the process running as inside the container? | id command inside container |
| Container runtime | What UID mapping is configured? | cat /proc/PID/uid_map |
| Host | What is the actual host UID of the file? | ls -ln on the host filesystem |
| Kernel | Is the UID within the mapped range? | Compare file UID against uid_map ranges |
| Storage | Are idmapped mounts in use? | mount |
Design Rationale Early Linux containers ran as real root on the host. A container escape meant full host compromise. User namespaces, merged in Linux 3.8, broke the assumption that UID 0 means host root. The mapping mechanism is intentionally one-shot and immutable to prevent privilege escalation through runtime remapping. The /etc/subuid delegation model follows the principle of least privilege: each user gets exactly the UID range the administrator allocates, nothing more. The newuidmap setuid helper exists because allowing arbitrary UID mapping would let any user impersonate any other user on the system.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Files in container volume owned by nobody (65534) | Host UID not in container's uid_map range | cat /proc/container_pid/uid_map |
| "Operation not permitted" creating user namespace | Kernel restricts unprivileged user namespaces | cat /proc/sys/kernel/unprivileged_userns_clone |
| newuidmap: write to uid_map failed | /etc/subuid missing entry for user | grep username /etc/subuid |
| EPERM writing gid_map | Did not write "deny" to setgroups first | cat /proc/PID/setgroups, write "deny" before gid_map |
| Container process cannot bind port 80 | Rootless container has no CAP_NET_BIND_SERVICE on host | sysctl net.ipv4.ip_unprivileged_port_start=0 |
| Two containers cannot share files | Non-overlapping UID mappings | Compare uid_map of both containers, consider idmapped mounts |
When to Use / Avoid
Relevant when:
- Running containers without a root daemon (Podman rootless, Docker rootless mode)
- Isolating multi-tenant workloads where container escape must not grant host root
- Sandboxing desktop applications (Flatpak, Chromium, Firefox)
- Understanding why files in container volumes show unexpected ownership
Watch out for:
- Shared volumes between containers with different UID mappings cause "nobody" ownership
- /etc/subuid must be configured before rootless containers work
- Some kernel versions disable unprivileged user namespaces (kernel.unprivileged_userns_clone=0)
- Performance overhead is negligible for most workloads but adds a translation step on every credential check
Try It Yourself
1 # Create a user namespace and become root inside
2
3 unshare --user --map-root-user bash -c "id; cat /proc/self/uid_map"
4
5 # Check the current UID mapping of a running container
6
7 cat /proc/$(pgrep -f "container_process")/uid_map
8
9 # View subordinate UID allocations
10
11 cat /etc/subuid && cat /etc/subgid
12
13 # Create a user namespace with custom mapping using newuidmap
14
15 unshare --user bash -c 'echo $$ && sleep 60' & CHILD=$!; newuidmap $CHILD 0 1000 1 1 100000 65535; cat /proc/$CHILD/uid_map
16
17 # List all user namespaces on the system
18
19 lsns -t user -o NS,PID,USER,COMMAND
20
21 # Check if a process runs in a non-initial user namespace
22
23 readlink /proc/$$/ns/user; readlink /proc/1/ns/user
24
25 # Trace UID mapping calls
26
27 strace -f -e trace=unshare,clone,write -p $(pgrep podman) 2>&1 | grep -E "uid_map|gid_map"
28
29 # Show file ownership from both perspectives
30
31 echo "Host view:" && ls -ln /tmp/container_vol/; echo "Container view:" && podman exec mycontainer ls -ln /data/Debug Checklist
- 1
Check if user namespaces are enabled: cat /proc/sys/kernel/unprivileged_userns_clone - 2
View UID mapping: cat /proc/<container_pid>/uid_map - 3
Check subordinate UID allocation: grep <username> /etc/subuid - 4
Verify setgroups status: cat /proc/<pid>/setgroups - 5
List all user namespaces: lsns -t user - 6
Check file ownership from host perspective: ls -ln /path/to/container/rootfs - 7
Trace UID translation: strace -e trace=open,openat,stat container_command 2>&1 | grep -i perm
Key Takeaways
- ✓A process with CAP_SYS_ADMIN inside a user namespace has that capability only within the namespace. It can mount filesystems, create network namespaces, and manipulate cgroups scoped to its namespace. On the host, the same process runs as an unprivileged user with zero extra capabilities.
- ✓Writing to uid_map is a one-shot operation. Once written, the mapping is immutable for the lifetime of the namespace. There is no way to change or extend the mapping after the fact. Getting the mapping wrong means destroying and recreating the namespace.
- ✓Without a valid UID mapping, the kernel uses the overflow UID (65534, typically "nobody"). Any file owned by an unmapped host UID appears as owned by 65534 inside the namespace. This is why cross-container shared volumes show files as "nobody" when UID mappings do not overlap.
- ✓Linux 5.12 introduced idmapped mounts (mount_setattr with MOUNT_ATTR_IDMAP). This allows a single filesystem to be mounted with different UID translations for different containers, solving the shared volume problem without changing on-disk ownership. The translation happens at the VFS layer during path resolution.
- ✓The user namespace is the root of all other namespace capabilities. Creating a network namespace, PID namespace, or mount namespace as an unprivileged user requires first creating a user namespace. The user namespace grants the in-namespace CAP_SYS_ADMIN needed to create the others.
Common Pitfalls
- ✗Assuming "root inside the container" means root on the host. With user namespaces, container UID 0 maps to an unprivileged host UID. But without user namespaces (Docker default mode until recently), container UID 0 is real host root. The difference is the presence or absence of the user namespace layer.
- ✗Forgetting to write "deny" to /proc/PID/setgroups before writing gid_map. The kernel requires this to prevent an unprivileged process from dropping supplementary groups it does not control. Omitting this step results in EPERM when writing gid_map and a confusing error message.
- ✗Sharing a volume between two containers with non-overlapping UID mappings. Container A maps host UID 100000 as its UID 0. Container B maps host UID 200000 as its UID 0. Files created by container A's root (host 100000) appear as nobody (65534) inside container B. Solutions: idmapped mounts, shared subordinate ranges, or running both containers with the same UID mapping.
- ✗Not allocating enough subordinate UIDs in /etc/subuid. A container running real services (systemd, multiple users) needs a range of at least 65536. Allocating only 1000 subordinate UIDs causes the container to fail when any process tries to use a UID above 1000 inside.
Reference
In One Line
User namespaces remap UID 0 inside a container to an unprivileged host UID, making rootless containers possible and container escapes survivable.