Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)
Mental Model
Apartments in a building. Each one has its own address, mailbox, door lock, and view of who lives inside. Shared foundation, plumbing, electrical grid. Tenants cannot see into each other's units even though the walls are thin and the building is one structure.
The Problem
Multiple teams, one host. Every process can see every other process, bind to any port, read any file, and send signals to anything on the machine. A misbehaving service kills another team's database. Port conflicts block deployments. A filesystem change in one service corrupts another.
Architecture
Docker containers are not virtual machines.
They are regular Linux processes. The only difference is that each container gets its own private view of the system -- its own process list, its own network stack, its own filesystem. The host sees the container as PID 47832. Inside, it thinks it is PID 1.
That illusion is namespaces.
What Actually Happens
When Docker starts a container, runc calls clone() with flags like CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS. The kernel creates a new nsproxy structure pointing to new namespace objects and attaches it to the child process.
PID namespace: The first process gets PID 1 and takes on init responsibilities -- it reaps orphaned processes and receives SIGCHLD. The process also has a PID in every ancestor namespace (visible via NSpid in /proc/PID/status). If PID 1 exits, every process in the namespace gets SIGKILL.
Network namespace: Starts with only a loopback device. To connect it to the outside, a veth pair is created -- a virtual ethernet cable with one end in each namespace. Docker puts one end on the docker0 bridge, assigns an IP from 172.17.0.0/16, and sets up NAT rules for outbound traffic.
Mount namespace + pivot_root: The container runtime builds a filesystem (usually an overlayfs stack of image layers), enters a mount namespace, and calls pivot_root() to swap the root. The host filesystem becomes invisible. Bind mounts selectively expose host paths.
User namespace: Maps UID/GID ranges between namespaces. UID 0 inside can map to UID 100000 on the host. This is the foundation of rootless containers -- the container has "root" capabilities inside its namespace while remaining unprivileged on the host.
Under the Hood
Namespace lifecycle. A namespace exists as long as at least one process is a member or a bind-mount of /proc/PID/ns/<type> exists. When the last reference drops, the namespace and its resources are destroyed. That is why the Kubernetes "pause container" exists -- it holds the network namespace alive even when application containers restart.
Mount propagation matters. Mount namespaces support propagation modes: shared, private, slave, unbindable. With shared propagation, mounts in one namespace appear in another. Private is the default after unshare. Getting this wrong leads to mount leaks or missing device access inside containers.
setns() is the key to docker exec. setns() takes a file descriptor (from opening /proc/PID/ns/<type>) and moves the calling process into that namespace. docker exec opens the target container's namespace FDs and calls setns() for each one before exec'ing the command.
User namespaces have security implications. A process with UID 0 in a user namespace can exercise capabilities that interact with kernel code paths normally restricted to true root. Kernel vulnerabilities in these paths become exploitable by unprivileged users. That is why some distributions disable unprivileged user namespaces (kernel.unprivileged_userns_clone=0).
Common Questions
How does Docker implement container networking?
Docker creates a network namespace per container, creates a veth pair, places one end in the container and the other on the docker0 bridge. It assigns an IP from a private subnet, sets up a default route via the bridge, and configures iptables MASQUERADE for outbound traffic. Port publishing (-p 8080:80) adds DNAT rules.
What happens when PID 1 in a namespace crashes?
The kernel sends SIGKILL to every process in that namespace -- the entire container is torn down. This is different from the host's PID 1: if the host init dies, the kernel panics. In a namespace, it is a contained shutdown. That is why container init processes (tini, dumb-init) are kept deliberately simple.
How do Kubernetes pods share networking but not filesystems?
All containers in a pod share one network namespace (same IP, same ports, same localhost). But each container gets its own PID and mount namespace (different filesystem views, different process lists). The pause container creates and holds the shared network namespace. Application containers join it via setns().
How Technologies Use This
Running 50 isolated workloads on one host where each thinks it is PID 1 with its own IP address sounds like it requires VMs, but spinning up 50 VMs would consume 25GB of overhead and take minutes to boot. The workloads must be isolated from each other but share the host kernel.
The trick is that namespaces give each process its own private view of system resources without duplicating the kernel. Docker's runc calls clone() with CLONE_NEWPID, CLONE_NEWNET, and CLONE_NEWNS to create isolated views in under 100ms. Each container sees itself as PID 1 with its own /proc, its own 172.17.x.x address, and its own overlayfs root, while the host sees it as just PID 47832.
This approach consumes about 50MB of overhead per container instead of the 500MB a VM would require. The takeaway is that containers are not lightweight VMs -- they are regular processes with namespace-isolated views of the system, which is why they start in milliseconds and share the host's memory page cache.
Two containers in the same pod need to share localhost on port 8080 for sidecar communication, but giving them full shared access breaks filesystem security. Fully isolating them breaks the sidecar pattern entirely because they cannot reach each other over 127.0.0.1.
The solution lies in selective namespace sharing. Kubernetes shares one network namespace across all containers in a pod so they see the same IP and can communicate over localhost, while giving each container its own PID and mount namespaces for separate filesystems and process lists. The pause container holds the shared network namespace alive even when application containers restart.
Without the pause container, every application container crash would destroy the shared network namespace, causing a 2-3 second network reconfiguration gap. The lesson is that Kubernetes pods are not just groups of containers -- they are carefully orchestrated namespace-sharing arrangements where each namespace type is shared or isolated independently based on the communication and security requirements.
A compromised web server service reads /home/admin/.ssh/id_rsa and exfiltrates the private key. Without namespace isolation, every service on the host shares the same filesystem and network view, so any service can see every other service's files and sniff its traffic.
The underlying problem is that traditional Linux services all run in the host's single namespace. A vulnerability in any one of them gives the attacker the same view of the system as every other service. There is no isolation boundary between services unless one is explicitly created.
systemd solves this without requiring Docker or any container runtime. PrivateNetwork= gives a service its own empty network stack with only loopback, ProtectHome=true mounts an empty tmpfs over /home, and DynamicUser= creates a temporary user namespace with an ephemeral UID. These three directives reduce the blast radius of a service compromise by over 90%.
A malicious webpage exploits a renderer vulnerability and calls kill() on an SSH session, or reads /proc/1/environ to steal host secrets. Without PID namespace isolation, the compromised renderer process can enumerate and signal every process on the machine.
The core issue is that a renderer tab is a regular process that, without namespace isolation, shares the host's PID space. It can see every process in /proc, send signals to any process owned by the same user, and read environment variables that may contain credentials. A single tab exploit becomes a full host compromise.
Chrome puts each renderer tab in its own PID and user namespace, so the tab process sees only itself in /proc and its UID 0 maps to an unprivileged host UID like 100000. Even a full renderer exploit is confined to a view of exactly 1 process, reducing the exploitable attack surface by roughly 99%.
Same Concept Across Tech
| Technology | Which namespaces it uses | Purpose |
|---|---|---|
| Docker | All 8 (pid, net, mnt, uts, ipc, user, cgroup, time) | Full container isolation |
| Kubernetes | Inherits from container runtime, adds network namespace per pod | Pod-level networking, containers in a pod share net namespace |
| Chrome | PID + network + mount for renderer processes | Sandboxing untrusted web content |
| systemd | PrivateNetwork, ProtectHome, PrivateTmp use mount and network namespaces | Service hardening without containers |
| Flatpak | User + mount + PID namespaces | Desktop app sandboxing |
Stack layer mapping (container networking issue):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the app binding to the right interface/port? | Application logs, ss -tlnp |
| Container runtime | Is the network namespace correctly configured? | docker inspect, ip netns list |
| Network namespace | Are routes and iptables correct inside the namespace? | nsenter -n -t PID ip route, iptables -L |
| Host kernel | Is IP forwarding enabled? Are veth pairs connected? | sysctl net.ipv4.ip_forward, ip link show |
| Hardware | Physical NIC configuration, MTU mismatches? | ethtool, ip link show |
Design Rationale Duplicating the entire kernel per workload (VMs) wastes memory and adds seconds of boot time. A single global resource view lets any process see every PID, bind any port, read any file -- fundamentally unsafe for multi-tenant hosting. Splitting isolation into orthogonal namespace types lets runtimes compose exactly the boundaries they need. Kubernetes shares a network namespace within a pod (sidecar over localhost) while isolating mount namespaces (separate filesystems) -- a level of granularity that a monolithic VM-or-nothing model cannot express.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Container cannot reach the internet | Network namespace missing default route or NAT rule | nsenter -n -t PID ip route |
| Process sees PID 1 but host shows PID 47832 | Normal PID namespace behavior, not a bug | cat /proc/PID/status, check NSpid line |
| Container can see host processes | PID namespace not applied, or container running with --pid=host | ls /proc inside container, check docker run flags |
| Permission denied despite running as root inside container | User namespace mapping. Root inside container maps to unprivileged user on host | cat /proc/PID/uid_map |
| mount/umount fails inside container | Mount namespace is read-only or propagation is set to slave | Check mount propagation with findmnt |
| Container escape vulnerability | Missing namespace + seccomp + capability restrictions | Audit all three layers, not just namespaces |
When to Use / Avoid
Use namespaces when:
- Building container runtimes (Docker, containerd, Podman all use namespaces)
- Isolating network stacks for testing (each namespace gets its own interfaces, routes, iptables)
- Running untrusted code that should not see host processes or filesystem
- Creating sandboxed build environments
Avoid when:
- Strong security isolation is required (namespaces are not a security boundary on their own, combine with seccomp and capabilities)
- Hardware-level isolation is needed (use VMs instead)
- The overhead of managing namespace lifecycle is not worth the isolation benefit
Try It Yourself
1 # List all namespaces on the system
2
3 lsns --output-all | head -20
4
5 # Create a minimal container with PID, mount, network, and user namespaces
6
7 unshare --user --pid --fork --mount-proc --net bash -c 'echo PID=$$; hostname; ip link; ls /proc | head -5; exit'
8
9 # Check which namespaces a process belongs to
10
11 ls -la /proc/$$/ns/
12
13 # Enter the namespaces of a running Docker container
14
15 docker run -d --name test-ns alpine sleep 3600 2>/dev/null; PID=$(docker inspect -f '{{.State.Pid}}' test-ns 2>/dev/null); echo "Container PID: $PID"; docker rm -f test-ns 2>/dev/null
16
17 # Create a network namespace with veth pair
18
19 sudo ip netns add testns && sudo ip link add veth0 type veth peer name veth1 && sudo ip link set veth1 netns testns && sudo ip netns exec testns ip addr show; sudo ip netns del testns
20
21 # Show PID namespace hierarchy
22
23 cat /proc/$$/status | grep -E '^(NSpid|NStgid|NSsid)'Debug Checklist
- 1
List namespaces of a process: ls -la /proc/<pid>/ns/ - 2
Enter a container's namespace: nsenter -t <pid> -m -u -i -n -p - 3
List all network namespaces: ip netns list - 4
Check PID namespace: cat /proc/<pid>/status | grep NSpid - 5
Run a command in a new namespace: unshare --pid --fork --mount-proc bash - 6
Check user namespace mappings: cat /proc/<pid>/uid_map
Key Takeaways
- ✓8 namespace types as of kernel 5.6: mount, UTS (hostname), IPC, network, PID, user, cgroup, and time. The time namespace lets containers have different CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets -- useful for container migration.
- ✓unshare() creates new namespaces for the calling process. clone() creates them for a child. setns() joins an existing namespace by fd from /proc/PID/ns/<type>. That last one is how 'docker exec' enters a running container.
- ✓If PID 1 in a PID namespace exits, every other process in that namespace gets SIGKILL. The entire container is torn down. That is why containers need an init process like tini or dumb-init for signal forwarding and zombie reaping.
- ✓Mount namespaces plus pivot_root() are what make the container's rootfs appear as /. Docker uses overlayfs for layered images, then pivot_root to swap the root. The host filesystem becomes invisible.
- ✓User namespaces are the key to rootless containers. An unprivileged user creates a user namespace, gets full capabilities inside it, and can then create all other namespace types. No real root needed.
Common Pitfalls
- ✗Mistake: Expecting /proc to be isolated after entering a PID namespace. Reality: You must mount a new procfs (mount -t proc proc /proc) or /proc still shows the host's process list.
- ✗Mistake: Wondering why network is broken in a new namespace. Reality: Network namespaces start with only loopback. You must create veth pairs, assign IPs, and set up routing -- or use a CNI plugin.
- ✗Mistake: Thinking PID namespace isolation is absolute. Reality: PID namespaces are hierarchical. The parent can see all child PIDs. kill() from the host can target container processes by their host PID. This is by design.
- ✗Mistake: Using chroot for container filesystem isolation. Reality: chroot is trivially escapable (open fd to /, chroot to subdirectory, fchdir). pivot_root in a mount namespace has no such escape.
Reference
In One Line
Containers are just processes with namespace-isolated views -- understand unshare, clone, and setns and the magic disappears.