Kernel & BootTopic 3 of 13

Kernel InternalsIntermediate

Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)

DockerKubernetessystemdChrome

🧠

Mental Model

Apartments in a building. Each one has its own address, mailbox, door lock, and view of who lives inside. Shared foundation, plumbing, electrical grid. Tenants cannot see into each other's units even though the walls are thin and the building is one structure.

💡

The Problem

Multiple teams, one host. Every process can see every other process, bind to any port, read any file, and send signals to anything on the machine. A misbehaving service kills another team's database. Port conflicts block deployments. A filesystem change in one service corrupts another.

Architecture

Docker containers are not virtual machines.

They are regular Linux processes. The only difference is that each container gets its own private view of the system -- its own process list, its own network stack, its own filesystem. The host sees the container as PID 47832. Inside, it thinks it is PID 1.

That illusion is namespaces.

What Actually Happens

When Docker starts a container, runc calls clone() with flags like CLONE_NEWPID | CLONE_NEWNET | CLONE_NEWNS. The kernel creates a new nsproxy structure pointing to new namespace objects and attaches it to the child process.

PID namespace: The first process gets PID 1 and takes on init responsibilities -- it reaps orphaned processes and receives SIGCHLD. The process also has a PID in every ancestor namespace (visible via NSpid in /proc/PID/status). If PID 1 exits, every process in the namespace gets SIGKILL.

Network namespace: Starts with only a loopback device. To connect it to the outside, a veth pair is created -- a virtual ethernet cable with one end in each namespace. Docker puts one end on the docker0 bridge, assigns an IP from 172.17.0.0/16, and sets up NAT rules for outbound traffic.

Mount namespace + pivot_root: The container runtime builds a filesystem (usually an overlayfs stack of image layers), enters a mount namespace, and calls pivot_root() to swap the root. The host filesystem becomes invisible. Bind mounts selectively expose host paths.

User namespace: Maps UID/GID ranges between namespaces. UID 0 inside can map to UID 100000 on the host. This is the foundation of rootless containers -- the container has "root" capabilities inside its namespace while remaining unprivileged on the host.

Under the Hood

Namespace lifecycle. A namespace exists as long as at least one process is a member or a bind-mount of /proc/PID/ns/<type> exists. When the last reference drops, the namespace and its resources are destroyed. That is why the Kubernetes "pause container" exists -- it holds the network namespace alive even when application containers restart.

Mount propagation matters. Mount namespaces support propagation modes: shared, private, slave, unbindable. With shared propagation, mounts in one namespace appear in another. Private is the default after unshare. Getting this wrong leads to mount leaks or missing device access inside containers.

setns() is the key to docker exec. setns() takes a file descriptor (from opening /proc/PID/ns/<type>) and moves the calling process into that namespace. docker exec opens the target container's namespace FDs and calls setns() for each one before exec'ing the command.

User namespaces have security implications. A process with UID 0 in a user namespace can exercise capabilities that interact with kernel code paths normally restricted to true root. Kernel vulnerabilities in these paths become exploitable by unprivileged users. That is why some distributions disable unprivileged user namespaces (kernel.unprivileged_userns_clone=0).

Common Questions

How does Docker implement container networking?

Docker creates a network namespace per container, creates a veth pair, places one end in the container and the other on the docker0 bridge. It assigns an IP from a private subnet, sets up a default route via the bridge, and configures iptables MASQUERADE for outbound traffic. Port publishing (-p 8080:80) adds DNAT rules.

What happens when PID 1 in a namespace crashes?

The kernel sends SIGKILL to every process in that namespace -- the entire container is torn down. This is different from the host's PID 1: if the host init dies, the kernel panics. In a namespace, it is a contained shutdown. That is why container init processes (tini, dumb-init) are kept deliberately simple.

How do Kubernetes pods share networking but not filesystems?

All containers in a pod share one network namespace (same IP, same ports, same localhost). But each container gets its own PID and mount namespace (different filesystem views, different process lists). The pause container creates and holds the shared network namespace. Application containers join it via setns().

How Technologies Use This

Docker

Running 50 isolated workloads on one host where each thinks it is PID 1 with its own IP address sounds like it requires VMs, but spinning up 50 VMs would consume 25GB of overhead and take minutes to boot. The workloads must be isolated from each other but share the host kernel.

The trick is that namespaces give each process its own private view of system resources without duplicating the kernel. Docker's runc calls clone() with CLONE_NEWPID, CLONE_NEWNET, and CLONE_NEWNS to create isolated views in under 100ms. Each container sees itself as PID 1 with its own /proc, its own 172.17.x.x address, and its own overlayfs root, while the host sees it as just PID 47832.

This approach consumes about 50MB of overhead per container instead of the 500MB a VM would require. The takeaway is that containers are not lightweight VMs -- they are regular processes with namespace-isolated views of the system, which is why they start in milliseconds and share the host's memory page cache.

Kubernetes

Two containers in the same pod need to share localhost on port 8080 for sidecar communication, but giving them full shared access breaks filesystem security. Fully isolating them breaks the sidecar pattern entirely because they cannot reach each other over 127.0.0.1.

The solution lies in selective namespace sharing. Kubernetes shares one network namespace across all containers in a pod so they see the same IP and can communicate over localhost, while giving each container its own PID and mount namespaces for separate filesystems and process lists. The pause container holds the shared network namespace alive even when application containers restart.

Without the pause container, every application container crash would destroy the shared network namespace, causing a 2-3 second network reconfiguration gap. The lesson is that Kubernetes pods are not just groups of containers -- they are carefully orchestrated namespace-sharing arrangements where each namespace type is shared or isolated independently based on the communication and security requirements.

systemd

A compromised web server service reads /home/admin/.ssh/id_rsa and exfiltrates the private key. Without namespace isolation, every service on the host shares the same filesystem and network view, so any service can see every other service's files and sniff its traffic.

The underlying problem is that traditional Linux services all run in the host's single namespace. A vulnerability in any one of them gives the attacker the same view of the system as every other service. There is no isolation boundary between services unless one is explicitly created.

systemd solves this without requiring Docker or any container runtime. PrivateNetwork= gives a service its own empty network stack with only loopback, ProtectHome=true mounts an empty tmpfs over /home, and DynamicUser= creates a temporary user namespace with an ephemeral UID. These three directives reduce the blast radius of a service compromise by over 90%.

Chrome

A malicious webpage exploits a renderer vulnerability and calls kill() on an SSH session, or reads /proc/1/environ to steal host secrets. Without PID namespace isolation, the compromised renderer process can enumerate and signal every process on the machine.

The core issue is that a renderer tab is a regular process that, without namespace isolation, shares the host's PID space. It can see every process in /proc, send signals to any process owned by the same user, and read environment variables that may contain credentials. A single tab exploit becomes a full host compromise.

Chrome puts each renderer tab in its own PID and user namespace, so the tab process sees only itself in /proc and its UID 0 maps to an unprivileged host UID like 100000. Even a full renderer exploit is confined to a view of exactly 1 process, reducing the exploitable attack surface by roughly 99%.

Same Concept Across Tech

Technology	Which namespaces it uses	Purpose
Docker	All 8 (pid, net, mnt, uts, ipc, user, cgroup, time)	Full container isolation
Kubernetes	Inherits from container runtime, adds network namespace per pod	Pod-level networking, containers in a pod share net namespace
Chrome	PID + network + mount for renderer processes	Sandboxing untrusted web content
systemd	PrivateNetwork, ProtectHome, PrivateTmp use mount and network namespaces	Service hardening without containers
Flatpak	User + mount + PID namespaces	Desktop app sandboxing

Stack layer mapping (container networking issue):

Layer	What to check	Tool
Application	Is the app binding to the right interface/port?	Application logs, ss -tlnp
Container runtime	Is the network namespace correctly configured?	docker inspect, ip netns list
Network namespace	Are routes and iptables correct inside the namespace?	nsenter -n -t PID ip route, iptables -L
Host kernel	Is IP forwarding enabled? Are veth pairs connected?	sysctl net.ipv4.ip_forward, ip link show
Hardware	Physical NIC configuration, MTU mismatches?	ethtool, ip link show

Design Rationale Duplicating the entire kernel per workload (VMs) wastes memory and adds seconds of boot time. A single global resource view lets any process see every PID, bind any port, read any file -- fundamentally unsafe for multi-tenant hosting. Splitting isolation into orthogonal namespace types lets runtimes compose exactly the boundaries they need. Kubernetes shares a network namespace within a pod (sidecar over localhost) while isolating mount namespaces (separate filesystems) -- a level of granularity that a monolithic VM-or-nothing model cannot express.

If You See This, Think This

Symptom	Likely cause	First check
Container cannot reach the internet	Network namespace missing default route or NAT rule	nsenter -n -t PID ip route
Process sees PID 1 but host shows PID 47832	Normal PID namespace behavior, not a bug	cat /proc/PID/status, check NSpid line
Container can see host processes	PID namespace not applied, or container running with --pid=host	ls /proc inside container, check docker run flags
Permission denied despite running as root inside container	User namespace mapping. Root inside container maps to unprivileged user on host	cat /proc/PID/uid_map
mount/umount fails inside container	Mount namespace is read-only or propagation is set to slave	Check mount propagation with findmnt
Container escape vulnerability	Missing namespace + seccomp + capability restrictions	Audit all three layers, not just namespaces

When to Use / Avoid

Use namespaces when:

Building container runtimes (Docker, containerd, Podman all use namespaces)
Isolating network stacks for testing (each namespace gets its own interfaces, routes, iptables)
Running untrusted code that should not see host processes or filesystem
Creating sandboxed build environments

Avoid when:

Strong security isolation is required (namespaces are not a security boundary on their own, combine with seccomp and capabilities)
Hardware-level isolation is needed (use VMs instead)
The overhead of managing namespace lifecycle is not worth the isolation benefit

Try It Yourself

 1  # List all namespaces on the system
 2  
 3  lsns --output-all | head -20
 4  
 5  # Create a minimal container with PID, mount, network, and user namespaces
 6  
 7  unshare --user --pid --fork --mount-proc --net bash -c 'echo PID=$$; hostname; ip link; ls /proc | head -5; exit'
 8  
 9  # Check which namespaces a process belongs to
10  
11  ls -la /proc/$$/ns/
12  
13  # Enter the namespaces of a running Docker container
14  
15  docker run -d --name test-ns alpine sleep 3600 2>/dev/null; PID=$(docker inspect -f '{{.State.Pid}}' test-ns 2>/dev/null); echo "Container PID: $PID"; docker rm -f test-ns 2>/dev/null
16  
17  # Create a network namespace with veth pair
18  
19  sudo ip netns add testns && sudo ip link add veth0 type veth peer name veth1 && sudo ip link set veth1 netns testns && sudo ip netns exec testns ip addr show; sudo ip netns del testns
20  
21  # Show PID namespace hierarchy
22  
23  cat /proc/$$/status | grep -E '^(NSpid|NStgid|NSsid)'

Debug Checklist

1List namespaces of a process: ls -la /proc/<pid>/ns/
2Enter a container's namespace: nsenter -t <pid> -m -u -i -n -p
3List all network namespaces: ip netns list
4Check PID namespace: cat /proc/<pid>/status | grep NSpid
5Run a command in a new namespace: unshare --pid --fork --mount-proc bash
6Check user namespace mappings: cat /proc/<pid>/uid_map

Key Takeaways

✓8 namespace types as of kernel 5.6: mount, UTS (hostname), IPC, network, PID, user, cgroup, and time. The time namespace lets containers have different CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets -- useful for container migration.
✓unshare() creates new namespaces for the calling process. clone() creates them for a child. setns() joins an existing namespace by fd from /proc/PID/ns/<type>. That last one is how 'docker exec' enters a running container.
✓If PID 1 in a PID namespace exits, every other process in that namespace gets SIGKILL. The entire container is torn down. That is why containers need an init process like tini or dumb-init for signal forwarding and zombie reaping.
✓Mount namespaces plus pivot_root() are what make the container's rootfs appear as /. Docker uses overlayfs for layered images, then pivot_root to swap the root. The host filesystem becomes invisible.
✓User namespaces are the key to rootless containers. An unprivileged user creates a user namespace, gets full capabilities inside it, and can then create all other namespace types. No real root needed.

Common Pitfalls

✗Mistake: Expecting /proc to be isolated after entering a PID namespace. Reality: You must mount a new procfs (mount -t proc proc /proc) or /proc still shows the host's process list.
✗Mistake: Wondering why network is broken in a new namespace. Reality: Network namespaces start with only loopback. You must create veth pairs, assign IPs, and set up routing -- or use a CNI plugin.
✗Mistake: Thinking PID namespace isolation is absolute. Reality: PID namespaces are hierarchical. The parent can see all child PIDs. kill() from the host can target container processes by their host PID. This is by design.
✗Mistake: Using chroot for container filesystem isolation. Reality: chroot is trivially escapable (open fd to /, chroot to subdirectory, fchdir). pivot_root in a mount namespace has no such escape.

Reference

System Calls

unshareclonesetnspivot_root

Tools

unshare(1)nsenterlsns

📌

In One Line

Containers are just processes with namespace-isolated views -- understand unshare, clone, and setns and the magic disappears.

Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)

DockerKubernetessystemdChrome

🧠

Mental Model

💡

The Problem

Architecture

Docker containers are not virtual machines.

That illusion is namespaces.

What Actually Happens

Under the Hood

Common Questions

How does Docker implement container networking?

What happens when PID 1 in a namespace crashes?

How do Kubernetes pods share networking but not filesystems?

How Technologies Use This

Docker

Kubernetes

systemd

Chrome

Same Concept Across Tech

Technology	Which namespaces it uses	Purpose
Docker	All 8 (pid, net, mnt, uts, ipc, user, cgroup, time)	Full container isolation
Kubernetes	Inherits from container runtime, adds network namespace per pod	Pod-level networking, containers in a pod share net namespace
Chrome	PID + network + mount for renderer processes	Sandboxing untrusted web content
systemd	PrivateNetwork, ProtectHome, PrivateTmp use mount and network namespaces	Service hardening without containers
Flatpak	User + mount + PID namespaces	Desktop app sandboxing

Stack layer mapping (container networking issue):

Layer	What to check	Tool
Application	Is the app binding to the right interface/port?	Application logs, ss -tlnp
Container runtime	Is the network namespace correctly configured?	docker inspect, ip netns list
Network namespace	Are routes and iptables correct inside the namespace?	nsenter -n -t PID ip route, iptables -L
Host kernel	Is IP forwarding enabled? Are veth pairs connected?	sysctl net.ipv4.ip_forward, ip link show
Hardware	Physical NIC configuration, MTU mismatches?	ethtool, ip link show

If You See This, Think This

Symptom	Likely cause	First check
Container cannot reach the internet	Network namespace missing default route or NAT rule	nsenter -n -t PID ip route
Process sees PID 1 but host shows PID 47832	Normal PID namespace behavior, not a bug	cat /proc/PID/status, check NSpid line
Container can see host processes	PID namespace not applied, or container running with --pid=host	ls /proc inside container, check docker run flags
Permission denied despite running as root inside container	User namespace mapping. Root inside container maps to unprivileged user on host	cat /proc/PID/uid_map
mount/umount fails inside container	Mount namespace is read-only or propagation is set to slave	Check mount propagation with findmnt
Container escape vulnerability	Missing namespace + seccomp + capability restrictions	Audit all three layers, not just namespaces

When to Use / Avoid

Use namespaces when:

Building container runtimes (Docker, containerd, Podman all use namespaces)
Isolating network stacks for testing (each namespace gets its own interfaces, routes, iptables)
Running untrusted code that should not see host processes or filesystem
Creating sandboxed build environments

Avoid when:

Strong security isolation is required (namespaces are not a security boundary on their own, combine with seccomp and capabilities)
Hardware-level isolation is needed (use VMs instead)
The overhead of managing namespace lifecycle is not worth the isolation benefit

Try It Yourself

 1  # List all namespaces on the system
 2  
 3  lsns --output-all | head -20
 4  
 5  # Create a minimal container with PID, mount, network, and user namespaces
 6  
 7  unshare --user --pid --fork --mount-proc --net bash -c 'echo PID=$$; hostname; ip link; ls /proc | head -5; exit'
 8  
 9  # Check which namespaces a process belongs to
10  
11  ls -la /proc/$$/ns/
12  
13  # Enter the namespaces of a running Docker container
14  
15  docker run -d --name test-ns alpine sleep 3600 2>/dev/null; PID=$(docker inspect -f '{{.State.Pid}}' test-ns 2>/dev/null); echo "Container PID: $PID"; docker rm -f test-ns 2>/dev/null
16  
17  # Create a network namespace with veth pair
18  
19  sudo ip netns add testns && sudo ip link add veth0 type veth peer name veth1 && sudo ip link set veth1 netns testns && sudo ip netns exec testns ip addr show; sudo ip netns del testns
20  
21  # Show PID namespace hierarchy
22  
23  cat /proc/$$/status | grep -E '^(NSpid|NStgid|NSsid)'

Debug Checklist

1List namespaces of a process: ls -la /proc/<pid>/ns/
2Enter a container's namespace: nsenter -t <pid> -m -u -i -n -p
3List all network namespaces: ip netns list
4Check PID namespace: cat /proc/<pid>/status | grep NSpid
5Run a command in a new namespace: unshare --pid --fork --mount-proc bash
6Check user namespace mappings: cat /proc/<pid>/uid_map

Key Takeaways

✓8 namespace types as of kernel 5.6: mount, UTS (hostname), IPC, network, PID, user, cgroup, and time. The time namespace lets containers have different CLOCK_MONOTONIC and CLOCK_BOOTTIME offsets -- useful for container migration.
✓unshare() creates new namespaces for the calling process. clone() creates them for a child. setns() joins an existing namespace by fd from /proc/PID/ns/<type>. That last one is how 'docker exec' enters a running container.
✓If PID 1 in a PID namespace exits, every other process in that namespace gets SIGKILL. The entire container is torn down. That is why containers need an init process like tini or dumb-init for signal forwarding and zombie reaping.
✓Mount namespaces plus pivot_root() are what make the container's rootfs appear as /. Docker uses overlayfs for layered images, then pivot_root to swap the root. The host filesystem becomes invisible.
✓User namespaces are the key to rootless containers. An unprivileged user creates a user namespace, gets full capabilities inside it, and can then create all other namespace types. No real root needed.

Common Pitfalls

✗Mistake: Expecting /proc to be isolated after entering a PID namespace. Reality: You must mount a new procfs (mount -t proc proc /proc) or /proc still shows the host's process list.
✗Mistake: Wondering why network is broken in a new namespace. Reality: Network namespaces start with only loopback. You must create veth pairs, assign IPs, and set up routing -- or use a CNI plugin.
✗Mistake: Thinking PID namespace isolation is absolute. Reality: PID namespaces are hierarchical. The parent can see all child PIDs. kill() from the host can target container processes by their host PID. This is by design.
✗Mistake: Using chroot for container filesystem isolation. Reality: chroot is trivially escapable (open fd to /, chroot to subdirectory, fchdir). pivot_root in a mount namespace has no such escape.

Reference

System Calls

unshareclonesetnspivot_root

Tools

unshare(1)nsenterlsns

📌

In One Line

Containers are just processes with namespace-isolated views -- understand unshare, clone, and setns and the magic disappears.

Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Linux Namespaces (PID, NET, MNT, UTS, IPC, USER)

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics