Container Runtime Internals (runc/containerd)
Mental Model
Building a container is like constructing a movie set on a studio lot. There is no single "build a set" operation -- the crew assembles it from standard parts. First, partition walls go up to create a separate space (namespaces). Then a fake street facade is erected so actors only see the set, not the warehouse behind it (pivot_root with overlay). The lighting budget and number of extras are capped by the production budget (cgroups). Prop weapons are rubber, not real -- the actors can hold them but they cannot actually fire (capabilities dropped). The script blocks certain actions entirely -- no real explosions allowed on set (seccomp). The director (runc) arranges all of this, then leaves. The stage manager (containerd-shim) stays behind to keep things running. From inside, the actors experience a complete world. From outside, it is just a partitioned corner of a warehouse using standard construction materials.
The Problem
A container takes 30 seconds to start because the overlay mount step re-extracts 847MB of image layers on every creation. PID 1 inside the container is bash, which swallows SIGTERM, so graceful shutdown never works and every pod termination hits the 30-second kill timeout before SIGKILL fires. Rootless containers fail with EPERM on clone() because user.max_user_namespaces is set to 0, and the error message says nothing about namespaces. None of these are container bugs -- they are kernel primitive misconfigurations that look like container platform failures.
Architecture
What actually happens between typing docker run nginx and a process responding to HTTP requests inside an isolated environment?
The word "container" does not appear anywhere in the Linux kernel source. The kernel provides primitives -- clone(), pivot_root(), cgroup pseudo-files, capset(), seccomp() -- and a container runtime assembles them in the right order. Every container problem is actually a kernel primitive problem. Slow starts are overlay mount problems. Signal failures are PID namespace problems. Permission errors are capability or user namespace problems.
The runc Create Sequence
When runc creates a container, it follows an exact 8-step sequence. Each step builds on the previous one.
Step 1: clone() with namespace flags. runc calls clone() with CLONE_NEWPID (child sees itself as PID 1), CLONE_NEWNS (separate mount table), CLONE_NEWNET (empty network stack), CLONE_NEWUTS (independent hostname), CLONE_NEWIPC (separate SysV IPC), and CLONE_NEWUSER (UID mapping for rootless). All six namespaces are created atomically in roughly 50-100 microseconds.
Step 2: Mount overlay filesystem. Image layers become lowerdirs (read-only), the container gets a read-write upperdir. Writing to an existing file triggers copy-up. This is where slow startup hides -- re-extracting an 847MB image with 12 layers can take 8-15 seconds on spinning disk.
Step 3: pivot_root(). Swaps the root mount to the overlay. The old root is moved to .pivot_old, then unmounted with MNT_DETACH and removed. Unlike chroot(), pivot_root() changes the actual mount tree, not just pathname resolution.
Step 4: Mount special filesystems. Fresh /proc for the new PID namespace (so ps only shows container processes), minimal /dev tmpfs (null, zero, random, urandom, tty), read-only /sys to prevent cgroup manipulation from inside.
Step 5: Apply cgroup limits. Write to memory.max, cpu.max, pids.max in the container's cgroup directory, then move the process into the cgroup via cgroup.procs. The kernel enforces limits transparently from this point.
Step 6: Drop capabilities. Keep only about 14 safe capabilities (CHOWN, SETUID, NET_BIND_SERVICE, etc.). Drop CAP_SYS_ADMIN from the bounding set -- permanent and irreversible. No process inside the container can ever regain it.
Step 7: Install seccomp filter. A BPF filter blocks roughly 44 dangerous syscalls: mount(), reboot(), kexec_load(), init_module(), bpf(). Even if CAP_SYS_ADMIN is regained through a kernel exploit, the seccomp filter blocks the dangerous operations. Defense in depth.
Step 8: execve() the entrypoint. runc's code is replaced by the container process. runc is gone. The containerd-shim remains as the parent, holding stdio pipes and tracking exit status.
The containerd Architecture
dockerd/kubelet translates high-level intent (image names, port mappings, restart policies) into containerd gRPC calls. containerd manages images (content-addressable blobs by SHA256), snapshots (overlay preparation), and tasks (lifecycle). containerd-shim is a per-container supervisor forked before runc runs -- it holds stdio, reaps zombies, and survives containerd restarts. This last point is critical: upgrading containerd does not kill running containers because each shim is an independent process. runc reads the OCI config.json, performs the 8 steps, and exits. It is a tool, not a daemon.
OCI Specs and Image Layers
The Runtime Spec (config.json) defines namespaces, mounts, cgroup limits, seccomp profile, capabilities, and entrypoint. runc spec generates a default. Because runc is driven entirely by this file, any program that implements it is a valid OCI runtime -- crun (C, faster startup) and youki (Rust) are drop-in replacements.
The Image Spec consists of a manifest (layer digests in order), a config blob (architecture, default command, diff IDs), and layer tarballs (tar.gz of filesystem changes, SHA256-identified). Two images sharing 5 of 7 layers store only 2 unique layers. containerd pulls by resolving the tag to a manifest digest, checking the content store for each layer SHA256, downloading only missing layers, extracting into the snapshotter, and recording the snapshot chain.
Kubernetes Pod Creation via CRI
Since Kubernetes 1.24, kubelet talks to containerd (or CRI-O) via CRI, a gRPC API. Pod creation follows a strict sequence:
RunPodSandbox creates the pause container (registry.k8s.io/pause:3.9, ~700KB binary that calls pause() in a loop). It is created with all namespace flags, and the CNI plugin configures its network namespace. The pause container IS the pod from the kernel's perspective.
CreateContainer prepares an overlay snapshot and OCI config.json for each app container. The config specifies joining the pause container's existing namespaces via setns() rather than creating new ones.
StartContainer has containerd-shim spawn runc, which joins the pause container's PID/NET/IPC namespaces. This is why all containers in a pod share localhost.
StopContainer sends SIGTERM, waits terminationGracePeriodSeconds (default 30), then SIGKILL.
The pause container serves three roles: namespace holder (survives app container crashes), zombie reaper (PID 1 that calls wait()), and lifecycle anchor (kubelet monitors it for pod health). The pause binary is roughly 200 bytes of assembly consuming about 1MB of memory.
The PID 1 Problem
The kernel only delivers signals to PID 1 if PID 1 has registered a handler. When a container runs bash -c "my-app", bash becomes PID 1, does not handle SIGTERM, and the signal is silently dropped. Kubernetes waits 30 seconds and sends SIGKILL. Solutions: use exec form CMD ["my-app"] so the app is PID 1 directly, or use tini/dumb-init as the entrypoint to forward signals and reap zombies.
Rootless Containers
CLONE_NEWUSER creates a user namespace where the process has full capabilities despite being unprivileged on the host. The UID mapping (0 100000 65536 in /proc/<pid>/uid_map) means UID 0 inside maps to UID 100000 outside. Files owned by container root are owned by UID 100000 on the host filesystem. Rootless networking uses slirp4netns or pasta (userspace, 2-5 Gbps) instead of veth pairs (kernel, 10+ Gbps) because creating veth requires host-level CAP_NET_ADMIN.
Common Questions
What is the difference between Docker and containerd? Docker is the developer platform (CLI, build, compose). containerd is the core runtime (images, snapshots, lifecycle). Docker uses containerd internally. Kubernetes uses containerd directly, bypassing Docker. Both produce identical containers via runc and OCI specs.
Why does a container ignore SIGTERM? PID 1 signal semantics. The kernel only delivers signals PID 1 has explicitly handled. Shell wrappers as PID 1 silently drop SIGTERM. Use exec form CMD or tini.
How does Kubernetes create a pod? CRI RunPodSandbox (pause container with namespaces), then CreateContainer + StartContainer for each app container joining via setns().
Why did Kubernetes remove Docker support? It removed dockershim (the adapter), not OCI image support. The path kubelet -> dockershim -> dockerd -> containerd -> runc became kubelet -> containerd -> runc. Fewer moving parts.
How Technologies Use This
A Dockerized microservice takes 30 seconds to start instead of 2. The overlay mount step hangs because the lower layers reference a deleted image tag, causing containerd to re-extract 847MB of layers on every container creation. The developer adds --pull=always thinking it is an image freshness issue, which makes it worse -- now every start pulls from the registry too.
The root cause is misunderstanding the container startup sequence. When dockerd receives a run command, it calls containerd over gRPC, which prepares a snapshot (overlay mount), then delegates to containerd-shim, which spawns runc. runc performs the actual namespace creation, pivot_root, cgroup writes, capability drops, seccomp filter installation, and final exec. Each step is sequential. A slow snapshot prepare blocks everything downstream.
The fix is examining each phase independently. 'ctr snapshots list' shows stale snapshots consuming 12GB. Pruning them with 'docker system prune' and pinning base images by digest eliminates the re-extraction. Container start time drops to 1.8 seconds. For rootless Docker, adding 'echo "user.max_user_namespaces=28633" > /etc/sysctl.d/userns.conf' prevents EPERM failures on clone() with CLONE_NEWUSER, which silently falls back to a slow path without user namespace isolation.
Pods are stuck in ContainerCreating for 45 seconds on a 200-node cluster. kubelet logs show "RunPodSandbox" taking 40 seconds. The team suspects network plugin issues because the pause container is involved, but the actual bottleneck is containerd's image pull for the pause image on nodes with a cold image cache.
The problem is that Kubernetes creates every pod through a precise CRI sequence: RunPodSandbox (creates pause container holding the network namespace), then CreateContainer + StartContainer for each app container. The pause container must be running before any app container can join its namespaces via setns(). If the pause image (registry.k8s.io/pause:3.9, about 700KB) is not cached, containerd pulls it synchronously, and on air-gapped clusters this times out entirely.
Pre-pulling the pause image into containerd's content store on every node via a DaemonSet eliminates the delay. For CRI-O clusters, switching to the pinned-image strategy (--pause-image=localhost/pause:3.9) avoids registry dependency completely. Understanding that the pause container exists solely to hold namespaces (it runs /pause, an infinite sleep binary) clarifies that it is not optional -- it is the pod's identity from the kernel's perspective.
Same Concept Across Tech
| Concept | Docker | Kubernetes | Podman | containerd |
|---|---|---|---|---|
| Container creation | docker run (calls containerd) | kubelet CRI -> RunPodSandbox + CreateContainer | podman run (calls runc directly) | ctr run (native API) |
| Runtime | runc (default), crun (alternative) | runc via containerd or CRI-O | runc, crun | runc (default OCI runtime) |
| Image format | OCI Image Spec (layers + manifest) | Same OCI images | Same OCI images | Content-addressable store |
| Filesystem isolation | overlayfs snapshot | overlayfs snapshot via containerd | overlayfs or fuse-overlayfs (rootless) | Snapshotter plugin (overlay default) |
| Process supervision | containerd-shim | containerd-shim (per container) | conmon (container monitor) | containerd-shim-runc-v2 |
| Rootless mode | Docker rootless (userns + slirp4netns) | Usernetes (experimental) | Rootless by default (no daemon) | Rootless containerd (user namespace) |
Stack Layer Mapping
| Layer | Container Runtime Component |
|---|---|
| User CLI | docker, kubectl, podman, ctr, crictl |
| High-level runtime | containerd (image mgmt, snapshot, task lifecycle) |
| Shim layer | containerd-shim (per-container supervision, stdio, exit status) |
| Low-level runtime | runc (namespace creation, pivot_root, cgroup, caps, seccomp, exec) |
| Kernel primitives | clone(), pivot_root(), mount(), cgroup fs, capable(), seccomp() |
| Image storage | OCI content-addressable store (blobs by SHA256) |
Design Rationale
The container runtime stack is deliberately split into layers because each layer has a different lifecycle and failure domain. containerd can be upgraded without killing containers because containerd-shim keeps them alive. runc exits after setup because there is nothing left for it to do -- the container IS the process, not something runc manages. The OCI spec exists so that runc can be swapped for crun, youki, or gVisor without changing anything above it. This layering is not accidental overengineering -- it came from years of Docker's monolithic architecture causing upgrade-induced container deaths.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Container start takes 30+ seconds | Stale or missing overlay snapshots causing layer re-extraction | ctr snapshots list and docker system df |
| SIGTERM ignored, pod termination hits 30s timeout | PID 1 is shell that does not forward signals | docker exec <c> cat /proc/1/cmdline -- check if PID 1 is bash/sh |
| "permission denied" on clone() in rootless mode | user.max_user_namespaces sysctl is 0 or too low | sysctl user.max_user_namespaces |
| ContainerCreating stuck in Kubernetes | Pause image pull timeout or CNI plugin failure | crictl inspectp <pod-id> and journalctl -u containerd |
| "OCI runtime create failed" | Invalid config.json or missing rootfs | runc state <id> and check bundle directory |
| Container process is zombie | containerd-shim crashed, no parent to reap | `ps aux |
| Overlay mount fails with ENOSPC | Too many lower layers (default max ~128) or disk full | `mount |
| "exec format error" on container start | Image architecture mismatch (arm64 image on amd64 host) | `docker inspect <image> |
When to Use / Avoid
Use when:
- Debugging slow container startup and needing to identify which phase (snapshot, namespace creation, mount setup) is the bottleneck
- Investigating container escape vulnerabilities to understand which isolation layer failed
- Building minimal container images and needing to understand how layers map to overlay mounts
- Diagnosing PID 1 signal handling problems in Kubernetes pod termination
- Implementing rootless containers and debugging user namespace UID mapping
- Troubleshooting CRI errors in Kubernetes where kubelet cannot create pod sandboxes
Avoid when:
- Simply running containers in development where Docker/Podman CLI abstractions are sufficient
- The problem is purely application-level (wrong config, missing env vars) and not runtime-related
- Working with serverless container platforms (Fargate, Cloud Run) where the runtime is fully managed
Try It Yourself
1 # Generate a default OCI runtime spec
2
3 mkdir -p /tmp/mycontainer && cd /tmp/mycontainer && runc spec 2>/dev/null && cat config.json | head -30 || echo 'runc not installed'
4
5 # List running containerd containers and tasks
6
7 ctr -n moby containers list 2>/dev/null || ctr containers list 2>/dev/null || echo 'containerd not running'
8
9 # Show namespace inodes for a process (compare container vs host)
10
11 ls -la /proc/$$/ns/ 2>/dev/null
12
13 # Inspect which cgroup a process belongs to
14
15 cat /proc/$$/cgroup 2>/dev/null
16
17 # Show PID namespace mapping (host PID vs container PID)
18
19 cat /proc/$$/status 2>/dev/null | grep -E 'NSpid|NStgid|NSsid' || echo 'Not in a PID namespace'
20
21 # List CRI pods and containers on a Kubernetes node
22
23 crictl pods 2>/dev/null | head -10 || echo 'crictl not installed or CRI not configured'
24
25 # Enter all namespaces of a container process (replace PID)
26
27 echo 'nsenter -t <container-pid> -m -p -n -u -i -- /bin/sh
28
29 # Examine overlay mount details for running containers
30
31 mount 2>/dev/null | grep overlay | head -5 || echo 'No overlay mounts found'
32
33 # Show containerd content store (image layers)
34
35 ctr content list 2>/dev/null | head -10 || echo 'containerd not running'
36
37 # Check container runtime and version used by Kubernetes
38
39 crictl version 2>/dev/null || echo 'crictl not available'Debug Checklist
- 1
ls -la /proc/<pid>/ns/ -- show all namespace inodes for a container process - 2
cat /proc/<pid>/cgroup -- show cgroup membership and limits - 3
cat /proc/<pid>/status | grep -E 'NSpid|NStgid' -- show PID mapping across namespaces - 4
runc state <container-id> -- dump container state including status and PID - 5
ctr -n moby tasks list -- list running containerd tasks (Docker uses moby namespace) - 6
crictl inspectp <pod-id> | jq '.status.linux.namespaces' -- show pod namespace paths - 7
nsenter -t <pid> -p -m cat /proc/1/status -- inspect PID 1 inside the container - 8
mount | grep overlay -- show active overlay mounts and their layer composition
Key Takeaways
- ✓There is no container syscall. A container is assembled from clone() (namespaces), pivot_root (filesystem), cgroup writes (resource limits), prctl/capset (capability drops), and seccomp() (syscall filtering). runc orchestrates these in a precise sequence, and if any step fails, the container does not start.
- ✓containerd-shim is the unsung hero of container reliability. Because each container has its own shim process, containerd itself can be restarted or upgraded without killing running containers. The shim holds stdio, tracks the exit code, and reaps zombies. Without it, a containerd upgrade would kill every container on the node.
- ✓The Kubernetes pause container is not overhead -- it is the pod's identity. It is created first via RunPodSandbox, holds the network namespace, and all app containers join it with setns(). If the pause container dies, every container in the pod loses its network identity. It runs /pause (an infinite sleep), consuming about 1MB of memory.
- ✓Image layers are content-addressable. Each layer is a tar.gz identified by its SHA256 hash. containerd stores them in a content store and assembles them using snapshots (overlayfs by default). Pulling an image that shares layers with an already-pulled image skips the shared layers entirely. A 500MB image that shares 450MB with an existing image only downloads 50MB.
- ✓PID 1 in a container has special signal semantics. The kernel does not deliver signals to PID 1 unless PID 1 has explicitly registered a handler for that signal. If the entrypoint is bash (which does not handle SIGTERM by default), graceful shutdown is impossible. Tini or dumb-init solves this by acting as a proper init that forwards signals and reaps zombies.
Common Pitfalls
- ✗Mistake: Using a shell as PID 1 (CMD ["bash", "-c", "my-app"]) and expecting SIGTERM to reach the application. Reality: bash becomes PID 1, does not forward SIGTERM, and the app never receives the shutdown signal. Kubernetes waits terminationGracePeriodSeconds (default 30s) then sends SIGKILL. Use exec form CMD ["my-app"] or tini as the entrypoint.
- ✗Mistake: Assuming "container restart" means a fast operation. Reality: runc must redo the entire setup sequence -- clone namespaces, prepare overlay mount, pivot_root, apply cgroups, drop capabilities, install seccomp filter, exec. If the snapshot is stale or the image layers need re-extraction, startup takes seconds, not milliseconds.
- ✗Mistake: Running containers with --privileged "because it is easier." Reality: --privileged disables ALL isolation -- all capabilities granted, all devices accessible, AppArmor/SELinux disabled, seccomp disabled, /proc and /sys writable. A process inside a privileged container can mount the host filesystem and modify the host kernel.
- ✗Mistake: Not understanding that Kubernetes removed dockershim in 1.24 and thinking Docker no longer works. Reality: Docker images still work everywhere because they are OCI images. What was removed is kubelet talking to dockerd. Kubernetes now talks to containerd directly via CRI, which is what Docker itself uses internally.
Reference
In One Line
A container is clone() + pivot_root + cgroups + capability drops + seccomp, orchestrated by runc and supervised by containerd-shim.