Container Runtime & Docker
Architecture Diagram
Why It Exists
VMs provide strong isolation but at a cost. A full guest OS means gigabytes of memory, minutes to boot, and real CPU overhead from the hypervisor. Most of the time, none of that is necessary.
Containers share the host kernel. They isolate processes, filesystems, and network stacks at the OS level instead of virtualizing hardware. The result is millisecond startup, megabyte images, and near-native performance. That tradeoff is why containers became the default unit of deployment.
The isolation is weaker than a VM, though. If the threat model includes hostile tenants running arbitrary code, containers alone won't cut it. Reach for gVisor, Firecracker, or actual VMs. Know the boundaries.
How It Works
Linux Primitives
There is no "container" feature in the Linux kernel. Containers are a user-space abstraction built from several independent kernel mechanisms stitched together:
- PID namespace: Process isolation. PID 1 inside the container is a completely separate process tree. The container can't see or signal host processes.
- NET namespace: Each container gets its own network stack (interfaces, routing tables, iptables rules). Container-to-container communication needs an explicit bridge or overlay network.
- MNT namespace: Filesystem isolation. The container sees its own root filesystem from the image layers, set up via pivot_root or chroot.
- UTS namespace: Hostname isolation. Each container can have its own hostname.
- IPC namespace: Isolates System V IPC and POSIX message queues between containers.
- USER namespace: Maps container UIDs to unprivileged host UIDs. Root inside the container (UID 0) maps to a non-root UID on the host. This is what makes rootless containers possible.
cgroups v2 handle resource limits. Namespaces control what a process can see. cgroups control how much it can consume. The key controllers: cpu (time shares and bandwidth), memory (hard and soft limits, OOM priority), io (block device bandwidth), and pids (max process count, which stops fork bombs).
OCI Specification
The Open Container Initiative defines two standards. The Image Spec covers how to package a container as a layered tarball with a manifest and config JSON. The Runtime Spec covers how to execute a container given a root filesystem and configuration. Because these are separate specs, images built with Docker run on containerd, CRI-O, or any OCI-compliant runtime without changes.
Image Layer Mechanics
Images are a stack of read-only layers. Each Dockerfile instruction creates a new layer. At runtime, a thin writable layer sits on top using a union filesystem (overlay2 on Linux). Layers are content-addressable by SHA256 hash, so identical layers across different images get stored and transferred only once.
# Multi-stage build example
FROM golang:1.22 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /server
FROM gcr.io/distroless/static-debian12
COPY --from=builder /server /server
ENTRYPOINT ["/server"]
The builder stage pulls in the full Go toolchain at roughly 800MB. The final image has only the static binary and a minimal base, typically under 15MB. This is the single most impactful Dockerfile optimization most teams can make.
Production Considerations
- Image scanning: Scan images for CVEs in CI using Trivy or Grype. Block deployments with critical vulnerabilities. Scan base images and application dependencies, not just one or the other.
- Rootless execution: Run containers as a non-root user (
USER 1001in the Dockerfile). In Kubernetes, setsecurityContext.runAsNonRoot: trueso the admission controller rejects anything that tries to run as root. - Read-only root filesystem: Set
readOnlyRootFilesystem: trueand mount writable tmpfs only where the app actually needs it. This limits what an attacker can do after getting in. - Image signing: Use Cosign or Notary to sign images. Enforce signature verification at deploy time with an admission controller like Kyverno or OPA Gatekeeper. Skip this, and anyone who can push to the registry can push to production.
- Layer caching: Order Dockerfile instructions from least to most frequently changing. Copy dependency files before source code so
go mod downloadornpm installlayers survive across builds. Getting this wrong means rebuilding everything on every commit.
Failure Scenarios
Scenario 1: Container Runtime Socket Unresponsive. The containerd daemon hangs because zombie containerd-shim-runc-v2 processes pile up (a shim leak). The kubelet can't start, stop, or health-check containers on that node. Existing containers keep running but nobody can manage them. The node goes NotReady after the kubelet heartbeat timeout (40s default), and all pods get evicted after pod-eviction-timeout (5 minutes). Detection: watch kubelet_runtime_operations_errors_total and kubelet_runtime_operations_duration_seconds{operation="container_status"} exceeding 30s. Recovery: restart containerd with systemctl restart containerd. It reattaches to running containers via shim PIDs, so workloads are preserved. Prevention: set max-container-log-size and configure shim cleanup. Running a containerd watchdog process on every node is worth the small overhead.
Scenario 2: Overlay Filesystem Corruption. The overlay2 storage driver corrupts a layer's metadata after a host kernel panic or unclean shutdown. Containers referencing that layer fail to start with mount: invalid argument. Every pod sharing that base layer on the affected node breaks. Detection: containerd_operations_errors_total{operation="prepare"} spikes, docker inspect shows broken mounts. Recovery: remove the corrupted layer from /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/, then pull fresh layers. Prevention: use XFS with d_type=true for overlay2. ext4 works but XFS handles unclean shutdowns better. Monitor disk I/O errors before they cascade.
Scenario 3: Image Pull Backoff Cascade. A registry outage or rate limit (Docker Hub allows 100 pulls per 6 hours for unauthenticated users) triggers ImagePullBackOff across the cluster during a rolling deployment. New pods can't start, the Deployment controller terminates old pods, and the service loses capacity. This one catches people off guard. Detection: kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"} rises. Recovery: use image pull-through caches (Harbor, Zot), pre-pull images to nodes via DaemonSets, or mirror critical images to a private registry. Depending on Docker Hub for production without a mirror makes it a matter of when, not if.
Capacity Planning
| Resource | Guideline | Real-World Reference |
|---|---|---|
| Image size target | < 100MB for microservices; < 50MB ideal | Google: distroless base images average 2-20MB |
| Registry storage | 1TB per 10K image tags (with deduplication) | GitHub Container Registry: petabytes with cross-repo layer sharing |
| Image pull throughput | 1 Gbps registry bandwidth per 100 concurrent node pulls | Uber: in-cluster P2P distribution (Kraken) for 1M+ container images/day |
| Build time target | < 5 min with cache; < 15 min cold | Shopify: 4-minute average with BuildKit distributed caching |
| Container density | 30-50 containers per node (4 vCPU/16GB) | Spotify: 40+ containers per node with strict memory limits |
| Layer count | < 10 layers per image (reduces pull and startup time) | Netflix: 3-5 layers per microservice image via multi-stage builds |
Key formulas: Registry bandwidth = concurrent_deployments * avg_image_size * nodes_pulling / deployment_window. For a 200-node cluster pulling 500MB images across a 50-node rolling update: 50 * 500MB = 25GB of registry egress per deployment. At a 1 Gbps link, that takes about 3.3 minutes per batch. When the cluster exceeds 500 nodes, P2P distribution (Dragonfly, Kraken) becomes worth the operational complexity. Uber's Kraken cut image distribution time from 20+ minutes to under 30 seconds across 2,000-node clusters.
Architecture Decision Record
ADR: Container Runtime Selection
Context: Choosing between containerd, CRI-O, Docker Engine, or Podman for a Kubernetes-based platform.
| Criteria (Weight) | containerd | CRI-O | Docker Engine | Podman |
|---|---|---|---|---|
| K8s compatibility (25%) | Native CRI, default in EKS/GKE/AKS | Native CRI, OpenShift default | Requires cri-dockerd shim since K8s 1.24 | Not a K8s runtime (dev only) |
| Resource overhead (20%) | Low (~50MB RSS) | Lowest (~30MB RSS) | High (~150MB RSS, dockerd + containerd) | N/A (daemonless) |
| Developer experience (15%) | No built-in CLI (use ctr/nerdctl) | Minimal tooling | Excellent (docker CLI, Compose, Desktop) | Docker CLI-compatible, rootless by default |
| Image build support (15%) | Via BuildKit (nerdctl build) | Delegates to Buildah | Native (docker build) | Via Buildah (podman build) |
| Security posture (15%) | Good; rootless support available | Good; minimal attack surface | Root daemon required by default | Excellent; daemonless, rootless-native |
| Community/ecosystem (10%) | CNCF graduated, broad adoption | Smaller community, Red Hat-driven | Largest ecosystem, declining in K8s context | Growing, RHEL/Fedora default |
Decision guidance: For production Kubernetes, pick containerd. It ships with EKS, GKE, and AKS, which means the most documentation and the fewest surprises. Pick CRI-O for OpenShift environments or when minimizing runtime attack surface is a hard security requirement. Keep Docker Desktop on developer workstations. Nothing else matches its local development experience, and fighting that just creates friction. Use Podman in CI pipelines that need rootless builds or where security policy prohibits running a Docker daemon. Docker Engine for Kubernetes clusters is a legacy choice at this point. The dockershim removal in K8s 1.24 made that clear.
Key Points
- •Containers isolate processes using Linux namespaces and cgroups. They are not VMs.
- •OCI defines the image format and runtime spec. Docker is just one implementation.
- •containerd and CRI-O are the real runtimes in Kubernetes. Docker (dockershim) was removed in K8s 1.24.
- •Image layers use copy-on-write. Shared base layers save disk space and speed up pulls.
- •Multi-stage builds keep final images small by throwing away build-time dependencies.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| containerd | Open Source | Kubernetes default runtime, CNCF graduated | Medium-Enterprise |
| Docker Engine | Open Source | Developer experience, Docker Compose, build tooling | Small-Large |
| CRI-O | Open Source | Minimal K8s-focused runtime, OpenShift default | Medium-Enterprise |
| Podman | Open Source | Rootless containers, daemonless, Docker CLI-compatible | Small-Medium |
Common Mistakes
- Running containers as root. A compromised container gets host-level access.
- Using bloated base images (ubuntu:latest is 77MB) when distroless or alpine will do at 5MB.
- Storing secrets in image layers. They persist in layer history even after deletion in a later layer.
- Not pinning base image versions. FROM python:3 pulls different images over time and will cause breakage.
- Ignoring .dockerignore. The build context drags in junk files and builds slow down for no reason.