Container Runtime & Docker

Why It Exists

VMs provide strong isolation but at a cost. A full guest OS means gigabytes of memory, minutes to boot, and real CPU overhead from the hypervisor. Most of the time, none of that is necessary.

Containers share the host kernel. They isolate processes, filesystems, and network stacks at the OS level instead of virtualizing hardware. The result is millisecond startup, megabyte images, and near-native performance. That tradeoff is why containers became the default unit of deployment.

The isolation is weaker than a VM, though. If the threat model includes hostile tenants running arbitrary code, containers alone won't cut it. Reach for gVisor, Firecracker, or actual VMs. Know the boundaries.

How It Works

Linux Primitives

There is no "container" feature in the Linux kernel. Containers are a user-space abstraction built from several independent kernel mechanisms stitched together:

PID namespace: Process isolation. PID 1 inside the container is a completely separate process tree. The container can't see or signal host processes.
NET namespace: Each container gets its own network stack (interfaces, routing tables, iptables rules). Container-to-container communication needs an explicit bridge or overlay network.
MNT namespace: Filesystem isolation. The container sees its own root filesystem from the image layers, set up via pivot_root or chroot.
UTS namespace: Hostname isolation. Each container can have its own hostname.
IPC namespace: Isolates System V IPC and POSIX message queues between containers.
USER namespace: Maps container UIDs to unprivileged host UIDs. Root inside the container (UID 0) maps to a non-root UID on the host. This is what makes rootless containers possible.

cgroups v2 handle resource limits. Namespaces control what a process can see. cgroups control how much it can consume. The key controllers: cpu (time shares and bandwidth), memory (hard and soft limits, OOM priority), io (block device bandwidth), and pids (max process count, which stops fork bombs).

OCI Specification

The Open Container Initiative defines two standards. The Image Spec covers how to package a container as a layered tarball with a manifest and config JSON. The Runtime Spec covers how to execute a container given a root filesystem and configuration. Because these are separate specs, images built with Docker run on containerd, CRI-O, or any OCI-compliant runtime without changes.

Image Layer Mechanics

Images are a stack of read-only layers. Each Dockerfile instruction creates a new layer. At runtime, a thin writable layer sits on top using a union filesystem (overlay2 on Linux). Layers are content-addressable by SHA256 hash, so identical layers across different images get stored and transferred only once.

# Multi-stage build example
FROM golang:1.22 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /server /server
ENTRYPOINT ["/server"]

The builder stage pulls in the full Go toolchain at roughly 800MB. The final image has only the static binary and a minimal base, typically under 15MB. This is the single most impactful Dockerfile optimization most teams can make.

Production Considerations

Image scanning: Scan images for CVEs in CI using Trivy or Grype. Block deployments with critical vulnerabilities. Scan base images and application dependencies, not just one or the other.
Rootless execution: Run containers as a non-root user (USER 1001 in the Dockerfile). In Kubernetes, set securityContext.runAsNonRoot: true so the admission controller rejects anything that tries to run as root.
Read-only root filesystem: Set readOnlyRootFilesystem: true and mount writable tmpfs only where the app actually needs it. This limits what an attacker can do after getting in.
Image signing: Use Cosign or Notary to sign images. Enforce signature verification at deploy time with an admission controller like Kyverno or OPA Gatekeeper. Skip this, and anyone who can push to the registry can push to production.
Layer caching: Order Dockerfile instructions from least to most frequently changing. Copy dependency files before source code so go mod download or npm install layers survive across builds. Getting this wrong means rebuilding everything on every commit.

Failure Scenarios

Scenario 1: Container Runtime Socket Unresponsive. The containerd daemon hangs because zombie containerd-shim-runc-v2 processes pile up (a shim leak). The kubelet can't start, stop, or health-check containers on that node. Existing containers keep running but nobody can manage them. The node goes NotReady after the kubelet heartbeat timeout (40s default), and all pods get evicted after pod-eviction-timeout (5 minutes). Detection: watch kubelet_runtime_operations_errors_total and kubelet_runtime_operations_duration_seconds{operation="container_status"} exceeding 30s. Recovery: restart containerd with systemctl restart containerd. It reattaches to running containers via shim PIDs, so workloads are preserved. Prevention: set max-container-log-size and configure shim cleanup. Running a containerd watchdog process on every node is worth the small overhead.

Scenario 2: Overlay Filesystem Corruption. The overlay2 storage driver corrupts a layer's metadata after a host kernel panic or unclean shutdown. Containers referencing that layer fail to start with mount: invalid argument. Every pod sharing that base layer on the affected node breaks. Detection: containerd_operations_errors_total{operation="prepare"} spikes, docker inspect shows broken mounts. Recovery: remove the corrupted layer from /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/, then pull fresh layers. Prevention: use XFS with d_type=true for overlay2. ext4 works but XFS handles unclean shutdowns better. Monitor disk I/O errors before they cascade.

Scenario 3: Image Pull Backoff Cascade. A registry outage or rate limit (Docker Hub allows 100 pulls per 6 hours for unauthenticated users) triggers ImagePullBackOff across the cluster during a rolling deployment. New pods can't start, the Deployment controller terminates old pods, and the service loses capacity. This one catches people off guard. Detection: kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"} rises. Recovery: use image pull-through caches (Harbor, Zot), pre-pull images to nodes via DaemonSets, or mirror critical images to a private registry. Depending on Docker Hub for production without a mirror makes it a matter of when, not if.

Capacity Planning

Resource	Guideline	Real-World Reference
Image size target	< 100MB for microservices; < 50MB ideal	Google: distroless base images average 2-20MB
Registry storage	1TB per 10K image tags (with deduplication)	GitHub Container Registry: petabytes with cross-repo layer sharing
Image pull throughput	1 Gbps registry bandwidth per 100 concurrent node pulls	Uber: in-cluster P2P distribution (Kraken) for 1M+ container images/day
Build time target	< 5 min with cache; < 15 min cold	Shopify: 4-minute average with BuildKit distributed caching
Container density	30-50 containers per node (4 vCPU/16GB)	Spotify: 40+ containers per node with strict memory limits
Layer count	< 10 layers per image (reduces pull and startup time)	Netflix: 3-5 layers per microservice image via multi-stage builds

Key formulas: Registry bandwidth = concurrent_deployments * avg_image_size * nodes_pulling / deployment_window. For a 200-node cluster pulling 500MB images across a 50-node rolling update: 50 * 500MB = 25GB of registry egress per deployment. At a 1 Gbps link, that takes about 3.3 minutes per batch. When the cluster exceeds 500 nodes, P2P distribution (Dragonfly, Kraken) becomes worth the operational complexity. Uber's Kraken cut image distribution time from 20+ minutes to under 30 seconds across 2,000-node clusters.

Architecture Decision Record

ADR: Container Runtime Selection

Context: Choosing between containerd, CRI-O, Docker Engine, or Podman for a Kubernetes-based platform.

Criteria (Weight)	containerd	CRI-O	Docker Engine	Podman
K8s compatibility (25%)	Native CRI, default in EKS/GKE/AKS	Native CRI, OpenShift default	Requires cri-dockerd shim since K8s 1.24	Not a K8s runtime (dev only)
Resource overhead (20%)	Low (~50MB RSS)	Lowest (~30MB RSS)	High (~150MB RSS, dockerd + containerd)	N/A (daemonless)
Developer experience (15%)	No built-in CLI (use ctr/nerdctl)	Minimal tooling	Excellent (docker CLI, Compose, Desktop)	Docker CLI-compatible, rootless by default
Image build support (15%)	Via BuildKit (nerdctl build)	Delegates to Buildah	Native (docker build)	Via Buildah (podman build)
Security posture (15%)	Good; rootless support available	Good; minimal attack surface	Root daemon required by default	Excellent; daemonless, rootless-native
Community/ecosystem (10%)	CNCF graduated, broad adoption	Smaller community, Red Hat-driven	Largest ecosystem, declining in K8s context	Growing, RHEL/Fedora default

Decision guidance: For production Kubernetes, pick containerd. It ships with EKS, GKE, and AKS, which means the most documentation and the fewest surprises. Pick CRI-O for OpenShift environments or when minimizing runtime attack surface is a hard security requirement. Keep Docker Desktop on developer workstations. Nothing else matches its local development experience, and fighting that just creates friction. Use Podman in CI pipelines that need rootless builds or where security policy prohibits running a Docker daemon. Docker Engine for Kubernetes clusters is a legacy choice at this point. The dockershim removal in K8s 1.24 made that clear.

Tool	Type	Best For	Scale
containerd	Open Source	Kubernetes default runtime, CNCF graduated	Medium-Enterprise
Docker Engine	Open Source	Developer experience, Docker Compose, build tooling	Small-Large
CRI-O	Open Source	Minimal K8s-focused runtime, OpenShift default	Medium-Enterprise
Podman	Open Source	Rootless containers, daemonless, Docker CLI-compatible	Small-Medium

Why It Exists

VMs provide strong isolation but at a cost. A full guest OS means gigabytes of memory, minutes to boot, and real CPU overhead from the hypervisor. Most of the time, none of that is necessary.

How It Works

Linux Primitives

There is no "container" feature in the Linux kernel. Containers are a user-space abstraction built from several independent kernel mechanisms stitched together:

PID namespace: Process isolation. PID 1 inside the container is a completely separate process tree. The container can't see or signal host processes.
NET namespace: Each container gets its own network stack (interfaces, routing tables, iptables rules). Container-to-container communication needs an explicit bridge or overlay network.
MNT namespace: Filesystem isolation. The container sees its own root filesystem from the image layers, set up via pivot_root or chroot.
UTS namespace: Hostname isolation. Each container can have its own hostname.
IPC namespace: Isolates System V IPC and POSIX message queues between containers.
USER namespace: Maps container UIDs to unprivileged host UIDs. Root inside the container (UID 0) maps to a non-root UID on the host. This is what makes rootless containers possible.

OCI Specification

Image Layer Mechanics

# Multi-stage build example
FROM golang:1.22 AS builder
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN CGO_ENABLED=0 go build -o /server

FROM gcr.io/distroless/static-debian12
COPY --from=builder /server /server
ENTRYPOINT ["/server"]

Production Considerations

Image scanning: Scan images for CVEs in CI using Trivy or Grype. Block deployments with critical vulnerabilities. Scan base images and application dependencies, not just one or the other.
Rootless execution: Run containers as a non-root user (USER 1001 in the Dockerfile). In Kubernetes, set securityContext.runAsNonRoot: true so the admission controller rejects anything that tries to run as root.
Read-only root filesystem: Set readOnlyRootFilesystem: true and mount writable tmpfs only where the app actually needs it. This limits what an attacker can do after getting in.
Image signing: Use Cosign or Notary to sign images. Enforce signature verification at deploy time with an admission controller like Kyverno or OPA Gatekeeper. Skip this, and anyone who can push to the registry can push to production.
Layer caching: Order Dockerfile instructions from least to most frequently changing. Copy dependency files before source code so go mod download or npm install layers survive across builds. Getting this wrong means rebuilding everything on every commit.

Failure Scenarios

Capacity Planning

Resource	Guideline	Real-World Reference
Image size target	< 100MB for microservices; < 50MB ideal	Google: distroless base images average 2-20MB
Registry storage	1TB per 10K image tags (with deduplication)	GitHub Container Registry: petabytes with cross-repo layer sharing
Image pull throughput	1 Gbps registry bandwidth per 100 concurrent node pulls	Uber: in-cluster P2P distribution (Kraken) for 1M+ container images/day
Build time target	< 5 min with cache; < 15 min cold	Shopify: 4-minute average with BuildKit distributed caching
Container density	30-50 containers per node (4 vCPU/16GB)	Spotify: 40+ containers per node with strict memory limits
Layer count	< 10 layers per image (reduces pull and startup time)	Netflix: 3-5 layers per microservice image via multi-stage builds

Architecture Decision Record

ADR: Container Runtime Selection

Context: Choosing between containerd, CRI-O, Docker Engine, or Podman for a Kubernetes-based platform.

Criteria (Weight)	containerd	CRI-O	Docker Engine	Podman
K8s compatibility (25%)	Native CRI, default in EKS/GKE/AKS	Native CRI, OpenShift default	Requires cri-dockerd shim since K8s 1.24	Not a K8s runtime (dev only)
Resource overhead (20%)	Low (~50MB RSS)	Lowest (~30MB RSS)	High (~150MB RSS, dockerd + containerd)	N/A (daemonless)
Developer experience (15%)	No built-in CLI (use ctr/nerdctl)	Minimal tooling	Excellent (docker CLI, Compose, Desktop)	Docker CLI-compatible, rootless by default
Image build support (15%)	Via BuildKit (nerdctl build)	Delegates to Buildah	Native (docker build)	Via Buildah (podman build)
Security posture (15%)	Good; rootless support available	Good; minimal attack surface	Root daemon required by default	Excellent; daemonless, rootless-native
Community/ecosystem (10%)	CNCF graduated, broad adoption	Smaller community, Red Hat-driven	Largest ecosystem, declining in K8s context	Growing, RHEL/Fedora default

Architecture Diagram

Why It Exists

How It Works

Linux Primitives

OCI Specification

Image Layer Mechanics

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Container Runtime Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics

Container Runtime & Docker

Architecture Diagram

Why It Exists

How It Works

Linux Primitives

OCI Specification

Image Layer Mechanics

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Container Runtime Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics