Containers & SecurityTopic 3 of 9

Security & Access ControlIntermediate

Linux Capabilities

DockerKubernetessystemd

🧠

Mental Model

A commercial kitchen used to hand out one master key ring -- knives, cash register, liquor cabinet, chemical storage, walk-in freezer, all on the same ring. Any cook who needed the knife drawer got the liquor cabinet too. Now each cook carries individual labeled keys. Pastry chef: sugar and flour drawers. Line cook: knife drawer. Nobody gets the cash register key unless they actually handle money. The head chef can permanently confiscate a key, and once it is gone, no copy can be made. A cook might carry keys in a pocket (permitted), but only the key actively in a lock (effective) opens anything.

💡

The Problem

Binding port 80 requires root, and root carries everything: mount filesystems, load kernel modules, read /etc/shadow, kill any process, reboot the machine -- 40+ capabilities at once. One nginx vulnerability with full root hands an attacker the entire host. Scale that to a 500-container cluster where every container starts as root, and the attack surface multiplies 500 times over. CVE-2019-5736 showed exactly what happens: a container process with CAP_SYS_ADMIN overwrote the host runc binary and escaped containment entirely.

Architecture

Why can ping send raw network packets without being root?

Most engineers never question it. But raw sockets are a privileged operation. In the old days, ping was a setuid-root binary. It ran with full superuser privileges just to send ICMP packets. Full power to read every file, kill every process, reboot the machine -- all for a network diagnostic tool.

Capabilities changed this. Modern ping holds exactly one permission: CAP_NET_RAW. It can create raw sockets and nothing else. This is not a minor detail. It is the entire philosophy of Linux security since kernel 2.6.26.

What Actually Happens

Every thread in Linux has five capability sets. They interact in specific ways.

Permitted is the ceiling. It defines the maximum capabilities this thread can ever use. The permitted set can only shrink -- once a capability is dropped from it, it is gone forever for that thread.

Effective is what the kernel actually checks. When a process calls bind() on port 80, the kernel calls capable(CAP_NET_BIND_SERVICE) and looks at the effective set. A well-written program keeps its effective set empty and raises specific capabilities only when needed.

Inheritable interacts with file capabilities across execve(). A capability passes to the new program only if it is in both the thread's inheritable set AND the file's inheritable set. This two-key requirement makes inheritable sets cumbersome in practice.

Ambient (kernel 4.3+) solves the inheritance problem. Capabilities in the ambient set automatically carry across execve() for non-setuid, non-file-capability binaries. This is what makes systemd's AmbientCapabilities= work cleanly.

Bounding is the one-way gate. Drop a capability from the bounding set with prctl(PR_CAPBSET_DROP), and no descendant process can ever gain it -- not through setuid binaries, not through file capabilities, not through anything. This is irreversible. Container runtimes use it to permanently lock dangerous capabilities out of reach.

Under the Hood

When the kernel needs to authorize a privileged operation, it calls capable(CAP_xxx) or ns_capable(). This replaces the old if (current->euid == 0) check throughout the kernel. The function examines the calling thread's effective capability set and returns allow or deny.

File capabilities are stored as extended attributes on binary files. setcap cap_net_bind_service=ep /usr/bin/myapp adds CAP_NET_BIND_SERVICE to the file's permitted (p) and effective (e) sets. When this binary is executed, the new thread gains the capability in its permitted set. The e flag causes it to also appear in the effective set automatically, so legacy programs that do not manage capabilities themselves still work.

The transformation formula on execve() is where the complexity lives. The kernel computes: new_P = (old_P & fI) | (fP & bounding) | old_ambient. In plain terms: new permitted comes from the intersection of old permitted with file inheritable, unioned with file permitted (masked by bounding), plus ambient. The complexity of this formula is exactly why ambient capabilities were added -- they provide a simple, predictable path.

User namespaces add another layer. Inside a user namespace, a process can have full capabilities while being completely unprivileged on the host. ns_capable() verifies capabilities in the relevant namespace. Some operations -- loading kernel modules, accessing raw hardware -- require capabilities in the init (host) namespace, which user namespace root does not have.

Common Questions

How does Docker decide which capabilities to grant?

Docker starts with a whitelist of about 14 capabilities needed for basic container operation: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, KILL, SETGID, SETUID, SETPCAP, NET_BIND_SERVICE, NET_RAW, SYS_CHROOT, MKNOD, AUDIT_WRITE, SETFCAP. Everything else is dropped, including CAP_SYS_ADMIN, CAP_NET_ADMIN, and CAP_SYS_PTRACE. The recommended pattern is --cap-drop ALL --cap-add <specific>, adding back only what the application actually needs.

Why is CAP_SYS_ADMIN considered so dangerous?

It is the dumping ground for operations that never got their own capability. It controls: mount/umount, sethostname, quotas, syslog, BPF operations, namespace creation, and many ioctl calls. A process with CAP_SYS_ADMIN can mount filesystems, manipulate namespaces, load BPF programs, and generally escape most containment. The kernel developers keep splitting things out -- CAP_BPF, CAP_PERFMON, CAP_CHECKPOINT_RESTORE were all carved from CAP_SYS_ADMIN -- but it remains dangerously broad.

How do ambient capabilities simplify things?

Before ambient capabilities, granting a capability to a non-root service required setuid root (overprivileged) or file capabilities on the binary (breaks on upgrades, requires xattr support). Ambient capabilities let a service manager set capabilities that automatically inherit across exec without any binary modification. systemd's AmbientCapabilities=CAP_NET_BIND_SERVICE makes this trivial.

How Technologies Use This

Docker

A root process inside a container calls mount() to attach the host filesystem and escapes isolation entirely. With all 40+ kernel capabilities including CAP_SYS_ADMIN, container root can mount filesystems, manipulate namespaces, and load BPF programs -- effectively owning the host.

The root cause is that traditional Linux root is all-or-nothing. Running as UID 0 grants every kernel privilege simultaneously, and there is no way to say a process should have root's file ownership abilities but not root's ability to mount filesystems. CAP_SYS_ADMIN alone controls mounting, namespace creation, BPF loading, and dozens of other operations.

Docker drops all capabilities at container start and adds back only about 14 safe ones like CHOWN, SETUID, and NET_BIND_SERVICE. Critically, CAP_SYS_ADMIN is permanently removed from the bounding set, meaning no process inside the container can ever regain it -- not through setuid binaries, not through file capabilities, not through anything. This reduces the exploitable kernel attack surface by roughly 60%.

Kubernetes

A compromised pod loads eBPF programs, manipulates network routing, and effectively owns the node. The pod was running a simple web application that never needed any of these privileges, but default container capabilities gave it all of them.

The problem is that without explicitly dropping capabilities, every pod starts with more kernel privileges than it will ever use. Default container capabilities include CAP_NET_RAW, CAP_SYS_CHROOT, and others that most workloads never need. A compromised pod inherits all of them, turning a web application exploit into a node-level compromise.

The recommended Kubernetes pattern is securityContext.capabilities.drop: [ALL] followed by adding back only what the pod requires. Pod Security Standards enforce this at the cluster level: the restricted profile completely disallows CAP_SYS_ADMIN and CAP_NET_RAW. Applying drop-all to a 200-pod cluster eliminates over 30 dangerous capabilities per pod, reducing the node-level attack surface from roughly 40 exploitable capabilities to 2-3 narrowly scoped ones.

systemd

A web server needs to bind port 80, but doing so requires running as full root with access to reboot the machine, read /etc/shadow, and kill every process on the system. A vulnerability in the web server would give the attacker all 40+ root capabilities.

The underlying problem is that Linux port binding below 1024 traditionally requires root, and root is an all-or-nothing privilege. There is no built-in way to say a process needs only the ability to bind low ports without also granting it every other root capability.

systemd solves this with AmbientCapabilities=CAP_NET_BIND_SERVICE combined with User=www-data, granting the Nginx process exactly one permission: binding ports below 1024. No setuid binary, no root UID, no other capabilities in the effective set. If the service is compromised, the attacker has www-data privileges plus one network capability instead of full root, reducing the exploitable permission surface by over 95%.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Capability control	--cap-drop/--cap-add flags	SecurityManager (deprecated in 17+)	N/A (OS-level only)	syscall.Prctl for PR_CAPBSET_DROP	securityContext.capabilities.drop/add
Least privilege default	Drops all but ~14 safe caps	No sandbox by default	No sandbox by default	No sandbox by default	Depends on Pod Security Standards
Privileged escape hatch	--privileged (all caps + devices)	N/A	N/A	N/A	privileged: true in securityContext
File capabilities	setcap on container binaries	N/A	N/A	setcap on Go binary	Init container with setcap
Bounding set control	Bounding set locked at start	N/A	N/A	prctl(PR_CAPBSET_DROP)	Inherited from container runtime

Stack Layer Mapping

Layer	Capability Mechanism
Hardware	N/A (capabilities are a kernel abstraction)
Kernel	capable() / ns_capable() checks in syscall paths
System libraries	libcap / libcap-ng wrap capget/capset syscalls
Container runtime	Drops capabilities before exec of entrypoint
Orchestrator	Pod Security Standards enforce capability policies
Application	Capability-aware programs raise/drop from effective set

Design Rationale

A web server that binds port 80 has no business loading kernel modules. Capabilities split the monolithic root privilege so it does not have to. The bounding set adds a hard ceiling -- once a container is running, no code path can claw back what was dropped. Ambient capabilities (kernel 4.3) came later because the inheritable + file capability dance was too convoluted for service managers to use in practice.

If You See This, Think This

Symptom	Likely Cause	First Check
EPERM on bind() to port 80	Missing CAP_NET_BIND_SERVICE	`getpcaps <pid>` to verify effective set
Container escape via mount()	CAP_SYS_ADMIN not dropped	`docker inspect --format '{{.HostConfig.CapAdd}}'`
setcap silently fails	Filesystem mounted nosuid or no xattr support	`mount
Capabilities lost after execve()	Ambient set not configured, file caps missing	`cat /proc/<pid>/status
Ping fails for non-root users	CAP_NET_RAW missing from ping binary	`getcap /usr/bin/ping`
Capabilities present but operation denied	SELinux or seccomp blocking the syscall	`ausearch -m AVC -ts recent` or check seccomp logs

When to Use / Avoid

Use when:

Running services that need one or two privileged operations but not full root
Hardening container security contexts in Docker or Kubernetes
Replacing setuid-root binaries with narrowly scoped file capabilities
Building systemd service units that bind low ports as non-root users
Auditing which processes hold dangerous capabilities like CAP_SYS_ADMIN

Avoid when:

The process genuinely needs full root (system installers, early boot init)
Running inside user namespaces where capabilities are already namespace-scoped
The binary is short-lived and the overhead of capability management exceeds the security benefit

Try It Yourself

 1  # Show capabilities of current process
 2  
 3  getpcaps $$ 2>/dev/null || cat /proc/$$/status | grep -i cap
 4  
 5  # List all capability names
 6  
 7  capsh --print 2>/dev/null | head -20 || grep Cap /proc/$$/status
 8  
 9  # Find all binaries with file capabilities
10  
11  getcap -r /usr 2>/dev/null | head -10 || echo 'getcap not available'
12  
13  # Decode hex capability set from /proc
14  
15  cat /proc/$$/status | grep CapEff | awk '{print $2}' | xargs -I{} capsh --decode={} 2>/dev/null || echo 'capsh not available'
16  
17  # Show capabilities of all running processes
18  
19  pscap 2>/dev/null | head -15 || echo 'pscap not available (install libcap-ng-utils)'
20  
21  # Run a command with dropped capabilities
22  
23  capsh --drop=cap_sys_admin --print 2>/dev/null | grep -i 'bounding' || echo 'capsh not available'

Debug Checklist

1getpcaps $$ -- show capabilities of current shell
2cat /proc/<pid>/status | grep -i cap -- raw hex capability sets
3capsh --decode=<hex> -- decode hex to human-readable names
4getcap -r /usr 2>/dev/null -- find all binaries with file capabilities
5pscap 2>/dev/null -- list capabilities of all running processes
6grep NoNewPrivs /proc/<pid>/status -- check if no_new_privs bit is set

Key Takeaways

✓There are 40+ capabilities in modern kernels, but a handful dominate real-world usage: CAP_NET_BIND_SERVICE (bind ports below 1024), CAP_NET_RAW (raw sockets for ping/tcpdump), CAP_SYS_ADMIN (the dangerous catch-all that is basically mini-root), CAP_DAC_OVERRIDE (bypass file permissions), and CAP_SETUID/CAP_SETGID (change identity).
✓The bounding set is an irreversible ceiling. Drop CAP_SYS_ADMIN from it, and no child process can ever gain that capability again -- not through setuid binaries, not through file capabilities, not through anything. This is how container runtimes permanently lock the door on dangerous privileges.
✓File capabilities replace setuid root for specific use cases. 'setcap cap_net_bind_service=ep /usr/bin/myserver' lets a binary bind port 80 without ever running as root. Much safer than chmod u+s, because the binary only gets the one permission it needs.
✓CAP_SYS_ADMIN is the 'new root.' It controls mount, chroot, sethostname, BPF, quotas, namespaces, and dozens of other operations. A process with CAP_SYS_ADMIN can do almost anything root can. Container runtimes drop it by default for exactly this reason.
✓When a setuid-root binary runs, the process gets ALL capabilities in its permitted and effective sets. When it drops to a non-root UID, it keeps the capabilities in its permitted set unless it explicitly drops them. That is how ping can run setuid-root, drop to your UID, and still hold CAP_NET_RAW for raw sockets.

Common Pitfalls

✗Mistake: Granting CAP_SYS_ADMIN to a container 'because it needs to mount filesystems.' Reality: CAP_SYS_ADMIN is nearly equivalent to full root. Use bind mounts from the host, or run the specific operation in an init container with a narrow capability set.
✗Mistake: Setting file capabilities without understanding version semantics. Reality: File capabilities have a version field. v2 (Linux 2.6.25+) supports only permitted/effective/inheritable. v3 (Linux 4.14+) adds namespace-aware root_id. Mismatched versions silently fail -- no error, just no capabilities.
✗Mistake: Dropping from the effective set but not the permitted set, thinking the process is restricted. Reality: The process (or a compromised library) can raise the capability back into effective at any time. Drop from permitted for permanent restriction.
✗Mistake: Forgetting that capabilities are per-thread, not per-process. Reality: A multithreaded program that drops capabilities in one thread still has them in all others. Each thread has its own effective/permitted/inheritable sets. Use prctl(PR_SET_KEEPCAPS) carefully across setuid transitions.

Reference

System Calls

capgetcapsetprctl

Tools

getpcaps / capshsetcap / getcappscap (libcap-ng-utils)

📌

In One Line

Drop what the process does not need; drop from the bounding set to make it irreversible.

Linux Capabilities

DockerKubernetessystemd

🧠

Mental Model

💡

The Problem

Architecture

Why can ping send raw network packets without being root?

What Actually Happens

Every thread in Linux has five capability sets. They interact in specific ways.

Under the Hood

Common Questions

How does Docker decide which capabilities to grant?

Why is CAP_SYS_ADMIN considered so dangerous?

How do ambient capabilities simplify things?

How Technologies Use This

Docker

Kubernetes

systemd

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Capability control	--cap-drop/--cap-add flags	SecurityManager (deprecated in 17+)	N/A (OS-level only)	syscall.Prctl for PR_CAPBSET_DROP	securityContext.capabilities.drop/add
Least privilege default	Drops all but ~14 safe caps	No sandbox by default	No sandbox by default	No sandbox by default	Depends on Pod Security Standards
Privileged escape hatch	--privileged (all caps + devices)	N/A	N/A	N/A	privileged: true in securityContext
File capabilities	setcap on container binaries	N/A	N/A	setcap on Go binary	Init container with setcap
Bounding set control	Bounding set locked at start	N/A	N/A	prctl(PR_CAPBSET_DROP)	Inherited from container runtime

Stack Layer Mapping

Layer	Capability Mechanism
Hardware	N/A (capabilities are a kernel abstraction)
Kernel	capable() / ns_capable() checks in syscall paths
System libraries	libcap / libcap-ng wrap capget/capset syscalls
Container runtime	Drops capabilities before exec of entrypoint
Orchestrator	Pod Security Standards enforce capability policies
Application	Capability-aware programs raise/drop from effective set

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
EPERM on bind() to port 80	Missing CAP_NET_BIND_SERVICE	`getpcaps <pid>` to verify effective set
Container escape via mount()	CAP_SYS_ADMIN not dropped	`docker inspect --format '{{.HostConfig.CapAdd}}'`
setcap silently fails	Filesystem mounted nosuid or no xattr support	`mount
Capabilities lost after execve()	Ambient set not configured, file caps missing	`cat /proc/<pid>/status
Ping fails for non-root users	CAP_NET_RAW missing from ping binary	`getcap /usr/bin/ping`
Capabilities present but operation denied	SELinux or seccomp blocking the syscall	`ausearch -m AVC -ts recent` or check seccomp logs

When to Use / Avoid

Use when:

Running services that need one or two privileged operations but not full root
Hardening container security contexts in Docker or Kubernetes
Replacing setuid-root binaries with narrowly scoped file capabilities
Building systemd service units that bind low ports as non-root users
Auditing which processes hold dangerous capabilities like CAP_SYS_ADMIN

Avoid when:

The process genuinely needs full root (system installers, early boot init)
Running inside user namespaces where capabilities are already namespace-scoped
The binary is short-lived and the overhead of capability management exceeds the security benefit

Try It Yourself

 1  # Show capabilities of current process
 2  
 3  getpcaps $$ 2>/dev/null || cat /proc/$$/status | grep -i cap
 4  
 5  # List all capability names
 6  
 7  capsh --print 2>/dev/null | head -20 || grep Cap /proc/$$/status
 8  
 9  # Find all binaries with file capabilities
10  
11  getcap -r /usr 2>/dev/null | head -10 || echo 'getcap not available'
12  
13  # Decode hex capability set from /proc
14  
15  cat /proc/$$/status | grep CapEff | awk '{print $2}' | xargs -I{} capsh --decode={} 2>/dev/null || echo 'capsh not available'
16  
17  # Show capabilities of all running processes
18  
19  pscap 2>/dev/null | head -15 || echo 'pscap not available (install libcap-ng-utils)'
20  
21  # Run a command with dropped capabilities
22  
23  capsh --drop=cap_sys_admin --print 2>/dev/null | grep -i 'bounding' || echo 'capsh not available'

Debug Checklist

1getpcaps $$ -- show capabilities of current shell
2cat /proc/<pid>/status | grep -i cap -- raw hex capability sets
3capsh --decode=<hex> -- decode hex to human-readable names
4getcap -r /usr 2>/dev/null -- find all binaries with file capabilities
5pscap 2>/dev/null -- list capabilities of all running processes
6grep NoNewPrivs /proc/<pid>/status -- check if no_new_privs bit is set

Key Takeaways

✓There are 40+ capabilities in modern kernels, but a handful dominate real-world usage: CAP_NET_BIND_SERVICE (bind ports below 1024), CAP_NET_RAW (raw sockets for ping/tcpdump), CAP_SYS_ADMIN (the dangerous catch-all that is basically mini-root), CAP_DAC_OVERRIDE (bypass file permissions), and CAP_SETUID/CAP_SETGID (change identity).
✓The bounding set is an irreversible ceiling. Drop CAP_SYS_ADMIN from it, and no child process can ever gain that capability again -- not through setuid binaries, not through file capabilities, not through anything. This is how container runtimes permanently lock the door on dangerous privileges.
✓File capabilities replace setuid root for specific use cases. 'setcap cap_net_bind_service=ep /usr/bin/myserver' lets a binary bind port 80 without ever running as root. Much safer than chmod u+s, because the binary only gets the one permission it needs.
✓CAP_SYS_ADMIN is the 'new root.' It controls mount, chroot, sethostname, BPF, quotas, namespaces, and dozens of other operations. A process with CAP_SYS_ADMIN can do almost anything root can. Container runtimes drop it by default for exactly this reason.
✓When a setuid-root binary runs, the process gets ALL capabilities in its permitted and effective sets. When it drops to a non-root UID, it keeps the capabilities in its permitted set unless it explicitly drops them. That is how ping can run setuid-root, drop to your UID, and still hold CAP_NET_RAW for raw sockets.

Common Pitfalls

✗Mistake: Granting CAP_SYS_ADMIN to a container 'because it needs to mount filesystems.' Reality: CAP_SYS_ADMIN is nearly equivalent to full root. Use bind mounts from the host, or run the specific operation in an init container with a narrow capability set.
✗Mistake: Setting file capabilities without understanding version semantics. Reality: File capabilities have a version field. v2 (Linux 2.6.25+) supports only permitted/effective/inheritable. v3 (Linux 4.14+) adds namespace-aware root_id. Mismatched versions silently fail -- no error, just no capabilities.
✗Mistake: Dropping from the effective set but not the permitted set, thinking the process is restricted. Reality: The process (or a compromised library) can raise the capability back into effective at any time. Drop from permitted for permanent restriction.
✗Mistake: Forgetting that capabilities are per-thread, not per-process. Reality: A multithreaded program that drops capabilities in one thread still has them in all others. Each thread has its own effective/permitted/inheritable sets. Use prctl(PR_SET_KEEPCAPS) carefully across setuid transitions.

Reference

System Calls

capgetcapsetprctl

Tools

getpcaps / capshsetcap / getcappscap (libcap-ng-utils)

📌

In One Line

Drop what the process does not need; drop from the bounding set to make it irreversible.

Linux Capabilities

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Linux Capabilities

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics