Linux Capabilities
Mental Model
A commercial kitchen used to hand out one master key ring -- knives, cash register, liquor cabinet, chemical storage, walk-in freezer, all on the same ring. Any cook who needed the knife drawer got the liquor cabinet too. Now each cook carries individual labeled keys. Pastry chef: sugar and flour drawers. Line cook: knife drawer. Nobody gets the cash register key unless they actually handle money. The head chef can permanently confiscate a key, and once it is gone, no copy can be made. A cook might carry keys in a pocket (permitted), but only the key actively in a lock (effective) opens anything.
The Problem
Binding port 80 requires root, and root carries everything: mount filesystems, load kernel modules, read /etc/shadow, kill any process, reboot the machine -- 40+ capabilities at once. One nginx vulnerability with full root hands an attacker the entire host. Scale that to a 500-container cluster where every container starts as root, and the attack surface multiplies 500 times over. CVE-2019-5736 showed exactly what happens: a container process with CAP_SYS_ADMIN overwrote the host runc binary and escaped containment entirely.
Architecture
Why can ping send raw network packets without being root?
Most engineers never question it. But raw sockets are a privileged operation. In the old days, ping was a setuid-root binary. It ran with full superuser privileges just to send ICMP packets. Full power to read every file, kill every process, reboot the machine -- all for a network diagnostic tool.
Capabilities changed this. Modern ping holds exactly one permission: CAP_NET_RAW. It can create raw sockets and nothing else. This is not a minor detail. It is the entire philosophy of Linux security since kernel 2.6.26.
What Actually Happens
Every thread in Linux has five capability sets. They interact in specific ways.
Permitted is the ceiling. It defines the maximum capabilities this thread can ever use. The permitted set can only shrink -- once a capability is dropped from it, it is gone forever for that thread.
Effective is what the kernel actually checks. When a process calls bind() on port 80, the kernel calls capable(CAP_NET_BIND_SERVICE) and looks at the effective set. A well-written program keeps its effective set empty and raises specific capabilities only when needed.
Inheritable interacts with file capabilities across execve(). A capability passes to the new program only if it is in both the thread's inheritable set AND the file's inheritable set. This two-key requirement makes inheritable sets cumbersome in practice.
Ambient (kernel 4.3+) solves the inheritance problem. Capabilities in the ambient set automatically carry across execve() for non-setuid, non-file-capability binaries. This is what makes systemd's AmbientCapabilities= work cleanly.
Bounding is the one-way gate. Drop a capability from the bounding set with prctl(PR_CAPBSET_DROP), and no descendant process can ever gain it -- not through setuid binaries, not through file capabilities, not through anything. This is irreversible. Container runtimes use it to permanently lock dangerous capabilities out of reach.
Under the Hood
When the kernel needs to authorize a privileged operation, it calls capable(CAP_xxx) or ns_capable(). This replaces the old if (current->euid == 0) check throughout the kernel. The function examines the calling thread's effective capability set and returns allow or deny.
File capabilities are stored as extended attributes on binary files. setcap cap_net_bind_service=ep /usr/bin/myapp adds CAP_NET_BIND_SERVICE to the file's permitted (p) and effective (e) sets. When this binary is executed, the new thread gains the capability in its permitted set. The e flag causes it to also appear in the effective set automatically, so legacy programs that do not manage capabilities themselves still work.
The transformation formula on execve() is where the complexity lives. The kernel computes: new_P = (old_P & fI) | (fP & bounding) | old_ambient. In plain terms: new permitted comes from the intersection of old permitted with file inheritable, unioned with file permitted (masked by bounding), plus ambient. The complexity of this formula is exactly why ambient capabilities were added -- they provide a simple, predictable path.
User namespaces add another layer. Inside a user namespace, a process can have full capabilities while being completely unprivileged on the host. ns_capable() verifies capabilities in the relevant namespace. Some operations -- loading kernel modules, accessing raw hardware -- require capabilities in the init (host) namespace, which user namespace root does not have.
Common Questions
How does Docker decide which capabilities to grant?
Docker starts with a whitelist of about 14 capabilities needed for basic container operation: CHOWN, DAC_OVERRIDE, FSETID, FOWNER, KILL, SETGID, SETUID, SETPCAP, NET_BIND_SERVICE, NET_RAW, SYS_CHROOT, MKNOD, AUDIT_WRITE, SETFCAP. Everything else is dropped, including CAP_SYS_ADMIN, CAP_NET_ADMIN, and CAP_SYS_PTRACE. The recommended pattern is --cap-drop ALL --cap-add <specific>, adding back only what the application actually needs.
Why is CAP_SYS_ADMIN considered so dangerous?
It is the dumping ground for operations that never got their own capability. It controls: mount/umount, sethostname, quotas, syslog, BPF operations, namespace creation, and many ioctl calls. A process with CAP_SYS_ADMIN can mount filesystems, manipulate namespaces, load BPF programs, and generally escape most containment. The kernel developers keep splitting things out -- CAP_BPF, CAP_PERFMON, CAP_CHECKPOINT_RESTORE were all carved from CAP_SYS_ADMIN -- but it remains dangerously broad.
How do ambient capabilities simplify things?
Before ambient capabilities, granting a capability to a non-root service required setuid root (overprivileged) or file capabilities on the binary (breaks on upgrades, requires xattr support). Ambient capabilities let a service manager set capabilities that automatically inherit across exec without any binary modification. systemd's AmbientCapabilities=CAP_NET_BIND_SERVICE makes this trivial.
How Technologies Use This
A root process inside a container calls mount() to attach the host filesystem and escapes isolation entirely. With all 40+ kernel capabilities including CAP_SYS_ADMIN, container root can mount filesystems, manipulate namespaces, and load BPF programs -- effectively owning the host.
The root cause is that traditional Linux root is all-or-nothing. Running as UID 0 grants every kernel privilege simultaneously, and there is no way to say a process should have root's file ownership abilities but not root's ability to mount filesystems. CAP_SYS_ADMIN alone controls mounting, namespace creation, BPF loading, and dozens of other operations.
Docker drops all capabilities at container start and adds back only about 14 safe ones like CHOWN, SETUID, and NET_BIND_SERVICE. Critically, CAP_SYS_ADMIN is permanently removed from the bounding set, meaning no process inside the container can ever regain it -- not through setuid binaries, not through file capabilities, not through anything. This reduces the exploitable kernel attack surface by roughly 60%.
A compromised pod loads eBPF programs, manipulates network routing, and effectively owns the node. The pod was running a simple web application that never needed any of these privileges, but default container capabilities gave it all of them.
The problem is that without explicitly dropping capabilities, every pod starts with more kernel privileges than it will ever use. Default container capabilities include CAP_NET_RAW, CAP_SYS_CHROOT, and others that most workloads never need. A compromised pod inherits all of them, turning a web application exploit into a node-level compromise.
The recommended Kubernetes pattern is securityContext.capabilities.drop: [ALL] followed by adding back only what the pod requires. Pod Security Standards enforce this at the cluster level: the restricted profile completely disallows CAP_SYS_ADMIN and CAP_NET_RAW. Applying drop-all to a 200-pod cluster eliminates over 30 dangerous capabilities per pod, reducing the node-level attack surface from roughly 40 exploitable capabilities to 2-3 narrowly scoped ones.
A web server needs to bind port 80, but doing so requires running as full root with access to reboot the machine, read /etc/shadow, and kill every process on the system. A vulnerability in the web server would give the attacker all 40+ root capabilities.
The underlying problem is that Linux port binding below 1024 traditionally requires root, and root is an all-or-nothing privilege. There is no built-in way to say a process needs only the ability to bind low ports without also granting it every other root capability.
systemd solves this with AmbientCapabilities=CAP_NET_BIND_SERVICE combined with User=www-data, granting the Nginx process exactly one permission: binding ports below 1024. No setuid binary, no root UID, no other capabilities in the effective set. If the service is compromised, the attacker has www-data privileges plus one network capability instead of full root, reducing the exploitable permission surface by over 95%.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Capability control | --cap-drop/--cap-add flags | SecurityManager (deprecated in 17+) | N/A (OS-level only) | syscall.Prctl for PR_CAPBSET_DROP | securityContext.capabilities.drop/add |
| Least privilege default | Drops all but ~14 safe caps | No sandbox by default | No sandbox by default | No sandbox by default | Depends on Pod Security Standards |
| Privileged escape hatch | --privileged (all caps + devices) | N/A | N/A | N/A | privileged: true in securityContext |
| File capabilities | setcap on container binaries | N/A | N/A | setcap on Go binary | Init container with setcap |
| Bounding set control | Bounding set locked at start | N/A | N/A | prctl(PR_CAPBSET_DROP) | Inherited from container runtime |
Stack Layer Mapping
| Layer | Capability Mechanism |
|---|---|
| Hardware | N/A (capabilities are a kernel abstraction) |
| Kernel | capable() / ns_capable() checks in syscall paths |
| System libraries | libcap / libcap-ng wrap capget/capset syscalls |
| Container runtime | Drops capabilities before exec of entrypoint |
| Orchestrator | Pod Security Standards enforce capability policies |
| Application | Capability-aware programs raise/drop from effective set |
Design Rationale
A web server that binds port 80 has no business loading kernel modules. Capabilities split the monolithic root privilege so it does not have to. The bounding set adds a hard ceiling -- once a container is running, no code path can claw back what was dropped. Ambient capabilities (kernel 4.3) came later because the inheritable + file capability dance was too convoluted for service managers to use in practice.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| EPERM on bind() to port 80 | Missing CAP_NET_BIND_SERVICE | getpcaps <pid> to verify effective set |
| Container escape via mount() | CAP_SYS_ADMIN not dropped | docker inspect --format '{{.HostConfig.CapAdd}}' |
| setcap silently fails | Filesystem mounted nosuid or no xattr support | `mount |
| Capabilities lost after execve() | Ambient set not configured, file caps missing | `cat /proc/<pid>/status |
| Ping fails for non-root users | CAP_NET_RAW missing from ping binary | getcap /usr/bin/ping |
| Capabilities present but operation denied | SELinux or seccomp blocking the syscall | ausearch -m AVC -ts recent or check seccomp logs |
When to Use / Avoid
Use when:
- Running services that need one or two privileged operations but not full root
- Hardening container security contexts in Docker or Kubernetes
- Replacing setuid-root binaries with narrowly scoped file capabilities
- Building systemd service units that bind low ports as non-root users
- Auditing which processes hold dangerous capabilities like CAP_SYS_ADMIN
Avoid when:
- The process genuinely needs full root (system installers, early boot init)
- Running inside user namespaces where capabilities are already namespace-scoped
- The binary is short-lived and the overhead of capability management exceeds the security benefit
Try It Yourself
1 # Show capabilities of current process
2
3 getpcaps $$ 2>/dev/null || cat /proc/$$/status | grep -i cap
4
5 # List all capability names
6
7 capsh --print 2>/dev/null | head -20 || grep Cap /proc/$$/status
8
9 # Find all binaries with file capabilities
10
11 getcap -r /usr 2>/dev/null | head -10 || echo 'getcap not available'
12
13 # Decode hex capability set from /proc
14
15 cat /proc/$$/status | grep CapEff | awk '{print $2}' | xargs -I{} capsh --decode={} 2>/dev/null || echo 'capsh not available'
16
17 # Show capabilities of all running processes
18
19 pscap 2>/dev/null | head -15 || echo 'pscap not available (install libcap-ng-utils)'
20
21 # Run a command with dropped capabilities
22
23 capsh --drop=cap_sys_admin --print 2>/dev/null | grep -i 'bounding' || echo 'capsh not available'Debug Checklist
- 1
getpcaps $$ -- show capabilities of current shell - 2
cat /proc/<pid>/status | grep -i cap -- raw hex capability sets - 3
capsh --decode=<hex> -- decode hex to human-readable names - 4
getcap -r /usr 2>/dev/null -- find all binaries with file capabilities - 5
pscap 2>/dev/null -- list capabilities of all running processes - 6
grep NoNewPrivs /proc/<pid>/status -- check if no_new_privs bit is set
Key Takeaways
- ✓There are 40+ capabilities in modern kernels, but a handful dominate real-world usage: CAP_NET_BIND_SERVICE (bind ports below 1024), CAP_NET_RAW (raw sockets for ping/tcpdump), CAP_SYS_ADMIN (the dangerous catch-all that is basically mini-root), CAP_DAC_OVERRIDE (bypass file permissions), and CAP_SETUID/CAP_SETGID (change identity).
- ✓The bounding set is an irreversible ceiling. Drop CAP_SYS_ADMIN from it, and no child process can ever gain that capability again -- not through setuid binaries, not through file capabilities, not through anything. This is how container runtimes permanently lock the door on dangerous privileges.
- ✓File capabilities replace setuid root for specific use cases. 'setcap cap_net_bind_service=ep /usr/bin/myserver' lets a binary bind port 80 without ever running as root. Much safer than chmod u+s, because the binary only gets the one permission it needs.
- ✓CAP_SYS_ADMIN is the 'new root.' It controls mount, chroot, sethostname, BPF, quotas, namespaces, and dozens of other operations. A process with CAP_SYS_ADMIN can do almost anything root can. Container runtimes drop it by default for exactly this reason.
- ✓When a setuid-root binary runs, the process gets ALL capabilities in its permitted and effective sets. When it drops to a non-root UID, it keeps the capabilities in its permitted set unless it explicitly drops them. That is how ping can run setuid-root, drop to your UID, and still hold CAP_NET_RAW for raw sockets.
Common Pitfalls
- ✗Mistake: Granting CAP_SYS_ADMIN to a container 'because it needs to mount filesystems.' Reality: CAP_SYS_ADMIN is nearly equivalent to full root. Use bind mounts from the host, or run the specific operation in an init container with a narrow capability set.
- ✗Mistake: Setting file capabilities without understanding version semantics. Reality: File capabilities have a version field. v2 (Linux 2.6.25+) supports only permitted/effective/inheritable. v3 (Linux 4.14+) adds namespace-aware root_id. Mismatched versions silently fail -- no error, just no capabilities.
- ✗Mistake: Dropping from the effective set but not the permitted set, thinking the process is restricted. Reality: The process (or a compromised library) can raise the capability back into effective at any time. Drop from permitted for permanent restriction.
- ✗Mistake: Forgetting that capabilities are per-thread, not per-process. Reality: A multithreaded program that drops capabilities in one thread still has them in all others. Each thread has its own effective/permitted/inheritable sets. Use prctl(PR_SET_KEEPCAPS) carefully across setuid transitions.
Reference
In One Line
Drop what the process does not need; drop from the bounding set to make it irreversible.