Seccomp: Sandboxing System Calls
Mental Model
A customs checkpoint at an airport. Every passenger -- each one a syscall -- must pass through the scanner before reaching the terminal gates. The scanner runs on a printed rulebook that nobody can edit once laminated, not even the airport director. Passenger not on the list? Turned away on the spot. Every new airline that opens at this airport inherits the same scanner with the same rules. More scanners can be added in series, but none are ever removed.
The Problem
A buffer overflow in a containerized service hands an attacker arbitrary code execution -- and without syscall filtering, all roughly 450 kernel entry points are wide open. kexec_load() can replace the host kernel, mount() can attach the host filesystem, ptrace() can read credentials from neighboring processes. Scale that to 500 pods on a single Kubernetes node, each one an independent gateway into the shared kernel. In Chrome, a compromised renderer with access to execve() and connect() spawns a reverse shell and exfiltrates session cookies in under 5 seconds.
Architecture
Every Docker container that has ever run was silently blocking about 44 system calls.
Most people never noticed. The application kept working. But if an attacker gained code execution inside that container and tried to call reboot(), kexec_load(), or mount(), the kernel would kill the process before the syscall even started executing.
That is seccomp. It is not a firewall for network traffic. It is a firewall for kernel requests. And once installed, it is permanent.
What Actually Happens
A process installs a seccomp filter by calling prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog) or the newer seccomp() syscall. The filter is a BPF program -- a small array of instructions that examine a seccomp_data struct on every syscall.
Every time the process (or any of its children) makes a syscall, the kernel hits this sequence:
1. Syscall entry. The CPU transitions to kernel mode via entry_SYSCALL_64.
2. Filter runs. Before looking up the syscall handler in sys_call_table, the kernel calls __seccomp_filter(). The BPF program receives the syscall number, architecture, instruction pointer, and all six arguments.
3. Verdict returned. The filter returns one of several actions:
SECCOMP_RET_ALLOW-- proceed normally.SECCOMP_RET_KILL_PROCESS-- terminate the process group with SIGSYS. Core dump generated.SECCOMP_RET_ERRNO(val)-- return an error to userspace without executing the syscall. The process sees a normal errno.SECCOMP_RET_TRAP-- deliver SIGSYS to the process's signal handler for logging.SECCOMP_RET_LOG-- allow but log to the audit subsystem.SECCOMP_RET_USER_NOTIF(kernel 5.0+) -- pause the syscall and notify a supervisor process.
No work is done on a blocked syscall. The kernel never even looks at the arguments. This is faster and more reliable than ptrace-based sandboxing, which runs the syscall and then decides.
Under the Hood
Seccomp uses classic BPF (cBPF), not eBPF. This is deliberate.
Classic BPF has no maps, no loops, no function calls. It is a linear sequence of load/store/compare/jump instructions operating on a fixed-size struct. This extreme simplicity makes it possible to formally verify that a filter always terminates and runs in bounded time. The kernel converts cBPF to eBPF internally for JIT compilation, but the program submitted by userspace must be cBPF.
Filters are immutable and stackable. An installed filter cannot be modified or removed. More filters can be added, but never subtracted. When multiple filters are stacked, ALL of them must return ALLOW for the syscall to proceed. This means security can only increase, never decrease.
Filters are inherited. Every fork() and clone() creates a child with the same filter. The PR_SET_NO_NEW_PRIVS flag (required for non-root installation) ensures that execve() of setuid binaries does not gain elevated privileges. The sandbox invariant is maintained across process boundaries.
User notifications (SECCOMP_RET_USER_NOTIF) are a game-changer for container runtimes. When the filter returns USER_NOTIF, the syscall is paused and a notification is sent to a supervisor via a seccomp notification fd. The supervisor can read the syscall arguments, inspect user-space memory, and decide to allow, deny, or inject a synthetic return value. This lets container runtimes emulate syscalls like mount without giving the container real CAP_SYS_ADMIN.
This is where things break without careful attention: architecture checking. On x86-64, a process can use int 0x80 to invoke the 32-bit syscall table, where syscall numbers are completely different. A filter that only checks x86-64 numbers is trivially bypassed. The very first instruction of any serious seccomp filter must verify seccomp_data.arch.
Common Questions
How does Docker's default seccomp profile work?
Docker ships a JSON profile (default.json) listing about 44 blocked syscalls out of roughly 450 total. The container runtime (runc) compiles this into BPF bytecode and installs it in the container's init process. Blocked syscalls include reboot, kexec_load, mount, keyctl, acct, ptrace, userfaultfd, and various namespace operations. The default action for unlisted syscalls is ALLOW. A custom profile can be supplied via --security-opt seccomp=/path/to/profile.json.
Can a seccomp filter be bypassed?
Not directly. The filter runs in kernel space before syscall dispatch and cannot be removed. But there are indirect attacks: (1) if the filter does not check the architecture, x86-64 processes can use int 0x80 to invoke 32-bit syscalls with different numbers; (2) if the filter checks numbers but not arguments, allowed syscalls with dangerous arguments can still be abused; (3) if the filter allows ptrace, a traced child can be manipulated to make any syscall. A well-written filter checks arch, blocks ptrace, and validates arguments for sensitive syscalls.
What is the performance overhead?
Minimal. The BPF filter is JIT-compiled to native code and runs in roughly 100-200 nanoseconds per syscall. For a process making 10,000 syscalls per second, that is 1-2ms of total overhead. Negligible. The overhead scales linearly with the number of stacked filters. Filter size matters -- a 100-instruction filter is faster than a 10,000-instruction one -- but both are fast compared to the syscall itself.
How to debug seccomp violations?
Use SECCOMP_RET_LOG instead of SECCOMP_RET_KILL during development. This allows the syscall but logs it to the audit log. The entry includes the syscall number, architecture, and PID. ausearch -m seccomp finds these entries. For Docker, docker run --security-opt seccomp=unconfined disables the filter to confirm it is the cause. strace -f on the failing process shows which syscall triggers the violation.
How Technologies Use This
An attacker gains code execution inside a container and calls kexec_load() to replace the host kernel, or mount() to attach the host filesystem. Without syscall filtering, the full kernel API of roughly 450 syscalls is available to every container process.
The fundamental problem is that containers share the host kernel, and every syscall is a potential attack vector against that shared kernel. Capabilities restrict what a process is authorized to do, but seccomp restricts what a process can even ask for. Without seccomp, a container escape exploit only needs to find one vulnerable syscall path among 450+ options.
Docker's default seccomp profile blocks about 44 dangerous syscalls including mount, reboot, kexec_load, and ptrace while allowing the 400+ that normal applications need. The BPF filter is compiled once and permanently attached to the container's task_struct. It cannot be removed, even by root inside the container. The overhead is roughly 100-200ns per syscall, negligible compared to the syscall itself, but it eliminates entire classes of container escape techniques.
Consider 500 pods on a node, every single one with access to the full syscall table. A single compromised pod can attempt mount, ptrace, or namespace manipulation against the shared host kernel. Writing individual seccomp profiles for each pod is not feasible at this scale.
The issue is that without cluster-wide defaults, seccomp is opt-in per pod. Most teams never configure it, leaving every pod with an unrestricted syscall surface. The attack surface scales linearly with pod count -- 500 pods means 500 unrestricted entry points into the kernel.
Enabling SeccompDefault in the kubelet applies the container runtime's default seccomp profile to all pods automatically, blocking about 44 dangerous syscalls without any per-pod configuration. Custom profiles can tighten the filter further for sensitive workloads, restricting a payment service to just 80 required syscalls and reducing its kernel attack surface by over 80% compared to the default.
A compromised renderer tab calls execve() to launch a reverse shell or connect() to exfiltrate session cookies to an external server. Without syscall restrictions, any code running in the renderer has access to the full kernel API of 450+ syscalls.
The renderer process needs very few syscalls to do its actual job -- memory mapping for rendering, read/write for IPC with the browser process, and a handful of others. But without a whitelist, the 430+ syscalls it does not need become an attacker's toolkit for file access, network connections, process spawning, and privilege escalation.
Chrome installs a seccomp-BPF filter that whitelists only a handful of syscalls: mmap, read, write, recvmsg for IPC, and a few others needed for rendering. Everything else including execve, open, connect, and fork is permanently blocked. A hacked tab can only communicate with the browser process through an IPC channel, reducing the exploitable kernel surface from 450 syscalls to roughly 20, a 95% reduction.
A compromised web server calls mount() to overlay /etc/shadow with an attacker-controlled file, then reads the modified shadow file to extract password hashes. Without syscall filtering, any service running on the host has access to every kernel operation its capabilities allow.
The gap is that capabilities control whether a process is authorized for broad categories of operations, but they do not restrict which specific syscalls a process can invoke. A service with no dangerous capabilities can still call mount-related syscalls -- the kernel will return EPERM, but the syscall still executes partially. Seccomp blocks the syscall before any kernel code runs.
systemd's SystemCallFilter= directive compiles seccomp-BPF filters directly from unit file declarations. SystemCallFilter=@system-service allows the roughly 250 syscalls that normal services need, while SystemCallFilter=~@mount blocks all mount-related calls. Filter violations are logged to the journal with the offending syscall number and PID, adding less than 200ns of overhead per syscall while permanently closing off entire categories of attack.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Default filtering | ~44 syscalls blocked via default.json profile | No seccomp by default; container runtime applies it | No seccomp by default; container runtime applies it | No seccomp by default; static binaries use fewer syscalls | SeccompDefault kubelet flag applies runtime profile to all pods |
| Custom profiles | --security-opt seccomp=profile.json | Same container-level profile; JNI native calls may trigger violations | Same container-level profile; native addons may need extra syscalls | Minimal syscall footprint since Go does not use glibc | securityContext.seccompProfile per pod or container |
| Violation handling | Container process killed with SIGSYS; dmesg shows event | JVM crash with SIGSYS; hs_err_pid log created | Node process exits with SIGSYS; no graceful handler | Go runtime catches SIGSYS but process usually terminates | Pod enters CrashLoopBackOff; events show OOMKilled or Error |
| Syscall audit | strace -c on container PID to build whitelist | strace -c -f on JVM PID (includes JNI and GC syscalls) | strace -c on node PID (includes libuv internals) | strace -c on Go binary (direct syscalls, no libc) | sysdig or Falco for cluster-wide syscall auditing |
| Stack Layer | Mechanism |
|---|---|
| Application | Calls prctl(PR_SET_SECCOMP) or seccomp() to install BPF filter |
| Container runtime | runc/crun compiles JSON profile to BPF bytecode; installs before exec |
| Kernel BPF engine | Runs cBPF filter (JIT-compiled to native) on every syscall entry before dispatch |
| Audit subsystem | Logs SECCOMP_RET_LOG and SECCOMP_RET_KILL events for monitoring |
| Supervisor (optional) | SECCOMP_RET_USER_NOTIF pauses syscall; supervisor process inspects and responds via notif fd |
Design rationale: Filtering at the syscall entry point with an immutable BPF program means nothing in userspace -- not even root -- can ever widen the allowed set. That immutability is the whole point, but it comes at a cost: the complete syscall whitelist must be known before the filter is installed, which makes profiling dynamic workloads genuinely hard.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Container process killed with SIGSYS | Seccomp filter blocked a required syscall | dmesg |
| Application works locally but crashes in Docker | Default seccomp profile blocks a syscall the app needs | docker run --security-opt seccomp=unconfined to confirm; then build custom profile |
| Cryptic segfault in containerized app | Filter blocks glibc internal syscalls (futex, mprotect, brk) | strace -c -f outside container to compare syscall set |
| systemd service fails with "Operation not permitted" | SystemCallFilter too restrictive for the service | journalctl -u $SERVICE; check for seccomp audit messages |
| 32-bit compat app bypasses seccomp filter | Filter does not check seccomp_data.arch; int 0x80 uses different syscall table | Verify first BPF instruction checks AUDIT_ARCH_X86_64 |
| Pod stuck in CrashLoopBackOff after seccomp profile change | New profile missing syscalls needed during startup (clone, execve, mmap) | kubectl describe pod; check events for SIGSYS or signal 31 |
When to Use / Avoid
- Sandboxing untrusted code that shares a kernel -- containers, browser renderers, plugin hosts
- Hardening services to a known syscall whitelist via systemd SystemCallFilter
- Layering defense-in-depth on top of capabilities and LSM policies
- Auditing syscall usage with SECCOMP_RET_LOG before locking down a strict profile
- Skip when plugins load dynamically and their syscall needs are unpredictable
- Skip during debugging -- either disable seccomp temporarily or switch to RET_LOG to find what is being blocked
Try It Yourself
1 # Check if current shell has seccomp enabled
2
3 grep Seccomp /proc/$$/status
4
5 # Check seccomp status for all processes
6
7 for pid in /proc/[0-9]*/status; do grep -H Seccomp $pid 2>/dev/null; done | grep -v ':0$' | head -10
8
9 # Show Docker's default seccomp profile syscall list
10
11 docker info 2>/dev/null | grep -i seccomp || echo 'Docker not available'
12
13 # Run a process with strict seccomp via systemd
14
15 systemd-run --property=SystemCallFilter='write exit_group' echo 'hello' 2>/dev/null || echo 'systemd-run not available'
16
17 # List syscalls used by a command (for building a whitelist)
18
19 strace -c -f ls /tmp 2>&1 | tail -20
20
21 # Examine seccomp filter of a running Docker container
22
23 docker run -d --name seccomp_test alpine sleep 60 2>/dev/null && PID=$(docker inspect --format '{{.State.Pid}}' seccomp_test 2>/dev/null); grep Seccomp /proc/$PID/status 2>/dev/null; docker rm -f seccomp_test 2>/dev/nullDebug Checklist
- 1
grep Seccomp /proc/$PID/status -- check seccomp mode (0=off, 1=strict, 2=filter) - 2
strace -c -f $COMMAND 2>&1 | tail -20 -- profile all syscalls to build a whitelist - 3
ausearch -m seccomp --start recent -- find seccomp violations in the audit log - 4
seccomp-tools dump $BINARY -- disassemble the BPF filter attached to a binary - 5
dmesg | grep seccomp -- check kernel ring buffer for seccomp kill events - 6
docker inspect $CONTAINER | jq '.[0].HostConfig.SecurityOpt' -- verify container seccomp profile
Key Takeaways
- ✓Seccomp filters are permanent. They cannot be removed, only made more restrictive by stacking additional filters. ALL stacked filters must return ALLOW for a syscall to proceed. This guarantee holds even if the attacker gains code execution and root inside the sandbox.
- ✓The filter runs BEFORE the syscall touches the kernel's dispatch table. A blocked syscall is killed before any work happens. This is faster and more reliable than LSM hooks (which run after initial setup) and ptrace-based sandboxes (which run in a separate process).
- ✓Architecture checking is the most commonly missed security detail. On x86-64, a process can use int 0x80 to invoke 32-bit syscalls with completely different numbers. A filter that only checks x86-64 syscall numbers is trivially bypassed. Always check seccomp_data.arch first.
- ✓Docker blocks about 44 of roughly 450 syscalls by default, including keyctl, reboot, mount, kexec_load, ptrace, and userfaultfd. The profile is a JSON file compiled to BPF instructions by the container runtime. Most applications never notice the missing syscalls.
- ✓Seccomp uses classic BPF (cBPF), not eBPF. No maps, no loops, no function calls -- just linear load/store/compare/jump instructions on the seccomp_data struct. The kernel converts cBPF to eBPF internally for JIT compilation. The simplicity is deliberate: it guarantees termination and makes formal verification feasible.
Common Pitfalls
- ✗Mistake: Blocking open() but not openat(). Reality: modern glibc uses openat(AT_FDCWD, ...) for all file opens. A filter that only blocks the open syscall number is useless -- you must also block openat, openat2, and the 32-bit compat versions.
- ✗Mistake: Not checking the architecture field in the filter. Reality: an attacker can use int 0x80 on x86-64 to invoke the 32-bit syscall table where numbers are completely different. The filter MUST verify arch == AUDIT_ARCH_X86_64 before checking the syscall number.
- ✗Mistake: Testing only the target application and missing library syscalls. Reality: glibc, libpthread, and the dynamic linker make syscalls (futex, mprotect, mmap, brk) that the application never calls directly. Blocking these causes cryptic segfaults, not clean error messages.
- ✗Mistake: Using SECCOMP_RET_TRAP without installing a SIGSYS handler. Reality: the default action for SIGSYS is process termination. If you want RET_TRAP for logging, you must install a sigaction handler that catches SIGSYS and decides what to do.
Reference
In One Line
Bolt a seccomp-BPF filter onto the process, whitelist only the syscalls it genuinely needs, and the other 400+ kernel entry points cease to exist for that process -- permanently.