System Calls: User to Kernel Transition
Mental Model
A bank vault. Customers cannot walk in and grab cash directly. They fill out a form at the counter, a teller verifies it, disappears into the vault, retrieves what is needed, and hands it back through the window. The form is the syscall number, the counter is the mode switch, the vault is kernel space. For one very common question -- "what time is it?" -- the answer is posted on a board in the lobby. Nobody visits the counter for that.
The Problem
An application fires 2 million write() calls per second, 100 bytes each. CPU sits at 80%, but the application accounts for only 10% of it -- the rest is system time spent on mode switches at 200-500ns apiece. Down the hall, another application calls gettimeofday 5 million times per second and pays zero kernel overhead. Both are technically syscalls. The cost difference between them is 1,000x.
Architecture
Every time a program writes a byte to a file, sends a packet over the network, or checks the clock, it is asking the kernel for a favor.
Application code lives in user space. It cannot touch hardware. It cannot peek at another process's memory. It is, by design, powerless. The only way out is through a narrow gate called a system call.
What makes this interesting is not that the gate exists -- it is what happens when stepping through it, and why some programs blow through it millions of times per second while others grind to a halt.
What Actually Happens
When write() is called, here is the precise sequence:
-
glibc loads the syscall number (1 for write) into RAX. Arguments go into RDI, RSI, RDX, R10, R8, R9.
-
The
syscallinstruction fires. The CPU saves the return address in RCX and the flags in R11, then loads the kernel entry point from MSR_LSTAR. Execution is now in ring 0. -
The kernel entry stub (
entry_SYSCALL_64) runsswapgsto access per-CPU data, then saves all registers into apt_regsstructure on the kernel stack. -
The kernel dispatches to
sys_call_table[RAX]-- in this case,__x64_sys_write. -
The handler validates the arguments, copies data from user space via
copy_from_user(), pushes it through the VFS layer down to the filesystem driver, and puts the result in RAX. -
sysretqswitches back to ring 3. Execution returns to user space. If RAX is between -1 and -4095, glibc interprets it as an error, negates it, stores it inerrno, and returns -1.
The whole round trip takes 100-300ns.
Under the Hood
Why R10 instead of RCX for the 4th argument? Because the syscall instruction clobbers RCX (it stores the return address there). The hardware made this choice, and every syscall convention on x86-64 lives with the consequence.
The vDSO trick. For gettimeofday() and clock_gettime(), the kernel maps a tiny shared library into every process. This library reads a shared memory page that the kernel updates on each timer tick, then adds the TSC delta for sub-tick precision. No ring transition. ~20ns instead of ~200ns. It appears as [vdso] in /proc/PID/maps.
ABI stability is sacred. Linux guarantees that syscall numbers and semantics never change once assigned. A binary compiled for kernel 2.6 still works on kernel 6.x. New functionality comes through new syscalls or flag extensions (openat2 extends openat, clone3 extends clone), never by changing existing ones.
The vsyscall page is dead. It was the original fast-path mechanism -- fixed address, three functions. But that fixed address made ROP attacks trivial. The vDSO replaced it with an ASLR-randomized ELF object. The vsyscall page now traps into the kernel on every call, defeating its original purpose.
Syscall restart. When a signal interrupts a blocking syscall, the kernel can auto-restart it (if SA_RESTART is set) by rewinding the instruction pointer to the syscall instruction. Without SA_RESTART, the call returns -EINTR and must be retried manually. Some syscalls like nanosleep are never auto-restarted -- they use a restart_block to resume with the remaining time.
Common Questions
Why does Linux pass arguments in registers instead of on the stack?
The syscall instruction switches the stack pointer to the kernel stack, making the user stack inaccessible without explicit copying. Registers are the fastest way to move data across the privilege boundary. Six argument registers (RDI, RSI, RDX, R10, R8, R9) are enough -- no Linux syscall takes more than six arguments.
How does seccomp-bpf intercept syscalls?
Seccomp installs a BPF filter that runs before the syscall table dispatch. It examines the syscall number and arguments from seccomp_data (derived from pt_regs) and returns an action: ALLOW, KILL, ERRNO, TRAP, or LOG. This happens before any real work, making it an efficient sandboxing mechanism.
How fast is syscall vs int 0x80?
The syscall instruction takes ~50-100ns for the transition itself (no memory access, MSR-based). int 0x80 requires an IDT lookup, a stack switch via TSS, and is roughly 5-10x slower. On 64-bit systems, int 0x80 also enters the 32-bit compatibility layer -- it truncates registers and uses different syscall numbers. Never use it in 64-bit code.
Can a new syscall be added to a running kernel?
No. The sys_call_table lives in read-only memory (rodata), protected by Write Protect in CR0. While modules can theoretically disable WP and modify it, this is unsupported, breaks KASLR, and is blocked by Secure Boot. The proper approach for kernel extensions is ioctl, netlink, or eBPF.
How Technologies Use This
A container starts up and immediately leaks host resources because namespace creation was not atomic. For a brief window, the half-isolated process could see host PIDs, touch the host network, or access the host filesystem before all namespaces were in place.
The root cause is that creating each namespace with a separate syscall leaves gaps between transitions. If runc called clone() once for PID, then again for NET, then again for MNT, any failure or delay between calls produces a partially isolated process that violates every security assumption.
Docker solves this with clone3() using CLONE_NEWPID|CLONE_NEWNET|CLONE_NEWNS to create all namespaces atomically in a single syscall, then installs a seccomp-BPF filter permanently blocking roughly 44 dangerous syscalls like mount and reboot. This reduces per-container setup overhead to under 50ms while guaranteeing that every kernel request from inside passes through the syscall gate.
A Go microservice handling 50K goroutines freezes for hundreds of milliseconds whenever a handful of goroutines make blocking read() calls. All 5000 goroutines sharing those OS threads stall, and tail latency spikes to unacceptable levels.
The problem is that a blocking syscall parks the entire OS thread, and without coordination between the scheduler and the syscall layer, every goroutine queued on that thread waits for the kernel to return. The Go runtime has no way to reclaim the processor unless it knows a syscall is in progress.
Go fixes this by bypassing glibc entirely and wrapping every raw syscall with entersyscall()/exitsyscall() hooks, allowing the runtime to hand the processor to another goroutine instantly. For time.Now(), Go reads CLOCK_MONOTONIC via the vDSO at roughly 20ns per call instead of paying the 200ns cost of a real syscall, saving about 3 million unnecessary kernel transitions per second in a typical microservice.
A malicious webpage exploits a renderer vulnerability and calls connect() to exfiltrate session cookies to a remote server. Without syscall-level restrictions, the compromised renderer process has access to every one of the 450+ kernel operations the user can invoke.
The core issue is that the renderer runs as a regular process with full syscall access. Any code execution exploit immediately inherits the ability to open files, create network connections, spawn processes, and interact with the kernel in ways a tab should never need.
Chrome installs a seccomp-BPF filter that whitelists only about 50 syscalls for renderers, permanently blocking open(), connect(), execve(), and fork(). File and network access must go through IPC to the privileged browser process, adding roughly 5us of latency per proxied call but cutting the renderer attack surface by nearly 90%.
Same Concept Across Tech
| Technology | Syscall behavior | Key insight |
|---|---|---|
| Node.js | Async I/O reduces syscall blocking but each operation still needs a syscall | libuv batches I/O via epoll to reduce syscall count |
| Go | Runtime manages goroutine-to-thread mapping, netpoller reduces network syscalls | High goroutine count does not mean high syscall count |
| JVM | NIO uses fewer syscalls than blocking I/O by multiplexing via Selector | Direct ByteBuffer avoids copy syscall overhead |
| Docker | seccomp profile restricts which syscalls a container can make | Default Docker profile blocks ~44 dangerous syscalls |
| Kubernetes | Pod securityContext can set seccomp profiles per container | RuntimeDefault profile is a good starting point |
Stack layer mapping (high system CPU debugging):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the app making many small I/O calls instead of batching? | Application profiler, code review |
| Runtime | Is buffered I/O enabled? Is the runtime doing unnecessary syscalls? | strace -c to see syscall distribution |
| Syscall | Which syscalls dominate? How much time per call? | strace -c -S time, perf trace |
| Kernel | Is the kernel spending time in specific subsystems? | perf top, /proc/stat |
| Hardware | Is the disk slow? Network latency high? | iostat, ss -ti |
Design Rationale Hardware-enforced privilege boundaries exist because no amount of library-level checking stops a compromised process from directly poking hardware or reading another process's memory. Running everything in one address space -- the unikernel approach -- trades security for speed and takes down the entire system when any component crashes. The ring 3/ring 0 split is the price of mutual suspicion between processes. The vDSO exists for a narrower reason: gettimeofday() is a read-only query against kernel data, and paying 200ns for a ring transition millions of times per second on every server that logs or measures latency is a pointless tax. So the kernel exports a shared read-only page and lets user space help itself.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| High sy% in top, low us% | Too many syscalls per second | strace -c -p PID |
| strace shows millions of write() with small sizes | Unbuffered I/O, each write is a separate syscall | Use buffered writers, writev, or increase buffer size |
| gettimeofday is fast but clock_gettime is slow | vDSO fallback not working, or using CLOCK_MONOTONIC_RAW | Check vDSO availability: ldd binary |
| Application blocked in futex() calls | Waiting on mutex/lock contention, not a syscall issue per se | perf trace, look for lock contention |
| Container cannot make certain syscalls | seccomp profile blocking them | Check seccomp in /proc/PID/status, audit log |
| strace shows EPERM on socket/mount syscalls | Namespace or capability restrictions | Check capabilities with getpcaps PID |
When to Use / Avoid
Relevant when:
- High system time (sy) in top or mpstat indicates too many syscalls per second
- Optimizing I/O-heavy applications (batching writes, using writev instead of multiple write calls)
- Understanding strace output and what each syscall does
- Using vDSO-accelerated calls for high-frequency time reads
Watch out for:
- Excessive small writes (each write() is a mode switch, use buffered I/O)
- Forgetting that strace itself adds overhead (attach briefly, not permanently)
- Assuming all syscalls are equally expensive (gettimeofday via vDSO is near-zero, write() is not)
Try It Yourself
1 # Count syscalls made by a command, grouped by type
2
3 strace -c ls /tmp 2>&1 | tail -20
4
5 # Trace only write and read syscalls with timestamps
6
7 strace -e trace=read,write -t -p $$ 2>&1 | head -20
8
9 # Show the syscall number table for x86-64
10
11 ausyscall --dump | head -30
12
13 # Check which vDSO functions are available
14
15 objdump -T /lib/x86_64-linux-gnu/libc.so.6 2>/dev/null | grep -i clock_gettime || echo 'Check via: dd if=/proc/self/mem bs=4096 skip=.. | strings'
16
17 # Measure syscall overhead vs vDSO
18
19 perf stat -e syscalls:sys_enter_clock_gettime -a -- sleep 1
20
21 # View raw syscall entry for a process
22
23 cat /proc/$$/syscallDebug Checklist
- 1
Count syscalls per second: strace -c -p <pid> -- sleep 5 - 2
Check system vs user CPU: top, look at sy% vs us% - 3
Trace specific syscalls: strace -e write,read -T -p <pid> -- sleep 5 - 4
Profile syscall overhead: perf stat -e raw_syscalls:sys_enter -p <pid> -- sleep 5 - 5
Check vDSO usage: ldd /proc/<pid>/exe | grep vdso - 6
Check seccomp restrictions: grep Seccomp /proc/<pid>/status
Key Takeaways
- ✓The 4th argument uses R10 instead of RCX because the hardware clobbers RCX -- the syscall instruction saves RIP into RCX and RFLAGS into R11. This one hardware quirk shapes the entire x86-64 calling convention for syscalls.
- ✓The kernel never trusts user pointers. copy_from_user()/copy_to_user() validate addresses and handle page faults gracefully. Dereferencing a user pointer directly from kernel code is a security hole -- SMEP/SMAP enforce this in hardware.
- ✓gettimeofday() almost never enters the kernel. The vDSO reads a shared page updated on each timer tick, making it a pure user-space call (~20ns vs ~200ns). Most 'syscall benchmarks' are actually measuring vDSO speed.
- ✓Linux has ~450 syscalls on x86-64. New ones are rare because each is a permanent ABI commitment. The kernel prefers extending existing syscalls with flags (openat2, clone3) or multiplexers (ioctl, prctl) over adding new entries.
- ✓Negative return values between -1 and -4095 encode errors. The C library translates -errno to -1 and sets errno. This is also why kernel addresses start above 0xFFFF800000000000 -- to stay out of the error range.
Common Pitfalls
- ✗Mistake: Assuming glibc wrappers are thin pass-throughs. Reality: glibc's getpid() caches the PID in user space -- it does not issue a syscall after the first call in each thread. The wrapper can add overhead, transform arguments, or skip the kernel entirely.
- ✗Mistake: Using int 0x80 on a 64-bit system. Reality: This enters the 32-bit compatibility path, truncates arguments to 32 bits, and uses completely different syscall numbers. Programs mixing 64-bit code with int 0x80 get silent data corruption.
- ✗Mistake: Assuming syscalls are atomic. Reality: Most blocking syscalls (read, write, sleep) can return early with EINTR when a signal arrives. Not retrying on EINTR is one of the most common Unix programming bugs.
- ✗Mistake: Using inline assembly for raw syscalls without understanding the clobber list. Reality: The syscall instruction clobbers RCX and R11, and the kernel may clobber additional registers. Use the syscall() wrapper or explicit clobber declarations.
Reference
In One Line
High sy% in top almost always means too many small syscalls -- batch them, buffer them, or let the vDSO handle the ones that never needed to enter the kernel in the first place.