ptrace: Process Tracing & Debugging
Mental Model
A puppet theater. The puppet moves on stage performing its routine. Behind the curtain, a puppeteer can freeze it mid-motion, inspect exactly how the strings are positioned, adjust them to alter the next move, plant a tripwire on stage that halts the puppet when stepped on, or demand the puppet pause and report every time it walks through a doorway. One puppeteer per puppet -- no sharing.
The Problem
A production service stops responding. No logs, no errors, no crash dump -- engineers spend 4+ hours restarting and guessing without ever seeing the process's live state. In a different scenario, a compromised container ptrace-attaches to a neighbor and extracts database credentials from memory in 200 ms, leaving zero application-level trace. And strace on a high-throughput service (50,000 syscalls/sec) adds two context switches per syscall, cratering throughput by 80-100x -- naive tracing in production is not an option.
Architecture
A production service just stopped responding. No logs. No error messages. Just silence.
Someone needs to look inside a running process's brain. What syscall is it stuck on? What file is it trying to open? What arguments is it passing?
That is what ptrace does. And every time strace, GDB, or any debugger has been used on Linux, ptrace was doing the actual work.
What Actually Happens
ptrace establishes a tracer-tracee relationship between two processes. The typical flow:
For launching a new program (GDB run): fork() a child. The child calls ptrace(PTRACE_TRACEME), then execve(). The kernel stops the child at the first instruction of the new program and notifies the parent via waitpid(). GDB now has full control.
For attaching to a running process (strace -p PID): The tracer calls ptrace(PTRACE_ATTACH, PID) or PTRACE_SEIZE. The kernel sends SIGSTOP to the tracee (for ATTACH) or simply marks it as traced (for SEIZE). The tracee enters TASK_TRACED state, and the tracer can inspect everything.
For syscall tracing (strace): PTRACE_SYSCALL resumes the tracee and stops it at the next syscall boundary. Each syscall produces two stops: at entry (read syscall number from orig_rax, arguments from rdi/rsi/rdx/r10/r8/r9) and at exit (read return value from rax). This two-stop pattern is why strace can show both arguments and results -- and why it adds 10-100x overhead.
For breakpoints (GDB): GDB uses PTRACE_POKETEXT to overwrite the first byte of a target instruction with 0xCC (x86 INT3). When the CPU hits INT3, it generates SIGTRAP, stopping the tracee. GDB reads registers, restores the original byte, single-steps past it, re-inserts the breakpoint, and continues.
Under the Hood
Signal injection and suppression. When a tracee stops due to a signal, the tracer can suppress it (pass 0 to PTRACE_CONT) or inject a different one. GDB suppresses SIGTRAP from breakpoints because they were set intentionally. It can also intercept SIGSEGV, examine the fault, and decide whether to let the signal through.
PTRACE_O_TRACESYSGOOD solves ambiguity. By default, both signal stops and syscall stops deliver SIGTRAP, making them indistinguishable. Setting this option causes syscall stops to set bit 7 (delivering SIGTRAP | 0x80), so the tracer reliably knows "this stop is a syscall" vs "this is a real signal."
/proc/PID/mem is faster than PTRACE_PEEKDATA. Modern tracers read memory via /proc/PID/mem with a single pread() call for arbitrary sizes. PTRACE_PEEKDATA reads one word (8 bytes) per syscall. For a 4KB page, that is 1 syscall vs 512.
Anti-debugging via self-tracing. Since only one tracer can attach to a process, calling ptrace(PTRACE_TRACEME) early blocks later debugger attachments with EPERM. Common in commercial software and malware. Bypassed with LD_PRELOAD or CAP_SYS_PTRACE.
Common Questions
How does strace work internally?
It forks a child (or PTRACE_ATTACHes to a running process), sets PTRACE_O_TRACESYSGOOD, and loops: PTRACE_SYSCALL to resume, waitpid() to catch the next stop, PTRACE_GETREGS to read registers. At entry, decode orig_rax and arguments. At exit, read rax for the return value. The -f flag uses PTRACE_O_TRACEFORK to follow children.
Why is strace so slow?
Two context switches per syscall (entry stop + exit stop), plus ptrace syscall overhead for register reads. That is ~10-20 microseconds per traced syscall. Alternatives: perf trace (sampling-based, much lower overhead), eBPF tracing (bpftrace, sysdig -- runs in-kernel without stopping the target), and seccomp with SECCOMP_RET_LOG for audit logging.
How does GDB implement hardware watchpoints?
GDB writes the watched address into debug registers DR0-DR3 and sets type/length bits in DR7 via PTRACE_POKEUSER. The CPU monitors these addresses in hardware -- when accessed, it generates a debug exception delivered as SIGTRAP. Zero runtime overhead, but limited to 4 addresses on x86.
What are the security implications?
A process with ptrace access can read passwords, encryption keys, and API tokens from memory. It can modify code to bypass security checks, inject shellcode, and hijack control flow. That is why Ubuntu defaults to ptrace_scope=1, containers drop CAP_SYS_PTRACE, and Docker's seccomp profile blocks it entirely.
How Technologies Use This
A compromised container calls ptrace(PTRACE_ATTACH) on another container's process and reads database passwords straight from memory. Without restrictions, any process with the same UID can attach to another and dump its entire address space in milliseconds.
The danger is that ptrace grants complete read/write access to another process's memory and registers. An attacker who can attach to a database process can extract connection strings, encryption keys, and user data without ever touching the filesystem or network. The attack leaves no application-level log trail because it operates entirely through memory inspection.
Docker's default seccomp profile blocks the ptrace syscall entirely, and CAP_SYS_PTRACE is dropped from the container's capability bounding set. Even a root process inside the container cannot trace anything. Need strace or GDB for debugging? It must be explicitly granted with --cap-add SYS_PTRACE, which should never be enabled in production where it would expose every other container's secrets.
A compromised renderer tab calls open() on /etc/passwd or connect() to an external server, and the syscalls go straight to the kernel with no chance for the browser process to inspect or reject them. seccomp blocks known-bad syscalls, but cannot make nuanced decisions about syscall arguments.
The gap is that seccomp-bpf filters make binary allow/deny decisions based on syscall numbers and arguments, but cannot perform complex policy checks like verifying that a file path is safe or that a network destination is authorized. Some syscalls need contextual validation that only a privileged supervisor process can provide.
Chrome uses ptrace to let the privileged browser process intercept and inspect renderer syscalls in real time. Combined with seccomp-bpf filtering, this two-layer approach catches anomalous behavior like unexpected file opens. The overhead is about 10-20us per intercepted syscall, but since renderers are restricted to roughly 50 allowed syscalls and most are I/O-related, the performance impact on page rendering stays under 2%.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Debugging attach | --cap-add SYS_PTRACE required; default seccomp blocks ptrace | jcmd/jstack use /proc/self, not ptrace; remote debug via JDWP | node --inspect uses V8 debug protocol, not ptrace | Delve debugger uses ptrace on Linux | ephemeral debug containers with SYS_PTRACE capability |
| Syscall tracing | strace inside container needs SYS_PTRACE cap | strace on JVM PID shows JNI and native syscalls | strace on node PID reveals libuv I/O patterns | strace on Go binary shows raw syscalls (no libc) | kubectl debug with strace image |
| Security restriction | Default seccomp + dropped CAP_SYS_PTRACE | Yama ptrace_scope limits who can attach | Same Yama restrictions apply | Same Yama restrictions apply | PodSecurityPolicy/PSA blocks SYS_PTRACE in production |
| Production alternative | eBPF sidecar for tracing | JFR (Java Flight Recorder) for in-process tracing | --perf-basic-prof + perf for sampling | runtime/trace + pprof for Go-native profiling | Pixie, Cilium Hubble for eBPF-based observability |
| Stack Layer | Mechanism |
|---|---|
| Application | GDB, LLDB, strace, ltrace -- all frontends to ptrace |
| Language runtime | Delve (Go), JDWP (Java), V8 Inspector (Node) may bypass ptrace with protocol-level debugging |
| Kernel ptrace subsystem | Validates CAP_SYS_PTRACE, Yama scope; manages TASK_TRACED state and signal routing |
| Security modules | Yama LSM enforces ptrace_scope policy; seccomp-bpf can block the ptrace syscall entirely |
| Hardware | x86 debug registers DR0-DR3 for hardware breakpoints/watchpoints; INT3 (0xCC) for software breakpoints |
Design rationale: Putting register access, memory inspection, syscall interception, and breakpoints behind a single syscall keeps the debugging interface simple and universal. The cost -- two context switches per traced syscall and exclusive single-tracer access -- is acceptable for debugging but pushed production tracing toward in-kernel alternatives like eBPF that run without stopping the target.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| strace fails with "Operation not permitted" | Yama ptrace_scope restricts non-parent tracing | cat /proc/sys/kernel/yama/ptrace_scope |
| Cannot attach debugger to containerized process | CAP_SYS_PTRACE dropped and seccomp blocks ptrace | docker inspect --format '{{.HostConfig.CapAdd}}' $CONTAINER |
| strace shows process stuck on futex() or epoll_wait() | Process waiting on lock or I/O event -- not a bug, just idle | strace -e trace=futex -c to measure wait frequency |
| GDB breakpoint causes SIGILL instead of stopping | Breakpoint set on non-executable memory or misaligned address | info breakpoints in GDB; check /proc/$PID/maps for segment permissions |
| Application 10x slower under strace | Two stops per syscall at high syscall rate | strace -c to count syscalls; switch to perf trace or bpftrace |
| "already being traced" error on ptrace attach | Another tracer (debugger, strace, security tool) already attached | grep TracerPid /proc/$PID/status |
When to Use / Avoid
- Use when debugging a hung process with no logs -- strace reveals the blocking syscall
- Use when profiling syscall patterns to build a seccomp whitelist for containers
- Use when setting breakpoints or inspecting memory in a live process via GDB
- Use when tracing library calls (ltrace) to diagnose dynamic linking issues
- Avoid in production for continuous monitoring -- 10-100x overhead; use eBPF or perf trace instead
- Avoid when multiple tools need simultaneous tracing -- only one tracer can attach at a time
Try It Yourself
1 # Trace all syscalls of a command with timing info
2 strace -T -f ls /tmp 2>&1 | head -30
3
4 # Count syscalls by type (summary mode)
5 strace -c -f curl -s https://example.com 2>&1 | tail -20
6
7 # Trace only file-related syscalls of a running process
8 strace -e trace=openat,read,write,close -p $$ 2>&1 &
9 sleep 1; kill %1
10
11 # Check Yama ptrace_scope setting
12 cat /proc/sys/kernel/yama/ptrace_scope
13 # 0=classic, 1=parent-only (Ubuntu default), 2=admin-only, 3=none
14
15 # Show which processes are being traced
16 grep -l TracerPid /proc/[0-9]*/status 2>/dev/null | while read f; do
17 pid=$(echo $f | cut -d/ -f3)
18 tracer=$(grep TracerPid $f | awk '{print $2}')
19 [ "$tracer" != "0" ] && echo "PID $pid traced by $tracer"
20 done
21
22 # Read a process's memory map (useful for ptrace targets)
23 cat /proc/$$/maps | head -10Debug Checklist
- 1
strace -p $PID -e trace=openat,read,write -T 2>&1 | head -50 -- see what files and I/O a process is doing - 2
strace -c -f -p $PID -- aggregate syscall counts and time for a running process - 3
cat /proc/sys/kernel/yama/ptrace_scope -- check ptrace security policy (0-3) - 4
grep TracerPid /proc/$PID/status -- check if a process is already being traced - 5
cat /proc/$PID/maps | head -10 -- view memory layout before attaching with ptrace - 6
strace -e trace=network -f -p $PID 2>&1 | head -30 -- trace network syscalls across all threads
Key Takeaways
- ✓strace stops the tracee at every syscall entry AND exit -- two stops per syscall. At entry it reads the number from orig_rax and arguments from registers. At exit it reads the return value from rax. This two-stop pattern is why strace slows programs by 10-100x.
- ✓GDB sets breakpoints by overwriting instruction bytes with 0xCC (INT3). When the CPU hits INT3, it generates SIGTRAP. GDB restores the original byte, single-steps past it, re-inserts the breakpoint, and continues. Hardware breakpoints use debug registers DR0-DR3.
- ✓Yama LSM restricts ptrace via /proc/sys/kernel/yama/ptrace_scope: 0 = any process can trace any other, 1 = only parent can trace child (Ubuntu default), 2 = only CAP_SYS_PTRACE, 3 = no ptrace at all. This prevents malware from reading secrets out of other processes' memory.
- ✓PTRACE_SEIZE (Linux 3.4+) is preferred over PTRACE_ATTACH. It does not send SIGSTOP (avoids race conditions), enables PTRACE_EVENT_STOP, and allows PTRACE_INTERRUPT for on-demand stopping.
- ✓Only one tracer can attach to a process at a time. You cannot strace a process that GDB is already debugging. A process can also self-trace via PTRACE_TRACEME to block later debugger attachments -- a common anti-debugging technique.
Common Pitfalls
- ✗Mistake: Not calling waitpid() after PTRACE_ATTACH. Reality: The tracee does not stop synchronously. PTRACE_ATTACH sends SIGSTOP, and you must waitpid() for the stop before issuing other ptrace commands. Reading registers before the tracee stops gives stale data.
- ✗Mistake: Reading the syscall number from rax at syscall exit. Reality: The return value overwrites rax. The original syscall number is in orig_rax (offset 120 in user_regs_struct). Use ORIG_RAX, not RAX.
- ✗Mistake: Forgetting to handle PTRACE_EVENT_* stops after setting PTRACE_O_TRACEFORK. Reality: Fork events produce a PTRACE_EVENT_FORK stop, not a signal stop. Use PTRACE_GETEVENTMSG to get the child PID and PTRACE_CONT to resume. Missing these events hangs the tracee indefinitely.
- ✗Mistake: Using ptrace for production monitoring. Reality: ptrace stops the tracee for each operation, adding ~10-20us per syscall. For production, use eBPF or seccomp with SECCOMP_RET_LOG -- they run in-kernel without stopping the target.
Reference
In One Line
strace for quick diagnosis, eBPF for production -- ptrace stops the target on every operation, which is fine for debugging but lethal for throughput.