Observability & TuningTopic 4 of 6

Debugging & TracingAdvanced

Kernel Tracing with ftrace, kprobes, and tracepoints

DockerKubernetesNginx

🧠

Mental Model

A building with thousands of light switches on the walls, one at every doorway and junction. Every switch is wired to a central logging room. When all switches are off, electricity flows through the building normally with no overhead. Flipping a specific switch activates a sensor at that doorway that records who walked through and when. The building does not need to be rewired or shut down to install the sensors -- they were built into the walls at construction time (tracepoints) or can be clipped onto any wire at runtime (kprobes). The logging room has a separate notebook for each floor (per-CPU ring buffers) so that writers on different floors never fight over the same pen.

💡

The Problem

A production server is showing unexplained latency spikes in the I/O path. Application-level metrics show requests occasionally taking 50ms instead of the expected 2ms. Standard tools like iostat and sar show nothing abnormal. The latency lives somewhere between the system call layer and the block device driver, but the exact function is unknown. The server cannot be rebooted, the kernel has no debug symbols installed, and attaching a traditional debugger is not an option on a production box handling 10k requests per second.

Architecture

A production server is dropping requests. Application logs show occasional 50ms latency spikes on operations that normally complete in 2ms. iostat and sar show nothing unusual. The latency is somewhere between the system call boundary and the block device, but which function is responsible?

The traditional approach -- add printk calls, recompile the kernel, reboot, reproduce the problem -- is not viable on a server handling live traffic. Attaching kgdb and halting a CPU is even less viable.

Linux has a better way. The kernel ships with thousands of instrumentation points that can be turned on and off at runtime, with no recompilation, no reboot, and near-zero overhead when disabled.

The Three Pillars of Kernel Tracing

ftrace: The Function Tracer

ftrace is the foundational tracing framework built into every modern Linux kernel. It works through a clever compiler trick: every kernel function is compiled with a call to mcount (or fentry on modern x86) at its entry point. At boot, dynamic ftrace patches all these calls to NOP instructions. The kernel runs at full speed with tracing compiled in but disabled.

When tracing is enabled for a specific function, ftrace patches just that function's NOP back to a call into the tracing infrastructure. Everything else remains untouched.

The interface is the filesystem at /sys/kernel/debug/tracing:

# Mount debugfs if not already mounted
mount -t debugfs none /sys/kernel/debug

# See what tracers are available
cat /sys/kernel/debug/tracing/available_tracers
# function function_graph nop

# Enable function tracer for a single function
echo function > /sys/kernel/debug/tracing/current_tracer
echo vfs_write > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_on

# Read the trace stream
cat /sys/kernel/debug/tracing/trace_pipe
#  dd-12345  [002]  1234.567890: vfs_write <-ksys_write
#  dd-12345  [002]  1234.567891: vfs_write <-ksys_write

The function_graph tracer goes further. It hooks both function entry and return, producing indented call trees with per-function timing:

echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo vfs_write > /sys/kernel/debug/tracing/set_graph_function
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace_pipe

Output resembles:

 2)               |  vfs_write() {
 2)               |    __vfs_write() {
 2)               |      ext4_file_write_iter() {
 2)               |        ext4_buffered_write_iter() {
 2)   0.451 us    |          down_write_trylock();
 2)               |          generic_perform_write() {
 2)   3.821 us    |            ext4_write_begin();
 2)   0.320 us    |            copy_page_from_iter_atomic();
 2)   1.205 us    |            ext4_write_end();
 2) + 12.440 us   |          }
 2) + 14.103 us   |        }
 2) + 15.220 us   |      }
 2) + 16.001 us   |    }
 2) + 16.832 us   |  }

This output immediately shows where time is spent. If ext4_write_begin takes 20ms instead of 4us, that function becomes the investigation target.

kprobes: Dynamic Breakpoints Anywhere

Static tracepoints and ftrace function tracing cover most needs, but sometimes the instrumentation point needed does not exist. kprobes solve this by allowing a breakpoint at any instruction address in the kernel.

The mechanism on x86:

Save the original instruction at the target address.
Replace it with int3 (breakpoint exception).
When the CPU hits int3, the kprobe handler runs.
Single-step the saved original instruction.
Resume normal execution.

kprobes can be registered from kernel modules (as shown in the C code above) or via the ftrace interface:

# Register a kprobe on do_sys_openat2, capture the filename argument
echo 'p:myprobe do_sys_openat2 filename=+0(%si):string' \
    > /sys/kernel/debug/tracing/kprobe_events

# Enable it
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable

# Watch the output
cat /sys/kernel/debug/tracing/trace_pipe
# bash-1234  [001]  5678.901234: myprobe: (do_sys_openat2+0x0/0x...)
#                                         filename="/etc/passwd"

kretprobes work similarly but fire on function return, enabling latency measurement:

# Register entry and return probes on vfs_read
echo 'p:vfs_read_entry vfs_read' \
    >> /sys/kernel/debug/tracing/kprobe_events
echo 'r:vfs_read_return vfs_read $retval' \
    >> /sys/kernel/debug/tracing/kprobe_events

echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
cat /sys/kernel/debug/tracing/trace_pipe

Static Tracepoints: Zero-Cost When Off

The kernel source contains thousands of TRACE_EVENT macros at semantically meaningful locations. These compile to a static key check: when no tracer is attached, the branch is a NOP (patched at runtime using static key infrastructure). Enabling a tracepoint patches the NOP to a JMP into the tracing handler.

The overhead of a disabled tracepoint is one NOP instruction -- immeasurable in practice.

# List all block I/O tracepoints
ls /sys/kernel/debug/tracing/events/block/
# block_bio_backmerge  block_bio_bounce   block_bio_complete
# block_bio_frontmerge block_bio_queue    block_bio_remap
# block_getrq          block_plug         block_rq_complete
# block_rq_insert      block_rq_issue     block_rq_requeue
# block_split          block_unplug

# Enable I/O submission and completion tracepoints
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable

cat /sys/kernel/debug/tracing/trace_pipe
# kworker-890  [003]  9012.345678: block_rq_issue: 259,0 W 4096 () 12345 + 8
# kworker-890  [003]  9012.345890: block_rq_complete: 259,0 W () 12345 + 8 [0]

The 212us gap between issue and complete is the device latency. If that gap jumps to 50ms, the problem is in the hardware or driver, not the kernel block layer.

Putting It Together: The Latency Investigation

Start at the system call boundary and work down:

# Step 1: Confirm latency is below the syscall layer
strace -T -e trace=write -p $(pgrep appname) 2>&1 | \
    awk '$NF+0 > 0.010 {print}'  # Show writes taking >10ms

# Step 2: Trace the block layer
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
cat /sys/kernel/debug/tracing/trace_pipe > /tmp/blocktrace.log &

# Step 3: Narrow down with function_graph
trace-cmd record -p function_graph -g vfs_write \
    -F dd if=/dev/zero of=/mnt/data/testfile bs=4k count=1000
trace-cmd report | awk '/{/ && /us/ {gsub(/[^0-9.]/, "", $2);
    if ($2+0 > 1000) print}'  # Functions taking >1ms

# Step 4: If the suspect is a specific driver function,
#          attach a kprobe for precise measurement
echo 'p:nvme_submit nvme_queue_rq' \
    >> /sys/kernel/debug/tracing/kprobe_events
echo 'r:nvme_complete nvme_queue_rq $retval' \
    >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable

# Step 5: Clean up
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo nop > /sys/kernel/debug/tracing/current_tracer
echo > /sys/kernel/debug/tracing/kprobe_events
echo 0 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 0 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable

Under the Hood

Dynamic ftrace patching. At kernel compile time, the -pg flag (or -fpatchable-function-entry on modern GCC) inserts a call mcount or a 5-byte NOP at every function entry. During boot, the ftrace_init function records the location of every one of these call sites in a table. When tracing is disabled, all sites are patched to NOPs using text_poke_bp (which handles the SMP-safe instruction replacement problem). When tracing is enabled for function X, only X's call site is patched back to call ftrace_caller. The rest of the kernel never notices.

kprobe optimization. On x86, unoptimized kprobes use int3, which triggers a trap and context switch to the handler -- expensive at high frequency. Optimized kprobes (enabled by default on modern kernels) replace the target instruction with a JMP to a trampoline containing the handler and the original instruction. This avoids the trap overhead entirely. The optimization kicks in automatically after a kprobe has been hit a few times.

Tracepoint static keys. The TRACE_EVENT macro expands to a branch using the static key infrastructure (static_branch_unlikely). This compiles to a JMP instruction that is patched at runtime. When the tracepoint is disabled, the JMP target is the very next instruction (effectively a NOP). When enabled, the JMP goes to the tracing handler. The patching uses text_poke_bp with stop-machine on older kernels or INT3-based patching on newer ones for SMP safety.

Per-CPU ring buffers. ftrace maintains a separate ring buffer for each CPU. Events are written to the local CPU's buffer without taking any locks. The buffer is a linked list of pages, each containing packed event records with timestamps. When a reader opens trace_pipe, the infrastructure merges events from all CPU buffers sorted by timestamp. The per-CPU design means tracing scales linearly with core count -- no contention, no cache bouncing.

Common Questions

How much overhead does ftrace add in production?

With tracing disabled, the overhead is unmeasurable. Dynamic ftrace patches all instrumentation to NOPs at boot. When tracing a single function, the overhead is the handler execution time (typically 100-500ns per call). Tracing a function called 100k times per second adds roughly 10-50ms of CPU time per second -- about 1-5% of one core. function_graph is roughly 2x the overhead of function tracing because it hooks both entry and return.

When should kprobes be used instead of tracepoints?

Tracepoints should be the first choice because they have stable APIs and near-zero disabled overhead. Use kprobes when: the needed instrumentation point has no tracepoint, tracing an internal function that is not exported, or prototyping a new tracepoint location before submitting a kernel patch. In BPF-based tooling, bpftrace and BCC abstract this choice -- both kprobe and tracepoint attachments use the same scripting syntax.

Can ftrace and kprobes be used simultaneously?

Yes. ftrace tracepoints, function tracing, and kprobes all feed into the same ring buffer infrastructure. Events from all sources are interleaved in timestamp order when reading trace_pipe. This is powerful for correlation: enable a tracepoint for block_rq_issue and a kprobe on an internal driver function, and the merged trace shows the exact sequence of events.

What is the difference between /sys/kernel/debug/tracing/trace and trace_pipe?

trace is a snapshot file. Reading it returns the current ring buffer contents without consuming them. Reading it again returns the same data (plus any new events). trace_pipe is a consuming read -- events are removed from the buffer as they are read, and the read blocks when the buffer is empty. Use trace for post-mortem analysis and trace_pipe for real-time monitoring.

How do kprobes interact with kernel address space layout randomization (KASLR)?

kprobes register by symbol name, not raw address. The kernel resolves the symbol to its current address (which changes on every boot with KASLR) at registration time. This means kprobe-based tools work transparently with KASLR. The symbol resolution uses the same /proc/kallsyms data that debuggers use. If kptr_restrict is set to non-zero, unprivileged users cannot see kernel addresses, but root can still register kprobes.

How Technologies Use This

Docker

A container host running 60 Docker containers shows p99 syscall latency of 12ms for file operations inside specific containers, while the host-level I/O metrics look normal. The latency is suspected to originate in the overlayfs layer or the cgroup I/O throttling path, but standard tools like iostat cannot distinguish per-container kernel overhead.

Enabling the function_graph tracer scoped to a container's cgroup reveals the full kernel call chain for every file operation. Running `echo function_graph > /sys/kernel/debug/tracing/current_tracer` and filtering with `echo 'ovl_*' > /sys/kernel/debug/tracing/set_ftrace_filter` traces every overlayfs function with nanosecond timestamps and call depth indentation. The trace output shows exactly how long each layer lookup, whiteout check, and copy-up operation takes. On a container performing 500 file writes per second to an overlayfs mount, the trace might reveal that ovl_copy_up_one() takes 8ms per call due to copying a 4 MB file from the lower layer to the upper layer on first write.

Combining function_graph with cgroup-aware filtering via `trace-cmd record -p function_graph -O funcgraph-proc` and grepping for the container's PID namespace isolates the trace to a single container's kernel activity. This avoids the noise of 59 other containers and pinpoints whether the latency originates in overlayfs, the block layer, or the cgroup I/O controller.

Kubernetes

A Kubernetes node running 45 pods intermittently hangs for 5-10 seconds during pod creation. The kubelet logs show mount operations timing out, but the storage backend responds normally to direct I/O tests. The hang is suspected to be inside the kernel's mount syscall path, where overlayfs, cgroup, and namespace setup intersect.

Attaching a kprobe to the mount syscall entry point with `echo 'p:mount_probe do_mount' > /sys/kernel/debug/tracing/kprobe_events` and enabling it captures every mount call with its arguments and timestamp. Correlating slow mount calls with the kubelet's pod creation timeline reveals that the hang occurs specifically when do_mount() calls into the overlayfs code path while another mount operation holds the namespace_sem semaphore. The kprobe trace shows the blocked mount waiting 5-8 seconds for the semaphore.

Adding a kretprobe with `echo 'r:mount_ret do_mount ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events` captures the return value and duration of every mount call. Filtering the trace for calls exceeding 1 second isolates the problematic path. This approach requires no kernel recompilation, no debug symbols, and no node restart. The kprobes are removed by clearing the kprobe_events file, leaving zero residual overhead on the production node.

Nginx

An Nginx proxy handling 30,000 connections reports that 0.5% of responses take over 200ms, but application-level timing shows the upstream responds within 10ms. The extra latency lives somewhere in the kernel TCP stack between receiving the upstream response and sending it to the client. Standard network tools like ss and netstat show no obvious congestion.

Enabling the tcp_sendmsg tracepoint with `echo 1 > /sys/kernel/debug/tracing/events/tcp/tcp_sendmsg/enable` captures every TCP send operation with the socket address, byte count, and timestamp. Filtering the trace output by Nginx worker PIDs and correlating timestamps with the slow responses reveals that certain sends stall for 150-180ms. Adding the tcp_retransmit_skb tracepoint alongside shows that these stalls coincide with retransmission events on the client-facing connections, indicating packet loss on the downstream network segment.

The tracepoint approach is superior to packet capture for this diagnosis because it shows the kernel's internal view of the TCP state machine. Each tracepoint event includes the socket state, congestion window size, and send buffer occupancy at the moment of the call. Running `trace-cmd record -e tcp:tcp_sendmsg -e tcp:tcp_retransmit_skb -f 'sport == 443' -- sleep 30` collects 30 seconds of filtered TCP activity with under 2% overhead, producing a definitive timeline of where the latency originates.

Same Concept Across Tech

Technology	How it uses ftrace/kprobes	Key consideration
BCC/bpftrace	Attaches BPF programs to kprobes and tracepoints. bpftrace one-liners compile to BPF bytecode and attach via the same kprobe mechanism ftrace uses	Requires kernel 4.9+ for kprobe BPF attachment. BTF (5.2+) eliminates header dependency
trace-cmd	CLI wrapper around the ftrace debugfs interface. Manages per-CPU ring buffers and produces trace.dat files for KernelShark	Handles the raw file writes to debugfs so operators do not need to remember the exact file paths
perf	Consumes ftrace tracepoints via perf_event_open. perf ftrace subcommand wraps ftrace directly	perf adds symbol resolution and call graph recording on top of raw ftrace events
systemtap	Compiles tracing scripts into kernel modules that use kprobes internally	Requires kernel headers and a compiler on the target system. Heavier than BPF-based approaches
LTTng	Uses its own tracepoint infrastructure alongside kernel tracepoints for high-throughput tracing	Optimized for low-overhead continuous tracing. Uses per-CPU buffers similar to ftrace

Stack layer mapping (unexplained I/O latency):

Layer	What to check	Tool
Application	Is the latency in application code or below the syscall boundary?	strace -T to measure syscall time
Syscall	Which syscall is slow (read, write, fsync)?	perf trace -e syscalls:sys_enter_write
VFS	Is the VFS layer adding overhead (file locking, dentry lookup)?	ftrace function_graph on vfs_write
Block layer	Is the I/O scheduler reordering or delaying requests?	block:block_rq_issue / block:block_rq_complete tracepoints
Device driver	Is the driver firmware or DMA path slow?	kprobe on driver-specific submit function
Hardware	Is the device itself slow (power saving, thermal throttle)?	Compare tracepoint timestamp deltas with device specs

Design Rationale Traditional kernel debugging required either recompiling with printk statements (rebuild, reboot, reproduce, repeat) or attaching kgdb and halting the CPU. Neither works on production systems. ftrace solved this by embedding NOP stubs at every function entry during compilation, then patching them to tracing calls at runtime. The cost when disabled is a single NOP per function -- undetectable in benchmarks. kprobes extended this by allowing instrumentation at arbitrary addresses without recompilation. Static tracepoints completed the picture by providing stable, semantically meaningful instrumentation points that survive kernel upgrades. Together, these three mechanisms make every running Linux kernel a traceable system by default.

If You See This, Think This

Symptom	Likely cause	First check
Latency spike visible in app metrics but not in iostat	Latency lives in kernel code between syscall and device	Enable block tracepoints: echo 1 > /sys/kernel/debug/tracing/events/block/enable
Function tracer shows no output for a known function	Function was inlined by the compiler or marked notrace	grep func_name /sys/kernel/debug/tracing/available_filter_functions
System becomes sluggish after enabling tracing	Global tracing without filter generates excessive events	Check current_tracer and set_ftrace_filter; add specific filters or set nop
kprobe fails to register on a valid symbol	Address is in a blacklisted region (.init, .exit, or ftrace itself)	dmesg for kprobe registration errors; check /sys/kernel/debug/kprobes/blacklist
trace_pipe shows nothing after enabling events	tracing_on is 0 (tracing paused)	echo 1 > /sys/kernel/debug/tracing/tracing_on
Ring buffer overflows and drops events	buffer_size_kb too small for the event rate	Increase: echo 16384 > /sys/kernel/debug/tracing/buffer_size_kb
function_graph shows ftrace trampolines in stack traces	Expected behavior; function_graph replaces return addresses	Use trace-cmd report which resolves trampolines back to real callers

When to Use / Avoid

Relevant when:

Diagnosing latency spikes that live below the application layer in kernel code paths
Tracing specific kernel functions on production systems without rebooting or installing debug packages
Understanding the call chain between a system call entry and the hardware driver
Measuring per-function execution time inside a kernel code path (function_graph)
Verifying that a kernel module follows the expected code path during initialization

Watch out for:

Global function tracing without filters generates millions of events per second and can make the system unresponsive
kprobe attachment points are not stable across kernel versions; prefer static tracepoints when available
function_graph tracing modifies return addresses on the kernel stack, which can confuse crash dump analysis
Ring buffer overflow silently drops oldest events in flight recorder mode; increase buffer_size_kb for high-frequency tracing

Try It Yourself

 1  # Enable function_graph tracer for a specific function
 2  
 3  echo function_graph > /sys/kernel/debug/tracing/current_tracer
 4  
 5  echo do_sys_openat2 > /sys/kernel/debug/tracing/set_graph_function
 6  
 7  echo 1 > /sys/kernel/debug/tracing/tracing_on
 8  
 9  cat /sys/kernel/debug/tracing/trace_pipe
10  
11  
12  
13  # Trace block I/O request latency using tracepoints
14  
15  echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
16  
17  echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
18  
19  cat /sys/kernel/debug/tracing/trace_pipe | head -50
20  
21  
22  
23  # Register a kprobe on a kernel function via ftrace interface
24  
25  echo 'p:myprobe do_sys_openat2 filename=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events
26  
27  echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
28  
29  cat /sys/kernel/debug/tracing/trace_pipe
30  
31  
32  
33  # Use trace-cmd to record function_graph for a command
34  
35  trace-cmd record -p function_graph -g vfs_write -F dd if=/dev/zero of=/tmp/test bs=4k count=100
36  
37  trace-cmd report | head -80
38  
39  
40  
41  # List all available tracepoint events
42  
43  ls /sys/kernel/debug/tracing/events/ | head -20
44  
45  cat /sys/kernel/debug/tracing/available_events | wc -l
46  
47  
48  
49  # Filter function tracer to a specific module
50  
51  echo ':mod:ext4' > /sys/kernel/debug/tracing/set_ftrace_filter
52  
53  echo function > /sys/kernel/debug/tracing/current_tracer
54  
55  cat /sys/kernel/debug/tracing/trace_pipe
56  
57  
58  
59  # Measure scheduler latency with function_graph
60  
61  trace-cmd record -p function_graph -g schedule -F stress --cpu 2 --timeout 5
62  
63  trace-cmd report --cpu 0 | grep "}" | sort -k2 -rn | head -20
64  
65  
66  
67  # Clean up: disable all tracing
68  
69  echo nop > /sys/kernel/debug/tracing/current_tracer
70  
71  echo 0 > /sys/kernel/debug/tracing/tracing_on
72  
73  echo > /sys/kernel/debug/tracing/set_ftrace_filter
74  
75  echo > /sys/kernel/debug/tracing/kprobe_events

Debug Checklist

1Verify debugfs is mounted: mount | grep debugfs (should show /sys/kernel/debug)
2Check available tracers: cat /sys/kernel/debug/tracing/available_tracers
3Verify target function exists: grep target_func /proc/kallsyms
4Check if function is traceable: grep target_func /sys/kernel/debug/tracing/available_filter_functions
5List available tracepoint events: ls /sys/kernel/debug/tracing/events/
6Check current tracing status: cat /sys/kernel/debug/tracing/tracing_on
7Check ring buffer usage: cat /sys/kernel/debug/tracing/per_cpu/cpu0/stats
8Reset tracing state: echo nop > /sys/kernel/debug/tracing/current_tracer && echo > /sys/kernel/debug/tracing/set_ftrace_filter

Key Takeaways

✓ftrace has near-zero overhead when disabled. Dynamic ftrace patches function entry points to NOPs at boot. Enabling tracing for a specific function patches just that NOP back to a call instruction. The rest of the kernel runs at full speed. This is why ftrace can be compiled into production kernels without fear.
✓kprobes can instrument any kernel function, but the instrumented address is not part of any stable API. A kprobe attached to an internal function may break on the next kernel update if the function is renamed, inlined, or removed. Static tracepoints have stable interfaces across kernel versions. Prefer tracepoints when they exist; fall back to kprobes when they do not.
✓function_graph tracing replaces return addresses on the kernel stack. If the traced function triggers an exception or oops, the stack trace may show ftrace trampoline addresses instead of the real callers. This is a known limitation. The ftrace infrastructure saves the real return addresses in a shadow stack, but crash dump tools may not decode them.
✓Per-CPU ring buffers are the reason ftrace scales on multi-core systems. Each CPU writes to its own buffer without taking any locks. The only synchronization happens when reading the merged trace output. For high-frequency events (100k+ per second), increasing buffer_size_kb prevents data loss.
✓trace-cmd and KernelShark are the standard tools for working with ftrace. trace-cmd handles the raw debugfs interface, manages per-CPU buffers, and produces trace.dat files. KernelShark provides a GUI timeline view. For scripted analysis, trace-cmd report produces text output that can be piped through standard Unix tools.

Common Pitfalls

✗Enabling the function tracer globally without set_ftrace_filter. Tracing every kernel function generates millions of events per second and can make the system unusable. Always filter to specific functions or use function_graph with a max_graph_depth limit. Start with a single function and widen the scope incrementally.
✗Forgetting to disable tracing after a debug session. ftrace stays active until explicitly stopped. A forgotten function tracer with no filter can silently degrade system performance for days. Always run echo nop > /sys/kernel/debug/tracing/current_tracer when done.
✗Attaching kprobes to functions that are called with interrupts disabled or while holding spinlocks. The kprobe handler itself must not sleep or take locks that could deadlock with the interrupted context. The handler runs in atomic context with preemption disabled. Allocating memory or calling printk from a kprobe handler can cause lockups.
✗Assuming kprobe attachment points are stable across kernel versions. Internal function names change between releases. A kprobe on __blk_mq_run_hw_queue in kernel 5.15 may need to target blk_mq_run_hw_queue in 6.1. Always verify function availability in /proc/kallsyms or /sys/kernel/debug/tracing/available_filter_functions before deploying kprobe-based monitoring.
✗Reading /sys/kernel/debug/tracing/trace instead of trace_pipe for live monitoring. The trace file is a snapshot that does not consume events; reading it repeatedly shows stale data. trace_pipe is a consuming read that blocks until new events arrive, making it suitable for real-time monitoring pipelines.

Reference

System Calls

perf_event_openioctlmmapwriteread

Tools

trace-cmd record / trace-cmd reportKernelShark/proc/kallsymsavailable_filter_functionsperf ftrace

📌

In One Line

ftrace, kprobes, and tracepoints turn the running kernel into a live instrumentation lab -- no reboot, no debug kernel, no recompilation -- with near-zero cost when disabled and surgical precision when enabled.

Kernel Tracing with ftrace, kprobes, and tracepoints

DockerKubernetesNginx

🧠

Mental Model

💡

The Problem

Architecture

Linux has a better way. The kernel ships with thousands of instrumentation points that can be turned on and off at runtime, with no recompilation, no reboot, and near-zero overhead when disabled.

The Three Pillars of Kernel Tracing

ftrace: The Function Tracer

When tracing is enabled for a specific function, ftrace patches just that function's NOP back to a call into the tracing infrastructure. Everything else remains untouched.

The interface is the filesystem at /sys/kernel/debug/tracing:

# Mount debugfs if not already mounted
mount -t debugfs none /sys/kernel/debug

# See what tracers are available
cat /sys/kernel/debug/tracing/available_tracers
# function function_graph nop

# Enable function tracer for a single function
echo function > /sys/kernel/debug/tracing/current_tracer
echo vfs_write > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_on

# Read the trace stream
cat /sys/kernel/debug/tracing/trace_pipe
#  dd-12345  [002]  1234.567890: vfs_write <-ksys_write
#  dd-12345  [002]  1234.567891: vfs_write <-ksys_write

The function_graph tracer goes further. It hooks both function entry and return, producing indented call trees with per-function timing:

echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo vfs_write > /sys/kernel/debug/tracing/set_graph_function
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace_pipe

Output resembles:

 2)               |  vfs_write() {
 2)               |    __vfs_write() {
 2)               |      ext4_file_write_iter() {
 2)               |        ext4_buffered_write_iter() {
 2)   0.451 us    |          down_write_trylock();
 2)               |          generic_perform_write() {
 2)   3.821 us    |            ext4_write_begin();
 2)   0.320 us    |            copy_page_from_iter_atomic();
 2)   1.205 us    |            ext4_write_end();
 2) + 12.440 us   |          }
 2) + 14.103 us   |        }
 2) + 15.220 us   |      }
 2) + 16.001 us   |    }
 2) + 16.832 us   |  }

This output immediately shows where time is spent. If ext4_write_begin takes 20ms instead of 4us, that function becomes the investigation target.

kprobes: Dynamic Breakpoints Anywhere

The mechanism on x86:

Save the original instruction at the target address.
Replace it with int3 (breakpoint exception).
When the CPU hits int3, the kprobe handler runs.
Single-step the saved original instruction.
Resume normal execution.

kprobes can be registered from kernel modules (as shown in the C code above) or via the ftrace interface:

# Register a kprobe on do_sys_openat2, capture the filename argument
echo 'p:myprobe do_sys_openat2 filename=+0(%si):string' \
    > /sys/kernel/debug/tracing/kprobe_events

# Enable it
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable

# Watch the output
cat /sys/kernel/debug/tracing/trace_pipe
# bash-1234  [001]  5678.901234: myprobe: (do_sys_openat2+0x0/0x...)
#                                         filename="/etc/passwd"

kretprobes work similarly but fire on function return, enabling latency measurement:

# Register entry and return probes on vfs_read
echo 'p:vfs_read_entry vfs_read' \
    >> /sys/kernel/debug/tracing/kprobe_events
echo 'r:vfs_read_return vfs_read $retval' \
    >> /sys/kernel/debug/tracing/kprobe_events

echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
cat /sys/kernel/debug/tracing/trace_pipe

Static Tracepoints: Zero-Cost When Off

The overhead of a disabled tracepoint is one NOP instruction -- immeasurable in practice.

# List all block I/O tracepoints
ls /sys/kernel/debug/tracing/events/block/
# block_bio_backmerge  block_bio_bounce   block_bio_complete
# block_bio_frontmerge block_bio_queue    block_bio_remap
# block_getrq          block_plug         block_rq_complete
# block_rq_insert      block_rq_issue     block_rq_requeue
# block_split          block_unplug

# Enable I/O submission and completion tracepoints
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable

cat /sys/kernel/debug/tracing/trace_pipe
# kworker-890  [003]  9012.345678: block_rq_issue: 259,0 W 4096 () 12345 + 8
# kworker-890  [003]  9012.345890: block_rq_complete: 259,0 W () 12345 + 8 [0]

The 212us gap between issue and complete is the device latency. If that gap jumps to 50ms, the problem is in the hardware or driver, not the kernel block layer.

Putting It Together: The Latency Investigation

Start at the system call boundary and work down:

# Step 1: Confirm latency is below the syscall layer
strace -T -e trace=write -p $(pgrep appname) 2>&1 | \
    awk '$NF+0 > 0.010 {print}'  # Show writes taking >10ms

# Step 2: Trace the block layer
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
cat /sys/kernel/debug/tracing/trace_pipe > /tmp/blocktrace.log &

# Step 3: Narrow down with function_graph
trace-cmd record -p function_graph -g vfs_write \
    -F dd if=/dev/zero of=/mnt/data/testfile bs=4k count=1000
trace-cmd report | awk '/{/ && /us/ {gsub(/[^0-9.]/, "", $2);
    if ($2+0 > 1000) print}'  # Functions taking >1ms

# Step 4: If the suspect is a specific driver function,
#          attach a kprobe for precise measurement
echo 'p:nvme_submit nvme_queue_rq' \
    >> /sys/kernel/debug/tracing/kprobe_events
echo 'r:nvme_complete nvme_queue_rq $retval' \
    >> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable

# Step 5: Clean up
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo nop > /sys/kernel/debug/tracing/current_tracer
echo > /sys/kernel/debug/tracing/kprobe_events
echo 0 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 0 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable

Under the Hood

Common Questions

How much overhead does ftrace add in production?

When should kprobes be used instead of tracepoints?

Can ftrace and kprobes be used simultaneously?

What is the difference between /sys/kernel/debug/tracing/trace and trace_pipe?

How do kprobes interact with kernel address space layout randomization (KASLR)?

How Technologies Use This

Docker

Kubernetes

Nginx

Same Concept Across Tech

Technology	How it uses ftrace/kprobes	Key consideration
BCC/bpftrace	Attaches BPF programs to kprobes and tracepoints. bpftrace one-liners compile to BPF bytecode and attach via the same kprobe mechanism ftrace uses	Requires kernel 4.9+ for kprobe BPF attachment. BTF (5.2+) eliminates header dependency
trace-cmd	CLI wrapper around the ftrace debugfs interface. Manages per-CPU ring buffers and produces trace.dat files for KernelShark	Handles the raw file writes to debugfs so operators do not need to remember the exact file paths
perf	Consumes ftrace tracepoints via perf_event_open. perf ftrace subcommand wraps ftrace directly	perf adds symbol resolution and call graph recording on top of raw ftrace events
systemtap	Compiles tracing scripts into kernel modules that use kprobes internally	Requires kernel headers and a compiler on the target system. Heavier than BPF-based approaches
LTTng	Uses its own tracepoint infrastructure alongside kernel tracepoints for high-throughput tracing	Optimized for low-overhead continuous tracing. Uses per-CPU buffers similar to ftrace

Stack layer mapping (unexplained I/O latency):

Layer	What to check	Tool
Application	Is the latency in application code or below the syscall boundary?	strace -T to measure syscall time
Syscall	Which syscall is slow (read, write, fsync)?	perf trace -e syscalls:sys_enter_write
VFS	Is the VFS layer adding overhead (file locking, dentry lookup)?	ftrace function_graph on vfs_write
Block layer	Is the I/O scheduler reordering or delaying requests?	block:block_rq_issue / block:block_rq_complete tracepoints
Device driver	Is the driver firmware or DMA path slow?	kprobe on driver-specific submit function
Hardware	Is the device itself slow (power saving, thermal throttle)?	Compare tracepoint timestamp deltas with device specs

If You See This, Think This

Symptom	Likely cause	First check
Latency spike visible in app metrics but not in iostat	Latency lives in kernel code between syscall and device	Enable block tracepoints: echo 1 > /sys/kernel/debug/tracing/events/block/enable
Function tracer shows no output for a known function	Function was inlined by the compiler or marked notrace	grep func_name /sys/kernel/debug/tracing/available_filter_functions
System becomes sluggish after enabling tracing	Global tracing without filter generates excessive events	Check current_tracer and set_ftrace_filter; add specific filters or set nop
kprobe fails to register on a valid symbol	Address is in a blacklisted region (.init, .exit, or ftrace itself)	dmesg for kprobe registration errors; check /sys/kernel/debug/kprobes/blacklist
trace_pipe shows nothing after enabling events	tracing_on is 0 (tracing paused)	echo 1 > /sys/kernel/debug/tracing/tracing_on
Ring buffer overflows and drops events	buffer_size_kb too small for the event rate	Increase: echo 16384 > /sys/kernel/debug/tracing/buffer_size_kb
function_graph shows ftrace trampolines in stack traces	Expected behavior; function_graph replaces return addresses	Use trace-cmd report which resolves trampolines back to real callers

When to Use / Avoid

Relevant when:

Diagnosing latency spikes that live below the application layer in kernel code paths
Tracing specific kernel functions on production systems without rebooting or installing debug packages
Understanding the call chain between a system call entry and the hardware driver
Measuring per-function execution time inside a kernel code path (function_graph)
Verifying that a kernel module follows the expected code path during initialization

Watch out for:

Global function tracing without filters generates millions of events per second and can make the system unresponsive
kprobe attachment points are not stable across kernel versions; prefer static tracepoints when available
function_graph tracing modifies return addresses on the kernel stack, which can confuse crash dump analysis
Ring buffer overflow silently drops oldest events in flight recorder mode; increase buffer_size_kb for high-frequency tracing

Try It Yourself

 1  # Enable function_graph tracer for a specific function
 2  
 3  echo function_graph > /sys/kernel/debug/tracing/current_tracer
 4  
 5  echo do_sys_openat2 > /sys/kernel/debug/tracing/set_graph_function
 6  
 7  echo 1 > /sys/kernel/debug/tracing/tracing_on
 8  
 9  cat /sys/kernel/debug/tracing/trace_pipe
10  
11  
12  
13  # Trace block I/O request latency using tracepoints
14  
15  echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
16  
17  echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
18  
19  cat /sys/kernel/debug/tracing/trace_pipe | head -50
20  
21  
22  
23  # Register a kprobe on a kernel function via ftrace interface
24  
25  echo 'p:myprobe do_sys_openat2 filename=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events
26  
27  echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
28  
29  cat /sys/kernel/debug/tracing/trace_pipe
30  
31  
32  
33  # Use trace-cmd to record function_graph for a command
34  
35  trace-cmd record -p function_graph -g vfs_write -F dd if=/dev/zero of=/tmp/test bs=4k count=100
36  
37  trace-cmd report | head -80
38  
39  
40  
41  # List all available tracepoint events
42  
43  ls /sys/kernel/debug/tracing/events/ | head -20
44  
45  cat /sys/kernel/debug/tracing/available_events | wc -l
46  
47  
48  
49  # Filter function tracer to a specific module
50  
51  echo ':mod:ext4' > /sys/kernel/debug/tracing/set_ftrace_filter
52  
53  echo function > /sys/kernel/debug/tracing/current_tracer
54  
55  cat /sys/kernel/debug/tracing/trace_pipe
56  
57  
58  
59  # Measure scheduler latency with function_graph
60  
61  trace-cmd record -p function_graph -g schedule -F stress --cpu 2 --timeout 5
62  
63  trace-cmd report --cpu 0 | grep "}" | sort -k2 -rn | head -20
64  
65  
66  
67  # Clean up: disable all tracing
68  
69  echo nop > /sys/kernel/debug/tracing/current_tracer
70  
71  echo 0 > /sys/kernel/debug/tracing/tracing_on
72  
73  echo > /sys/kernel/debug/tracing/set_ftrace_filter
74  
75  echo > /sys/kernel/debug/tracing/kprobe_events

Debug Checklist

1Verify debugfs is mounted: mount | grep debugfs (should show /sys/kernel/debug)
2Check available tracers: cat /sys/kernel/debug/tracing/available_tracers
3Verify target function exists: grep target_func /proc/kallsyms
4Check if function is traceable: grep target_func /sys/kernel/debug/tracing/available_filter_functions
5List available tracepoint events: ls /sys/kernel/debug/tracing/events/
6Check current tracing status: cat /sys/kernel/debug/tracing/tracing_on
7Check ring buffer usage: cat /sys/kernel/debug/tracing/per_cpu/cpu0/stats
8Reset tracing state: echo nop > /sys/kernel/debug/tracing/current_tracer && echo > /sys/kernel/debug/tracing/set_ftrace_filter

Key Takeaways

✓ftrace has near-zero overhead when disabled. Dynamic ftrace patches function entry points to NOPs at boot. Enabling tracing for a specific function patches just that NOP back to a call instruction. The rest of the kernel runs at full speed. This is why ftrace can be compiled into production kernels without fear.
✓kprobes can instrument any kernel function, but the instrumented address is not part of any stable API. A kprobe attached to an internal function may break on the next kernel update if the function is renamed, inlined, or removed. Static tracepoints have stable interfaces across kernel versions. Prefer tracepoints when they exist; fall back to kprobes when they do not.
✓function_graph tracing replaces return addresses on the kernel stack. If the traced function triggers an exception or oops, the stack trace may show ftrace trampoline addresses instead of the real callers. This is a known limitation. The ftrace infrastructure saves the real return addresses in a shadow stack, but crash dump tools may not decode them.
✓Per-CPU ring buffers are the reason ftrace scales on multi-core systems. Each CPU writes to its own buffer without taking any locks. The only synchronization happens when reading the merged trace output. For high-frequency events (100k+ per second), increasing buffer_size_kb prevents data loss.
✓trace-cmd and KernelShark are the standard tools for working with ftrace. trace-cmd handles the raw debugfs interface, manages per-CPU buffers, and produces trace.dat files. KernelShark provides a GUI timeline view. For scripted analysis, trace-cmd report produces text output that can be piped through standard Unix tools.

Common Pitfalls

✗Enabling the function tracer globally without set_ftrace_filter. Tracing every kernel function generates millions of events per second and can make the system unusable. Always filter to specific functions or use function_graph with a max_graph_depth limit. Start with a single function and widen the scope incrementally.
✗Forgetting to disable tracing after a debug session. ftrace stays active until explicitly stopped. A forgotten function tracer with no filter can silently degrade system performance for days. Always run echo nop > /sys/kernel/debug/tracing/current_tracer when done.
✗Attaching kprobes to functions that are called with interrupts disabled or while holding spinlocks. The kprobe handler itself must not sleep or take locks that could deadlock with the interrupted context. The handler runs in atomic context with preemption disabled. Allocating memory or calling printk from a kprobe handler can cause lockups.
✗Assuming kprobe attachment points are stable across kernel versions. Internal function names change between releases. A kprobe on __blk_mq_run_hw_queue in kernel 5.15 may need to target blk_mq_run_hw_queue in 6.1. Always verify function availability in /proc/kallsyms or /sys/kernel/debug/tracing/available_filter_functions before deploying kprobe-based monitoring.
✗Reading /sys/kernel/debug/tracing/trace instead of trace_pipe for live monitoring. The trace file is a snapshot that does not consume events; reading it repeatedly shows stale data. trace_pipe is a consuming read that blocks until new events arrive, making it suitable for real-time monitoring pipelines.

Reference

System Calls

perf_event_openioctlmmapwriteread

Tools

trace-cmd record / trace-cmd reportKernelShark/proc/kallsymsavailable_filter_functionsperf ftrace

📌

Mental Model

The Problem

Architecture

The Three Pillars of Kernel Tracing

ftrace: The Function Tracer

kprobes: Dynamic Breakpoints Anywhere

Static Tracepoints: Zero-Cost When Off

Putting It Together: The Latency Investigation

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Mental Model

The Problem

Architecture

The Three Pillars of Kernel Tracing

ftrace: The Function Tracer

kprobes: Dynamic Breakpoints Anywhere

Static Tracepoints: Zero-Cost When Off

Putting It Together: The Latency Investigation

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics