Kernel Tracing with ftrace, kprobes, and tracepoints
Mental Model
A building with thousands of light switches on the walls, one at every doorway and junction. Every switch is wired to a central logging room. When all switches are off, electricity flows through the building normally with no overhead. Flipping a specific switch activates a sensor at that doorway that records who walked through and when. The building does not need to be rewired or shut down to install the sensors -- they were built into the walls at construction time (tracepoints) or can be clipped onto any wire at runtime (kprobes). The logging room has a separate notebook for each floor (per-CPU ring buffers) so that writers on different floors never fight over the same pen.
The Problem
A production server is showing unexplained latency spikes in the I/O path. Application-level metrics show requests occasionally taking 50ms instead of the expected 2ms. Standard tools like iostat and sar show nothing abnormal. The latency lives somewhere between the system call layer and the block device driver, but the exact function is unknown. The server cannot be rebooted, the kernel has no debug symbols installed, and attaching a traditional debugger is not an option on a production box handling 10k requests per second.
Architecture
A production server is dropping requests. Application logs show occasional 50ms latency spikes on operations that normally complete in 2ms. iostat and sar show nothing unusual. The latency is somewhere between the system call boundary and the block device, but which function is responsible?
The traditional approach -- add printk calls, recompile the kernel, reboot, reproduce the problem -- is not viable on a server handling live traffic. Attaching kgdb and halting a CPU is even less viable.
Linux has a better way. The kernel ships with thousands of instrumentation points that can be turned on and off at runtime, with no recompilation, no reboot, and near-zero overhead when disabled.
The Three Pillars of Kernel Tracing
ftrace: The Function Tracer
ftrace is the foundational tracing framework built into every modern Linux kernel. It works through a clever compiler trick: every kernel function is compiled with a call to mcount (or fentry on modern x86) at its entry point. At boot, dynamic ftrace patches all these calls to NOP instructions. The kernel runs at full speed with tracing compiled in but disabled.
When tracing is enabled for a specific function, ftrace patches just that function's NOP back to a call into the tracing infrastructure. Everything else remains untouched.
The interface is the filesystem at /sys/kernel/debug/tracing:
# Mount debugfs if not already mounted
mount -t debugfs none /sys/kernel/debug
# See what tracers are available
cat /sys/kernel/debug/tracing/available_tracers
# function function_graph nop
# Enable function tracer for a single function
echo function > /sys/kernel/debug/tracing/current_tracer
echo vfs_write > /sys/kernel/debug/tracing/set_ftrace_filter
echo 1 > /sys/kernel/debug/tracing/tracing_on
# Read the trace stream
cat /sys/kernel/debug/tracing/trace_pipe
# dd-12345 [002] 1234.567890: vfs_write <-ksys_write
# dd-12345 [002] 1234.567891: vfs_write <-ksys_write
The function_graph tracer goes further. It hooks both function entry and return, producing indented call trees with per-function timing:
echo function_graph > /sys/kernel/debug/tracing/current_tracer
echo vfs_write > /sys/kernel/debug/tracing/set_graph_function
echo 1 > /sys/kernel/debug/tracing/tracing_on
cat /sys/kernel/debug/tracing/trace_pipe
Output resembles:
2) | vfs_write() {
2) | __vfs_write() {
2) | ext4_file_write_iter() {
2) | ext4_buffered_write_iter() {
2) 0.451 us | down_write_trylock();
2) | generic_perform_write() {
2) 3.821 us | ext4_write_begin();
2) 0.320 us | copy_page_from_iter_atomic();
2) 1.205 us | ext4_write_end();
2) + 12.440 us | }
2) + 14.103 us | }
2) + 15.220 us | }
2) + 16.001 us | }
2) + 16.832 us | }
This output immediately shows where time is spent. If ext4_write_begin takes 20ms instead of 4us, that function becomes the investigation target.
kprobes: Dynamic Breakpoints Anywhere
Static tracepoints and ftrace function tracing cover most needs, but sometimes the instrumentation point needed does not exist. kprobes solve this by allowing a breakpoint at any instruction address in the kernel.
The mechanism on x86:
- Save the original instruction at the target address.
- Replace it with
int3(breakpoint exception). - When the CPU hits
int3, the kprobe handler runs. - Single-step the saved original instruction.
- Resume normal execution.
kprobes can be registered from kernel modules (as shown in the C code above) or via the ftrace interface:
# Register a kprobe on do_sys_openat2, capture the filename argument
echo 'p:myprobe do_sys_openat2 filename=+0(%si):string' \
> /sys/kernel/debug/tracing/kprobe_events
# Enable it
echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
# Watch the output
cat /sys/kernel/debug/tracing/trace_pipe
# bash-1234 [001] 5678.901234: myprobe: (do_sys_openat2+0x0/0x...)
# filename="/etc/passwd"
kretprobes work similarly but fire on function return, enabling latency measurement:
# Register entry and return probes on vfs_read
echo 'p:vfs_read_entry vfs_read' \
>> /sys/kernel/debug/tracing/kprobe_events
echo 'r:vfs_read_return vfs_read $retval' \
>> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
cat /sys/kernel/debug/tracing/trace_pipe
Static Tracepoints: Zero-Cost When Off
The kernel source contains thousands of TRACE_EVENT macros at semantically meaningful locations. These compile to a static key check: when no tracer is attached, the branch is a NOP (patched at runtime using static key infrastructure). Enabling a tracepoint patches the NOP to a JMP into the tracing handler.
The overhead of a disabled tracepoint is one NOP instruction -- immeasurable in practice.
# List all block I/O tracepoints
ls /sys/kernel/debug/tracing/events/block/
# block_bio_backmerge block_bio_bounce block_bio_complete
# block_bio_frontmerge block_bio_queue block_bio_remap
# block_getrq block_plug block_rq_complete
# block_rq_insert block_rq_issue block_rq_requeue
# block_split block_unplug
# Enable I/O submission and completion tracepoints
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
cat /sys/kernel/debug/tracing/trace_pipe
# kworker-890 [003] 9012.345678: block_rq_issue: 259,0 W 4096 () 12345 + 8
# kworker-890 [003] 9012.345890: block_rq_complete: 259,0 W () 12345 + 8 [0]
The 212us gap between issue and complete is the device latency. If that gap jumps to 50ms, the problem is in the hardware or driver, not the kernel block layer.
Putting It Together: The Latency Investigation
Start at the system call boundary and work down:
# Step 1: Confirm latency is below the syscall layer
strace -T -e trace=write -p $(pgrep appname) 2>&1 | \
awk '$NF+0 > 0.010 {print}' # Show writes taking >10ms
# Step 2: Trace the block layer
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
cat /sys/kernel/debug/tracing/trace_pipe > /tmp/blocktrace.log &
# Step 3: Narrow down with function_graph
trace-cmd record -p function_graph -g vfs_write \
-F dd if=/dev/zero of=/mnt/data/testfile bs=4k count=1000
trace-cmd report | awk '/{/ && /us/ {gsub(/[^0-9.]/, "", $2);
if ($2+0 > 1000) print}' # Functions taking >1ms
# Step 4: If the suspect is a specific driver function,
# attach a kprobe for precise measurement
echo 'p:nvme_submit nvme_queue_rq' \
>> /sys/kernel/debug/tracing/kprobe_events
echo 'r:nvme_complete nvme_queue_rq $retval' \
>> /sys/kernel/debug/tracing/kprobe_events
echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable
# Step 5: Clean up
echo 0 > /sys/kernel/debug/tracing/tracing_on
echo nop > /sys/kernel/debug/tracing/current_tracer
echo > /sys/kernel/debug/tracing/kprobe_events
echo 0 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
echo 0 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
Under the Hood
Dynamic ftrace patching. At kernel compile time, the -pg flag (or -fpatchable-function-entry on modern GCC) inserts a call mcount or a 5-byte NOP at every function entry. During boot, the ftrace_init function records the location of every one of these call sites in a table. When tracing is disabled, all sites are patched to NOPs using text_poke_bp (which handles the SMP-safe instruction replacement problem). When tracing is enabled for function X, only X's call site is patched back to call ftrace_caller. The rest of the kernel never notices.
kprobe optimization. On x86, unoptimized kprobes use int3, which triggers a trap and context switch to the handler -- expensive at high frequency. Optimized kprobes (enabled by default on modern kernels) replace the target instruction with a JMP to a trampoline containing the handler and the original instruction. This avoids the trap overhead entirely. The optimization kicks in automatically after a kprobe has been hit a few times.
Tracepoint static keys. The TRACE_EVENT macro expands to a branch using the static key infrastructure (static_branch_unlikely). This compiles to a JMP instruction that is patched at runtime. When the tracepoint is disabled, the JMP target is the very next instruction (effectively a NOP). When enabled, the JMP goes to the tracing handler. The patching uses text_poke_bp with stop-machine on older kernels or INT3-based patching on newer ones for SMP safety.
Per-CPU ring buffers. ftrace maintains a separate ring buffer for each CPU. Events are written to the local CPU's buffer without taking any locks. The buffer is a linked list of pages, each containing packed event records with timestamps. When a reader opens trace_pipe, the infrastructure merges events from all CPU buffers sorted by timestamp. The per-CPU design means tracing scales linearly with core count -- no contention, no cache bouncing.
Common Questions
How much overhead does ftrace add in production?
With tracing disabled, the overhead is unmeasurable. Dynamic ftrace patches all instrumentation to NOPs at boot. When tracing a single function, the overhead is the handler execution time (typically 100-500ns per call). Tracing a function called 100k times per second adds roughly 10-50ms of CPU time per second -- about 1-5% of one core. function_graph is roughly 2x the overhead of function tracing because it hooks both entry and return.
When should kprobes be used instead of tracepoints?
Tracepoints should be the first choice because they have stable APIs and near-zero disabled overhead. Use kprobes when: the needed instrumentation point has no tracepoint, tracing an internal function that is not exported, or prototyping a new tracepoint location before submitting a kernel patch. In BPF-based tooling, bpftrace and BCC abstract this choice -- both kprobe and tracepoint attachments use the same scripting syntax.
Can ftrace and kprobes be used simultaneously?
Yes. ftrace tracepoints, function tracing, and kprobes all feed into the same ring buffer infrastructure. Events from all sources are interleaved in timestamp order when reading trace_pipe. This is powerful for correlation: enable a tracepoint for block_rq_issue and a kprobe on an internal driver function, and the merged trace shows the exact sequence of events.
What is the difference between /sys/kernel/debug/tracing/trace and trace_pipe?
trace is a snapshot file. Reading it returns the current ring buffer contents without consuming them. Reading it again returns the same data (plus any new events). trace_pipe is a consuming read -- events are removed from the buffer as they are read, and the read blocks when the buffer is empty. Use trace for post-mortem analysis and trace_pipe for real-time monitoring.
How do kprobes interact with kernel address space layout randomization (KASLR)?
kprobes register by symbol name, not raw address. The kernel resolves the symbol to its current address (which changes on every boot with KASLR) at registration time. This means kprobe-based tools work transparently with KASLR. The symbol resolution uses the same /proc/kallsyms data that debuggers use. If kptr_restrict is set to non-zero, unprivileged users cannot see kernel addresses, but root can still register kprobes.
How Technologies Use This
A container host running 60 Docker containers shows p99 syscall latency of 12ms for file operations inside specific containers, while the host-level I/O metrics look normal. The latency is suspected to originate in the overlayfs layer or the cgroup I/O throttling path, but standard tools like iostat cannot distinguish per-container kernel overhead.
Enabling the function_graph tracer scoped to a container's cgroup reveals the full kernel call chain for every file operation. Running `echo function_graph > /sys/kernel/debug/tracing/current_tracer` and filtering with `echo 'ovl_*' > /sys/kernel/debug/tracing/set_ftrace_filter` traces every overlayfs function with nanosecond timestamps and call depth indentation. The trace output shows exactly how long each layer lookup, whiteout check, and copy-up operation takes. On a container performing 500 file writes per second to an overlayfs mount, the trace might reveal that ovl_copy_up_one() takes 8ms per call due to copying a 4 MB file from the lower layer to the upper layer on first write.
Combining function_graph with cgroup-aware filtering via `trace-cmd record -p function_graph -O funcgraph-proc` and grepping for the container's PID namespace isolates the trace to a single container's kernel activity. This avoids the noise of 59 other containers and pinpoints whether the latency originates in overlayfs, the block layer, or the cgroup I/O controller.
A Kubernetes node running 45 pods intermittently hangs for 5-10 seconds during pod creation. The kubelet logs show mount operations timing out, but the storage backend responds normally to direct I/O tests. The hang is suspected to be inside the kernel's mount syscall path, where overlayfs, cgroup, and namespace setup intersect.
Attaching a kprobe to the mount syscall entry point with `echo 'p:mount_probe do_mount' > /sys/kernel/debug/tracing/kprobe_events` and enabling it captures every mount call with its arguments and timestamp. Correlating slow mount calls with the kubelet's pod creation timeline reveals that the hang occurs specifically when do_mount() calls into the overlayfs code path while another mount operation holds the namespace_sem semaphore. The kprobe trace shows the blocked mount waiting 5-8 seconds for the semaphore.
Adding a kretprobe with `echo 'r:mount_ret do_mount ret=$retval' >> /sys/kernel/debug/tracing/kprobe_events` captures the return value and duration of every mount call. Filtering the trace for calls exceeding 1 second isolates the problematic path. This approach requires no kernel recompilation, no debug symbols, and no node restart. The kprobes are removed by clearing the kprobe_events file, leaving zero residual overhead on the production node.
An Nginx proxy handling 30,000 connections reports that 0.5% of responses take over 200ms, but application-level timing shows the upstream responds within 10ms. The extra latency lives somewhere in the kernel TCP stack between receiving the upstream response and sending it to the client. Standard network tools like ss and netstat show no obvious congestion.
Enabling the tcp_sendmsg tracepoint with `echo 1 > /sys/kernel/debug/tracing/events/tcp/tcp_sendmsg/enable` captures every TCP send operation with the socket address, byte count, and timestamp. Filtering the trace output by Nginx worker PIDs and correlating timestamps with the slow responses reveals that certain sends stall for 150-180ms. Adding the tcp_retransmit_skb tracepoint alongside shows that these stalls coincide with retransmission events on the client-facing connections, indicating packet loss on the downstream network segment.
The tracepoint approach is superior to packet capture for this diagnosis because it shows the kernel's internal view of the TCP state machine. Each tracepoint event includes the socket state, congestion window size, and send buffer occupancy at the moment of the call. Running `trace-cmd record -e tcp:tcp_sendmsg -e tcp:tcp_retransmit_skb -f 'sport == 443' -- sleep 30` collects 30 seconds of filtered TCP activity with under 2% overhead, producing a definitive timeline of where the latency originates.
Same Concept Across Tech
| Technology | How it uses ftrace/kprobes | Key consideration |
|---|---|---|
| BCC/bpftrace | Attaches BPF programs to kprobes and tracepoints. bpftrace one-liners compile to BPF bytecode and attach via the same kprobe mechanism ftrace uses | Requires kernel 4.9+ for kprobe BPF attachment. BTF (5.2+) eliminates header dependency |
| trace-cmd | CLI wrapper around the ftrace debugfs interface. Manages per-CPU ring buffers and produces trace.dat files for KernelShark | Handles the raw file writes to debugfs so operators do not need to remember the exact file paths |
| perf | Consumes ftrace tracepoints via perf_event_open. perf ftrace subcommand wraps ftrace directly | perf adds symbol resolution and call graph recording on top of raw ftrace events |
| systemtap | Compiles tracing scripts into kernel modules that use kprobes internally | Requires kernel headers and a compiler on the target system. Heavier than BPF-based approaches |
| LTTng | Uses its own tracepoint infrastructure alongside kernel tracepoints for high-throughput tracing | Optimized for low-overhead continuous tracing. Uses per-CPU buffers similar to ftrace |
Stack layer mapping (unexplained I/O latency):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is the latency in application code or below the syscall boundary? | strace -T to measure syscall time |
| Syscall | Which syscall is slow (read, write, fsync)? | perf trace -e syscalls:sys_enter_write |
| VFS | Is the VFS layer adding overhead (file locking, dentry lookup)? | ftrace function_graph on vfs_write |
| Block layer | Is the I/O scheduler reordering or delaying requests? | block:block_rq_issue / block:block_rq_complete tracepoints |
| Device driver | Is the driver firmware or DMA path slow? | kprobe on driver-specific submit function |
| Hardware | Is the device itself slow (power saving, thermal throttle)? | Compare tracepoint timestamp deltas with device specs |
Design Rationale Traditional kernel debugging required either recompiling with printk statements (rebuild, reboot, reproduce, repeat) or attaching kgdb and halting the CPU. Neither works on production systems. ftrace solved this by embedding NOP stubs at every function entry during compilation, then patching them to tracing calls at runtime. The cost when disabled is a single NOP per function -- undetectable in benchmarks. kprobes extended this by allowing instrumentation at arbitrary addresses without recompilation. Static tracepoints completed the picture by providing stable, semantically meaningful instrumentation points that survive kernel upgrades. Together, these three mechanisms make every running Linux kernel a traceable system by default.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| Latency spike visible in app metrics but not in iostat | Latency lives in kernel code between syscall and device | Enable block tracepoints: echo 1 > /sys/kernel/debug/tracing/events/block/enable |
| Function tracer shows no output for a known function | Function was inlined by the compiler or marked notrace | grep func_name /sys/kernel/debug/tracing/available_filter_functions |
| System becomes sluggish after enabling tracing | Global tracing without filter generates excessive events | Check current_tracer and set_ftrace_filter; add specific filters or set nop |
| kprobe fails to register on a valid symbol | Address is in a blacklisted region (.init, .exit, or ftrace itself) | dmesg for kprobe registration errors; check /sys/kernel/debug/kprobes/blacklist |
| trace_pipe shows nothing after enabling events | tracing_on is 0 (tracing paused) | echo 1 > /sys/kernel/debug/tracing/tracing_on |
| Ring buffer overflows and drops events | buffer_size_kb too small for the event rate | Increase: echo 16384 > /sys/kernel/debug/tracing/buffer_size_kb |
| function_graph shows ftrace trampolines in stack traces | Expected behavior; function_graph replaces return addresses | Use trace-cmd report which resolves trampolines back to real callers |
When to Use / Avoid
Relevant when:
- Diagnosing latency spikes that live below the application layer in kernel code paths
- Tracing specific kernel functions on production systems without rebooting or installing debug packages
- Understanding the call chain between a system call entry and the hardware driver
- Measuring per-function execution time inside a kernel code path (function_graph)
- Verifying that a kernel module follows the expected code path during initialization
Watch out for:
- Global function tracing without filters generates millions of events per second and can make the system unresponsive
- kprobe attachment points are not stable across kernel versions; prefer static tracepoints when available
- function_graph tracing modifies return addresses on the kernel stack, which can confuse crash dump analysis
- Ring buffer overflow silently drops oldest events in flight recorder mode; increase buffer_size_kb for high-frequency tracing
Try It Yourself
1 # Enable function_graph tracer for a specific function
2
3 echo function_graph > /sys/kernel/debug/tracing/current_tracer
4
5 echo do_sys_openat2 > /sys/kernel/debug/tracing/set_graph_function
6
7 echo 1 > /sys/kernel/debug/tracing/tracing_on
8
9 cat /sys/kernel/debug/tracing/trace_pipe
10
11
12
13 # Trace block I/O request latency using tracepoints
14
15 echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_issue/enable
16
17 echo 1 > /sys/kernel/debug/tracing/events/block/block_rq_complete/enable
18
19 cat /sys/kernel/debug/tracing/trace_pipe | head -50
20
21
22
23 # Register a kprobe on a kernel function via ftrace interface
24
25 echo 'p:myprobe do_sys_openat2 filename=+0(%si):string' > /sys/kernel/debug/tracing/kprobe_events
26
27 echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable
28
29 cat /sys/kernel/debug/tracing/trace_pipe
30
31
32
33 # Use trace-cmd to record function_graph for a command
34
35 trace-cmd record -p function_graph -g vfs_write -F dd if=/dev/zero of=/tmp/test bs=4k count=100
36
37 trace-cmd report | head -80
38
39
40
41 # List all available tracepoint events
42
43 ls /sys/kernel/debug/tracing/events/ | head -20
44
45 cat /sys/kernel/debug/tracing/available_events | wc -l
46
47
48
49 # Filter function tracer to a specific module
50
51 echo ':mod:ext4' > /sys/kernel/debug/tracing/set_ftrace_filter
52
53 echo function > /sys/kernel/debug/tracing/current_tracer
54
55 cat /sys/kernel/debug/tracing/trace_pipe
56
57
58
59 # Measure scheduler latency with function_graph
60
61 trace-cmd record -p function_graph -g schedule -F stress --cpu 2 --timeout 5
62
63 trace-cmd report --cpu 0 | grep "}" | sort -k2 -rn | head -20
64
65
66
67 # Clean up: disable all tracing
68
69 echo nop > /sys/kernel/debug/tracing/current_tracer
70
71 echo 0 > /sys/kernel/debug/tracing/tracing_on
72
73 echo > /sys/kernel/debug/tracing/set_ftrace_filter
74
75 echo > /sys/kernel/debug/tracing/kprobe_eventsDebug Checklist
- 1
Verify debugfs is mounted: mount | grep debugfs (should show /sys/kernel/debug) - 2
Check available tracers: cat /sys/kernel/debug/tracing/available_tracers - 3
Verify target function exists: grep target_func /proc/kallsyms - 4
Check if function is traceable: grep target_func /sys/kernel/debug/tracing/available_filter_functions - 5
List available tracepoint events: ls /sys/kernel/debug/tracing/events/ - 6
Check current tracing status: cat /sys/kernel/debug/tracing/tracing_on - 7
Check ring buffer usage: cat /sys/kernel/debug/tracing/per_cpu/cpu0/stats - 8
Reset tracing state: echo nop > /sys/kernel/debug/tracing/current_tracer && echo > /sys/kernel/debug/tracing/set_ftrace_filter
Key Takeaways
- ✓ftrace has near-zero overhead when disabled. Dynamic ftrace patches function entry points to NOPs at boot. Enabling tracing for a specific function patches just that NOP back to a call instruction. The rest of the kernel runs at full speed. This is why ftrace can be compiled into production kernels without fear.
- ✓kprobes can instrument any kernel function, but the instrumented address is not part of any stable API. A kprobe attached to an internal function may break on the next kernel update if the function is renamed, inlined, or removed. Static tracepoints have stable interfaces across kernel versions. Prefer tracepoints when they exist; fall back to kprobes when they do not.
- ✓function_graph tracing replaces return addresses on the kernel stack. If the traced function triggers an exception or oops, the stack trace may show ftrace trampoline addresses instead of the real callers. This is a known limitation. The ftrace infrastructure saves the real return addresses in a shadow stack, but crash dump tools may not decode them.
- ✓Per-CPU ring buffers are the reason ftrace scales on multi-core systems. Each CPU writes to its own buffer without taking any locks. The only synchronization happens when reading the merged trace output. For high-frequency events (100k+ per second), increasing buffer_size_kb prevents data loss.
- ✓trace-cmd and KernelShark are the standard tools for working with ftrace. trace-cmd handles the raw debugfs interface, manages per-CPU buffers, and produces trace.dat files. KernelShark provides a GUI timeline view. For scripted analysis, trace-cmd report produces text output that can be piped through standard Unix tools.
Common Pitfalls
- ✗Enabling the function tracer globally without set_ftrace_filter. Tracing every kernel function generates millions of events per second and can make the system unusable. Always filter to specific functions or use function_graph with a max_graph_depth limit. Start with a single function and widen the scope incrementally.
- ✗Forgetting to disable tracing after a debug session. ftrace stays active until explicitly stopped. A forgotten function tracer with no filter can silently degrade system performance for days. Always run echo nop > /sys/kernel/debug/tracing/current_tracer when done.
- ✗Attaching kprobes to functions that are called with interrupts disabled or while holding spinlocks. The kprobe handler itself must not sleep or take locks that could deadlock with the interrupted context. The handler runs in atomic context with preemption disabled. Allocating memory or calling printk from a kprobe handler can cause lockups.
- ✗Assuming kprobe attachment points are stable across kernel versions. Internal function names change between releases. A kprobe on __blk_mq_run_hw_queue in kernel 5.15 may need to target blk_mq_run_hw_queue in 6.1. Always verify function availability in /proc/kallsyms or /sys/kernel/debug/tracing/available_filter_functions before deploying kprobe-based monitoring.
- ✗Reading /sys/kernel/debug/tracing/trace instead of trace_pipe for live monitoring. The trace file is a snapshot that does not consume events; reading it repeatedly shows stale data. trace_pipe is a consuming read that blocks until new events arrive, making it suitable for real-time monitoring pipelines.
Reference
In One Line
ftrace, kprobes, and tracepoints turn the running kernel into a live instrumentation lab -- no reboot, no debug kernel, no recompilation -- with near-zero cost when disabled and surgical precision when enabled.