Kernel & BootTopic 12 of 13

Kernel InternalsAdvanced

Perf Events & Performance Counters

JVMGo

🧠

Mental Model

A car's engine already has sensors for RPM, fuel injection rate, turbo pressure, exhaust temperature -- all ticking in real time with zero drag. Most drivers never look at them. Counting mode reads the gauges after the drive and reports totals. Sampling mode mounts a camera that snaps a photo every few seconds, capturing what the engine was doing at each instant. Stack enough photos and patterns jump out: the turbo ran 80% of the time, the fuel injector was the bottleneck. Flame graphs turn those photos into a heat map of where the engine spent its effort.

💡

The Problem

An application is pegged at 100% CPU. Top confirms it -- then what? Is it doing useful compute or stalling on cache misses? pprof spreads blame evenly across 20 functions without distinguishing real work from memory waits. A Java service profiled under perf shows nothing but hex addresses because JIT code has no symbol table. IPC sits below 0.3 -- a clear memory bottleneck -- but no application-level profiler captures that metric. Crank the sample frequency to 100K Hz and the profiling overhead itself distorts the results by 10-20%.

Architecture

The application is using 100% CPU. Top confirms it. But that is where the information ends.

Which function is the bottleneck? Is the CPU doing useful work or stalling on memory? Is the problem cache misses, branch mispredictions, or just too many instructions?

Without perf, it is all guessing. With it, flame graphs show exactly where cycles burn -- and suddenly optimization becomes obvious.

What Actually Happens

The perf_event subsystem handles three event classes:

Hardware events come from the CPU's PMU (Performance Monitoring Unit) -- dedicated registers that increment on micro-architectural events. The kernel programs MSR registers (IA32_PERFEVTSELx for event selection, IA32_PMCx for the count) and reads them at context switch.

Software events are counted by the kernel at instrumentation points: context switches in schedule(), page faults in the fault handler, CPU migrations in the scheduler.

Tracepoints are 1,500+ static instrumentation points compiled into the kernel. They fire on syscall entry/exit, scheduler decisions, I/O completion, packet transmission, and more.

In counting mode (perf stat), the kernel reads counter values at start and end. No interrupts. No sampling. Overhead is less than 0.1%.

In sampling mode (perf record), the kernel programs the PMU to generate an NMI every N events. The NMI handler captures the instruction pointer, call chain, timestamp, CPU, and PID, then writes the sample into a memory-mapped ring buffer. perf record reads this buffer and writes perf.data for offline analysis.

Under the Hood

IPC is the single most important metric. Instructions Per Cycle = instructions retired / CPU cycles. Modern CPUs can retire 4-6 per cycle. IPC of 1.0 is typical for general code. IPC below 0.5 means the CPU is stalling -- usually on memory. IPC above 2.0 means well-optimized, compute-bound code. perf stat -e instructions,cycles gives IPC instantly.

Sampling vs counting serve different purposes. Counting reveals the total: "1.5 million cache misses during execution." Sampling reveals WHERE: "40% of cache misses happen in hash_lookup(), 30% in btree_search()." Use counting first to characterize the problem, then sampling to find the cause.

Three ways to collect call stacks. (1) Frame pointer (--call-graph fp) -- fast but requires -fno-omit-frame-pointer compilation. (2) DWARF (--call-graph dwarf) -- works with optimized code but slower and uses more storage. (3) LBR (--call-graph lbr) -- uses Intel's Last Branch Record hardware, fast and accurate but limited to ~32 frames.

The flame graph workflow. perf record -g -F 99 -p PID -- sleep 30 captures samples. perf script > out.perf dumps raw data. stackcollapse-perf.pl out.perf > out.folded collapses stacks. flamegraph.pl out.folded > flame.svg generates the visualization. The x-axis is alphabetically sorted (NOT a timeline). Width is proportional to sample count. The widest bars at the top are the optimization targets.

Common Questions

What does a high cache-miss rate indicate?

The working set does not fit in cache, or the access pattern is hostile (random). Fixes: reduce working set size, improve spatial locality (arrays over linked lists, struct-of-arrays over array-of-structs), improve temporal locality (process in cache-sized chunks), or add prefetch hints. perf record -e LLC-load-misses -g shows WHERE the misses happen.

How do flame graphs differ from traditional profilers?

Traditional profilers (gprof, callgrind) instrument every function call, adding overhead proportional to frequency and distorting the profile. Flame graphs use statistical sampling at 99 Hz with negligible overhead, capturing actual production behavior. The visualization shows the full call stack hierarchy at once, making it easy to spot which paths consume the most CPU.

What is perf_event_paranoid?

/proc/sys/kernel/perf_event_paranoid controls access: -1 = no restrictions, 0 = everything except raw tracepoints, 1 = restrict kernel profiling to root (many distros' default), 2 = all profiling requires root, 3 = disabled entirely. In production, restricting it prevents unprivileged users from side-channel attacks via profiling.

How are JIT-compiled languages profiled?

JIT compilers generate code at runtime without symbol tables. The fix: JIT symbol maps. The runtime writes /tmp/perf-PID.map mapping address ranges to function names. For Java, use perf-map-agent or -XX:+PreserveFramePointer. For Node.js, use --perf-basic-prof. perf reads these maps automatically during report and script.

How Technologies Use This

JVM

A Java service is burning 90% CPU, but perf top shows nothing but a wall of hex addresses with zero useful method names. JVM-only profilers like async-profiler show timer-based samples, but cannot explain whether the CPU is doing useful compute or stalling on memory.

The problem is that JIT-compiled bytecode has no static symbol table, making hardware counter profiles completely opaque. Without symbol resolution, perf captures instruction pointers that map to dynamically generated code with no function name metadata. The CPU's hardware performance counters are counting events perfectly, but there is no way to attribute them to Java methods.

perf-map-agent solves this by writing /tmp/perf-PID.map with address-to-method mappings on the fly, and -XX:+PreserveFramePointer lets perf record -g walk the Java call stack accurately. Together they turn unreadable hex dumps into flame graphs that pinpoint the exact Java method burning cycles. Netflix uses this approach to identify 15-20% CPU savings in production services by finding hot loops invisible to JVM-only profilers.

A Go service is using 100% CPU, but pprof shows even distribution across 20 functions with no obvious hotspot. The team has optimized every function individually, but the service remains CPU-bound with no single bottleneck to attack.

The hidden issue is that pprof only does timer-based sampling at 100Hz and cannot distinguish compute-bound code from code stalled on cache misses or branch mispredictions. A function that appears to consume 5% of CPU time might actually be spending 90% of that time waiting for memory fetches, but pprof reports it the same as a function doing useful computation.

perf fills this gap by reading hardware counters directly. Since Go 1.21 preserves frame pointers by default, perf record -g captures accurate Go call stacks without any build flags. Running perf stat -e cache-misses,instructions on a Go binary can reveal an IPC below 0.3, pointing to a data structure with poor cache locality that pprof would never surface.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
CPU profiling	perf from host (needs CAP_PERFMON)	async-profiler (uses perf_event_open)	--prof flag + tick processor	pprof (timer-based, 100 Hz)	kubectl cp perf.data from pod
Hardware counters	perf stat from host namespace	perf stat -p <jvm_pid>	perf stat -p <node_pid>	perf stat ./go-binary	Host-level perf targeting container PIDs
Flame graphs	perf script + flamegraph.pl	async-profiler --flame	0x (Node.js flamegraph tool)	pprof -http=:8080 (web UI)	Continuous profiling (Pyroscope, Parca)
Symbol resolution	Requires debug symbols in image	perf-map-agent + PreserveFramePointer	N/A (V8 JIT symbols)	Default since Go 1.21 (frame pointers)	Mount debuginfo from host
Overhead	<1% at 99 Hz sampling	<1% with async-profiler	~3% with --prof	~1% with pprof default	Depends on profiling agent

Stack Layer Mapping

Layer	Perf Mechanism
CPU hardware	PMU registers count events (cycles, cache-misses, branches)
Local APIC	Delivers PMI (Performance Monitoring Interrupt) as NMI on overflow
Kernel perf_event	NMI handler captures IP + callchain, writes to ring buffer
perf tool	Reads ring buffer, writes perf.data, generates reports
Visualization	stackcollapse-perf.pl + flamegraph.pl produce SVG flame graphs
Application	Optimizes hot paths identified by flame graph width

Design Rationale

Software instrumentation perturbs timing and is blind to micro-architectural effects like cache behavior and branch prediction -- hardware counters exist to fill that gap. The ring buffer uses mmap so the kernel writes samples directly into shared memory and userspace reads without any syscall per sample. Sampling fires through NMI rather than regular interrupts because regular interrupts can be masked, and losing a performance monitoring interrupt inside a critical section would skew the profile.

If You See This, Think This

Symptom	Likely Cause	First Check
perf shows only hex addresses, no function names	Missing debug symbols or JIT symbol map	Install debuginfo package or use perf-map-agent for JVM
IPC below 0.3	Memory-stalled workload (cache misses dominating)	`perf stat -e cache-misses,instructions` to confirm miss rate
perf stat shows "not supported" for events	PMU not available (VM without passthrough)	`perf list` to check available events
Flame graph shows wide kernel bars	Excessive syscall overhead or kernel lock contention	`perf record -g --call-graph dwarf` for accurate kernel stacks
Multiplexing warning in perf stat	More events requested than PMU counters	Reduce event count or run separate perf stat passes
Sampling overhead above 5%	Sample frequency too high	Lower -F value (use 99 Hz as default)

When to Use / Avoid

Use when:

Diagnosing CPU-bound workloads where top shows 100% but the hotspot is unknown
Distinguishing compute-bound code from memory-stalled code (IPC analysis)
Building flame graphs for production profiling with minimal overhead (~1% at 99 Hz)
Measuring cache miss rates, branch mispredictions, or TLB misses for optimization
Profiling JIT-compiled code (Java, Go) with symbol resolution

Avoid when:

The bottleneck is I/O-bound (use iostat, biolatency, or strace instead)
Application-level profilers (pprof, async-profiler) already pinpoint the issue
Running on VMs without PMU passthrough (hardware counters may not be available)

Try It Yourself

 1  # Count hardware events for a command
 2  perf stat -e cycles,instructions,cache-references,cache-misses ls /tmp 2>&1 | tail -15
 3  
 4  # Measure IPC (instructions per cycle). ideal > 1.0
 5  perf stat -e instructions,cycles -r 3 -- sleep 1 2>&1 | tail -10
 6  
 7  # Profile a process for 30 seconds with call graphs
 8  # perf record -g -F 99 -p $(pidof myapp) -- sleep 30
 9  
10  # Generate a flame graph from perf data
11  # perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
12  
13  # Live top-like profiling of a process
14  # perf top -p $(pidof myapp) -e cycles
15  
16  # List all available hardware events on this CPU
17  perf list hw 2>&1 | head -20
18  
19  # List available tracepoints
20  perf list tracepoint 2>&1 | head -20
21  
22  # Count context switches and page faults (software events)
23  perf stat -e context-switches,page-faults,cpu-migrations ls /tmp 2>&1 | tail -12
24  
25  # Trace a specific kernel function
26  # perf probe -a tcp_sendmsg
27  # perf record -e probe:tcp_sendmsg -aR -- sleep 5
28  # perf probe -d tcp_sendmsg
29  
30  # Show cache miss rate per function
31  # perf record -e cache-misses -g -- ./my_benchmark
32  # perf report --sort=dso,symbol

Debug Checklist

1perf stat -e cycles,instructions,cache-misses,branch-misses ./program -- basic hardware counter summary
2perf record -g -F 99 -p <pid> -- sleep 30 -- capture 30s of profiling data
3perf report -- interactive TUI with per-function overhead
4perf top -p <pid> -- live view of hottest functions
5perf list -- show all available events on this CPU
6perf stat -e L1-dcache-load-misses,LLC-load-misses ./program -- cache hierarchy analysis

Key Takeaways

✓Counting mode (perf stat) has near-zero overhead. Hardware counters tick in dedicated CPU registers with no interrupts. The kernel reads them at context switch. 'perf stat -e cache-misses,instructions ./myprogram' costs less than 0.1%.
✓Sampling mode (perf record) fires an NMI every N events and captures the instruction pointer plus call chain. At 99 Hz, overhead is ~1%. At 100K Hz, it is 10-20%. The key insight: sampling tells you WHERE events occur, not just HOW MANY.
✓Flame graphs are built from callchain samples. The x-axis is alphabetical (NOT time). Width is proportional to sample count. The y-axis is stack depth. Wide bars at the top are your optimization targets.
✓Hardware counter multiplexing kicks in when you request more events than PMU counters (typically 4-8). The kernel time-slices and extrapolates. You see a percentage indicator in perf stat output. For precise measurements, stay within the counter limit.
✓perf traces kernel functions (kprobes) and user-space functions (uprobes) without recompilation. Combined with 'perf record -e probe:*', this gives function-level tracing with far less overhead than strace because it runs in-kernel.

Common Pitfalls

✗Mistake: Using perf record without -g (call graph). Reality: Without stack traces, you see which functions are hot but not WHY they are hot (which callers lead to them). Always use 'perf record -g'. Use --call-graph dwarf for user-space or --call-graph fp for kernel.
✗Mistake: Comparing raw cache-miss counts across different CPUs. Reality: A 'cache miss' on Intel Skylake refers to a different cache level than on AMD Zen. Compare cache-misses/instructions (miss rate) instead, and verify which level the event maps to.
✗Mistake: Profiling without debug symbols. Reality: perf captures instruction pointers and needs symbol tables to map them to function names. Without debuginfo, you see hex addresses. Install debuginfo packages or build with -g -O2.
✗Mistake: Setting sample frequency too high. Reality: 'perf record -F 99999' generates ~100K NMIs/sec, consuming significant CPU and perturbing the workload. Use -F 99 as the default -- enough for statistical significance with minimal observer effect.

Reference

System Calls

perf_event_openioctlmmapread

Tools

perf statperf record + perf reportperf top

📌

In One Line

perf stat tells how much; perf record -g -F 99 tells where -- start with counters, then sample for flame graphs.

Perf Events & Performance Counters

JVMGo

🧠

Mental Model

💡

The Problem

Architecture

The application is using 100% CPU. Top confirms it. But that is where the information ends.

Which function is the bottleneck? Is the CPU doing useful work or stalling on memory? Is the problem cache misses, branch mispredictions, or just too many instructions?

Without perf, it is all guessing. With it, flame graphs show exactly where cycles burn -- and suddenly optimization becomes obvious.

What Actually Happens

The perf_event subsystem handles three event classes:

Software events are counted by the kernel at instrumentation points: context switches in schedule(), page faults in the fault handler, CPU migrations in the scheduler.

Tracepoints are 1,500+ static instrumentation points compiled into the kernel. They fire on syscall entry/exit, scheduler decisions, I/O completion, packet transmission, and more.

In counting mode (perf stat), the kernel reads counter values at start and end. No interrupts. No sampling. Overhead is less than 0.1%.

Under the Hood

Common Questions

What does a high cache-miss rate indicate?

How do flame graphs differ from traditional profilers?

What is perf_event_paranoid?

How are JIT-compiled languages profiled?

How Technologies Use This

JVM

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
CPU profiling	perf from host (needs CAP_PERFMON)	async-profiler (uses perf_event_open)	--prof flag + tick processor	pprof (timer-based, 100 Hz)	kubectl cp perf.data from pod
Hardware counters	perf stat from host namespace	perf stat -p <jvm_pid>	perf stat -p <node_pid>	perf stat ./go-binary	Host-level perf targeting container PIDs
Flame graphs	perf script + flamegraph.pl	async-profiler --flame	0x (Node.js flamegraph tool)	pprof -http=:8080 (web UI)	Continuous profiling (Pyroscope, Parca)
Symbol resolution	Requires debug symbols in image	perf-map-agent + PreserveFramePointer	N/A (V8 JIT symbols)	Default since Go 1.21 (frame pointers)	Mount debuginfo from host
Overhead	<1% at 99 Hz sampling	<1% with async-profiler	~3% with --prof	~1% with pprof default	Depends on profiling agent

Stack Layer Mapping

Layer	Perf Mechanism
CPU hardware	PMU registers count events (cycles, cache-misses, branches)
Local APIC	Delivers PMI (Performance Monitoring Interrupt) as NMI on overflow
Kernel perf_event	NMI handler captures IP + callchain, writes to ring buffer
perf tool	Reads ring buffer, writes perf.data, generates reports
Visualization	stackcollapse-perf.pl + flamegraph.pl produce SVG flame graphs
Application	Optimizes hot paths identified by flame graph width

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
perf shows only hex addresses, no function names	Missing debug symbols or JIT symbol map	Install debuginfo package or use perf-map-agent for JVM
IPC below 0.3	Memory-stalled workload (cache misses dominating)	`perf stat -e cache-misses,instructions` to confirm miss rate
perf stat shows "not supported" for events	PMU not available (VM without passthrough)	`perf list` to check available events
Flame graph shows wide kernel bars	Excessive syscall overhead or kernel lock contention	`perf record -g --call-graph dwarf` for accurate kernel stacks
Multiplexing warning in perf stat	More events requested than PMU counters	Reduce event count or run separate perf stat passes
Sampling overhead above 5%	Sample frequency too high	Lower -F value (use 99 Hz as default)

When to Use / Avoid

Use when:

Diagnosing CPU-bound workloads where top shows 100% but the hotspot is unknown
Distinguishing compute-bound code from memory-stalled code (IPC analysis)
Building flame graphs for production profiling with minimal overhead (~1% at 99 Hz)
Measuring cache miss rates, branch mispredictions, or TLB misses for optimization
Profiling JIT-compiled code (Java, Go) with symbol resolution

Avoid when:

The bottleneck is I/O-bound (use iostat, biolatency, or strace instead)
Application-level profilers (pprof, async-profiler) already pinpoint the issue
Running on VMs without PMU passthrough (hardware counters may not be available)

Try It Yourself

 1  # Count hardware events for a command
 2  perf stat -e cycles,instructions,cache-references,cache-misses ls /tmp 2>&1 | tail -15
 3  
 4  # Measure IPC (instructions per cycle). ideal > 1.0
 5  perf stat -e instructions,cycles -r 3 -- sleep 1 2>&1 | tail -10
 6  
 7  # Profile a process for 30 seconds with call graphs
 8  # perf record -g -F 99 -p $(pidof myapp) -- sleep 30
 9  
10  # Generate a flame graph from perf data
11  # perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
12  
13  # Live top-like profiling of a process
14  # perf top -p $(pidof myapp) -e cycles
15  
16  # List all available hardware events on this CPU
17  perf list hw 2>&1 | head -20
18  
19  # List available tracepoints
20  perf list tracepoint 2>&1 | head -20
21  
22  # Count context switches and page faults (software events)
23  perf stat -e context-switches,page-faults,cpu-migrations ls /tmp 2>&1 | tail -12
24  
25  # Trace a specific kernel function
26  # perf probe -a tcp_sendmsg
27  # perf record -e probe:tcp_sendmsg -aR -- sleep 5
28  # perf probe -d tcp_sendmsg
29  
30  # Show cache miss rate per function
31  # perf record -e cache-misses -g -- ./my_benchmark
32  # perf report --sort=dso,symbol

Debug Checklist

1perf stat -e cycles,instructions,cache-misses,branch-misses ./program -- basic hardware counter summary
2perf record -g -F 99 -p <pid> -- sleep 30 -- capture 30s of profiling data
3perf report -- interactive TUI with per-function overhead
4perf top -p <pid> -- live view of hottest functions
5perf list -- show all available events on this CPU
6perf stat -e L1-dcache-load-misses,LLC-load-misses ./program -- cache hierarchy analysis

Key Takeaways

✓Counting mode (perf stat) has near-zero overhead. Hardware counters tick in dedicated CPU registers with no interrupts. The kernel reads them at context switch. 'perf stat -e cache-misses,instructions ./myprogram' costs less than 0.1%.
✓Sampling mode (perf record) fires an NMI every N events and captures the instruction pointer plus call chain. At 99 Hz, overhead is ~1%. At 100K Hz, it is 10-20%. The key insight: sampling tells you WHERE events occur, not just HOW MANY.
✓Flame graphs are built from callchain samples. The x-axis is alphabetical (NOT time). Width is proportional to sample count. The y-axis is stack depth. Wide bars at the top are your optimization targets.
✓Hardware counter multiplexing kicks in when you request more events than PMU counters (typically 4-8). The kernel time-slices and extrapolates. You see a percentage indicator in perf stat output. For precise measurements, stay within the counter limit.
✓perf traces kernel functions (kprobes) and user-space functions (uprobes) without recompilation. Combined with 'perf record -e probe:*', this gives function-level tracing with far less overhead than strace because it runs in-kernel.

Common Pitfalls

✗Mistake: Using perf record without -g (call graph). Reality: Without stack traces, you see which functions are hot but not WHY they are hot (which callers lead to them). Always use 'perf record -g'. Use --call-graph dwarf for user-space or --call-graph fp for kernel.
✗Mistake: Comparing raw cache-miss counts across different CPUs. Reality: A 'cache miss' on Intel Skylake refers to a different cache level than on AMD Zen. Compare cache-misses/instructions (miss rate) instead, and verify which level the event maps to.
✗Mistake: Profiling without debug symbols. Reality: perf captures instruction pointers and needs symbol tables to map them to function names. Without debuginfo, you see hex addresses. Install debuginfo packages or build with -g -O2.
✗Mistake: Setting sample frequency too high. Reality: 'perf record -F 99999' generates ~100K NMIs/sec, consuming significant CPU and perturbing the workload. Use -F 99 as the default -- enough for statistical significance with minimal observer effect.

Reference

System Calls

perf_event_openioctlmmapread

Tools

perf statperf record + perf reportperf top

📌

In One Line

perf stat tells how much; perf record -g -F 99 tells where -- start with counters, then sample for flame graphs.

Perf Events & Performance Counters

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Perf Events & Performance Counters

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics