Perf Events & Performance Counters
Mental Model
A car's engine already has sensors for RPM, fuel injection rate, turbo pressure, exhaust temperature -- all ticking in real time with zero drag. Most drivers never look at them. Counting mode reads the gauges after the drive and reports totals. Sampling mode mounts a camera that snaps a photo every few seconds, capturing what the engine was doing at each instant. Stack enough photos and patterns jump out: the turbo ran 80% of the time, the fuel injector was the bottleneck. Flame graphs turn those photos into a heat map of where the engine spent its effort.
The Problem
An application is pegged at 100% CPU. Top confirms it -- then what? Is it doing useful compute or stalling on cache misses? pprof spreads blame evenly across 20 functions without distinguishing real work from memory waits. A Java service profiled under perf shows nothing but hex addresses because JIT code has no symbol table. IPC sits below 0.3 -- a clear memory bottleneck -- but no application-level profiler captures that metric. Crank the sample frequency to 100K Hz and the profiling overhead itself distorts the results by 10-20%.
Architecture
The application is using 100% CPU. Top confirms it. But that is where the information ends.
Which function is the bottleneck? Is the CPU doing useful work or stalling on memory? Is the problem cache misses, branch mispredictions, or just too many instructions?
Without perf, it is all guessing. With it, flame graphs show exactly where cycles burn -- and suddenly optimization becomes obvious.
What Actually Happens
The perf_event subsystem handles three event classes:
Hardware events come from the CPU's PMU (Performance Monitoring Unit) -- dedicated registers that increment on micro-architectural events. The kernel programs MSR registers (IA32_PERFEVTSELx for event selection, IA32_PMCx for the count) and reads them at context switch.
Software events are counted by the kernel at instrumentation points: context switches in schedule(), page faults in the fault handler, CPU migrations in the scheduler.
Tracepoints are 1,500+ static instrumentation points compiled into the kernel. They fire on syscall entry/exit, scheduler decisions, I/O completion, packet transmission, and more.
In counting mode (perf stat), the kernel reads counter values at start and end. No interrupts. No sampling. Overhead is less than 0.1%.
In sampling mode (perf record), the kernel programs the PMU to generate an NMI every N events. The NMI handler captures the instruction pointer, call chain, timestamp, CPU, and PID, then writes the sample into a memory-mapped ring buffer. perf record reads this buffer and writes perf.data for offline analysis.
Under the Hood
IPC is the single most important metric. Instructions Per Cycle = instructions retired / CPU cycles. Modern CPUs can retire 4-6 per cycle. IPC of 1.0 is typical for general code. IPC below 0.5 means the CPU is stalling -- usually on memory. IPC above 2.0 means well-optimized, compute-bound code. perf stat -e instructions,cycles gives IPC instantly.
Sampling vs counting serve different purposes. Counting reveals the total: "1.5 million cache misses during execution." Sampling reveals WHERE: "40% of cache misses happen in hash_lookup(), 30% in btree_search()." Use counting first to characterize the problem, then sampling to find the cause.
Three ways to collect call stacks. (1) Frame pointer (--call-graph fp) -- fast but requires -fno-omit-frame-pointer compilation. (2) DWARF (--call-graph dwarf) -- works with optimized code but slower and uses more storage. (3) LBR (--call-graph lbr) -- uses Intel's Last Branch Record hardware, fast and accurate but limited to ~32 frames.
The flame graph workflow. perf record -g -F 99 -p PID -- sleep 30 captures samples. perf script > out.perf dumps raw data. stackcollapse-perf.pl out.perf > out.folded collapses stacks. flamegraph.pl out.folded > flame.svg generates the visualization. The x-axis is alphabetically sorted (NOT a timeline). Width is proportional to sample count. The widest bars at the top are the optimization targets.
Common Questions
What does a high cache-miss rate indicate?
The working set does not fit in cache, or the access pattern is hostile (random). Fixes: reduce working set size, improve spatial locality (arrays over linked lists, struct-of-arrays over array-of-structs), improve temporal locality (process in cache-sized chunks), or add prefetch hints. perf record -e LLC-load-misses -g shows WHERE the misses happen.
How do flame graphs differ from traditional profilers?
Traditional profilers (gprof, callgrind) instrument every function call, adding overhead proportional to frequency and distorting the profile. Flame graphs use statistical sampling at 99 Hz with negligible overhead, capturing actual production behavior. The visualization shows the full call stack hierarchy at once, making it easy to spot which paths consume the most CPU.
What is perf_event_paranoid?
/proc/sys/kernel/perf_event_paranoid controls access: -1 = no restrictions, 0 = everything except raw tracepoints, 1 = restrict kernel profiling to root (many distros' default), 2 = all profiling requires root, 3 = disabled entirely. In production, restricting it prevents unprivileged users from side-channel attacks via profiling.
How are JIT-compiled languages profiled?
JIT compilers generate code at runtime without symbol tables. The fix: JIT symbol maps. The runtime writes /tmp/perf-PID.map mapping address ranges to function names. For Java, use perf-map-agent or -XX:+PreserveFramePointer. For Node.js, use --perf-basic-prof. perf reads these maps automatically during report and script.
How Technologies Use This
A Java service is burning 90% CPU, but perf top shows nothing but a wall of hex addresses with zero useful method names. JVM-only profilers like async-profiler show timer-based samples, but cannot explain whether the CPU is doing useful compute or stalling on memory.
The problem is that JIT-compiled bytecode has no static symbol table, making hardware counter profiles completely opaque. Without symbol resolution, perf captures instruction pointers that map to dynamically generated code with no function name metadata. The CPU's hardware performance counters are counting events perfectly, but there is no way to attribute them to Java methods.
perf-map-agent solves this by writing /tmp/perf-PID.map with address-to-method mappings on the fly, and -XX:+PreserveFramePointer lets perf record -g walk the Java call stack accurately. Together they turn unreadable hex dumps into flame graphs that pinpoint the exact Java method burning cycles. Netflix uses this approach to identify 15-20% CPU savings in production services by finding hot loops invisible to JVM-only profilers.
A Go service is using 100% CPU, but pprof shows even distribution across 20 functions with no obvious hotspot. The team has optimized every function individually, but the service remains CPU-bound with no single bottleneck to attack.
The hidden issue is that pprof only does timer-based sampling at 100Hz and cannot distinguish compute-bound code from code stalled on cache misses or branch mispredictions. A function that appears to consume 5% of CPU time might actually be spending 90% of that time waiting for memory fetches, but pprof reports it the same as a function doing useful computation.
perf fills this gap by reading hardware counters directly. Since Go 1.21 preserves frame pointers by default, perf record -g captures accurate Go call stacks without any build flags. Running perf stat -e cache-misses,instructions on a Go binary can reveal an IPC below 0.3, pointing to a data structure with poor cache locality that pprof would never surface.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| CPU profiling | perf from host (needs CAP_PERFMON) | async-profiler (uses perf_event_open) | --prof flag + tick processor | pprof (timer-based, 100 Hz) | kubectl cp perf.data from pod |
| Hardware counters | perf stat from host namespace | perf stat -p <jvm_pid> | perf stat -p <node_pid> | perf stat ./go-binary | Host-level perf targeting container PIDs |
| Flame graphs | perf script + flamegraph.pl | async-profiler --flame | 0x (Node.js flamegraph tool) | pprof -http=:8080 (web UI) | Continuous profiling (Pyroscope, Parca) |
| Symbol resolution | Requires debug symbols in image | perf-map-agent + PreserveFramePointer | N/A (V8 JIT symbols) | Default since Go 1.21 (frame pointers) | Mount debuginfo from host |
| Overhead | <1% at 99 Hz sampling | <1% with async-profiler | ~3% with --prof | ~1% with pprof default | Depends on profiling agent |
Stack Layer Mapping
| Layer | Perf Mechanism |
|---|---|
| CPU hardware | PMU registers count events (cycles, cache-misses, branches) |
| Local APIC | Delivers PMI (Performance Monitoring Interrupt) as NMI on overflow |
| Kernel perf_event | NMI handler captures IP + callchain, writes to ring buffer |
| perf tool | Reads ring buffer, writes perf.data, generates reports |
| Visualization | stackcollapse-perf.pl + flamegraph.pl produce SVG flame graphs |
| Application | Optimizes hot paths identified by flame graph width |
Design Rationale
Software instrumentation perturbs timing and is blind to micro-architectural effects like cache behavior and branch prediction -- hardware counters exist to fill that gap. The ring buffer uses mmap so the kernel writes samples directly into shared memory and userspace reads without any syscall per sample. Sampling fires through NMI rather than regular interrupts because regular interrupts can be masked, and losing a performance monitoring interrupt inside a critical section would skew the profile.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| perf shows only hex addresses, no function names | Missing debug symbols or JIT symbol map | Install debuginfo package or use perf-map-agent for JVM |
| IPC below 0.3 | Memory-stalled workload (cache misses dominating) | perf stat -e cache-misses,instructions to confirm miss rate |
| perf stat shows "not supported" for events | PMU not available (VM without passthrough) | perf list to check available events |
| Flame graph shows wide kernel bars | Excessive syscall overhead or kernel lock contention | perf record -g --call-graph dwarf for accurate kernel stacks |
| Multiplexing warning in perf stat | More events requested than PMU counters | Reduce event count or run separate perf stat passes |
| Sampling overhead above 5% | Sample frequency too high | Lower -F value (use 99 Hz as default) |
When to Use / Avoid
Use when:
- Diagnosing CPU-bound workloads where top shows 100% but the hotspot is unknown
- Distinguishing compute-bound code from memory-stalled code (IPC analysis)
- Building flame graphs for production profiling with minimal overhead (~1% at 99 Hz)
- Measuring cache miss rates, branch mispredictions, or TLB misses for optimization
- Profiling JIT-compiled code (Java, Go) with symbol resolution
Avoid when:
- The bottleneck is I/O-bound (use iostat, biolatency, or strace instead)
- Application-level profilers (pprof, async-profiler) already pinpoint the issue
- Running on VMs without PMU passthrough (hardware counters may not be available)
Try It Yourself
1 # Count hardware events for a command
2 perf stat -e cycles,instructions,cache-references,cache-misses ls /tmp 2>&1 | tail -15
3
4 # Measure IPC (instructions per cycle). ideal > 1.0
5 perf stat -e instructions,cycles -r 3 -- sleep 1 2>&1 | tail -10
6
7 # Profile a process for 30 seconds with call graphs
8 # perf record -g -F 99 -p $(pidof myapp) -- sleep 30
9
10 # Generate a flame graph from perf data
11 # perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
12
13 # Live top-like profiling of a process
14 # perf top -p $(pidof myapp) -e cycles
15
16 # List all available hardware events on this CPU
17 perf list hw 2>&1 | head -20
18
19 # List available tracepoints
20 perf list tracepoint 2>&1 | head -20
21
22 # Count context switches and page faults (software events)
23 perf stat -e context-switches,page-faults,cpu-migrations ls /tmp 2>&1 | tail -12
24
25 # Trace a specific kernel function
26 # perf probe -a tcp_sendmsg
27 # perf record -e probe:tcp_sendmsg -aR -- sleep 5
28 # perf probe -d tcp_sendmsg
29
30 # Show cache miss rate per function
31 # perf record -e cache-misses -g -- ./my_benchmark
32 # perf report --sort=dso,symbolDebug Checklist
- 1
perf stat -e cycles,instructions,cache-misses,branch-misses ./program -- basic hardware counter summary - 2
perf record -g -F 99 -p <pid> -- sleep 30 -- capture 30s of profiling data - 3
perf report -- interactive TUI with per-function overhead - 4
perf top -p <pid> -- live view of hottest functions - 5
perf list -- show all available events on this CPU - 6
perf stat -e L1-dcache-load-misses,LLC-load-misses ./program -- cache hierarchy analysis
Key Takeaways
- ✓Counting mode (perf stat) has near-zero overhead. Hardware counters tick in dedicated CPU registers with no interrupts. The kernel reads them at context switch. 'perf stat -e cache-misses,instructions ./myprogram' costs less than 0.1%.
- ✓Sampling mode (perf record) fires an NMI every N events and captures the instruction pointer plus call chain. At 99 Hz, overhead is ~1%. At 100K Hz, it is 10-20%. The key insight: sampling tells you WHERE events occur, not just HOW MANY.
- ✓Flame graphs are built from callchain samples. The x-axis is alphabetical (NOT time). Width is proportional to sample count. The y-axis is stack depth. Wide bars at the top are your optimization targets.
- ✓Hardware counter multiplexing kicks in when you request more events than PMU counters (typically 4-8). The kernel time-slices and extrapolates. You see a percentage indicator in perf stat output. For precise measurements, stay within the counter limit.
- ✓perf traces kernel functions (kprobes) and user-space functions (uprobes) without recompilation. Combined with 'perf record -e probe:*', this gives function-level tracing with far less overhead than strace because it runs in-kernel.
Common Pitfalls
- ✗Mistake: Using perf record without -g (call graph). Reality: Without stack traces, you see which functions are hot but not WHY they are hot (which callers lead to them). Always use 'perf record -g'. Use --call-graph dwarf for user-space or --call-graph fp for kernel.
- ✗Mistake: Comparing raw cache-miss counts across different CPUs. Reality: A 'cache miss' on Intel Skylake refers to a different cache level than on AMD Zen. Compare cache-misses/instructions (miss rate) instead, and verify which level the event maps to.
- ✗Mistake: Profiling without debug symbols. Reality: perf captures instruction pointers and needs symbol tables to map them to function names. Without debuginfo, you see hex addresses. Install debuginfo packages or build with -g -O2.
- ✗Mistake: Setting sample frequency too high. Reality: 'perf record -F 99999' generates ~100K NMIs/sec, consuming significant CPU and perturbing the workload. Use -F 99 as the default -- enough for statistical significance with minimal observer effect.
Reference
In One Line
perf stat tells how much; perf record -g -F 99 tells where -- start with counters, then sample for flame graphs.