eBPF: Programmable Kernel
Mental Model
An app store for the kernel. Instead of modifying the operating system itself -- dangerous, voids the warranty -- install small verified apps that run in a sandbox. They can observe everything (tracing), manage network traffic (firewall), enforce rules (security). But they cannot crash the system. The store verifies each app before installation. Anything unsafe gets rejected before it ever runs.
The Problem
A production server needs to trace every file open, filter 10 million packets per second, and enforce custom security policies -- all at the same time. Recompiling the kernel on a live system is out. A kernel module could do it, but one bug panics the machine. strace slaps 50-100% overhead on every traced process. iptables falls over at millions of rules. There is no safe, fast, dynamic way to extend the kernel at runtime. Or there was not, until eBPF.
Architecture
What if custom code could run inside the kernel -- without risking a crash?
Not a kernel module that could panic the whole system. Not a recompilation that requires a reboot. Something that the kernel itself verifies is safe before executing, then JIT-compiles to native machine code.
That is eBPF. And it has quietly become the most important innovation in the Linux kernel in the last decade.
What Actually Happens
A developer writes a BPF program in restricted C. Compiles it with clang -target bpf to BPF bytecode. Loads it into the kernel via the bpf() syscall.
Before the program can execute, it must pass the verifier. This static analyzer walks every possible execution path, tracking register states as a lattice (known value, bounded range, pointer type). It checks memory bounds, prevents null pointer dereferences, ensures no unbounded loops, and verifies that the program only uses allowed helper functions. If anything looks unsafe, the program is rejected with a detailed error log.
Once verified, the JIT compiler translates BPF bytecode to native x86-64 (or ARM64) instructions. The overhead is near zero.
The program then attaches to a hook point. A kprobe fires when a specific kernel function is called. A tracepoint fires on stable kernel events. An XDP program processes packets at the earliest possible point -- in the NIC driver's receive path, before any sk_buff allocation. TC programs filter at L3/L4. LSM programs make security decisions at file open, socket connect, and other sensitive operations.
BPF programs talk to user space through maps -- key-value stores in kernel memory. Hash maps, arrays, ring buffers, per-CPU variants. Maps persist independently of programs and can be pinned to /sys/fs/bpf/ for long-lived state.
Under the Hood
The verifier is the magic. It models each of the 11 BPF registers as a state and tracks how every instruction changes them. A bpf_map_lookup_elem() return is typed as "pointer to map value OR NULL" -- the verifier requires a NULL check before dereference. This is why eBPF programs cannot crash the kernel: safety is proved statically, before execution.
CO-RE solves the portability problem. Without CO-RE, a BPF program compiled on kernel 5.10 might break on 5.15 because struct field offsets changed. CO-RE uses BTF (BPF Type Format) embedded in the kernel to relocate field accesses at load time. Compile once, run on any kernel with BTF support.
Helpers vs kfuncs. Helpers (bpf_map_lookup_elem, bpf_probe_read_kernel, bpf_get_current_pid_tgid) are stable ABI. Kfuncs (kernel 5.13+) let BPF programs call kernel functions marked with BTF_ID -- more flexible but less stable, as signatures can change between versions.
Tail calls enable composability. bpf_tail_call() replaces the current program's execution context with another (up to 33 levels). Cilium uses this for composable policy: base -> L3 filter -> L4 filter -> L7 proxy, each a separate program.
Common Questions
How does eBPF differ from kernel modules?
Safety and portability. A buggy kernel module can panic the system. eBPF programs are verified before execution -- they cannot crash the kernel, access arbitrary memory, or loop forever. They also do not require kernel headers at runtime (with CO-RE). The tradeoff: eBPF programs have restricted capabilities (limited helpers, small stack, no sleeping in most program types).
What is the overhead?
For kprobe/tracepoint programs, typically less than 1% aggregate. Each invocation takes 50-200ns. XDP adds 50-100ns per packet. The bottleneck is usually ring buffer I/O for event streaming. Use per-CPU maps and sampling to manage it.
Can eBPF modify kernel behavior, or only observe?
Both. XDP drops, redirects, or modifies packets. TC rewrites headers. bpf_override_return() changes function return values for fault injection. LSM programs return allow/deny decisions. struct_ops programs replace kernel function pointer tables (e.g., TCP congestion control). But BPF programs cannot call arbitrary functions, allocate persistent memory, or sleep (except in sleepable types added in kernel 5.10).
How Technologies Use This
Service routing latency jumps from 0.5ms to 5ms as a Kubernetes cluster grows past 5000 services. kube-proxy becomes the bottleneck, and CPU usage on nodes climbs steadily with each new service added to the cluster.
The root cause is that kube-proxy uses iptables, and every service creates rules that the kernel evaluates linearly on each packet. At 10K services the rule count exceeds 40,000, making packet routing O(n) per packet where n is the number of services. Every packet traverses the entire rule chain before reaching its destination.
Cilium replaces kube-proxy entirely with eBPF programs attached to tc and XDP hooks that use hash-map lookups for O(1) service resolution regardless of cluster size. Load balancing, network policy enforcement, and transparent encryption all execute as verified BPF code in the kernel data path. The result is roughly 10x lower latency at scale and 40% less CPU consumed by packet processing compared to iptables.
Detecting container escape attempts in real time on production hosts is critical, but strace adds 10-100x overhead by stopping the process on every syscall, and loading a custom kernel module risks panicking the entire system on a single bug. Neither option is acceptable in production.
The fundamental dilemma is that syscall-level monitoring traditionally requires either stopping the target process (ptrace/strace) or running unverified code in kernel space (kernel modules). Both approaches trade safety for visibility, and on a production host running hundreds of containers, that tradeoff is unacceptable.
eBPF breaks this tradeoff entirely. Falco attaches eBPF programs to kprobes and tracepoints that observe every container's syscall activity at less than 1% CPU overhead. The verifier guarantees the monitoring code cannot crash the kernel, and BPF ring buffers stream 50,000+ security events per second to userspace for real-time alerting on suspicious behavior.
Same Concept Across Tech
| Technology | How it uses eBPF | Key tool |
|---|---|---|
| Cilium | Kubernetes CNI plugin. Replaces iptables with eBPF for pod networking and network policy | Bypasses kube-proxy entirely |
| Falco | Runtime security monitoring. eBPF programs detect suspicious syscalls (exec, open, connect) | Alternative to kernel module-based Falco |
| Tetragon | Security observability. Traces process execution, file access, network calls via eBPF | Deeper than Falco, can enforce policies |
| bpftrace | One-liner tracing scripts (like awk for kernel tracing) | bpftrace -e 'kprobe:vfs_read { @[comm] = count(); }' |
| BCC | Python/C toolkit for writing eBPF programs. Includes 100+ ready-made tools | opensnoop, execsnoop, tcpconnect |
| Datadog/Grafana | APM agents use eBPF for network monitoring without application instrumentation | Zero-code instrumentation for services |
Stack layer mapping (production tracing):
| Layer | What to check | eBPF tool |
|---|---|---|
| Application | Which functions are slow? What files are opened? | funclatency, opensnoop |
| Syscall | Which syscalls dominate? Latency per syscall? | syscount, bpftrace tracepoints |
| Kernel | Which kernel functions take the longest? Lock contention? | funclatency on kernel functions |
| Network | TCP retransmits? Connection latency? Packet drops? | tcpretrans, tcpconnlat |
| Storage | I/O latency distribution? Which files are hottest? | biolatency, fileslower |
Design Rationale Two options existed for extending kernel behavior, and both had fatal problems. Kernel modules run with full privileges -- one bug panics everything, unacceptable for production observability. Ptrace-based tracing (strace) stops the target on every event, imposing 10-100x overhead that rules out production use. eBPF broke this deadlock with a static verifier that proves safety before execution: the crash risk of modules vanishes, while in-kernel speed stays. The JIT compiler followed because interpreted bytecode was too slow for the networking fast path, where even 100ns per packet matters at millions of packets per second.
If You See This, Think This
| Symptom | Likely cause | eBPF tool to use |
|---|---|---|
| Mystery process opening sensitive files | Unexpected file access, potential security issue | opensnoop -f /etc/shadow |
| TCP retransmissions causing latency | Network congestion or misconfigured buffers | tcpretrans |
| Short-lived processes consuming CPU | Processes spawning and dying too fast for top to catch | execsnoop |
| Disk I/O latency outliers | Occasional slow disk requests mixed with fast ones | biolatency (shows histogram) |
| Unknown network connections from container | Container making unexpected outbound calls | tcpconnect with PID filtering |
| High system CPU but unclear which subsystem | Kernel functions consuming time not visible in user-space profilers | profile (eBPF-based CPU profiler) |
When to Use / Avoid
Use eBPF when:
- Tracing system behavior in production without restart or overhead (bpftrace, BCC tools)
- High-performance packet filtering or load balancing (Cilium, XDP)
- Custom security policies beyond what AppArmor/SELinux provide (Falco, Tetragon)
- Profiling CPU, memory, or I/O at the kernel level without modifying application code
Avoid when:
- Simple tracing needs are met by strace or perf (no need for eBPF complexity)
- Running on kernels older than 4.15 (limited eBPF support)
- The problem can be solved in user space (eBPF adds complexity)
- Needing to modify data structures (eBPF is read-mostly, writing is restricted)
Try It Yourself
1 # List all loaded BPF programs
2
3 sudo bpftool prog list 2>/dev/null || echo 'bpftool not installed'
4
5 # Show BPF maps and their contents
6
7 sudo bpftool map list 2>/dev/null || echo 'bpftool not installed'
8
9 # Trace all file opens system-wide using bpftrace
10
11 sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }' 2>/dev/null & sleep 2; kill %1 2>/dev/null
12
13 # Check if BTF is available (required for CO-RE)
14
15 ls -la /sys/kernel/btf/vmlinux 2>/dev/null || echo 'BTF not available'
16
17 # Count syscalls per process using bpftrace
18
19 sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' 2>/dev/null & sleep 3; kill %1 2>/dev/null
20
21 # Show available BPF program types
22
23 sudo bpftool feature probe kernel 2>/dev/null | grep program_type | head -10Debug Checklist
- 1
List loaded BPF programs: bpftool prog list - 2
List BPF maps: bpftool map list - 3
Check if BPF is available: ls /sys/fs/bpf/ - 4
Quick one-liner tracing: bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }' - 5
Check BPF-related kernel config: grep BPF /boot/config-$(uname -r) - 6
Monitor BPF program overhead: bpftool prog show (check run_time_ns)
Key Takeaways
- ✓The verifier is the gatekeeper. It tracks register states as a lattice (known value, bounded range, pointer type), walks every execution path, and prunes equivalent states to handle path explosion. Limit: ~1 million verified instructions.
- ✓XDP processes packets before sk_buff allocation -- at the driver's DMA completion handler. Cloudflare uses XDP to drop DDoS traffic at line rate (~100M pps) before it even reaches the network stack.
- ✓Tail calls let one BPF program chain into another (up to 33 deep). Cilium uses this for composable network policy: base program -> L3 filter -> L4 filter -> L7 proxy, each a separate BPF program.
- ✓BPF ring buffer (kernel 5.8+) replaces perf buffers for event streaming. Single shared ring per CPU group, variable-length records, no data copy. This is how modern observability tools get events out of the kernel efficiently.
- ✓libbpf is the canonical library. It handles ELF parsing, CO-RE relocation, map creation, program loading, and attachment. libbpf-bootstrap provides skeleton code for new projects.
Common Pitfalls
- ✗Mistake: Hardcoding kernel struct field offsets. Reality: This breaks on different kernel versions. Use CO-RE with BTF and __builtin_preserve_access_index for portable programs.
- ✗Mistake: Using bpf_probe_read() for everything. Reality: Since kernel 5.5, use bpf_probe_read_kernel() or bpf_probe_read_user() explicitly. The generic version cannot distinguish pointer types on all architectures.
- ✗Mistake: Trying to call arbitrary kernel functions. Reality: Only BPF helpers and kfuncs (marked with BTF_ID) are callable. The verifier rejects everything else.
- ✗Mistake: Dereferencing bpf_map_lookup_elem() without NULL check. Reality: The lookup returns NULL if the key does not exist. Skipping the NULL check is the most common verifier rejection for beginners.
Reference
In One Line
Verified programs running at native speed inside the kernel -- replacing modules for tracing, iptables for networking, and enabling security policies that used to require kernel patches.