Kernel & BootTopic 8 of 13

Kernel InternalsAdvanced

eBPF: Programmable Kernel

KubernetesDocker

🧠

Mental Model

An app store for the kernel. Instead of modifying the operating system itself -- dangerous, voids the warranty -- install small verified apps that run in a sandbox. They can observe everything (tracing), manage network traffic (firewall), enforce rules (security). But they cannot crash the system. The store verifies each app before installation. Anything unsafe gets rejected before it ever runs.

💡

The Problem

A production server needs to trace every file open, filter 10 million packets per second, and enforce custom security policies -- all at the same time. Recompiling the kernel on a live system is out. A kernel module could do it, but one bug panics the machine. strace slaps 50-100% overhead on every traced process. iptables falls over at millions of rules. There is no safe, fast, dynamic way to extend the kernel at runtime. Or there was not, until eBPF.

Architecture

What if custom code could run inside the kernel -- without risking a crash?

Not a kernel module that could panic the whole system. Not a recompilation that requires a reboot. Something that the kernel itself verifies is safe before executing, then JIT-compiles to native machine code.

That is eBPF. And it has quietly become the most important innovation in the Linux kernel in the last decade.

What Actually Happens

A developer writes a BPF program in restricted C. Compiles it with clang -target bpf to BPF bytecode. Loads it into the kernel via the bpf() syscall.

Before the program can execute, it must pass the verifier. This static analyzer walks every possible execution path, tracking register states as a lattice (known value, bounded range, pointer type). It checks memory bounds, prevents null pointer dereferences, ensures no unbounded loops, and verifies that the program only uses allowed helper functions. If anything looks unsafe, the program is rejected with a detailed error log.

Once verified, the JIT compiler translates BPF bytecode to native x86-64 (or ARM64) instructions. The overhead is near zero.

The program then attaches to a hook point. A kprobe fires when a specific kernel function is called. A tracepoint fires on stable kernel events. An XDP program processes packets at the earliest possible point -- in the NIC driver's receive path, before any sk_buff allocation. TC programs filter at L3/L4. LSM programs make security decisions at file open, socket connect, and other sensitive operations.

BPF programs talk to user space through maps -- key-value stores in kernel memory. Hash maps, arrays, ring buffers, per-CPU variants. Maps persist independently of programs and can be pinned to /sys/fs/bpf/ for long-lived state.

Under the Hood

The verifier is the magic. It models each of the 11 BPF registers as a state and tracks how every instruction changes them. A bpf_map_lookup_elem() return is typed as "pointer to map value OR NULL" -- the verifier requires a NULL check before dereference. This is why eBPF programs cannot crash the kernel: safety is proved statically, before execution.

CO-RE solves the portability problem. Without CO-RE, a BPF program compiled on kernel 5.10 might break on 5.15 because struct field offsets changed. CO-RE uses BTF (BPF Type Format) embedded in the kernel to relocate field accesses at load time. Compile once, run on any kernel with BTF support.

Helpers vs kfuncs. Helpers (bpf_map_lookup_elem, bpf_probe_read_kernel, bpf_get_current_pid_tgid) are stable ABI. Kfuncs (kernel 5.13+) let BPF programs call kernel functions marked with BTF_ID -- more flexible but less stable, as signatures can change between versions.

Tail calls enable composability. bpf_tail_call() replaces the current program's execution context with another (up to 33 levels). Cilium uses this for composable policy: base -> L3 filter -> L4 filter -> L7 proxy, each a separate program.

Common Questions

How does eBPF differ from kernel modules?

Safety and portability. A buggy kernel module can panic the system. eBPF programs are verified before execution -- they cannot crash the kernel, access arbitrary memory, or loop forever. They also do not require kernel headers at runtime (with CO-RE). The tradeoff: eBPF programs have restricted capabilities (limited helpers, small stack, no sleeping in most program types).

What is the overhead?

For kprobe/tracepoint programs, typically less than 1% aggregate. Each invocation takes 50-200ns. XDP adds 50-100ns per packet. The bottleneck is usually ring buffer I/O for event streaming. Use per-CPU maps and sampling to manage it.

Can eBPF modify kernel behavior, or only observe?

Both. XDP drops, redirects, or modifies packets. TC rewrites headers. bpf_override_return() changes function return values for fault injection. LSM programs return allow/deny decisions. struct_ops programs replace kernel function pointer tables (e.g., TCP congestion control). But BPF programs cannot call arbitrary functions, allocate persistent memory, or sleep (except in sleepable types added in kernel 5.10).

How Technologies Use This

Kubernetes

Service routing latency jumps from 0.5ms to 5ms as a Kubernetes cluster grows past 5000 services. kube-proxy becomes the bottleneck, and CPU usage on nodes climbs steadily with each new service added to the cluster.

The root cause is that kube-proxy uses iptables, and every service creates rules that the kernel evaluates linearly on each packet. At 10K services the rule count exceeds 40,000, making packet routing O(n) per packet where n is the number of services. Every packet traverses the entire rule chain before reaching its destination.

Cilium replaces kube-proxy entirely with eBPF programs attached to tc and XDP hooks that use hash-map lookups for O(1) service resolution regardless of cluster size. Load balancing, network policy enforcement, and transparent encryption all execute as verified BPF code in the kernel data path. The result is roughly 10x lower latency at scale and 40% less CPU consumed by packet processing compared to iptables.

Docker

Detecting container escape attempts in real time on production hosts is critical, but strace adds 10-100x overhead by stopping the process on every syscall, and loading a custom kernel module risks panicking the entire system on a single bug. Neither option is acceptable in production.

The fundamental dilemma is that syscall-level monitoring traditionally requires either stopping the target process (ptrace/strace) or running unverified code in kernel space (kernel modules). Both approaches trade safety for visibility, and on a production host running hundreds of containers, that tradeoff is unacceptable.

eBPF breaks this tradeoff entirely. Falco attaches eBPF programs to kprobes and tracepoints that observe every container's syscall activity at less than 1% CPU overhead. The verifier guarantees the monitoring code cannot crash the kernel, and BPF ring buffers stream 50,000+ security events per second to userspace for real-time alerting on suspicious behavior.

Same Concept Across Tech

Technology	How it uses eBPF	Key tool
Cilium	Kubernetes CNI plugin. Replaces iptables with eBPF for pod networking and network policy	Bypasses kube-proxy entirely
Falco	Runtime security monitoring. eBPF programs detect suspicious syscalls (exec, open, connect)	Alternative to kernel module-based Falco
Tetragon	Security observability. Traces process execution, file access, network calls via eBPF	Deeper than Falco, can enforce policies
bpftrace	One-liner tracing scripts (like awk for kernel tracing)	bpftrace -e 'kprobe:vfs_read { @[comm] = count(); }'
BCC	Python/C toolkit for writing eBPF programs. Includes 100+ ready-made tools	opensnoop, execsnoop, tcpconnect
Datadog/Grafana	APM agents use eBPF for network monitoring without application instrumentation	Zero-code instrumentation for services

Stack layer mapping (production tracing):

Layer	What to check	eBPF tool
Application	Which functions are slow? What files are opened?	funclatency, opensnoop
Syscall	Which syscalls dominate? Latency per syscall?	syscount, bpftrace tracepoints
Kernel	Which kernel functions take the longest? Lock contention?	funclatency on kernel functions
Network	TCP retransmits? Connection latency? Packet drops?	tcpretrans, tcpconnlat
Storage	I/O latency distribution? Which files are hottest?	biolatency, fileslower

Design Rationale Two options existed for extending kernel behavior, and both had fatal problems. Kernel modules run with full privileges -- one bug panics everything, unacceptable for production observability. Ptrace-based tracing (strace) stops the target on every event, imposing 10-100x overhead that rules out production use. eBPF broke this deadlock with a static verifier that proves safety before execution: the crash risk of modules vanishes, while in-kernel speed stays. The JIT compiler followed because interpreted bytecode was too slow for the networking fast path, where even 100ns per packet matters at millions of packets per second.

If You See This, Think This

Symptom	Likely cause	eBPF tool to use
Mystery process opening sensitive files	Unexpected file access, potential security issue	opensnoop -f /etc/shadow
TCP retransmissions causing latency	Network congestion or misconfigured buffers	tcpretrans
Short-lived processes consuming CPU	Processes spawning and dying too fast for top to catch	execsnoop
Disk I/O latency outliers	Occasional slow disk requests mixed with fast ones	biolatency (shows histogram)
Unknown network connections from container	Container making unexpected outbound calls	tcpconnect with PID filtering
High system CPU but unclear which subsystem	Kernel functions consuming time not visible in user-space profilers	profile (eBPF-based CPU profiler)

When to Use / Avoid

Use eBPF when:

Tracing system behavior in production without restart or overhead (bpftrace, BCC tools)
High-performance packet filtering or load balancing (Cilium, XDP)
Custom security policies beyond what AppArmor/SELinux provide (Falco, Tetragon)
Profiling CPU, memory, or I/O at the kernel level without modifying application code

Avoid when:

Simple tracing needs are met by strace or perf (no need for eBPF complexity)
Running on kernels older than 4.15 (limited eBPF support)
The problem can be solved in user space (eBPF adds complexity)
Needing to modify data structures (eBPF is read-mostly, writing is restricted)

Try It Yourself

 1  # List all loaded BPF programs
 2  
 3  sudo bpftool prog list 2>/dev/null || echo 'bpftool not installed'
 4  
 5  # Show BPF maps and their contents
 6  
 7  sudo bpftool map list 2>/dev/null || echo 'bpftool not installed'
 8  
 9  # Trace all file opens system-wide using bpftrace
10  
11  sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }' 2>/dev/null & sleep 2; kill %1 2>/dev/null
12  
13  # Check if BTF is available (required for CO-RE)
14  
15  ls -la /sys/kernel/btf/vmlinux 2>/dev/null || echo 'BTF not available'
16  
17  # Count syscalls per process using bpftrace
18  
19  sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' 2>/dev/null & sleep 3; kill %1 2>/dev/null
20  
21  # Show available BPF program types
22  
23  sudo bpftool feature probe kernel 2>/dev/null | grep program_type | head -10

Debug Checklist

1List loaded BPF programs: bpftool prog list
2List BPF maps: bpftool map list
3Check if BPF is available: ls /sys/fs/bpf/
4Quick one-liner tracing: bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
5Check BPF-related kernel config: grep BPF /boot/config-$(uname -r)
6Monitor BPF program overhead: bpftool prog show (check run_time_ns)

Key Takeaways

✓The verifier is the gatekeeper. It tracks register states as a lattice (known value, bounded range, pointer type), walks every execution path, and prunes equivalent states to handle path explosion. Limit: ~1 million verified instructions.
✓XDP processes packets before sk_buff allocation -- at the driver's DMA completion handler. Cloudflare uses XDP to drop DDoS traffic at line rate (~100M pps) before it even reaches the network stack.
✓Tail calls let one BPF program chain into another (up to 33 deep). Cilium uses this for composable network policy: base program -> L3 filter -> L4 filter -> L7 proxy, each a separate BPF program.
✓BPF ring buffer (kernel 5.8+) replaces perf buffers for event streaming. Single shared ring per CPU group, variable-length records, no data copy. This is how modern observability tools get events out of the kernel efficiently.
✓libbpf is the canonical library. It handles ELF parsing, CO-RE relocation, map creation, program loading, and attachment. libbpf-bootstrap provides skeleton code for new projects.

Common Pitfalls

✗Mistake: Hardcoding kernel struct field offsets. Reality: This breaks on different kernel versions. Use CO-RE with BTF and __builtin_preserve_access_index for portable programs.
✗Mistake: Using bpf_probe_read() for everything. Reality: Since kernel 5.5, use bpf_probe_read_kernel() or bpf_probe_read_user() explicitly. The generic version cannot distinguish pointer types on all architectures.
✗Mistake: Trying to call arbitrary kernel functions. Reality: Only BPF helpers and kfuncs (marked with BTF_ID) are callable. The verifier rejects everything else.
✗Mistake: Dereferencing bpf_map_lookup_elem() without NULL check. Reality: The lookup returns NULL if the key does not exist. Skipping the NULL check is the most common verifier rejection for beginners.

Reference

System Calls

bpfperf_event_open

Tools

bpftoolbpftraceperf trace / perf record

📌

In One Line

Verified programs running at native speed inside the kernel -- replacing modules for tracing, iptables for networking, and enabling security policies that used to require kernel patches.

eBPF: Programmable Kernel

KubernetesDocker

🧠

Mental Model

💡

The Problem

Architecture

What if custom code could run inside the kernel -- without risking a crash?

That is eBPF. And it has quietly become the most important innovation in the Linux kernel in the last decade.

What Actually Happens

A developer writes a BPF program in restricted C. Compiles it with clang -target bpf to BPF bytecode. Loads it into the kernel via the bpf() syscall.

Once verified, the JIT compiler translates BPF bytecode to native x86-64 (or ARM64) instructions. The overhead is near zero.

Under the Hood

Common Questions

How does eBPF differ from kernel modules?

What is the overhead?

Can eBPF modify kernel behavior, or only observe?

How Technologies Use This

Kubernetes

Docker

Same Concept Across Tech

Technology	How it uses eBPF	Key tool
Cilium	Kubernetes CNI plugin. Replaces iptables with eBPF for pod networking and network policy	Bypasses kube-proxy entirely
Falco	Runtime security monitoring. eBPF programs detect suspicious syscalls (exec, open, connect)	Alternative to kernel module-based Falco
Tetragon	Security observability. Traces process execution, file access, network calls via eBPF	Deeper than Falco, can enforce policies
bpftrace	One-liner tracing scripts (like awk for kernel tracing)	bpftrace -e 'kprobe:vfs_read { @[comm] = count(); }'
BCC	Python/C toolkit for writing eBPF programs. Includes 100+ ready-made tools	opensnoop, execsnoop, tcpconnect
Datadog/Grafana	APM agents use eBPF for network monitoring without application instrumentation	Zero-code instrumentation for services

Stack layer mapping (production tracing):

Layer	What to check	eBPF tool
Application	Which functions are slow? What files are opened?	funclatency, opensnoop
Syscall	Which syscalls dominate? Latency per syscall?	syscount, bpftrace tracepoints
Kernel	Which kernel functions take the longest? Lock contention?	funclatency on kernel functions
Network	TCP retransmits? Connection latency? Packet drops?	tcpretrans, tcpconnlat
Storage	I/O latency distribution? Which files are hottest?	biolatency, fileslower

If You See This, Think This

Symptom	Likely cause	eBPF tool to use
Mystery process opening sensitive files	Unexpected file access, potential security issue	opensnoop -f /etc/shadow
TCP retransmissions causing latency	Network congestion or misconfigured buffers	tcpretrans
Short-lived processes consuming CPU	Processes spawning and dying too fast for top to catch	execsnoop
Disk I/O latency outliers	Occasional slow disk requests mixed with fast ones	biolatency (shows histogram)
Unknown network connections from container	Container making unexpected outbound calls	tcpconnect with PID filtering
High system CPU but unclear which subsystem	Kernel functions consuming time not visible in user-space profilers	profile (eBPF-based CPU profiler)

When to Use / Avoid

Use eBPF when:

Tracing system behavior in production without restart or overhead (bpftrace, BCC tools)
High-performance packet filtering or load balancing (Cilium, XDP)
Custom security policies beyond what AppArmor/SELinux provide (Falco, Tetragon)
Profiling CPU, memory, or I/O at the kernel level without modifying application code

Avoid when:

Simple tracing needs are met by strace or perf (no need for eBPF complexity)
Running on kernels older than 4.15 (limited eBPF support)
The problem can be solved in user space (eBPF adds complexity)
Needing to modify data structures (eBPF is read-mostly, writing is restricted)

Try It Yourself

 1  # List all loaded BPF programs
 2  
 3  sudo bpftool prog list 2>/dev/null || echo 'bpftool not installed'
 4  
 5  # Show BPF maps and their contents
 6  
 7  sudo bpftool map list 2>/dev/null || echo 'bpftool not installed'
 8  
 9  # Trace all file opens system-wide using bpftrace
10  
11  sudo bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }' 2>/dev/null & sleep 2; kill %1 2>/dev/null
12  
13  # Check if BTF is available (required for CO-RE)
14  
15  ls -la /sys/kernel/btf/vmlinux 2>/dev/null || echo 'BTF not available'
16  
17  # Count syscalls per process using bpftrace
18  
19  sudo bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }' 2>/dev/null & sleep 3; kill %1 2>/dev/null
20  
21  # Show available BPF program types
22  
23  sudo bpftool feature probe kernel 2>/dev/null | grep program_type | head -10

Debug Checklist

1List loaded BPF programs: bpftool prog list
2List BPF maps: bpftool map list
3Check if BPF is available: ls /sys/fs/bpf/
4Quick one-liner tracing: bpftrace -e 'tracepoint:syscalls:sys_enter_openat { printf("%s %s\n", comm, str(args->filename)); }'
5Check BPF-related kernel config: grep BPF /boot/config-$(uname -r)
6Monitor BPF program overhead: bpftool prog show (check run_time_ns)

Key Takeaways

✓The verifier is the gatekeeper. It tracks register states as a lattice (known value, bounded range, pointer type), walks every execution path, and prunes equivalent states to handle path explosion. Limit: ~1 million verified instructions.
✓XDP processes packets before sk_buff allocation -- at the driver's DMA completion handler. Cloudflare uses XDP to drop DDoS traffic at line rate (~100M pps) before it even reaches the network stack.
✓Tail calls let one BPF program chain into another (up to 33 deep). Cilium uses this for composable network policy: base program -> L3 filter -> L4 filter -> L7 proxy, each a separate BPF program.
✓BPF ring buffer (kernel 5.8+) replaces perf buffers for event streaming. Single shared ring per CPU group, variable-length records, no data copy. This is how modern observability tools get events out of the kernel efficiently.
✓libbpf is the canonical library. It handles ELF parsing, CO-RE relocation, map creation, program loading, and attachment. libbpf-bootstrap provides skeleton code for new projects.

Common Pitfalls

✗Mistake: Hardcoding kernel struct field offsets. Reality: This breaks on different kernel versions. Use CO-RE with BTF and __builtin_preserve_access_index for portable programs.
✗Mistake: Using bpf_probe_read() for everything. Reality: Since kernel 5.5, use bpf_probe_read_kernel() or bpf_probe_read_user() explicitly. The generic version cannot distinguish pointer types on all architectures.
✗Mistake: Trying to call arbitrary kernel functions. Reality: Only BPF helpers and kfuncs (marked with BTF_ID) are callable. The verifier rejects everything else.
✗Mistake: Dereferencing bpf_map_lookup_elem() without NULL check. Reality: The lookup returns NULL if the key does not exist. Skipping the NULL check is the most common verifier rejection for beginners.

Reference

System Calls

bpfperf_event_open

Tools

bpftoolbpftraceperf trace / perf record

📌

In One Line

Verified programs running at native speed inside the kernel -- replacing modules for tracing, iptables for networking, and enabling security policies that used to require kernel patches.

eBPF: Programmable Kernel

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

eBPF: Programmable Kernel

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics