XDP & AF_XDP: Kernel-Bypass Networking
Mental Model
A warehouse receiving dock. Normally, every truck gets unloaded, every package scanned, logged into inventory, and carried through the warehouse before anyone checks if it was even ordered. Rejected packages already consumed the full processing effort. Put an inspector at the dock gate instead. A glance at the shipping label -- wrong address, and the truck turns around before it even backs in. No unloading, no scanning, no warehouse space consumed. The dock stays clear for legitimate deliveries.
The Problem
A 5 Mpps SYN flood hits a Kafka broker and pegs CPU at 100% -- not because the traffic is being processed, but because the kernel allocates a 240-byte sk_buff for every attack packet, runs it through conntrack, processes TCP, and only then drops it. That is 3,000-5,000 cycles spent rejecting each packet. A Kubernetes cluster with 5,000 Services burns 20% of node CPU walking 20,000+ iptables rules linearly per packet, adding 5-10ms of latency per hop. At 14 Mpps -- 10 GbE line rate with small packets -- the kernel stack simply cannot keep up, and legitimate traffic starves.
Architecture
A packet arrives at the server. The normal path: the kernel allocates ~240 bytes of memory (sk_buff), walks through netfilter hooks, does a routing table lookup, runs through the protocol stack. Two thousand to five thousand CPU cycles. Per packet.
At 14 million packets per second (line rate on 10 GbE with small packets), that is an impossible amount of work for the kernel to keep up with. The server drowns in the overhead of simply touching each packet.
XDP short-circuits the entire thing. The eBPF program runs inside the NIC driver, before the kernel allocates a single byte. One hundred to five hundred cycles per packet. Drop rate: 10 million packets per second. On one core.
What Actually Happens
When a NIC receives a packet, the driver's NAPI poll function runs. Normally, it allocates an sk_buff, fills in the headers, and pushes the packet into the kernel's receive path.
With XDP attached, the driver first calls the XDP eBPF program. The program gets a lightweight xdp_buff structure -- just pointers to the raw packet data (start and end), the ingress interface, and the RX queue index. No sk_buff. No protocol parsing. No memory allocation.
The program inspects the packet -- parsing Ethernet, IP, and TCP headers manually -- and returns one of five actions:
XDP_DROP -- discard the packet immediately. The driver frees the DMA buffer. The packet never existed as far as the kernel is concerned. This is the fast path for DDoS mitigation.
XDP_PASS -- continue through the normal kernel stack. sk_buff gets allocated, netfilter runs, routing happens. Business as usual.
XDP_TX -- transmit the packet back out the same NIC. The program can rewrite headers first. This is how Facebook's Katran L4 load balancer works: rewrite destination MAC/IP and bounce the packet back out. Millions of connections per second, one server.
XDP_REDIRECT -- send the packet to another NIC, another CPU, or an AF_XDP socket. The Swiss Army knife of XDP.
XDP_ABORTED -- error path. Increments a trace counter.
Under the Hood
Why XDP is safe enough for production. XDP programs are eBPF, which means the kernel's BPF verifier checks every program before loading. It proves the program can't crash the kernel, access memory out of bounds, or loop infinitely. Unlike kernel modules (which can panic the system), a verified eBPF program is as safe as user-space code with kernel-level performance.
Three operating modes. Native mode runs in the NIC driver's NAPI poll -- this is the fast path. Supported drivers include ixgbe, i40e, mlx5, bnxt, virtio_net, and veth. Generic mode (xdpgeneric) runs at netif_receive_skb() after sk_buff allocation -- it works on any NIC but is 5-10x slower because it doesn't avoid the expensive part. Offloaded mode programs the NIC hardware directly -- only Netronome SmartNICs support this, achieving line rate with zero CPU.
BPF maps for state. XDP programs are stateless per packet. All persistent state lives in BPF maps: hash maps for IP blocklists, per-CPU arrays for counters, LRU hash maps for rate limiting, bloom filters for set membership. User-space control programs create and update maps; the XDP program reads them via bpf_map_lookup_elem(). Updates are atomic.
AF_XDP: DPDK performance without leaving the kernel. AF_XDP creates a shared memory region (UMEM) between kernel and user space. The NIC DMAs packets directly into UMEM frames. The XDP program redirects them to the AF_XDP socket. User space reads frame descriptors from the receive ring and accesses packet data directly in shared memory. Zero copies, end to end.
Four ring buffers manage this: fill ring (user gives empty frames to kernel), receive ring (kernel gives full frames to user), transmit ring (user gives frames to send), completion ring (kernel confirms sent frames). On 25+ Gbps NICs, this reaches line rate.
XDP vs DPDK. DPDK completely bypasses the kernel -- binds the NIC to a user-space driver, polls in a busy loop, processes packets entirely in user space. Lowest possible per-packet latency (~1 microsecond). But the NIC disappears from the kernel: no ifconfig, no tcpdump, no kernel routing. XDP runs in the kernel, coexists with the normal stack, and uses standard tools. The tradeoff: BPF verifier adds ~50 cycles per packet, and eBPF has constraints (limited stack, no unbounded loops).
Common Questions
How does Cloudflare use XDP for DDoS mitigation?
XDP programs on every NIC RX queue inspect packet headers against BPF hash maps of attack signatures. Matching packets return XDP_DROP at the driver level -- before sk_buff allocation. One server drops 10+ million attack packets per second per core. BPF maps are updated in real-time as the control plane detects new patterns.
How does AF_XDP achieve zero-copy?
The NIC DMAs directly into UMEM. The XDP program redirects the packet to the AF_XDP socket, placing a frame descriptor (address + length) on the receive ring. User space reads the descriptor and accesses packet data in shared memory. No copy. For transmit, user space writes into a UMEM frame and puts the descriptor on the TX ring. The NIC DMAs from UMEM to the wire. End-to-end: zero CPU copies.
Why did Facebook build Katran?
Traditional L4 load balancers (IPVS) operate after sk_buff allocation and netfilter. At Facebook's scale (billions of connections), that overhead is too much. Katran uses XDP_TX: look up destination in a BPF map, rewrite headers, bounce the packet back out the NIC. No sk_buff. No conntrack. No iptables. 10M+ pps per server.
What are the limitations of XDP programs?
(1) 1 million verified instruction limit, 512-byte stack. (2) No unbounded loops -- the verifier must prove termination. (3) No dynamic memory allocation -- use pre-sized BPF maps. (4) Limited helper functions -- no arbitrary kernel calls. (5) Originally single-buffer only -- multi-buffer support arrived in kernel 6.0. (6) No sk_buff fields -- no timestamps, no conntrack, no socket info. Only raw packet data.
How Technologies Use This
A Kubernetes cluster with 5,000 Services shows 20% of node CPU consumed by packet processing alone. Every packet walks through 20,000+ iptables rules linearly, adding 5-10ms of latency per hop. At Meta-scale clusters, the iptables overhead means needing 5 nodes for a workload that should run on 3.
The kube-proxy iptables mode creates O(n) rules per Service. Each packet triggers a linear walk through the entire chain to find the right DNAT target. The cost grows linearly with the number of services, and every packet pays this cost regardless of its destination. The overhead happens after sk_buff allocation, after netfilter hooks, deep in the kernel stack where it is most expensive.
Cilium replaces kube-proxy entirely by attaching XDP programs at the host NIC. Service destinations are resolved via O(1) BPF hash map lookups inside the NIC driver, before sk_buff allocation, before netfilter, before routing. The result is 2-5x lower service routing latency and 30-50% less CPU spent on packet processing compared to iptables mode.
A Kafka broker under a 5 million packets per second SYN flood becomes unresponsive to legitimate producers and consumers. CPU is pegged at 100% even though the attack traffic is ultimately dropped. Real Kafka connections starve for resources and begin timing out.
The kernel allocates an sk_buff (240 bytes) for each attack packet, runs it through conntrack, processes the TCP handshake, and only then drops it. Each dropped packet costs 3,000-5,000 CPU cycles through the full kernel stack. At 5 million packets per second, the overhead of merely rejecting attack traffic saturates the CPU, leaving nothing for legitimate Kafka operations.
XDP programs attached to the broker NICs short-circuit this entirely. Malformed packets and unauthorized source IPs are dropped inside the NIC driver, before sk_buff allocation, before conntrack, before any TCP processing. Each XDP drop costs roughly 200 CPU cycles. A single core can drop 10 million attack packets per second via XDP while legitimate Kafka traffic flows through unaffected.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Packet filtering | Host XDP protects all containers on the NIC | N/A (XDP is kernel/NIC level) | N/A (XDP is kernel/NIC level) | AF_XDP sockets via golang.org/x/sys/unix | Cilium replaces kube-proxy with XDP + BPF maps |
| DDoS defense | Host XDP drops attack traffic before container network | N/A | N/A | N/A | XDP on node NICs protects entire cluster |
| Load balancing | N/A (Docker uses iptables DNAT) | N/A | N/A | Katran-style XDP_TX LB in Go control plane | Cilium XDP L4 LB replaces kube-proxy |
| Zero-copy I/O | N/A | N/A | N/A | AF_XDP UMEM for high-speed packet capture | N/A (handled at node level) |
Stack Layer Mapping
| Layer | Component |
|---|---|
| NIC hardware | DMA ring buffers, optional XDP offload (Netronome) |
| NIC driver | NAPI poll -- XDP hook runs here (native mode) |
| XDP program | eBPF bytecode: parse headers, lookup BPF maps, return action |
| AF_XDP | UMEM shared memory + fill/RX/TX/completion rings |
| Kernel stack | Only reached on XDP_PASS; sk_buff alloc, netfilter, routing |
| Userspace | bpftool for management, AF_XDP sockets for packet I/O |
Design Rationale: sk_buff allocation and netfilter processing are the two most expensive per-packet operations in the kernel, so XDP hooks in before both of them -- the earliest possible point where code can run. eBPF verification is what makes this production-safe: unlike kernel modules, a verified eBPF program cannot crash the system or loop forever. AF_XDP splits the difference between XDP and DPDK -- DPDK-class performance, but the NIC remains visible to ip, tcpdump, and every other standard kernel tool.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| CPU 100% during DDoS but all traffic is dropped | Kernel processing attack packets through full stack before drop | Attach XDP_DROP program; check ip link show | grep xdp |
| XDP attached but performance matches iptables | Running in generic mode instead of native | ip link show dev eth0 -- look for xdpgeneric vs xdpdrv |
| XDP_ABORTED counter incrementing | BPF program returning error (bounds check failure) | bpftool prog show id <ID> and check trace_pipe |
| AF_XDP ring stalls, packets dropped | Fill ring empty -- user space not returning frames fast enough | Check fill ring occupancy and processing loop |
| XDP program load rejected by verifier | Unbounded loop, out-of-bounds access, or stack overflow | Read verifier output from bpf() syscall error |
| Jumbo frame packets bypass XDP | Multi-buffer XDP not supported (pre-kernel 6.0) | Check kernel version; use standard MTU or upgrade |
When to Use / Avoid
- Use when DDoS mitigation requires dropping millions of packets per second without CPU saturation
- Use when replacing iptables/kube-proxy for O(1) service routing in large Kubernetes clusters
- Use when building L4 load balancers that need to rewrite headers and bounce packets at line rate (XDP_TX)
- Use AF_XDP when user-space packet processing needs DPDK-like performance without kernel bypass
- Avoid when full conntrack/netfilter state is needed -- XDP has no access to sk_buff or connection tracking
- Avoid when the NIC driver lacks native XDP support -- generic mode is 5-10x slower and defeats the purpose
Try It Yourself
1 # Check if NIC driver supports native XDP
2
3 ethtool -i eth0 | grep driver
4
5 # List supported XDP drivers: ixgbe, i40e, mlx5, virtio_net, etc.
6
7 # Load an XDP program (native mode)
8
9 ip link set dev eth0 xdpdrv obj xdp_drop.o sec xdp
10
11 # Verify XDP is attached
12
13 ip link show dev eth0 | grep xdp
14
15 # Show all loaded BPF programs
16
17 bpftool prog list
18
19 # Inspect XDP program stats
20
21 bpftool prog show id 42
22
23 # Remove XDP program
24
25 ip link set dev eth0 xdpdrv offDebug Checklist
- 1
ip link show dev eth0 | grep xdp - 2
bpftool prog list - 3
bpftool net show - 4
bpftool map dump id <map_id> - 5
ethtool -i eth0 | grep driver - 6
cat /sys/kernel/debug/tracing/trace_pipe # for XDP_ABORTED traces
Key Takeaways
- ✓XDP runs at the earliest possible point in the stack: inside the NIC driver's NAPI poll, before sk_buff allocation, before netfilter, before routing. This skips 90% of per-packet kernel overhead, enabling 10M+ packets/sec DROP rates on a single core.
- ✓XDP programs are eBPF -- verified by the kernel to be safe. They cannot crash the kernel, access arbitrary memory, or loop infinitely. This is what makes XDP production-ready, unlike kernel modules which can take down the entire system.
- ✓Three modes: native (in the NIC driver, fastest, needs driver support), generic (after sk_buff allocation, works everywhere, 5-10x slower), and offloaded (on NIC hardware, only Netronome SmartNICs, zero CPU).
- ✓AF_XDP delivers DPDK-like performance without leaving the kernel. The NIC DMAs packets into shared UMEM frames, the XDP program redirects them to the AF_XDP socket, and user space reads from the ring buffer. Zero copies. Line rate on 25+ Gbps NICs.
- ✓XDP vs DPDK: XDP coexists with the kernel stack (XDP_PASS falls through), doesn't require dedicated NICs, and uses standard tools (ip, ethtool). DPDK has lower per-packet latency but takes over the NIC entirely. Choose XDP unless you need sub-microsecond per-packet latency.
Common Pitfalls
- ✗Mistake: expecting sk_buff fields in XDP programs. Reality: XDP operates on raw packet data (xdp_buff), not sk_buff. No socket info, no conntrack state, no pre-parsed headers. The BPF program must parse everything manually.
- ✗Mistake: using jumbo frames without multi-buffer support. Reality: XDP originally only handled single-buffer packets. Multi-buffer support arrived in kernel 6.0. Without it, jumbo frames fall back to generic (slow) processing.
- ✗Mistake: using generic XDP and expecting native performance. Reality: generic XDP runs after sk_buff allocation. It doesn't skip the expensive part. Performance is 5-10x worse than native. Always use native mode with a supported driver.
- ✗Mistake: not pinning AF_XDP threads to the correct CPU/RX-queue. Reality: XDP programs run on the CPU handling the NIC's RX queue interrupt. The AF_XDP socket must be bound to the same queue. Without CPU pinning, cross-CPU access to shared maps causes cache bouncing.
Reference
In One Line
Move packet decisions into the NIC driver via XDP and the cost drops from 3,000-5,000 cycles per packet to 100-500 -- a single core can drop 10M pps.