XDP (eXpress Data Path)
Architecture Diagram
Why It Exists
The Linux kernel networking stack is great for general-purpose networking. It handles TCP, UDP, routing, firewalling, connection tracking, and a hundred other things. But all that flexibility comes with overhead. Every packet that arrives at the NIC goes through a long journey: driver, socket buffer allocation, protocol processing, netfilter hooks, routing lookup, and eventually reaches the application through a syscall.
For most workloads, this is fine. For high packet-per-second workloads (telemetry pipelines, DDoS mitigation, load balancers, packet brokers), the kernel stack becomes the bottleneck. Not the NIC. Not the wire. The kernel's own processing overhead per packet.
The numbers tell the story. A typical telemetry data point, a metric sample or trace span, is under 1KB after serialization. Small packets are the worst case for kernel networking because the per-packet overhead dominates. The CPU spends more time managing each packet than actually transmitting it. A standard Linux box tops out around 1M small packets/sec through the normal send() path.
XDP solves this by running an eBPF program at the NIC driver level, before packets enter the kernel networking stack. Packets are intercepted at the earliest possible point with a decision on what to do: pass them through to the stack, drop them, bounce them back out the NIC, or redirect them to another interface or CPU. The kernel stack never sees the packets handled in XDP.
How It Works
When a packet arrives at the NIC, it normally goes through this path:
NIC → Driver → Allocate SKB → Netfilter → Routing → TCP/IP → Socket → Userspace App
With XDP attached, the packet hits the eBPF program right after the driver, before any of the expensive steps:
NIC → Driver → XDP eBPF Program → (one of four actions)
The four XDP actions:
- XDP_PASS lets the packet continue through the normal kernel stack. The program inspected it and decided it is fine.
- XDP_DROP silently drops the packet. The kernel never allocates memory for it. This is why XDP is so effective for DDoS mitigation. Cloudflare drops millions of malicious packets per second this way.
- XDP_TX bounces the packet back out the same NIC it came in on. Useful for implementing a load balancer where the response goes back the same way.
- XDP_REDIRECT sends the packet to a different NIC, a different CPU, or an AF_XDP socket in userspace. This is how high-performance packet forwarding is built.
Real Example: Telemetry Fast Path
An observability agent on a production server collects 5,000 metric samples per second via eBPF. Without XDP, shipping those metrics to the collector goes through the full kernel stack: 5,000 send() syscalls, 5,000 socket buffer allocations, 5,000 trips through TCP/IP processing.
With XDP, the agent batches metrics into packets and XDP redirects them to the NIC at driver level. The kernel networking stack is never involved. The same box that struggled at 1M packets/sec now handles 5-10M.
eBPF (collects metrics in kernel) → Agent batches into packets → XDP redirects at NIC driver → Wire → Collector
The agent does not need to be rewritten. XDP sits below it, accelerating the packet path transparently.
Three Execution Modes
| Mode | How It Works | Performance | Requirement |
|---|---|---|---|
| Native | XDP runs inside the NIC driver itself | Best. 5-10M pps/core. | NIC driver must support XDP (most modern drivers do: i40e, mlx5, ixgbe, virtio-net) |
| Offloaded | XDP runs on the NIC hardware (SmartNIC) | Best possible. Zero CPU. | SmartNIC with eBPF offload support (Netronome, some Mellanox) |
| Generic | XDP runs after SKB allocation, faking the early hook | Slow. Defeats the purpose. Only useful for testing. | Any NIC, but the performance benefit is lost |
Always deploy in native mode. If the NIC does not support native XDP, upgrade the NIC before deploying. Generic mode is a trap.
Production Considerations
- Program size limits. eBPF verifier enforces a maximum instruction count (currently 1M instructions). Keep XDP programs small and focused. For complex logic, redirect the packet to userspace via AF_XDP and process it there.
- Maps for state. XDP programs use eBPF maps (hash tables, arrays, ring buffers) to share state between the XDP program and userspace. Use per-CPU maps to avoid lock contention at high packet rates.
- Testing. Use
xdp-toolsandbpftoolto test XDP programs. Load them in generic mode first to validate correctness, then switch to native mode for production performance. - Monitoring. Track
xdp_actionscounters (pass/drop/tx/redirect) andxdp_errors. A sudden spike in errors means the program is hitting edge cases. - Kernel version. XDP has been in mainline Linux since 4.8, but features like
XDP_REDIRECTand AF_XDP require 4.18+. Use 5.10+ for the best experience.
Failure Scenarios
Scenario 1: XDP Program Bug Drops Valid Traffic. An XDP program has a logic error in the filtering rules. Instead of dropping malicious packets, it drops 10% of legitimate traffic. Users see packet loss and connection timeouts. Nobody suspects XDP because the application and kernel metrics look fine. Detection: monitor xdp_drop counters alongside application error rates. If drops increase while application traffic is normal, the XDP program is the culprit. Recovery: detach the XDP program (ip link set dev eth0 xdp off), traffic immediately flows normally. Prevention: extensive testing with production traffic replays before deployment.
Scenario 2: NIC Driver Does Not Support Native XDP. An XDP program is deployed and gets native mode... on the development box. In production, the NIC driver does not have XDP support, so it silently falls back to generic mode. Performance is no better than the regular stack, but nobody notices because only functionality was checked, not throughput. The telemetry pipeline falls behind during peak. Detection: check ip link show for xdpgeneric vs xdp flag. Prevention: verify native XDP support on production NICs before deployment. Add a startup check that refuses to run in generic mode.
Scenario 3: eBPF Map Size Exhaustion. An XDP program uses a hash map to track per-connection state. The map has a fixed max_entries of 100K. During a traffic spike, connections exceed 100K. New entries fail silently. The XDP program cannot look up state for new connections and falls through to XDP_PASS, bypassing the filtering logic. Detection: monitor bpf_map_lookup_elem failure counters. Prevention: size maps for peak traffic, not average. Use LRU maps that evict old entries automatically.
Capacity Planning
| Metric | Native XDP | Generic XDP | No XDP (kernel stack) |
|---|---|---|---|
| Packets/sec per core | 5-10M | 1-2M | 0.5-1M |
| Latency per packet | 1-5 μs | 10-20 μs | 20-50 μs |
| CPU overhead | Minimal (in-line processing) | Moderate | High (full stack traversal) |
| Memory per packet | Near zero (no SKB) | Full SKB allocation | Full SKB allocation |
Real-world reference numbers: Cloudflare handles 10M+ packets/sec of DDoS traffic per server using XDP. Facebook's Katran load balancer serves billions of requests per day across their fleet using XDP. Cilium (Kubernetes CNI) uses XDP for pod-to-pod networking, achieving wire-speed packet forwarding between containers.
Sizing formula for telemetry pipelines: required_cores = (total_packets_per_sec / 6M). A 3,000-node fleet generating 5,000 telemetry packets/sec per node = 15M packets/sec total. With XDP on the collector, about 3 cores are needed for packet processing. Without XDP, 15+ cores are needed. That is a 5x reduction in CPU for the same throughput.
Architecture Decision Record
ADR: When to Use XDP vs tc/BPF vs iptables
Context: Packets need to be processed at high speed. Three options exist at different layers of the Linux networking stack.
| Criteria (Weight) | XDP | tc/BPF | iptables/nftables |
|---|---|---|---|
| Packet rate (30%) | 5-10M pps | 2-4M pps | 0.5-1M pps |
| Ease of use (20%) | Medium (eBPF required) | Medium (eBPF required) | Easy (rule syntax) |
| Stateful processing (20%) | Limited (eBPF maps) | Better (after stack) | Full (conntrack) |
| Feature richness (15%) | Minimal (4 actions) | Moderate | Rich (NAT, mangle, etc.) |
| Kernel version (15%) | 4.8+ (5.10+ ideal) | 4.1+ | 2.4+ |
Decision framework:
- Less than 1M pps with NAT, conntrack, or complex rules needed. Use iptables/nftables. No reason to add eBPF complexity.
- 1-5M pps with classification, shaping, or post-stack processing needed. Use tc/BPF. It runs after SKB allocation, providing access to parsed protocol headers.
- Over 5M pps or need to drop/redirect packets before the kernel stack. Use XDP. DDoS mitigation, high-throughput forwarding, telemetry fast path.
- Over 20M pps or full kernel bypass is needed. XDP is not enough. Look at DPDK.
Key Points
- •Processes packets at the NIC driver level before the kernel networking stack even sees them
- •Runs as an eBPF program, using the same toolchain and deployment model teams already know
- •No dedicated CPU cores needed. Lightweight enough to run on every production server
- •5-10M packets/sec per core vs roughly 1M with normal send() syscalls
- •Used by Cloudflare for DDoS mitigation, Facebook for load balancing, and Cilium for Kubernetes networking
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| XDP + eBPF | Open Source | Packet filtering, forwarding, and sampling at NIC driver level | Medium-Enterprise |
| tc/BPF | Open Source | Traffic shaping and classification after the kernel stack | Small-Enterprise |
| iptables / nftables | Open Source | Traditional firewall rules, simpler setups | Small-Medium |
| AF_XDP | Open Source | Zero-copy packet delivery from NIC to userspace applications | Medium-Enterprise |
Common Mistakes
- Assuming XDP bypasses the kernel entirely. It does not. It hooks into the NIC driver inside the kernel, just before the networking stack. Full kernel bypass is DPDK territory.
- Writing complex stateful logic in XDP programs. eBPF programs have size limits and restricted loops. Keep XDP programs simple: filter, forward, or redirect. Do complex processing in userspace.
- Not checking NIC driver support. XDP works best in native mode (driver support). Generic mode (no driver support) is much slower and defeats the purpose.
- Forgetting that XDP runs per-packet. At 10M packets/sec, even a small per-packet overhead adds up fast. Profile XDP programs.
- Deploying without a fallback. If the XDP program crashes or has a bug, packets get dropped. Always have a health check that detaches the program if it misbehaves.