DPDK: User-Space Networking
Mental Model
Imagine a factory receiving raw materials. The kernel networking path is like having every delivery truck stop at a security checkpoint (interrupt), get unpacked into a standard shipping container (sk_buff), pass through customs (netfilter), go through a central warehouse (socket buffer), and then get delivered to the factory floor (context switch to application). DPDK is like building a private loading dock where trucks back up directly into the factory. Materials go straight from the truck onto the assembly line. No checkpoint, no repackaging, no customs, no warehouse. The factory runs its own dock workers (poll-mode drivers) who check the dock continuously instead of waiting for a doorbell (interrupt). The dock workers never stop checking, even when no trucks are there -- that is the cost of guaranteed instant response.
The Problem
A software load balancer running on a 16-core server with dual 10 GbE NICs hits a ceiling at 2 million packets per second using kernel networking. CPU profiling shows 60% of cycles consumed inside the kernel: sk_buff allocation and freeing (22%), softirq processing (18%), netfilter hook traversal (12%), and context switches between interrupt handlers and the application (8%). The application itself -- consistent hashing and header rewriting -- uses only 25% of CPU. Scaling up means more cores doing kernel work, not more useful packet processing. Switching to DPDK removes the kernel from the data path entirely. The same 16-core server reaches 14 million packets per second. The application now uses 85% of CPU for actual packet processing logic.
Architecture
A 16-core server with dual 10 GbE NICs runs a software load balancer. It tops out at 2 million packets per second. CPU profiling reveals the problem: 60% of cycles are spent inside the kernel, not in application logic. sk_buff allocation and freeing accounts for 22%. softirq processing takes 18%. Netfilter hook traversal adds 12%. Context switches between interrupt handlers and userspace eat another 8%. Only 25% of CPU does useful work -- consistent hashing and header rewriting.
The application is fast. The kernel networking stack is the bottleneck.
DPDK removes the kernel from the packet path entirely. The same server, same NICs, same application logic -- 14 million packets per second. Seven times the throughput. The difference is not algorithmic. It is architectural.
How DPDK Works
DPDK is a set of userspace libraries that enable applications to interact with NIC hardware directly, bypassing the kernel networking stack completely. The approach has four pillars:
1. Unbind the NIC from the kernel. Before DPDK can access a NIC, the device must be detached from its kernel driver (e.g., ixgbe, i40e) and attached to a DPDK-compatible driver. Two options exist:
- UIO (uio_pci_generic): Maps PCI BAR (Base Address Register) regions into userspace. Simple but dangerous -- gives the application unrestricted DMA access to all physical memory.
- VFIO (vfio-pci): Adds IOMMU-based DMA remapping. The NIC can only DMA into regions explicitly mapped by the application. Preferred for production.
# Show current driver bindings
dpdk-devbind.py --status
# Bind NIC to VFIO (requires IOMMU enabled in BIOS and kernel)
modprobe vfio-pci
dpdk-devbind.py --bind=vfio-pci 0000:03:00.0
# Verify
dpdk-devbind.py --status
Once bound, the kernel no longer sees the NIC. ip link show will not list it. tcpdump cannot capture on it. The NIC belongs entirely to DPDK.
2. Allocate hugepage-backed memory for packet buffers. NICs perform DMA -- Direct Memory Access -- writing incoming packets directly to RAM without CPU involvement. DMA requires physically contiguous memory at known physical addresses. Standard 4 KB pages can be swapped out, and their physical addresses can change. Hugepages solve both problems: they are pinned in physical memory and provide large contiguous regions.
# Check current hugepage state
grep -i huge /proc/meminfo
# Allocate 8 x 1 GB hugepages (best done at boot time)
echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# Mount the hugetlbfs filesystem
mkdir -p /dev/hugepages
mount -t hugetlbfs nodev /dev/hugepages
DPDK's rte_pktmbuf_pool_create() carves mbuf pools from this hugepage memory. Each mbuf is a fixed-size structure holding one packet's data plus metadata. The pool is allocated once at startup -- no per-packet allocation or freeing during operation.
3. Initialize the Environment Abstraction Layer (EAL). Every DPDK application begins with rte_eal_init(). The EAL:
- Parses arguments:
-l 0,2,4(core list),-n 4(memory channels),--socket-mem=1024,1024(per-NUMA hugepage allocation) - Maps hugepage memory
- Scans PCI bus for DPDK-bound devices
- Pins each DPDK thread to its assigned CPU core via
sched_setaffinity() - Initializes per-core data structures to avoid cross-core locking
4. Poll for packets in a tight loop. This is the fundamental architectural difference. Kernel networking is interrupt-driven: the NIC raises an IRQ, the kernel runs a handler, schedules a softirq, allocates an sk_buff, processes the packet through the stack, and eventually delivers it to userspace via a context switch. Each step adds latency and CPU overhead.
DPDK inverts this model. The application calls rte_eth_rx_burst() in an infinite loop, checking the NIC's RX descriptor ring for completed DMA transfers:
while (running) {
/* No interrupt. No sleep. No context switch. */
uint16_t nb_rx = rte_eth_rx_burst(port, queue, bufs, BURST_SIZE);
if (nb_rx == 0)
continue; /* Immediately poll again */
/* Process packets entirely in userspace */
process_packets(bufs, nb_rx);
/* Transmit results directly to NIC TX ring */
rte_eth_tx_burst(port, queue, bufs, nb_rx);
}
The poll-mode driver (PMD) reads the RX descriptor ring, which the NIC updates via DMA as packets arrive. If new descriptors are ready, the PMD returns pointers to the corresponding mbufs. If not, it returns zero, and the loop immediately polls again. The thread never sleeps, never yields, never gets interrupted. This is why DPDK cores show 100% CPU utilization regardless of traffic load.
DPDK vs Kernel Networking: Where the Cycles Go
For a single 64-byte packet flowing through kernel networking:
| Stage | Cost | Notes |
|---|---|---|
| Hardware interrupt (IRQ) | 200-500 cycles | Mode switch to kernel, save registers |
| sk_buff allocation | 300-600 cycles | slab allocator, metadata init |
| NAPI poll / softirq | 200-400 cycles | Scheduling overhead |
| Netfilter hooks | 200-1000 cycles | Depends on rule count |
| Protocol processing | 500-1000 cycles | IP lookup, checksum verification |
| Socket buffer copy | 300-500 cycles | Copy to userspace buffer |
| Context switch | 500-1000 cycles | Return to userspace process |
| Total | 3000-5000 cycles |
For the same packet through DPDK:
| Stage | Cost | Notes |
|---|---|---|
| PMD polls descriptor ring | 50-100 cycles | Cache-hot memory read |
| mbuf pointer dereference | 20-50 cycles | Preallocated, no allocation |
| Application processing | 100-200 cycles | Depends on logic |
| TX descriptor write | 50-100 cycles | Cache-hot memory write |
| Total | 200-400 cycles |
The 10-15x reduction in per-packet cycles is the source of DPDK's throughput advantage. At 14 million 64-byte packets per second on a 3 GHz core, that is roughly 214 cycles per packet -- achievable only because every kernel overhead source has been eliminated.
DPDK vs XDP: Two Philosophies
Both DPDK and XDP accelerate packet processing, but they approach the problem from opposite directions.
DPDK: the application owns the NIC. The kernel driver is replaced. Packets never enter the kernel. The application has full control and full responsibility. Protocol stacks, monitoring, configuration -- all reimplemented in userspace.
XDP: the kernel owns the NIC, but runs custom code early. XDP attaches eBPF programs at the NIC driver's NAPI poll handler, before sk_buff allocation. The eBPF program inspects raw packet data and returns a verdict: DROP, PASS to kernel stack, TX back out the NIC, or REDIRECT to another interface or AF_XDP socket. The kernel remains in control; XDP just adds a programmable fast path.
| Dimension | DPDK | XDP |
|---|---|---|
| Architecture | Userspace owns NIC | Kernel runs eBPF in driver |
| Throughput (64B pkts) | 14M pps/core | 4-6M pps/core |
| Latency | Sub-microsecond, consistent | 1-5 microseconds |
| Kernel tools (tcpdump, iptables) | Not available | Still work for PASS traffic |
| Protocol stack | Must implement in userspace | Full kernel TCP/IP available |
| Deployment | Replace NIC drivers, dedicate cores | Load eBPF program, no driver changes |
| CPU when idle | 100% (polling) | 0% (interrupt-driven) |
| Safety | Application has raw hardware access | eBPF verifier enforces safety |
| Use case sweet spot | High-pps forwarding, virtual switches | Filtering, load balancing, DDoS mitigation |
The choice depends on the workload. If every packet must be processed at the lowest possible latency and the application can manage its own networking, DPDK wins. If the workload needs fast filtering or redirection while keeping kernel integration, XDP wins. AF_XDP bridges the gap -- it uses XDP to redirect selected packets to a userspace ring buffer, providing DPDK-like zero-copy access for specific flows while leaving other traffic in the kernel.
Pipeline Architectures
DPDK applications at scale rarely run as a single RX-process-TX loop. Instead, they use pipeline stages connected by rte_ring queues:
Run-to-completion model: Each core handles the full pipeline for its assigned RX queue. Simple, good locality, but limits flexibility -- if one processing stage is heavier than others, cores are unevenly loaded.
Pipeline model: Dedicated cores for each stage -- RX, classification, processing, TX -- with rte_ring queues between them. Better load balancing but adds ring enqueue/dequeue overhead (20-50 cycles per operation).
Core 0 (RX) → rte_ring → Core 1 (classify) → rte_ring → Core 2 (process) → rte_ring → Core 3 (TX)
rte_ring is a fixed-size, lock-free FIFO. In single-producer/single-consumer mode, it uses no atomic operations at all -- just memory barriers. Cache-line alignment of the producer and consumer indices prevents false sharing. A properly sized ring adds negligible overhead to the pipeline.
Common Questions
What happens to packets destined for the host itself (management traffic, SSH)?
A common pattern is to leave one NIC bound to the kernel for management traffic and bind only the data-plane NICs to DPDK. Alternatively, DPDK applications can use KNI (Kernel NIC Interface) to inject selected packets back into the kernel stack, though this adds the overhead the application was trying to avoid.
How does DPDK handle RSS (Receive Side Scaling)?
NICs with RSS hash incoming packets across multiple RX queues based on the flow 5-tuple. DPDK exposes this directly: configure N RX queues per port, assign each to a different core, and each core polls its own queue independently. No locking, no shared state. This is the primary horizontal scaling mechanism.
Can DPDK run in a virtual machine?
Yes, via SR-IOV virtual functions (VFs) or virtio-user. SR-IOV passes a hardware NIC partition directly to the VM with near-native performance. virtio-user connects to a vhost-user backend (often OVS-DPDK on the host) via shared memory, avoiding the overhead of emulated network devices.
What is the minimum hardware requirement?
At minimum: a NIC supported by a DPDK PMD (Intel, Mellanox, Broadcom, and others), IOMMU support (Intel VT-d or AMD-Vi) for VFIO, and enough RAM for hugepage allocation. Two CPU cores per NIC port (one for RX, one for TX) is a practical starting point. The dpdk-devbind.py --status command shows which NICs have PMD support.
How Technologies Use This
An e-commerce platform serves 3 million HTTP requests per second through a fleet of 20 Nginx reverse proxy servers. Each Nginx instance handles 150,000 requests per second using kernel TCP sockets, but the kernel networking stack consumes 40% of each server's 32 CPU cores on interrupt handling, sk_buff allocation, and context switches between softirq and process context. At peak traffic during flash sales, the kernel networking layer becomes the bottleneck before Nginx worker processes saturate.
A DPDK-based L4 load balancer sits in front of the Nginx fleet, distributing TCP connections across the Nginx instances using Direct Server Return (DSR). The DPDK application runs on 4 dedicated cores per server, polling 25 GbE NICs through poll-mode drivers. Incoming packets are read directly from NIC RX descriptor rings into hugepage-backed mbuf pools without triggering any hardware interrupts. The load balancer performs consistent hashing on the TCP 4-tuple in userspace, rewrites the destination MAC address, and transmits through the NIC TX ring. The entire forwarding path takes 2 to 3 microseconds per packet with zero kernel involvement.
With DPDK handling L4 distribution at 14 million packets per second per server, the Nginx instances behind it receive pre-balanced traffic and can dedicate all CPU cores to HTTP parsing, TLS termination, and upstream connection management. The 4 cores consumed by DPDK poll-mode loops are unavailable for other work (they spin at 100% utilization continuously), but this tradeoff frees 40% of CPU on each Nginx server that was previously spent on kernel networking overhead. Standard kernel tools like tcpdump and netstat cannot observe DPDK-managed traffic, so the team uses DPDK's built-in pdump library and custom telemetry hooks for debugging.
An HAProxy instance running on a dual-socket server with 64 cores terminates 400,000 TCP connections per second for a financial trading platform. Each connection carries small, latency-sensitive messages averaging 200 bytes. Under kernel networking, the per-packet path through the socket layer, netfilter hooks, and TCP stack adds 15 microseconds of processing latency. At 800,000 packets per second (request plus response), kernel softirq processing alone occupies 12 cores, and P99 latency sits at 400 microseconds.
Integrating DPDK into the HAProxy forwarding path moves L4 TCP processing into userspace. DPDK's EAL (Environment Abstraction Layer) pins worker threads to isolated CPU cores configured with isolcpus at boot. Hugepages (typically 1 GB pages on each NUMA node) back all packet buffers, eliminating TLB misses during mbuf access. The poll-mode driver reads batches of 32 packets from the NIC RX ring in a single operation, and the TCP state machine runs in userspace using a lightweight stack like TLDK (Transport Layer Development Kit). No interrupts fire, no context switches occur, and no sk_buff structures are allocated in kernel memory.
On the same hardware, DPDK-accelerated HAProxy processes 4 million connections per second with P99 latency below 50 microseconds. The wire-speed L4 processing path handles 14.8 million minimum-size (64-byte) packets per second on a 10 GbE link, matching the theoretical line rate. The operational cost is that DPDK-managed NICs are invisible to the kernel: iptables rules do not apply, /proc/net/tcp shows no connections, and debugging requires DPDK-native tools. HAProxy deployments using DPDK maintain a separate management NIC on the kernel stack for SSH access and monitoring traffic.
A telecom service provider runs Envoy as a service mesh proxy for 5G core network functions, handling GTP-U tunnel encapsulation and decapsulation at 25 Gbps per node. Standard kernel networking processes GTP-U packets through the UDP socket layer, netfilter, and the kernel's GTP tunnel module, achieving a ceiling of 2 million packets per second per node. The 5G specification requires processing 10 million packets per second at sustained line rate with sub-100-microsecond latency for user-plane traffic.
DPDK integration with Envoy in this NFV (Network Function Virtualization) deployment bypasses the kernel entirely for data-plane traffic. Envoy's DPDK worker threads poll 25 GbE NICs through Mellanox mlx5 poll-mode drivers, reading packets directly from descriptor rings into hugepage-backed mbuf pools. GTP-U encapsulation and decapsulation happen in userspace on the raw packet buffers. Envoy's hot restart mechanism allows deploying new configurations without dropping active GTP tunnels: a new Envoy process inherits the DPDK EAL memory regions and NIC queue ownership from the old process through shared hugepage mappings and file descriptor passing over a Unix domain socket.
A single node with 8 cores dedicated to DPDK poll-mode processing sustains 14 million packets per second at 25 GbE line rate, with P99 latency at 30 microseconds. The hot restart completes in under 50 milliseconds, during which the NIC hardware queues buffer packets in the ring descriptors without loss. The remaining 24 cores on the 32-core server handle control-plane Envoy processing (xDS configuration updates, health checks, metrics export) through standard kernel networking. This split architecture requires careful NUMA-aware core allocation: DPDK cores and NIC queues must reside on the same NUMA node to avoid cross-socket memory access penalties that would add 80 to 100 nanoseconds per packet.
Same Concept Across Tech
| Technology | How it uses DPDK | Key gotcha |
|---|---|---|
| OVS-DPDK | Replaces kernel datapath with userspace PMD-based forwarding. 10x throughput improvement over kernel OVS | Dedicated cores spin at 100%. Must pin vhost-user ports to correct NUMA node |
| VPP (fd.io) | Uses DPDK PMDs for NIC access, adds vector (batch) processing for cache efficiency. 40M pps on 4 cores | VPP graph node model has a learning curve. Debugging vectorized pipelines requires VPP-specific tools |
| SPDK | Applies the DPDK model to storage: NVMe drives accessed via userspace poll-mode drivers instead of kernel block layer | Same tradeoffs as DPDK -- dedicated cores, loss of kernel block layer tools (iostat, blktrace) |
| F-Stack | Full TCP/IP stack running in userspace on top of DPDK, using FreeBSD's network stack ported to Linux | Provides socket-compatible API but with different edge-case behavior than Linux kernel TCP |
| mTCP | Research userspace TCP stack on DPDK. Achieves 3-10x better small-message throughput than kernel TCP | Not production-hardened. Lacks features like ECN, SACK, and modern congestion control algorithms |
Stack layer mapping (packet processing performance ceiling):
| Layer | What to check | Tool |
|---|---|---|
| Application | Is per-packet processing logic efficient? Batch operations where possible | perf record on DPDK cores, rte_rdtsc() instrumentation |
| DPDK | Are mbuf pools sized correctly? Are rings backing up? | dpdk-proc-info, rte_mempool_avail_count() |
| Memory | Is hugepage allocation sufficient? Any TLB misses? | /proc/meminfo (HugePages_Free), perf stat -e dTLB-load-misses |
| NUMA | Are cores and memory on the same NUMA node as the NIC? | lstopo, numactl --hardware, /sys/bus/pci/devices/*/numa_node |
| NIC | Is the NIC the bottleneck? Check for RX/TX descriptor ring overflows | ethtool -S, dpdk-testpmd show port stats |
| PCIe | Is the PCIe bus saturated? Gen3 x16 = 128 Gbps theoretical | lspci -vvv for link width/speed, PCIe bandwidth counters |
Design Rationale The Linux kernel networking stack was designed for generality: support every protocol, integrate with firewalling, provide fair scheduling across all applications, and handle connection-oriented workloads where per-packet cost is amortized over large transfers. This design adds 3,000-5,000 CPU cycles of overhead per packet. For workloads that process millions of small packets per second -- virtual switching, load balancing, telecom tunneling -- this overhead becomes the dominant cost. DPDK takes the radical position that for these workloads, the kernel should not be involved at all. Map the hardware into userspace, let the application manage everything, and accept the operational cost of losing kernel integration. The 7-10x throughput improvement validates this tradeoff for the specific class of high-pps, low-latency workloads.
If You See This, Think This
| Symptom | Likely cause | First check |
|---|---|---|
| DPDK app fails to start with EAL error | Hugepages not allocated or not mounted | grep -i huge /proc/meminfo, check /dev/hugepages mount |
| Zero packets received despite traffic arriving | NIC still bound to kernel driver, not DPDK driver | dpdk-devbind.py --status |
| Throughput 40-60% below expected | NUMA misalignment between NIC and processing cores | Compare NIC NUMA node (lspci) with core NUMA node (lstopo) |
| Intermittent latency spikes of 10-100us | DPDK cores not isolated from kernel scheduler | Check isolcpus in /proc/cmdline, look for involuntary context switches |
| RX packet drops at NIC level | mbuf pool exhausted, no buffers for incoming DMA | dpdk-proc-info stats, increase mbuf pool size or add RX queues |
| Application crashes on startup | Insufficient hugepage memory for requested mbuf pools | Increase hugepage reservation in GRUB or /proc/sys/vm/nr_hugepages |
| Performance degrades over hours | Memory fragmentation if using 2 MB hugepages allocated late | Switch to 1 GB hugepages reserved at boot via kernel command line |
| tcpdump shows no traffic on NIC | Expected. Kernel cannot see DPDK-bound interfaces | Use dpdk-pdump or application-level telemetry instead |
When to Use / Avoid
Relevant when:
- Packet processing rates exceed 1-2 million packets per second and kernel overhead dominates CPU profiles
- Sub-microsecond latency consistency matters more than kernel integration
- Building virtual switches, load balancers, firewalls, or network functions that process every packet
- Telecom or NFV workloads need line-rate processing on commodity hardware
Watch out for:
- Dedicated CPU cores spin at 100% even with zero traffic, reducing available compute
- All kernel networking tools (tcpdump, iptables, ss, netstat) stop working for DPDK-bound interfaces
- Applications must implement their own protocol stacks for anything beyond L2/L3 forwarding
- NUMA misalignment between NICs and processing cores causes 40-60% throughput loss
Try It Yourself
1 # Check available hugepages
2
3 grep -i huge /proc/meminfo
4
5 # Allocate 1 GB hugepages at runtime (prefer boot-time allocation)
6
7 echo 8 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
8
9 # Mount hugetlbfs if not already mounted
10
11 mkdir -p /dev/hugepages && mount -t hugetlbfs nodev /dev/hugepages
12
13 # Load VFIO kernel module (preferred over UIO for security)
14
15 modprobe vfio-pci
16
17 # Show current NIC driver bindings
18
19 dpdk-devbind.py --status
20
21 # Unbind NIC from kernel driver and bind to VFIO
22
23 dpdk-devbind.py --bind=vfio-pci 0000:03:00.0
24
25 # Run testpmd to validate DPDK NIC access (2 cores, 1 GB hugepages)
26
27 dpdk-testpmd -l 0,1 -n 4 --socket-mem=1024 -- -i --portmask=0x1
28
29 # Inside testpmd: start forwarding and show stats
30
31 testpmd> start
32
33 testpmd> show port stats all
34
35 # Check NUMA node for a PCI device
36
37 cat /sys/bus/pci/devices/0000:03:00.0/numa_node
38
39 # Reserve hugepages at boot (add to GRUB_CMDLINE_LINUX)
40
41 GRUB_CMDLINE_LINUX="default_hugepagesz=1G hugepagesz=1G hugepages=8 isolcpus=2-7"Debug Checklist
- 1
Verify hugepages are allocated: grep -i huge /proc/meminfo - 2
Check NIC driver binding: dpdk-devbind.py --status - 3
Verify IOMMU is enabled (for VFIO): dmesg | grep -i iommu - 4
Check NUMA affinity of NIC: cat /sys/bus/pci/devices/<bdf>/numa_node - 5
Verify core isolation: cat /proc/cmdline | grep isolcpus - 6
Monitor mbuf pool exhaustion: dpdk-proc-info -- --stats - 7
Check for packet drops at NIC level: ethtool -S <iface> | grep -i drop - 8
Validate hugepage mount: mount | grep hugetlbfs
Key Takeaways
- ✓The fundamental tradeoff: DPDK trades kernel integration for raw speed. Applications lose access to the entire kernel networking stack -- iptables, tc, conntrack, /proc/net, tcpdump, socket API. Everything the kernel provides must be reimplemented in userspace or abandoned.
- ✓Poll-mode drivers spin at 100% CPU regardless of traffic load. A DPDK core handling zero packets per second and one handling 10 million packets per second both show 100% CPU utilization. This is by design -- the latency benefit comes from eliminating the interrupt-to-poll transition. Power-aware deployments can use rte_power_empty_poll_stat to detect idle periods and scale frequency, but this adds latency variance.
- ✓Hugepages are not optional. DMA requires physically contiguous memory because NICs operate on physical addresses. Standard 4 KB pages can be swapped out or fragmented, breaking DMA. Hugepages are pinned, physically contiguous, and provide 512x fewer TLB entries for the same memory. A DPDK application that fails to allocate hugepages will not start.
- ✓DPDK and XDP solve similar problems from opposite directions. DPDK pulls packets out of the kernel entirely -- the application owns the NIC. XDP pushes processing into the kernel, running eBPF programs at the earliest point in the NIC driver. XDP keeps kernel integration (tcpdump, iptables still work for non-XDP traffic), while DPDK maximizes raw throughput at the cost of kernel visibility. XDP is typically 2-5x slower than DPDK for pure forwarding but requires no application-level protocol stacks.
- ✓NUMA awareness is critical. A DPDK application that allocates mbufs from NUMA node 0 but processes them on a core attached to NUMA node 1 pays a 40-60% latency penalty for cross-node memory access. The EAL's --socket-mem flag and rte_lcore_to_socket_id() exist specifically to prevent this. In production, each NIC should be handled by cores on the same NUMA node as the NIC's PCIe slot.
Common Pitfalls
- ✗Running DPDK without isolating CPU cores from the kernel scheduler. If the kernel schedules other tasks on DPDK poll-mode cores, context switches destroy latency predictability. Use isolcpus= boot parameter or cgroups cpuset to dedicate cores exclusively to DPDK. Without isolation, P99 latency can spike by 100x during scheduler preemptions.
- ✗Allocating hugepages after boot instead of reserving them at boot time. Late allocation depends on physically contiguous free memory, which fragments over uptime. A server running for weeks may fail to allocate 1 GB hugepages even with plenty of free memory. Reserve hugepages via the kernel boot command line: hugepagesz=1G hugepages=8 default_hugepagesz=1G.
- ✗Ignoring NUMA topology when assigning cores and memory. A NIC on PCIe bus attached to NUMA node 1, with DPDK cores running on NUMA node 0, crosses the QPI/UPI interconnect for every packet buffer access. This adds 70-100ns per memory operation. Use lstopo or lspci -vvv to check NIC NUMA affinity, then match --socket-mem and -l core assignments accordingly.
- ✗Expecting kernel tools to work with DPDK traffic. Once a NIC is bound to a DPDK driver (uio_pci_generic or vfio-pci), the kernel cannot see any traffic on that interface. tcpdump, ss, netstat, iptables -- none of them work. DPDK applications must implement their own monitoring: pdump library for packet capture, rte_eth_stats_get() for counters, telemetry library for runtime introspection.
- ✗Using DPDK for workloads that do not need it. If the application processes fewer than 1 million packets per second, kernel networking with interrupt coalescing and SO_BUSY_POLL is usually sufficient. DPDK adds operational complexity: custom drivers, dedicated cores, loss of kernel tooling, and application-managed protocol stacks. The breakeven point is typically 2-5 million pps depending on per-packet processing cost.
Reference
In One Line
DPDK moves packet processing from kernel to userspace by mapping NIC hardware directly into the application -- 7x throughput gains at the cost of dedicated CPU cores and total loss of kernel networking visibility.