Kernel Network Stack
Mental Model
A package delivery through a high-security office building. The delivery truck (NIC) drops the package at the loading dock (RX ring buffer). The security guard (hardware interrupt) buzzes the mailroom. The mailroom clerk (NAPI) scans the barcode (IP header), checks customs paperwork (Netfilter), verifies the tracking number matches the order (TCP sequence), looks up which office placed the order in the company directory (inet_hashtable 4-tuple lookup), places it in that office's inbox tray (socket receive buffer), and the office worker picks it up at the front desk window (copy_to_user). Sending a package reverses the process -- hand it to the mailroom, they add labels and stamps, and put it on the outgoing truck.
The Problem
A containerized service receives 500,000 packets per second. Each packet crosses the NIC ring buffer, triggers a hardware interrupt, gets wrapped in an sk_buff, passes through IP routing and Netfilter hooks, hits TCP reassembly, gets matched to a socket via inet_hashtable, queues on the socket receive buffer, and finally gets copied to user space. That is seven kernel subsystems, two privilege transitions, and at least one memory copy per packet. When latency spikes, the question is always: which layer broke? Without understanding the full path, every tuning attempt is guesswork.
Architecture
Open a terminal and run ss -tnpi on a busy server. Every socket listed there -- with its congestion window, retransmission count, and buffer depth -- represents a struct sock in kernel memory. That struct sock exists because a packet arrived on the wire, traveled through seven layers of kernel code, and got matched to that specific socket by hashing four numbers.
This is the path every TCP packet takes, and every performance anomaly in networked software traces back to one of these layers.
The RX Path: Wire to Application
A packet arriving at the server follows this sequence. No shortcuts, no exceptions.
Step 1: NIC and DMA. The network card receives the frame and writes it directly into host memory via DMA (Direct Memory Access). The CPU is not involved at all. The NIC uses a ring buffer -- a circular array of descriptors that point to pre-allocated memory regions. The NIC fills the next descriptor, advances its pointer, and the frame is in RAM. If the ring fills up before the kernel drains it, subsequent frames are silently dropped at the hardware level. This is the most common packet loss point on high-throughput systems, and it shows up only in ethtool -S counters.
Step 2: Hardware interrupt and NAPI. The NIC raises a hardware interrupt. The interrupt handler (top-half) does almost nothing -- it disables further NIC interrupts and schedules a NAPI softirq. This is critical. Processing packets in interrupt context would block all other interrupts on that CPU. NAPI flips to polling mode: the softirq handler calls the driver's poll function, which drains packets from the ring buffer in batches. One interrupt can process dozens of packets. The NAPI budget (netdev_budget, default 300 packets, and netdev_budget_usecs, default 2000 microseconds) controls how much work each poll cycle does before yielding the CPU.
Step 3: sk_buff creation. For each frame, the driver allocates an sk_buff (socket buffer) -- the universal packet representation in the Linux kernel. The sk_buff has four pointers: head (start of allocated memory), data (start of current protocol header), tail (end of packet data), and end (end of allocated memory). As the packet moves up the stack, each layer calls skb_pull() to advance the data pointer past its header, exposing the next layer's header. No copying, just pointer arithmetic. The skb_shared_info structure at the end holds page fragment references for scatter-gather I/O and GSO (Generic Segmentation Offload) metadata.
Step 4: IP layer. ip_rcv() validates the IP header checksum, makes a routing decision (is this packet for the local machine, or does it need forwarding?), and passes through the Netfilter PREROUTING hook. In container environments, this is where DNAT happens -- the destination IP gets rewritten from the host's external address to the container's internal address. This rewrite occurs before socket lookup, which is how 50 containers can all listen on port 80 with different internal IPs.
Step 5: TCP layer. tcp_v4_rcv() handles the heavy lifting. It validates the TCP checksum, processes the segment according to the connection's state machine, handles sequence number validation, generates ACKs, updates the congestion window, and performs segment reassembly if packets arrived out of order. The TCP receive path is one of the most complex pieces of code in the kernel, and for good reason -- it implements decades of RFCs governing reliability, flow control, and congestion avoidance.
Step 6: Socket lookup. inet_hashtable hashes the 4-tuple (source IP, source port, destination IP, destination port) to find the matching struct sock. For established connections, this is a hash table lookup -- effectively O(1). For incoming SYN packets, a separate inet_listening_hashtable finds the listening socket. With SO_REUSEPORT, multiple sockets can bind the same address and port; the kernel distributes incoming connections among them, optionally steered by an attached BPF program. Each network namespace has its own inet_hashtable instance. This is how containers isolate their socket spaces -- 50 containers binding port 80 means 50 separate hash tables, not one shared one (cross-ref network-namespaces.md).
Step 7: Socket receive buffer. The sk_buff gets queued on sk->sk_receive_queue. If the application has an epoll registration on this socket, ep_poll_callback() fires, adding the socket to epoll's ready list. The application's epoll_wait() returns, the application calls recv() or read(), and copy_to_user() copies the data from kernel memory to the user-space buffer. The CPU transitions from ring 0 back to ring 3. The packet journey is complete.
The TX Path: Application to Wire
Sending reverses the process with its own complexities.
Step 1: Syscall entry. The application calls send() or write(). The CPU transitions from ring 3 to ring 0. copy_from_user() copies the data from the application buffer into kernel memory.
Step 2: TCP segmentation. The TCP layer segments the data into MSS-sized chunks, constructing an sk_buff for each segment. It adds TCP headers with sequence numbers, calculates checksums (or defers to hardware), and applies congestion control logic. If TSO (TCP Segmentation Offload) is available, the kernel creates one large sk_buff and the NIC hardware splits it into wire-sized segments -- saving CPU cycles for the most common operation in the send path.
Step 3: IP layer and Netfilter. The IP layer performs route lookup, adds the IP header, and passes through Netfilter OUTPUT and POSTROUTING hooks. In container environments, MASQUERADE in POSTROUTING rewrites the source address from the container's internal IP to the host's external IP (cross-ref netfilter.md).
Step 4: Qdisc and driver. The queueing discipline (qdisc) schedules the packet. Default fq_codel provides fair queuing with controlled delay. The driver enqueues the sk_buff's DMA address onto the NIC's TX ring. The NIC reads the descriptor via DMA and transmits the frame. A TX completion interrupt (or timer) signals the driver to free the sk_buff.
The User/Kernel Boundary
Every recv() and send() crosses the user/kernel boundary via the syscall instruction. This transition costs roughly 200 nanoseconds on modern hardware -- switching from ring 3 to ring 0, saving registers, entering kernel code. vDSO (virtual dynamic shared object) accelerates some syscalls like gettimeofday() by mapping kernel data into user space, but network I/O cannot benefit from this because it genuinely needs kernel code to manipulate socket buffers and protocol state.
Zero-copy techniques reduce the data copying cost. sendfile() tells the kernel to transfer data directly from a file's page cache into the socket without ever copying it to user space. splice() connects two kernel-side pipe buffers. MSG_ZEROCOPY (Linux 4.14+) lets send() reference user-space pages via the sk_buff's skb_shared_info frags, avoiding copy_from_user() entirely. The application must wait for a completion notification before reusing those pages.
io_uring takes a different approach: instead of reducing per-syscall cost, it reduces the number of syscalls. A shared ring buffer between user space and kernel lets the application submit batches of network operations (connect, send, recv) with a single io_uring_enter() call. For workloads making thousands of small network operations per second, the syscall overhead savings are substantial.
Container Integration
Container networking adds extra hops to the packet path. A packet destined for a container traverses: host NIC, host IP stack, Netfilter PREROUTING (DNAT rewrites destination to container IP), bridge device, veth pair (virtual Ethernet tunnel into the container's network namespace), container IP stack, and finally the container's inet_hashtable for socket lookup.
Each network namespace has its own routing table, Netfilter rules, and inet_hashtable. This is fundamental to container isolation. When a container binds port 80, it binds in its own namespace's hash table. The host's port mapping works through DNAT: external traffic to host:8080 gets rewritten to container_ip:80 in the PREROUTING chain, then routed through the bridge and veth pair into the namespace where the socket lookup succeeds (cross-ref network-namespaces.md, netfilter.md).
Common Questions
Where do most packets get dropped?
In order of likelihood: (1) NIC RX ring buffer overflow when NAPI cannot drain fast enough, (2) softirq backlog when netdev_max_backlog is exceeded, (3) Netfilter rules explicitly dropping traffic, (4) TCP accept queue overflow on the listening socket, (5) socket receive buffer full when the application reads too slowly. Each drop point has a specific counter. The mistake most engineers make is looking at application logs first when the real answer is in ethtool -S or /proc/net/softnet_stat.
Why does container networking add latency?
Each veth pair traversal means the packet passes through the network stack twice -- once in the host namespace and once in the container namespace. The bridge device adds a forwarding decision. Netfilter DNAT/MASQUERADE rules add conntrack overhead. For latency-sensitive workloads, host networking (--network=host in Docker) eliminates all these extra hops by sharing the host's network namespace directly.
What is the difference between TCP segmentation offload (TSO) and generic segmentation offload (GSO)?
TSO lets the NIC hardware split one large buffer into wire-sized TCP segments. The kernel builds one sk_buff with up to 64 KB of data, and the NIC hardware creates individual Ethernet frames. GSO does the same splitting but in software, just before the driver hands packets to the NIC. GSO is the fallback when the NIC does not support TSO, but it still saves work by deferring segmentation as late as possible so that Netfilter and routing operate on one large packet instead of many small ones.
How does SO_REUSEPORT actually distribute connections?
Without SO_REUSEPORT, one socket binds a port and all connections funnel through it. With SO_REUSEPORT, multiple sockets (typically one per worker thread or process) bind the same address and port. The kernel hashes the incoming 4-tuple and distributes each connection to one of the sockets. An optional BPF program (attached via SO_ATTACH_REUSEPORT_CBPF or EBPF) can override the hash for custom steering logic -- routing specific clients to specific workers, for example.
How Technologies Use This
A Kafka cluster consuming 2 GB/s of replication traffic sees tail latencies spike above 50 ms during peak hours. Brokers show healthy CPU and disk, but network softirq time is consuming 40% of two cores and packet drops appear in /proc/net/softnet_stat.
The RX ring buffer on each NIC is sized at the default 256 entries. At 2 GB/s, the NIC fills the ring faster than a single NAPI poll cycle can drain it. Packets overflow, the NIC hardware drops them, and TCP retransmissions create the latency spikes. Meanwhile, all softirq processing is pinned to the core handling the NIC interrupt because RSS is disabled.
Increase the RX ring buffer with ethtool -G eth0 rx 4096, enable RSS to spread interrupts across cores, and tune net.core.netdev_budget to 600 so each NAPI poll cycle processes more sk_buffs before yielding. Kafka replication latency drops below 5 ms because packets stop being dropped at the ring buffer and softirq load is balanced across cores.
An Nginx reverse proxy handling 200,000 requests per second shows intermittent connection resets. The application logs show nothing wrong, but netstat reveals a listen queue overflow with thousands of SYN cookies being issued. Clients experience random 502 errors.
The kernel network stack has two queues for incoming connections -- the SYN queue (half-open) and the accept queue (fully established). The accept queue depth is controlled by the backlog argument to listen() and capped by net.core.somaxconn. At default 128, the accept queue fills in under a millisecond at this rate. Once full, the kernel either drops SYNs or falls back to SYN cookies, both of which cause visible client failures.
Set net.core.somaxconn to 65535 and configure Nginx backlog to match in the listen directive. Also increase net.ipv4.tcp_max_syn_backlog to 65535. The accept queue now has headroom for burst traffic, inet_csk_reqsk_queue_is_full stops returning true, and the connection reset rate drops to zero.
A Go HTTP server using net/http shows 200 microsecond p99 latency for small responses, but when response bodies exceed 64 KB, latency jumps to 2 milliseconds. CPU profiling shows time spent in runtime.memmove inside the TCP send path, and the garbage collector is running more frequently.
Each send() syscall copies the user-space buffer into a kernel sk_buff via copy_from_user(). For large responses, the kernel allocates multiple sk_buffs and the Go runtime allocates temporary buffers that become GC pressure. The copy cost is real -- at 64 KB per response and 50,000 responses per second, that is 3.2 GB/s of memory copies across the user/kernel boundary.
Enable TCP_CORK to batch small writes into full-MSS segments, reducing sk_buff allocations. For file serving, use net/http.ServeFile which calls sendfile() internally, bypassing the user-space copy entirely. The kernel splices data directly from the page cache into the socket buffer using sk_buff page fragments. Latency drops to 400 microseconds at p99.
A Node.js WebSocket server handling 30,000 concurrent connections shows increasing memory usage over hours until it crashes with an OOM kill. The heap dump shows memory is not leaking in JavaScript -- the RSS growth is in kernel space, visible only through /proc/meminfo showing rising slab allocation.
Every connected socket has a struct sock in kernel memory, and each sock holds receive and send buffers sized by net.ipv4.tcp_rmem and tcp_wmem. At default max values of 6 MB per socket, 30,000 connections can claim up to 180 GB of theoretical buffer space. The kernel auto-tunes within these bounds, and bursty WebSocket traffic causes buffers to grow toward the maximum and stay there because tcp_memory_pressure thresholds are set too high.
Lower net.ipv4.tcp_rmem and tcp_wmem max values to 1 MB for this workload. Set net.ipv4.tcp_mem thresholds to trigger memory pressure earlier. The kernel reclaims socket buffer memory more aggressively, and RSS stabilizes at 4 GB instead of growing unbounded. The sk_buff allocations stay within the SLUB cache budget.
Same Concept Across Tech
| Technology | Network Stack Integration | Key Optimization |
|---|---|---|
| Kafka | FileChannel.transferTo() maps to sendfile(), bypassing JVM heap entirely | Zero-copy from page cache to socket sk_buff via splice |
| Nginx | sendfile() + TCP_CORK batches file data with HTTP headers in kernel | Avoids user-space copy for static content, wire-speed on one core |
| Go | netpoller parks goroutines until sk_receive_queue has data via epoll | Goroutines decouple from OS threads -- millions of connections on a few threads |
| Node.js | libuv wraps epoll for socket readiness, UV_THREADPOOL for disk I/O | Single-threaded event loop avoids context switch overhead on the RX path |
| DPDK | Bypasses the entire kernel stack -- NIC maps directly to user-space memory | Eliminates all seven kernel layers, but requires dedicated cores and custom drivers |
| io_uring | Batches send/recv operations via shared ring buffers, reducing syscall count | One syscall submits dozens of network operations, amortizing ring 3-to-0 cost |
Design Rationale
The kernel network stack is general-purpose by design. It handles routing, firewalling, congestion control, and protocol compliance for every application. High-performance systems optimize by skipping layers they do not need. Kafka skips the user-space copy with sendfile(). DPDK skips the kernel entirely. io_uring amortizes the syscall boundary. Each optimization removes one or more of the seven hops in the packet path, trading generality for speed at the layer that matters most for that workload.
RX path layer-by-layer cost breakdown:
| Layer | Typical Cost | Bottleneck Indicator |
|---|---|---|
| NIC RX ring | ~0 (DMA, no CPU) | ethtool -S rx_missed_errors |
| Hardware interrupt | ~1 us | /proc/interrupts imbalance |
| NAPI softirq | ~2-5 us per batch | /proc/net/softnet_stat column 3 |
| IP + Netfilter | ~1-3 us (depends on rules) | iptables -L -v -n rule counters |
| TCP processing | ~2-5 us | ss retransmission counters |
| Socket lookup | ~0.1 us (hash O(1)) | Rarely a bottleneck |
| copy_to_user | ~0.5 us per 4 KB | perf showing copy_to_user in profile |
Stack layer mapping (debugging a latency spike):
| Layer | What to check | Tool |
|---|---|---|
| NIC hardware | Ring buffer overflow, RSS queue count | ethtool -S, ethtool -l |
| Interrupt/softirq | Per-CPU load imbalance, time squeeze | /proc/interrupts, /proc/net/softnet_stat |
| IP/Netfilter | Excessive iptables rules, conntrack table full | conntrack -C, iptables -L -v |
| TCP | Retransmissions, window shrink, congestion | ss -tnpi, /proc/net/snmp |
| Socket buffer | Receive queue backlog, memory pressure | ss Recv-Q, /proc/net/sockstat |
| Syscall boundary | copy_to_user time, syscall frequency | perf trace -e read,recv, strace -c |
| Application | Slow processing after recv returns | Application profiler |
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Packet drops at NIC, zero application errors | RX ring buffer overflow -- NAPI cannot drain fast enough | `ethtool -S eth0 |
| Latency spikes correlated with CPU softirq time | NAPI budget exhausted, packets queued in ring buffer | cat /proc/net/softnet_stat column 3 (time_squeeze) |
| Connection resets under load, SYN cookies in dmesg | Accept queue overflow on listening socket | nstat -az TcpExtListenDrops |
| High retransmission rate, cwnd stuck at low value | Packet loss in the network or NIC drops causing TCP backoff | ss -tnpi and check retrans count and cwnd |
| Memory growing in kernel slab, not in application heap | Socket buffers auto-tuning to tcp_rmem max across many connections | cat /proc/net/sockstat and check mem column |
| sendfile() slower than expected | File not in page cache, causing disk reads in the send path | vmstat 1 to check block I/O during transfers |
| Container cannot reach host service on same port | DNAT rewrite in PREROUTING changes dst before inet_hashtable lookup | conntrack -L to verify NAT translation entries |
| All softirq on one CPU, other cores idle | RSS disabled or IRQ affinity pinned to single core | `cat /proc/interrupts |
When to Use / Avoid
Understand the kernel network stack when:
- Debugging latency that application profiling cannot explain
- Tuning high-throughput services (above 100K packets per second)
- Diagnosing packet drops that show up in ethtool but not in application logs
- Working with container networking where DNAT and veth pairs add kernel hops
- Evaluating zero-copy techniques (sendfile, MSG_ZEROCOPY, io_uring) for data-intensive services
- Understanding why socket buffer tuning affects memory and latency simultaneously
Less relevant when:
- Application-level bottlenecks dominate (slow queries, algorithmic issues)
- Traffic is low enough that default kernel settings work fine (under 10K req/s)
- Working exclusively with UDP where TCP reassembly and congestion control do not apply
Try It Yourself
1 # Check NIC ring buffer sizes and packet drops
2
3 ethtool -g eth0 && ethtool -S eth0 | grep -i "drop\|miss\|error"
4
5 # Show per-CPU softirq stats (columns: processed, dropped, time_squeeze)
6
7 cat /proc/net/softnet_stat
8
9 # Monitor TCP socket internals: cwnd, retransmissions, buffer fill
10
11 ss -tnpi | head -20
12
13 # Check accept queue overflows on listening sockets
14
15 nstat -az TcpExtListenDrops TcpExtListenOverflows
16
17 # Trace the RX packet path in the kernel with perf
18
19 perf trace -e net:* -p $(pidof nginx) -- sleep 5 2>&1 | head -30
20
21 # Show socket memory usage across all TCP connections
22
23 cat /proc/net/sockstat
24
25 # Verify interrupt distribution across CPUs for network interfaces
26
27 cat /proc/interrupts | grep eth
28
29 # Check current TCP buffer tuning parameters
30
31 sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.core.rmem_max net.core.wmem_max
32
33 # Watch conntrack table for NAT translations (container networking)
34
35 conntrack -L -p tcp --dport 80 2>/dev/null | head -10Debug Checklist
- 1
ethtool -S eth0 | grep -i drop -- check NIC-level packet drops - 2
cat /proc/net/softnet_stat -- check per-CPU softirq drops and time squeezes - 3
ss -tnpi | grep -A2 <port> -- check socket buffer fill, cwnd, retransmissions - 4
cat /proc/net/snmp | grep Tcp -- check TCP retransmissions, receive errors - 5
sysctl net.ipv4.tcp_rmem -- verify receive buffer auto-tuning range - 6
nstat -az TcpExtListenDrops -- check accept queue overflows - 7
cat /proc/interrupts | grep eth -- verify interrupt distribution across CPUs - 8
perf top -g -- look for time spent in net_rx_action, tcp_v4_rcv, copy_to_user
Key Takeaways
- ✓The RX path crosses two privilege boundaries. The NIC writes packets via DMA (no CPU involvement), then a hardware interrupt transitions to kernel mode. NAPI softirq processes packets in kernel context. Finally, copy_to_user() copies data to the application buffer and the CPU returns to ring 3. Each boundary has real cost -- the syscall transition alone is ~200 nanoseconds.
- ✓sk_buff is not a simple buffer. It has four pointers (head, data, tail, end) that allow protocols to push/pull headers without copying. When the TCP stack needs to prepend a header, it calls skb_push() which moves the data pointer backward. The actual packet data might span multiple pages via skb_shared_info's frags array, supporting scatter-gather DMA.
- ✓Socket lookup is the bridge between the network and the application. For each incoming TCP segment, inet_hashtable hashes the 4-tuple and walks a hash chain to find the matching struct sock. SO_REUSEPORT creates multiple sockets on the same port, and the kernel (or an attached BPF program) selects which one receives each connection. In containers, DNAT rewrites the destination IP before this lookup happens.
- ✓The TX path mirrors the RX path in reverse. send() copies data from user space into sk_buffs, TCP adds headers and applies congestion control, IP performs route lookup and passes through Netfilter OUTPUT and POSTROUTING hooks, the qdisc schedules the packet, and the driver enqueues it on the NIC TX ring for DMA transmission. TSO offloads TCP segmentation to the NIC hardware, sending one large sk_buff instead of many small ones.
- ✓Zero-copy techniques bypass the user/kernel data copy. sendfile() splices data from the page cache directly into sk_buffs using page references instead of memcpy. MSG_ZEROCOPY (kernel 4.14+) lets send() reference user-space pages directly. io_uring can batch network operations to amortize syscall overhead. Each technique trades complexity for throughput at high data rates.
Common Pitfalls
- ✗Mistake: Assuming packet drops are always a network problem. Reality: The most common drop point is the NIC RX ring buffer overflow, visible in ethtool -S as rx_missed_errors. The NIC filled the ring via DMA faster than NAPI could drain it. Fix with larger ring buffers (ethtool -G) and RSS to spread interrupt load.
- ✗Mistake: Tuning TCP buffer sizes to maximum values for all workloads. Reality: Each socket can auto-tune up to tcp_rmem[2] (default 6 MB). With 50,000 connections, that is 300 GB of theoretical buffer allocation. The kernel enters tcp_memory_pressure and starts dropping segments. Set appropriate maximums for the workload.
- ✗Mistake: Ignoring softirq processing time. Reality: NAPI processes packets in softirq context, which has a time budget (netdev_budget_usecs, default 2 ms). If the budget expires, remaining packets stay in the ring buffer until the next cycle. Under load, this adds latency that looks like application slowness but is entirely in the kernel RX path.
- ✗Mistake: Expecting sendfile() to always be faster than read()+send(). Reality: sendfile() avoids one copy but only works from a file descriptor to a socket. If the data needs modification (compression, encryption in user space), the copy is unavoidable. TLS via kTLS can push encryption into the kernel to preserve the zero-copy path.
Reference
In One Line
Seven subsystems between the wire and the application -- ring buffer, NAPI, sk_buff, IP, TCP, socket lookup, copy_to_user -- and every tuning knob maps to exactly one of them.