Zero-Copy Networking (sendfile, splice)
Mental Model
A librarian asked to photocopy a chapter and fax it. Walk to the shelf, photocopy the pages, carry them back to the desk, feed them into the fax one by one. The content was never read or changed -- pure transport work. Now give the fax machine a long arm that reaches directly to the bookshelf, scans pages in place, and transmits them without any photocopying or carrying. The librarian never touches the pages. Content goes from shelf to wire in a single step.
The Problem
Nginx pushing 10 Gbps of static files burns 40% CPU -- not on request handling, but on copying. read() pulls data from the page cache into a user buffer, write() pushes it back into the socket buffer. Two CPU copies and two context switches per chunk, for data that was never inspected. A Kafka broker serving 2 GB/s of consumer reads through application buffers generates 4 GB/s of heap churn, triggering multi-second GC pauses that stall every consumer on the broker. HAProxy proxying TCP at 5 Gbps with read+write copies every byte through user space twice for traffic it never even looks at.
Architecture
A web server is pushing gigabytes per second of static files. CPU usage is high. But not because the server is doing anything useful -- it is copying. The same bytes, over and over, from one buffer to another.
Disk to page cache. Page cache to user buffer. User buffer to socket buffer. Socket buffer to NIC. Four copies. Two of them are pure waste -- the user-space round trip where the application touches data it never looks at.
sendfile and splice exist to eliminate that round trip entirely.
What Actually Happens
In a traditional file-serving path with read() + write():
- DMA copy: Disk controller DMAs data into the page cache. (Hardware, unavoidable.)
- CPU copy:
read()copies data from page cache to the user-space buffer. (Waste.) - CPU copy:
write()copies data from the user buffer to the socket buffer. (Waste.) - DMA copy: NIC DMAs data from socket buffer to the wire. (Hardware, unavoidable.)
That is 4 copies and 2 context switches for every chunk of file data. For a 1 MB file: 2 MB of unnecessary CPU copies, 2 context switches, and cache pollution from data the application never even looked at.
sendfile(out_fd, in_fd, offset, count) eliminates step 2 and 3. It tells the kernel: "move data from this file directly to this socket, without going through my address space." The kernel transfers data from the page cache to the socket buffer internally. One syscall instead of two. Zero context switches for the data path.
With scatter-gather DMA (supported by most modern NICs), it gets even better. The page cache pages are not copied to the socket buffer at all. Instead, their physical addresses and lengths are placed in the NIC's DMA descriptor ring. The NIC reads directly from the page cache. Only the TCP/IP headers are copied into a small buffer. Zero CPU copies for the payload.
Check NIC support with: ethtool -k eth0 | grep scatter.
splice() generalizes zero-copy to any pair of file descriptors using a pipe as an intermediary. The pipe buffer holds references to pages, not data copies. So splice(file_fd, pipe_write_end) + splice(pipe_read_end, socket_fd) achieves the same result as sendfile.
But here is what splice can do that sendfile cannot: socket-to-socket transfer. sendfile requires a file input. splice works between any two fds as long as one is a pipe. HAProxy uses exactly this for zero-copy TCP proxying: splice from client socket into a pipe, splice from pipe to backend socket. The actual packet data never enters HAProxy's address space.
tee() is the companion that duplicates pipe data without consuming it. The source pipe's data remains available for another splice. This enables fan-out: one input stream forwarded to multiple destinations.
Under the Hood
TCP_CORK (tcp_nopush) is the natural partner for sendfile. Without it, the HTTP headers (a small write()) and the file body (sendfile) might become separate TCP segments. TCP_CORK tells the kernel: "hold off sending until I have a full segment or I uncork." Nginx's tcp_nopush on enables this -- the kernel waits for sendfile data before flushing the segment containing the headers.
vmsplice maps user-space pages into a pipe buffer without copying. Combined with splice to a socket, this enables zero-copy send from user-allocated memory. The caveat: those pages must not be modified until the kernel finishes the DMA. Unlike MSG_ZEROCOPY, vmsplice provides no completion notification, leaving lifetime management entirely to the caller.
copy_file_range (Linux 4.5+) handles file-to-file copies. On filesystems that support it (Btrfs, NFS, CIFS), the kernel can do a server-side copy with zero data passing through the client. An NFS server copies bytes directly on storage, with zero network traffic.
MSG_ZEROCOPY (kernel 4.14+) eliminates even the user-to-kernel copy for send(). The kernel pins the user-space pages and puts them directly in the NIC's DMA ring. Completion notifications arrive on the socket's error queue. But the page pinning overhead makes it worthwhile only for sends above ~10 KB.
Common Questions
How many copies does sendfile actually save?
Traditional read+write: 4 copies (2 DMA + 2 CPU), 2 context switches. sendfile without scatter-gather: 3 copies (2 DMA + 1 CPU), 0 context switches. sendfile with scatter-gather: 2 copies (both DMA, zero CPU). The CPU copy savings and context switch elimination are where the real performance wins come from.
Why does Kafka use sendfile for consumer reads?
Kafka stores messages as append-only log files. When a consumer reads, Kafka calls FileChannel.transferTo() (Java's sendfile wrapper) to send log segments directly from the page cache to the consumer's socket. Since consumers typically read near the tail of the log, data is already hot in the page cache. sendfile avoids copying GB/s of data through the JVM heap, which would destroy garbage collector performance. This is one of Kafka's core performance secrets.
How does splice enable zero-copy proxying?
A reverse proxy receives data on socket A and forwards it to socket B. With read+write: socket A buffer to user buffer to socket B buffer (2 CPU copies, 2 syscalls). With splice: socket A buffer to pipe (page reference moved, not data) to socket B buffer (page reference moved again). The CPU never copies the actual bytes. HAProxy uses this in TCP mode. The pipe acts as a zero-copy transport layer between socket buffers.
When is sendfile NOT worth using?
Four cases. (1) When the data must be modified before sending (template rendering, compression) -- sendfile sends raw file content. (2) For very small files (< 1 KB) where syscall overhead dominates. (3) When the file is not in the page cache -- sendfile still triggers disk I/O and blocks; for cold files, io_uring is better. (4) For non-file sources -- sendfile requires a file input fd.
How Technologies Use This
A Kafka broker pushing 2 GB/s of consumer reads hits multi-second GC pauses that stall all consumers. Throughput collapses periodically as the JVM garbage collector struggles with roughly 4 GB/s of heap churn from copying log data through application buffers.
The problem is that the standard read-write path allocates, copies twice, and garbage collects every byte. Kafka avoids this entirely with FileChannel.transferTo, which maps to sendfile at the OS level. Log data moves directly from the page cache to the consumer socket without entering the JVM address space. Since consumers typically read near the tail of the log, over 95% of the data is already hot in the page cache.
Leave Kafka's default sendfile behavior enabled and ensure the page cache is large enough to hold recent log segments. The transfer becomes a DMA scatter-gather operation with zero CPU copies and zero GC overhead, eliminating the heap churn that causes pause storms.
An Nginx server doing nothing but serving static files at 10 Gbps is burning 40% CPU. The server is not processing requests or running application logic. It is just copying bytes between buffers it never inspects.
Without sendfile, each file chunk is copied from the page cache into a user-space buffer (CPU copy 1), then from the buffer into the socket buffer (CPU copy 2). That is two unnecessary copies and two context switches per chunk for data Nginx never even looks at. Nginx enables sendfile by default, transferring data directly from the page cache to the NIC via scatter-gather DMA with zero CPU copies for the payload.
Ensure sendfile is on in nginx.conf and pair it with tcp_nopush (TCP_CORK) to coalesce HTTP headers and file body into optimally sized TCP segments. The result is roughly 50% lower CPU usage at the same throughput, letting a single server saturate a 25 Gbps link.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| File-to-socket zero-copy | N/A (application-level concern) | FileChannel.transferTo() wraps sendfile | fs.createReadStream().pipe() does NOT use sendfile (copies through JS) | syscall.Sendfile() available | Nginx ingress uses sendfile for static assets |
| Socket-to-socket zero-copy | N/A | N/A (no splice wrapper in std lib) | N/A (no splice in Node.js) | syscall.Splice() available | HAProxy ingress uses splice for TCP proxy |
| Scatter-gather DMA | Host NIC capability shared by all containers | Transparent when OS supports it | Transparent when OS supports it | Transparent when OS supports it | Node NIC capability applies to all pods |
| MSG_ZEROCOPY | N/A | Netty supports SO_ZEROCOPY | N/A | golang.org/x/net supports MSG_ZEROCOPY | N/A |
Stack Layer Mapping
| Layer | Component |
|---|---|
| NIC hardware | DMA engine, scatter-gather descriptor ring |
| Kernel socket | Socket buffer (sk_buff) holds page references or header-only descriptors |
| Kernel VFS | Page cache pages referenced by sendfile, splice pipe_buffer |
| Syscall | sendfile(), splice(), tee(), vmsplice(), send(MSG_ZEROCOPY) |
| Userspace | Application chooses zero-copy path; data never enters process address space |
Design Rationale: read+write is the default because applications often need to inspect or transform data in flight. Zero-copy APIs exist for the surprisingly common case where they do not -- serving static files, proxying TCP streams, replicating log segments. In those scenarios, the user-space round trip is pure waste. The pipe buffer in splice is the clever generalization: a universal page-reference transport that connects any two kernel subsystems without caring what sits on either end.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| 40% CPU on static file serving with no processing | read+write copying data through user space unnecessarily | strace -e sendfile -p <PID> -- if no sendfile calls, enable it |
| GC pauses during high-throughput log consumption | Data copied through JVM heap instead of sendfile | Verify FileChannel.transferTo() path; check Kafka log.transfer.use-sendfile |
| sendfile enabled but CPU copies still high | NIC lacks scatter-gather DMA support | ethtool -k eth0 | grep scatter |
| splice returns EINVAL | Neither fd is a pipe (splice requires at least one pipe fd) | Use sendfile for file-to-socket; create pipe intermediary for socket-to-socket |
| MSG_ZEROCOPY slower than regular send for small messages | Page pinning overhead exceeds copy cost under ~10 KB | Only use MSG_ZEROCOPY for payloads >10 KB |
| Context switches still high despite sendfile | Application making other syscalls between sendfile calls | perf stat -e context-switches and audit syscall pattern |
When to Use / Avoid
- Use sendfile() when serving static files from disk to a socket without modification
- Use splice() when proxying data between two sockets (reverse proxy) without inspecting the payload
- Use tee() when fanning out one input stream to multiple destinations
- Use MSG_ZEROCOPY for large sends (>10 KB) from user-space buffers where page pinning overhead is amortized
- Avoid when data must be modified before sending (compression, templating, encryption in user space)
- Avoid sendfile for very small files (<1 KB) where syscall overhead dominates the copy savings
Try It Yourself
1 # Check if Nginx is using sendfile
2
3 strace -e sendfile -p $(pidof nginx) -f 2>&1 | head -5
4
5 # Benchmark sendfile vs read+write
6
7 dd if=/dev/zero of=/tmp/testfile bs=1M count=100
8
9 # Use python to test sendfile
10
11 python3 -c "import os,socket; s=socket.socket(); s.connect(('localhost',80)); os.sendfile(s.fileno(), open('/tmp/testfile','rb').fileno(), 0, 100*1024*1024)"
12
13 # Check if NIC supports scatter-gather (for true zero-copy)
14
15 ethtool -k eth0 | grep scatter
16
17 # Monitor context switches during file serving
18
19 perf stat -e context-switches -p $(pidof nginx) -- sleep 10
20
21 # Check Kafka's use of sendfile (Java transferTo)
22
23 strace -e sendfile -p $(pidof java) -f 2>&1 | grep sendfile | headDebug Checklist
- 1
strace -e sendfile,splice,tee,vmsplice -p <PID> -f 2>&1 | head -10 - 2
ethtool -k eth0 | grep scatter - 3
perf stat -e context-switches -p <PID> -- sleep 10 - 4
cat /proc/<PID>/io - 5
grep sendfile /etc/nginx/nginx.conf - 6
ethtool -k eth0 | grep tx-checksum
Key Takeaways
- ✓Traditional read()+write() copies data FOUR times and requires TWO context switches. sendfile eliminates the user-space round trip, cutting it to TWO copies (both DMA) and ZERO user-kernel context switches.
- ✓With scatter-gather DMA (most modern NICs), sendfile achieves true zero CPU copy: page cache pages go directly into the NIC's DMA descriptor ring. Only TCP/IP headers are copied, not the payload. Check support with ethtool -k <iface> | grep scatter.
- ✓splice() is more powerful than sendfile -- it works between any two fds as long as one is a pipe. For socket-to-socket proxying (reverse proxy), splice from socket A into a pipe, then from the pipe to socket B. The actual data bytes never touch your process's address space.
- ✓tee() duplicates pipe data without consuming it -- the source pipe's data remains available for another splice. This enables fan-out patterns where one input stream is forwarded to multiple destinations.
- ✓MSG_ZEROCOPY (kernel 4.14+) eliminates even the user-to-kernel copy for send(). The kernel pins user-space pages and DMAs directly from them. But the overhead of page pinning makes it worthwhile only for sends above ~10 KB.
Common Pitfalls
- ✗Mistake: Trying to use sendfile() for file-to-file copies. Reality: On Linux < 2.6.33, sendfile requires the output fd to be a socket. For file-to-file copies, use splice() with a pipe as intermediate, or copy_file_range() (Linux 4.5+).
- ✗Mistake: Assuming sendfile works with all input types. Reality: sendfile requires a regular file (or mmap-able) input. It does not work with pipes or sockets as input. For socket-to-socket transfer, use splice().
- ✗Mistake: Not looping on splice with SPLICE_F_NONBLOCK. Reality: Like read/write, splice may transfer fewer bytes than requested. Always loop until the desired count or EOF.
- ✗Mistake: Using MSG_ZEROCOPY for small sends. Reality: Page pinning, reference counting, and error queue notification overhead exceeds the copy cost for payloads under ~10 KB. Only use it for bulk transfers.
Reference
In One Line
sendfile for file-to-socket, splice for socket-to-socket -- both cut out the user-space round trip where CPU was being burned copying data that nobody ever looked at.