TCP Tuning & Congestion Control
Mental Model
A garden hose stretched between two houses a block apart. Water pressure is bandwidth; hose length is latency. The total water flowing at any instant depends on both diameter and length -- that is the bandwidth-delay product. Open the faucet only halfway and the hose is never full, no matter how wide it is. Pinch the nozzle at the far end and water backs up regardless of pressure. Both ends have to match the hose capacity, or the pipe sits partially empty.
The Problem
A 10 Gbps cross-datacenter link pushes only 500 Mbps per connection because tcp_rmem maxes out at 6 MB, capping throughput at 2.4 Gbps on a 20ms RTT path. Redis processes a GET in 0.1ms, but the 20-byte response reaches the client 200ms later -- Nagle buffering holding it hostage. CUBIC sees one packet drop on a 1% loss WAN link and halves the congestion window, collapsing a 1 Gbps flow to 100 Mbps. And every new connection starts at initcwnd=10 (14.6 KB), needing 4 RTTs just for slow start -- on a 100ms link, that is 400ms before a 100 KB page even finishes loading.
Architecture
A 10 Gbps link between two datacenters gets provisioned. iperf3 runs. A single connection pushes 500 Mbps. Not 10 Gbps. Not even close.
Or this: a Redis command takes 0.1 milliseconds to process. The response reaches the client 200 milliseconds later. A 2000x penalty. Not from the application code. From the TCP stack.
The default Linux TCP configuration was designed for a different era. And on modern networks, it's leaving enormous performance on the table.
What Actually Happens
TCP throughput is governed by a simple formula: sending rate = min(cwnd, rwnd) / RTT.
The congestion window (cwnd) is how much unacknowledged data the sender is allowed to have in flight. The receive window (rwnd) is how much data the receiver can buffer. The smaller one wins. And neither can exceed the actual BDP of the link, or packets pile up in queues.
On a 10 Gbps link with 20ms RTT, BDP = 25 MB. The TCP window must be at least 25 MB to fill the pipe. Linux's default tcp_rmem max is 6 MB. That limits throughput to 6 MB / 0.02s = 2.4 Gbps. On a 10 Gbps link. Just from buffer sizing.
This is where things break for most people.
Three layers of tuning. Buffer sizes (SO_SNDBUF/SO_RCVBUF, tcp_rmem/tcp_wmem) determine the maximum window. Congestion control (CUBIC vs BBR vs DCTCP) determines how fast the window grows and how it responds to signals. Latency knobs (TCP_NODELAY, TCP_QUICKACK, TCP_CORK) control when data actually leaves the buffer.
Under the Hood
Congestion control: CUBIC vs BBR. CUBIC (default since Linux 2.6.19) is loss-based. It grows cwnd following a cubic function until a packet drops, then cuts cwnd by 30%. This works well on clean links but falls apart on lossy networks like WiFi, cellular, or transcontinental paths, where random loss gets misinterpreted as congestion.
BBR (Google, Linux 4.9+) takes a fundamentally different approach. It periodically probes for maximum bandwidth (increases send rate until delivery rate plateaus) and minimum RTT (briefly reduces rate to drain queues). Then it sets cwnd = estimated BDP. BBR doesn't react to individual packet losses. On a 100 Mbps link with 1% random loss, CUBIC stabilizes around 10 Mbps. BBR maintains close to 100 Mbps.
Google deployed BBR on YouTube servers. The result: 4% bandwidth improvement globally, 14% in developing regions with lossy last-mile connections.
The Nagle-delayed-ACK trap. Nagle's algorithm buffers small writes, waiting for an ACK before sending. Delayed ACKs hold ACKs for up to 200ms, hoping to piggyback them on data. Now put them together: the sender writes a small request. Nagle holds it, waiting for ACK. The receiver delays the ACK for 200ms, waiting for data. The request sits in a buffer for 200ms before it moves.
This is the most common cause of unexplained latency in interactive protocols. TCP_NODELAY on the sender disables Nagle. TCP_QUICKACK on the receiver disables delayed ACKs.
Slow start and initcwnd. New connections start cautiously. With initcwnd=10 and MSS=1460, the first RTT can send 14.6 KB. After that: 29.2 KB, 58.4 KB, 116.8 KB -- doubling each RTT. A 100 KB web page needs 4 RTTs to deliver. On a 100ms link, that's 400ms just for slow start.
Google uses initcwnd=10 for web-facing traffic and initcwnd=32+ for internal datacenter traffic. On known-good links, increasing it safely is straightforward.
The autotuning trap. Linux dynamically adjusts receive buffer sizes based on measured bandwidth and RTT, scaling up to tcp_rmem[2]. This works great -- until SO_RCVBUF is manually set via setsockopt(). That single call disables autotuning for that socket, locking the buffer to the chosen value. When manual tuning is necessary, change the sysctl maximums and let autotuning do its job.
Common Questions
A 10 Gbps link shows only 500 Mbps for a single TCP connection. What's wrong?
Almost certainly a buffer problem. With 20ms RTT, BDP = 10 Gbps * 0.02s = 25 MB. Default tcp_rmem max is 6 MB, limiting throughput to 2.4 Gbps. If the application set SO_RCVBUF to 64 KB, throughput is capped at 25.6 Mbps. Fix: increase tcp_rmem max to 32 MB and make sure SO_RCVBUF isn't manually set.
Why does BBR require the 'fq' qdisc?
BBR paces packets precisely -- for example, one packet every 12 microseconds for 1 Gbps. The fq (Fair Queue) scheduler holds packets in per-flow queues and releases them at the specified rate. Without fq, packets go out in bursts (pfifo_fast just dequeues FIFO), overwhelming buffers and causing exactly the bufferbloat BBR is designed to prevent.
Should TCP_NODELAY always be enabled?
For interactive protocols (HTTP, gRPC, Redis, databases): absolutely yes. Without it, small writes get buffered for up to 200ms. The only case where TCP_NODELAY hurts is bulk transfer with many tiny write() calls -- each becomes a separate TCP segment, increasing overhead. For bulk transfers, use writev() or TCP_CORK instead.
What is DCTCP and when is it the right choice?
DCTCP uses ECN (Explicit Congestion Notification) marks from switches instead of packet loss. When a switch queue exceeds a threshold, it marks packets. DCTCP reduces cwnd proportionally to the fraction marked -- 1% marks causes 1% reduction, not 50% like CUBIC. This keeps datacenter queues short and latency low. Used in Azure, Google, and other cloud datacenter networks with ECN-capable switches.
How Technologies Use This
Cross-datacenter Kafka replication crawls at 200 Mbps on a 1 Gbps link with 50ms RTT. The link is not saturated, the brokers are not CPU-bound, and iperf3 between the same hosts shows the same bottleneck. A single random packet drop cuts throughput in half and recovery takes seconds.
The bandwidth-delay product is 1 Gbps times 50ms, which equals 6.25 MB, but the default tcp_rmem max is only 6 MB. The receiver cannot advertise a large enough window to keep the pipe full, so the sender stalls waiting for window updates. CUBIC congestion control makes it worse: it interprets any packet loss as congestion and halves the congestion window, even though random loss on WAN links is inevitable and not a congestion signal.
Increase socket buffers to at least the BDP so the receiver can advertise a window large enough to fill the pipe. Switch from CUBIC to BBR, which measures actual bandwidth and RTT independently and maintains close to line rate even with 1-2% packet loss. After tuning, replication throughput typically jumps from 200 Mbps to 800+ Mbps on the same link.
Nginx serving 50,000 responses per second shows 200ms of unexplained latency on small HTTP redirects. The server processes each redirect in microseconds, but the 200-byte response takes 200ms to reach the client. Large file downloads, oddly, perform fine.
Nagle's algorithm is the culprit. It holds small writes in the send buffer, waiting for either an ACK or enough data to fill a full TCP segment before sending. A 200-byte redirect is too small to trigger a send, so it sits in the buffer for up to 200ms. Large file transfers are unaffected because they immediately fill segments.
Nginx enables TCP_NODELAY by default so every response leaves the buffer immediately, eliminating the Nagle delay. During sendfile() transfers of large files, Nginx temporarily enables TCP_CORK to batch the HTTP headers and the first chunk of file data into one optimal-size segment, avoiding the tiny-packet overhead of sending headers separately. This combination of TCP_NODELAY for interactive responses and TCP_CORK for bulk transfers gives Nginx both low latency and high throughput from the same socket.
A web application issuing 2,000 small queries per second reports that every query takes 200ms at the client, even though PostgreSQL logs show 0.5ms execution time. The 400x discrepancy is consistent across all queries returning a few hundred bytes. Meanwhile, connection slots slowly leak as clients disappear without closing cleanly.
Nagle's algorithm is buffering each small result in the kernel send buffer, waiting for an ACK or enough additional data to fill a TCP segment. Since query results are small and no more data follows immediately, each result sits for up to 200ms before being sent. The leaking connection slots come from clients that crash or lose network without sending FIN, leaving backends blocked on recv() indefinitely against the default max_connections limit of 100.
PostgreSQL sets TCP_NODELAY on all client connections so results leave the buffer immediately, eliminating the Nagle penalty entirely. It also enables TCP keepalive probes with configurable intervals so that dead sessions are detected automatically and their connection slots reclaimed rather than held indefinitely.
Redis benchmarks show 200ms p99 latency on GET operations that the server processes in 0.1ms. The 2,000x discrepancy affects every command uniformly, regardless of key size or data type. At 100,000 operations per second, the cumulative wasted client wait time is staggering.
A 20-byte GET response is far too small to trigger an immediate TCP send. Without TCP_NODELAY, Nagle's algorithm holds it in the kernel send buffer, waiting for either an ACK from the previous segment or enough additional data to fill a full segment. Since Redis commands are independent and responses are tiny, neither condition is met quickly, and the response sits for up to 200ms.
Redis sets TCP_NODELAY unconditionally on every client connection, ensuring every response is pushed to the wire immediately. This is not a micro-optimization. It is the single setting that makes sub-millisecond p99 latency possible. At 100,000 operations per second, removing that 200ms Nagle delay saves 20,000 seconds of cumulative client wait time per second of real time.
A Node.js HTTP/2 server shows sluggish stream interleaving. Multiple streams on a single connection appear to stall for up to 200ms at a time, even though the server processes each frame in microseconds. Under peak connection bursts, new connections are silently dropped with no error visible in application logs.
HTTP/2 multiplexes many small HEADERS and DATA frames over a single connection. Nagle's algorithm batches these tiny frames together, waiting for ACKs before sending, which destroys the multiplexing benefit and adds up to 200ms of artificial delay per frame. The silent connection drops happen when the listen backlog overflows: the default somaxconn of 4096 means at most 4096 completed handshakes can queue before the single-threaded event loop calls accept().
Node.js enables TCP_NODELAY by default so each frame leaves the buffer immediately and streams interleave properly. For high-traffic servers, increase the listen backlog and net.core.somaxconn to match the peak connection burst rate, since connections beyond the limit are silently dropped with no application-level notification.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Buffer sizing | Inherits host tcp_rmem/tcp_wmem sysctls | Netty ChannelOption.SO_RCVBUF (disables autotuning) | socket.setRecvBufferSize() (disables autotuning) | net.Dialer.Control for setsockopt | Init containers can set sysctl if security context allows |
| Congestion control | Host kernel controls algorithm for all containers | N/A (OS-level) | N/A (OS-level) | Per-socket TCP_CONGESTION via Dialer.Control | Sidecar proxy (Envoy) inherits pod sysctl |
| Nagle disable | Application must set TCP_NODELAY | Netty auto-enables TCP_NODELAY | http.createServer sets TCP_NODELAY by default | net.TCPConn.SetNoDelay(true) default | Envoy enables TCP_NODELAY on all connections |
| Pacing / qdisc | Host qdisc applies to veth bridge | N/A (OS-level) | N/A (OS-level) | N/A (OS-level) | Pod-level tc qdisc via CNI plugin |
Stack Layer Mapping
| Layer | Component |
|---|---|
| Hardware/NIC | Ring buffer size, interrupt coalescing, offload (TSO/GRO) |
| Kernel qdisc | fq (required for BBR pacing), pfifo_fast (legacy), fq_codel |
| Kernel TCP | tcp_congestion_ops, tcp_rmem/tcp_wmem autotuning, initcwnd |
| Syscall | setsockopt(TCP_NODELAY, TCP_CORK, TCP_CONGESTION, SO_RCVBUF) |
| Userspace | Connection pools, writev() for batched writes, sendfile() |
Design Rationale: Buffer autotuning was built to do the right thing without manual intervention -- it measures BDP and scales accordingly, which is why calling setsockopt to override it actually makes things worse unless the value is carefully measured. BBR decouples bandwidth estimation from loss response because random packet loss on a WAN link is not congestion, and treating it as congestion is what makes CUBIC collapse on lossy paths. Nagle made sense in the telnet era when every keystroke was a separate packet; for modern request-response protocols where each write is already a complete message, it is pure latency tax.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Single-connection throughput far below link speed | Receive buffer smaller than BDP | ss -ti dst <IP> check rwnd vs BDP calculation |
| 200ms latency on small responses | Nagle buffering small writes | ss -ti check if nodelay is off; enable TCP_NODELAY |
| Throughput collapses after single loss event | CUBIC halving cwnd on random loss | sysctl net.ipv4.tcp_congestion_control -- consider BBR |
| BBR enabled but no improvement | Missing fq qdisc for packet pacing | tc qdisc show dev eth0 -- switch to fq |
| New connections slow for first few RTTs | initcwnd=10 limits first flight to 14.6 KB | ip route show check initcwnd; raise on trusted links |
| SO_RCVBUF set but throughput still low | Kernel doubles value but reserves half for metadata | getsockopt SO_RCVBUF -- actual usable is ~half of reported |
When to Use / Avoid
- Use when single-connection throughput is significantly below link capacity on high-latency paths
- Use when small-message latency is orders of magnitude higher than processing time (Nagle trap)
- Use when cross-datacenter or WAN replication runs far below iperf3 benchmarks
- Use when switching congestion control algorithms for lossy or datacenter networks
- Avoid when the bottleneck is application CPU, disk I/O, or upstream rate limiting -- TCP tuning cannot help there
- Avoid when connections are all local (loopback or same-rack) with sub-millisecond RTT -- defaults are fine
Try It Yourself
1 # Check current congestion control algorithm
2
3 sysctl net.ipv4.tcp_congestion_control
4
5 # Switch to BBR (requires fq scheduler)
6
7 sysctl -w net.core.default_qdisc=fq
8
9 sysctl -w net.ipv4.tcp_congestion_control=bbr
10
11 # Show TCP buffer sizes (min, default, max in bytes)
12
13 sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem
14
15 # View per-connection TCP details
16
17 ss -ti 'dst 10.0.0.5'
18
19 # Show available congestion control algorithms
20
21 sysctl net.ipv4.tcp_available_congestion_control
22
23 # Increase receive buffer max for high-BDP links
24
25 sysctl -w net.ipv4.tcp_rmem='4096 131072 16777216'Debug Checklist
- 1
ss -ti dst <PEER_IP> | head -20 - 2
sysctl net.ipv4.tcp_rmem net.ipv4.tcp_wmem - 3
sysctl net.ipv4.tcp_congestion_control net.core.default_qdisc - 4
tc qdisc show dev eth0 - 5
nstat -az | grep -i 'retrans\|ecn\|loss' - 6
ethtool -S eth0 | grep -i 'err\|drop\|pause'
Key Takeaways
- ✓The throughput ceiling for any TCP connection is: min(cwnd, rwnd) / RTT. If your buffer is 6 MB and your RTT is 20ms, you max out at 2.4 Gbps -- on a 10 Gbps link. This is the Bandwidth-Delay Product (BDP = bandwidth * RTT), and it determines how big your buffers need to be.
- ✓BBR measures actual bandwidth and RTT instead of panicking at packet loss. On lossy links (WiFi, cellular, transcontinental), BBR delivers 2-10x more throughput than CUBIC because it doesn't mistake random loss for congestion.
- ✓TCP_NODELAY disables Nagle's algorithm, which batches small writes until an ACK arrives. Without it, a tiny Redis response sits in the buffer for up to 200ms. Every interactive protocol (HTTP, Redis, gRPC) should set it.
- ✓Linux autotuning dynamically adjusts receive buffers up to tcp_rmem[2]. But here's the trap: setting SO_RCVBUF manually DISABLES autotuning for that socket. Only override buffer sizes if you've measured the optimal value.
- ✓The initial congestion window (initcwnd=10 since Linux 3.0) means a new connection can only send 14.6 KB in the first RTT. A 100 KB web page needs 4 RTTs just for slow start. On datacenter links, increasing initcwnd to 32-64 cuts this dramatically.
Common Pitfalls
- ✗Mistake: small receive buffers on high-BDP links. Reality: a 64 KB buffer on a 100 Mbps / 100ms link limits throughput to 5 Mbps (64KB / 0.1s) regardless of bandwidth. Calculate BDP first, then size buffers.
- ✗Mistake: setting SO_SNDBUF/SO_RCVBUF without knowing the kernel doubles them. Reality: the kernel reserves half for metadata (sk_buff overhead). Setting SO_RCVBUF=65536 gives ~32 KB usable space. And it disables autotuning.
- ✗Mistake: enabling BBR without setting net.core.default_qdisc=fq. Reality: BBR requires the fq (Fair Queue) scheduler to pace packets properly. With default pfifo_fast, BBR can't control inter-packet timing and loses its advantage.
- ✗Mistake: enabling TCP_NODELAY but forgetting TCP_QUICKACK. Reality: delayed ACKs (default 200ms wait) can still add latency on the receiver side, especially in bidirectional protocols where the ACK would piggyback on data that hasn't arrived yet.
Reference
In One Line
Start with the BDP calculation, size buffers to fill the pipe, pick a congestion algorithm that matches the link, and set TCP_NODELAY on every socket that cares about latency.