TCP Congestion Control
Congestion control algorithms dynamically adjust TCP's sending rate to maximize throughput without collapsing the network.
The Problem
How should a TCP sender determine its transmission rate when it has no direct visibility into network congestion, router queue depths, or competing traffic?
Mental Model
Like driving on a highway — start slow, speed up until brake lights appear, then back off. BBR is like having GPS traffic data instead of just watching brake lights.
Architecture Diagram
How It Works
In 1986, the internet nearly died. Van Jacobson observed that the network between UC Berkeley and Lawrence Berkeley Labs — just 400 yards apart — was running at 1/100th its capacity. Packets were being sent faster than routers could handle, causing massive loss, which triggered retransmissions, which caused more congestion, which caused more loss. This death spiral is called congestion collapse.
Jacobson's fix, published in 1988, gave TCP a congestion window (cwnd) — a sender-side limit on unacknowledged data, independent of the receiver's window. The sender transmits at most min(cwnd, rwnd) bytes before waiting for ACKs. The congestion control algorithm decides how cwnd evolves over time.
The Four Phases
1. Slow Start — Despite its name, slow start is exponential growth. cwnd starts at the initial window (typically 10 MSS on Linux, thanks to RFC 6928) and doubles every RTT. On a 100ms RTT link with 1460-byte MSS, that goes from ~14KB to 14MB in just 10 RTTs (1 second). Slow start continues until cwnd reaches ssthresh or a loss event occurs.
2. Congestion Avoidance — Once cwnd exceeds ssthresh, TCP switches to linear growth: add 1 MSS per RTT. This is the AIMD (Additive Increase, Multiplicative Decrease) phase — carefully probing for more bandwidth, one segment at a time.
3. Fast Retransmit — When the sender receives three duplicate ACKs (four ACKs for the same sequence number), it assumes a packet was lost and retransmits immediately without waiting for the RTO timer. This saves an entire timeout period.
4. Fast Recovery — After fast retransmit, instead of resetting cwnd to 1 (slow start), the sender sets ssthresh = cwnd/2 and cwnd = ssthresh + 3 MSS. Each additional duplicate ACK inflates cwnd by 1 MSS. When the retransmitted segment is ACKed, cwnd deflates back to ssthresh and enters congestion avoidance. This is multiplicative decrease without the catastrophic reset.
# Watch congestion window changes in real-time
watch -n 0.5 "ss -ti dst 10.0.1.50 | grep -E 'cwnd|ssthresh|rtt'"
CUBIC: The Linux Default
Reno/NewReno's AIMD is conservative — it takes a long time to recover the window after a loss, especially on high-bandwidth, high-latency links. On a 10 Gbps link with 100ms RTT, the bandwidth-delay product is 125 MB. After a loss event halves the window, Reno takes ~43,000 RTTs (over an hour!) to recover.
CUBIC (RFC 8312, Linux default since 2.6.19) replaces linear growth with a cubic function centered on the window size at the last loss event (W_max). The key insight: grow quickly when far from W_max, slow down when approaching it, then grow aggressively past it to probe for more bandwidth.
W(t) = C × (t - K)³ + W_max
Where:
C = scaling factor (0.4)
K = time to reach W_max from the reduced window
t = time since last loss event
CUBIC's behavior is independent of RTT, which means flows with different RTTs get fairer bandwidth allocation than Reno. This is critical for internet-scale fairness.
BBR: The Paradigm Shift
Traditional algorithms (Reno, CUBIC) are loss-based — they increase the sending rate until they see packet loss, then back off. The problem: in modern networks with deep buffers, loss doesn't happen until the buffers are full. By that point, hundreds of milliseconds of queuing delay have accumulated. This is bufferbloat.
BBR (Bottleneck Bandwidth and Round-trip time), developed at Google, takes a fundamentally different approach. Instead of reacting to loss, BBR models the network by continuously estimating two parameters:
- BtlBw (bottleneck bandwidth): maximum delivery rate observed over a sliding window
- RTprop (minimum RTT): minimum RTT observed over a longer sliding window
The optimal operating point is BtlBw × RTprop — the bandwidth-delay product. BBR paces packets at exactly this rate, keeping the pipe full without filling buffers.
BBR Phases
- Startup: Like slow start but pacing-based. Doubles sending rate each RTT until BtlBw estimate plateaus (no 25% increase in 3 RTTs).
- Drain: Briefly sends below BtlBw to drain any queue built during startup.
- ProbeBW: Steady state. Cycles through 8 phases — one at 1.25× BtlBw (probe for more bandwidth), one at 0.75× (drain queue), six at 1.0× (cruise).
- ProbeRTT: Every 10 seconds, reduces cwnd to 4 packets for 200ms to get a clean RTprop sample.
Netflix's BBR Story
Netflix runs one of the world's largest CDNs (Open Connect). When they switched from CUBIC to BBR:
- 4-14% throughput improvement globally
- Biggest gains in developing countries where last-mile connections are lossy (cellular, DSL)
- CUBIC interpreted random wireless loss as congestion and backed off aggressively. BBR, not being loss-based, maintained throughput.
# Enable BBR on Linux
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq # BBR requires fair queuing
# Verify
sysctl net.ipv4.tcp_congestion_control
# Output: net.ipv4.tcp_congestion_control = bbr
BBR vs CUBIC vs Reno: When to Use What
| Characteristic | Reno/NewReno | CUBIC | BBR v2/v3 |
|---|---|---|---|
| Loss model | Loss = congestion | Loss = congestion | Loss ≠ congestion |
| Growth | Linear (AIMD) | Cubic function | Model-based pacing |
| Bufferbloat | Causes it | Causes it | Avoids it |
| Lossy links | Terrible | Bad | Excellent |
| High-BDP | Very slow recovery | Fast recovery | Optimal |
| Fairness | Fair with itself | Fair with itself | BBRv1 unfair to CUBIC |
| Data centers | Not recommended | OK | Not recommended (use DCTCP) |
Use CUBIC when: general-purpose server, no specific tuning requirements, mixed traffic. It's battle-tested and the default for a reason.
Use BBR when: serving video/large files, users on cellular/satellite links, high-latency paths. But deploy BBR v2/v3 — v1 has known fairness issues.
Use DCTCP when: data center networks where both endpoints are controlled and all switches support ECN. DCTCP maintains sub-millisecond queuing delay at 90%+ utilization.
Production Monitoring
Congestion control problems manifest as throughput issues, not errors. Metrics are essential:
# Per-connection congestion state
ss -ti dst 10.0.1.50
# Look for: cwnd, ssthresh, rtt, retrans, bytes_sent
# System-wide TCP statistics
nstat -az | grep -E "Tcp(Retrans|InSegs|OutSegs)"
# Simulate congestion for testing
tc qdisc add dev eth0 root netem delay 100ms loss 1%
# Real-time cwnd tracking with bpftrace
bpftrace -e 'kprobe:tcp_cong_avoid_ai {
printf("cwnd=%d ssthresh=%d\n",
((struct tcp_sock *)arg0)->snd_cwnd,
((struct tcp_sock *)arg0)->snd_ssthresh);
}'
The single most important metric is retransmission rate. Below 0.1% is healthy. Between 0.1-1% is concerning. Above 1% means significant throughput is being lost to retransmissions and congestion window resets.
The Fairness Problem
Here's the uncomfortable truth: congestion control algorithms are only fair when competing with themselves. BBR v1 can consume 40% more bandwidth than CUBIC flows on the same bottleneck. CUBIC flows are unfair to Reno flows. DCTCP is catastrophically unfair to non-ECN flows.
This is why algorithm choice matters at the infrastructure level. Running BBR on a CDN while the ISPs use CUBIC-based traffic shaping can produce unexpected interactions. Always test in the actual deployment environment, not just in the lab.
Key Points
- •Congestion control is about the NETWORK capacity, not the receiver's capacity — it prevents routers from dropping packets due to overloaded queues
- •Without congestion control, TCP would cause congestion collapse — the internet literally stopped working in 1986 before Jacobson's fixes
- •CUBIC is the default on Linux since 2.6.19 — it uses a cubic function to probe bandwidth more aggressively than Reno on high-BDP links
- •BBR (Bottleneck Bandwidth and RTT) fundamentally changed the game by modeling the network instead of reacting to loss
- •Congestion control algorithms are NOT interchangeable — BBR and CUBIC competing on the same bottleneck can cause unfairness
Key Components
| Component | Role |
|---|---|
| Congestion Window (cwnd) | Sender-side limit on unacknowledged bytes in flight, separate from receiver's flow control window |
| Slow Start | Exponentially grows cwnd from 1 MSS until loss or ssthresh is reached, probing available bandwidth |
| Congestion Avoidance | Linearly grows cwnd after ssthresh, carefully probing for additional bandwidth one MSS per RTT |
| Fast Retransmit / Fast Recovery | Retransmits on triple duplicate ACK without waiting for timeout, halves cwnd instead of resetting to 1 |
| ssthresh (Slow Start Threshold) | Boundary between slow start (exponential) and congestion avoidance (linear) phases |
When to Use
Congestion control is always active — the question is which algorithm. Use CUBIC for general workloads, BBR for high-latency or lossy links, DCTCP for data centers with ECN. Never disable congestion control.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| CUBIC | Open Source | General-purpose default on Linux; good for most workloads without tuning | Any |
| BBR (v2/v3) | Open Source | High-latency links, lossy networks (cellular, satellite), video streaming | Large-Enterprise |
| Reno/NewReno | Open Source | Legacy systems, textbook reference implementation, low-BDP links | Any |
| DCTCP | Open Source | Data center networks with ECN support — maintains ultra-low latency at high utilization | Enterprise Data Centers |
Debug Checklist
- Check which congestion control algorithm is active: sysctl net.ipv4.tcp_congestion_control and ss -ti (shows algo per connection)
- Monitor cwnd over time with ss -ti — a cwnd that keeps resetting to small values indicates persistent packet loss
- Look at retransmission counts: nstat -az TcpRetransSegs — compare to total segments sent for loss rate
- Check for bufferbloat: measure RTT under load vs idle. If RTT jumps from 20ms to 200ms, intermediate buffers are full
- Verify ECN is enabled if using DCTCP: sysctl net.ipv4.tcp_ecn — both endpoints and all routers must support it
Common Mistakes
- Confusing cwnd with rwnd. Flow control (rwnd) protects the receiver; congestion control (cwnd) protects the network. The sender uses min(cwnd, rwnd)
- Thinking slow start is slow. It doubles cwnd every RTT — a 10 MSS initial window reaches 10,240 segments in just 10 RTTs. It's exponential growth.
- Deploying BBR without understanding its fairness implications. BBR v1 is known to starve CUBIC flows sharing the same bottleneck
- Ignoring initial cwnd tuning. Linux defaults to initcwnd=10; Google showed that increasing to 10 (from the old default of 3) improved page load times by 10%
- Not monitoring congestion metrics. Optimization without measurement is guesswork — track retransmission rate, RTT variance, and cwnd over time
Real World Usage
- •Netflix switched from CUBIC to BBR and saw 4-14% throughput improvement on their video delivery network, especially in regions with lossy last-mile connections
- •Google deployed BBR on YouTube and saw 4% higher throughput globally, with 14% improvement in developing countries with lossy links
- •Data center operators use DCTCP with ECN to keep tail latency low — traditional loss-based algorithms cause buffer bloat and latency spikes
- •CDN providers like Cloudflare and Akamai tune congestion control per connection type — BBR for streaming, CUBIC for bulk downloads
- •Cellular carriers see massive improvements with BBR because mobile networks have high loss rates that CUBIC interprets as congestion