Transport & ReliabilityTopic 7 of 7

Transport & ReliabilityAdvanced

TCP Congestion Control

TCPIPECN (Explicit Congestion Notification)

Congestion control algorithms dynamically adjust TCP's sending rate to maximize throughput without collapsing the network.

The Problem

How should a TCP sender determine its transmission rate when it has no direct visibility into network congestion, router queue depths, or competing traffic?

Mental Model

Like driving on a highway — start slow, speed up until brake lights appear, then back off. BBR is like having GPS traffic data instead of just watching brake lights.

Architecture Diagram

How It Works

In 1986, the internet nearly died. Van Jacobson observed that the network between UC Berkeley and Lawrence Berkeley Labs — just 400 yards apart — was running at 1/100th its capacity. Packets were being sent faster than routers could handle, causing massive loss, which triggered retransmissions, which caused more congestion, which caused more loss. This death spiral is called congestion collapse.

Jacobson's fix, published in 1988, gave TCP a congestion window (cwnd) — a sender-side limit on unacknowledged data, independent of the receiver's window. The sender transmits at most min(cwnd, rwnd) bytes before waiting for ACKs. The congestion control algorithm decides how cwnd evolves over time.

The Four Phases

1. Slow Start — Despite its name, slow start is exponential growth. cwnd starts at the initial window (typically 10 MSS on Linux, thanks to RFC 6928) and doubles every RTT. On a 100ms RTT link with 1460-byte MSS, that goes from ~14KB to 14MB in just 10 RTTs (1 second). Slow start continues until cwnd reaches ssthresh or a loss event occurs.

2. Congestion Avoidance — Once cwnd exceeds ssthresh, TCP switches to linear growth: add 1 MSS per RTT. This is the AIMD (Additive Increase, Multiplicative Decrease) phase — carefully probing for more bandwidth, one segment at a time.

3. Fast Retransmit — When the sender receives three duplicate ACKs (four ACKs for the same sequence number), it assumes a packet was lost and retransmits immediately without waiting for the RTO timer. This saves an entire timeout period.

4. Fast Recovery — After fast retransmit, instead of resetting cwnd to 1 (slow start), the sender sets ssthresh = cwnd/2 and cwnd = ssthresh + 3 MSS. Each additional duplicate ACK inflates cwnd by 1 MSS. When the retransmitted segment is ACKed, cwnd deflates back to ssthresh and enters congestion avoidance. This is multiplicative decrease without the catastrophic reset.

# Watch congestion window changes in real-time
watch -n 0.5 "ss -ti dst 10.0.1.50 | grep -E 'cwnd|ssthresh|rtt'"

CUBIC: The Linux Default

Reno/NewReno's AIMD is conservative — it takes a long time to recover the window after a loss, especially on high-bandwidth, high-latency links. On a 10 Gbps link with 100ms RTT, the bandwidth-delay product is 125 MB. After a loss event halves the window, Reno takes ~43,000 RTTs (over an hour!) to recover.

CUBIC (RFC 8312, Linux default since 2.6.19) replaces linear growth with a cubic function centered on the window size at the last loss event (W_max). The key insight: grow quickly when far from W_max, slow down when approaching it, then grow aggressively past it to probe for more bandwidth.

W(t) = C × (t - K)³ + W_max

Where:
  C = scaling factor (0.4)
  K = time to reach W_max from the reduced window
  t = time since last loss event

CUBIC's behavior is independent of RTT, which means flows with different RTTs get fairer bandwidth allocation than Reno. This is critical for internet-scale fairness.

BBR: The Paradigm Shift

Traditional algorithms (Reno, CUBIC) are loss-based — they increase the sending rate until they see packet loss, then back off. The problem: in modern networks with deep buffers, loss doesn't happen until the buffers are full. By that point, hundreds of milliseconds of queuing delay have accumulated. This is bufferbloat.

BBR (Bottleneck Bandwidth and Round-trip time), developed at Google, takes a fundamentally different approach. Instead of reacting to loss, BBR models the network by continuously estimating two parameters:

BtlBw (bottleneck bandwidth): maximum delivery rate observed over a sliding window
RTprop (minimum RTT): minimum RTT observed over a longer sliding window

The optimal operating point is BtlBw × RTprop — the bandwidth-delay product. BBR paces packets at exactly this rate, keeping the pipe full without filling buffers.

BBR Phases

Startup: Like slow start but pacing-based. Doubles sending rate each RTT until BtlBw estimate plateaus (no 25% increase in 3 RTTs).
Drain: Briefly sends below BtlBw to drain any queue built during startup.
ProbeBW: Steady state. Cycles through 8 phases — one at 1.25× BtlBw (probe for more bandwidth), one at 0.75× (drain queue), six at 1.0× (cruise).
ProbeRTT: Every 10 seconds, reduces cwnd to 4 packets for 200ms to get a clean RTprop sample.

Netflix's BBR Story

Netflix runs one of the world's largest CDNs (Open Connect). When they switched from CUBIC to BBR:

4-14% throughput improvement globally
Biggest gains in developing countries where last-mile connections are lossy (cellular, DSL)
CUBIC interpreted random wireless loss as congestion and backed off aggressively. BBR, not being loss-based, maintained throughput.

# Enable BBR on Linux
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq  # BBR requires fair queuing

# Verify
sysctl net.ipv4.tcp_congestion_control
# Output: net.ipv4.tcp_congestion_control = bbr

BBR vs CUBIC vs Reno: When to Use What

Characteristic	Reno/NewReno	CUBIC	BBR v2/v3
Loss model	Loss = congestion	Loss = congestion	Loss ≠ congestion
Growth	Linear (AIMD)	Cubic function	Model-based pacing
Bufferbloat	Causes it	Causes it	Avoids it
Lossy links	Terrible	Bad	Excellent
High-BDP	Very slow recovery	Fast recovery	Optimal
Fairness	Fair with itself	Fair with itself	BBRv1 unfair to CUBIC
Data centers	Not recommended	OK	Not recommended (use DCTCP)

Use CUBIC when: general-purpose server, no specific tuning requirements, mixed traffic. It's battle-tested and the default for a reason.

Use BBR when: serving video/large files, users on cellular/satellite links, high-latency paths. But deploy BBR v2/v3 — v1 has known fairness issues.

Use DCTCP when: data center networks where both endpoints are controlled and all switches support ECN. DCTCP maintains sub-millisecond queuing delay at 90%+ utilization.

Production Monitoring

Congestion control problems manifest as throughput issues, not errors. Metrics are essential:

# Per-connection congestion state
ss -ti dst 10.0.1.50
# Look for: cwnd, ssthresh, rtt, retrans, bytes_sent

# System-wide TCP statistics
nstat -az | grep -E "Tcp(Retrans|InSegs|OutSegs)"

# Simulate congestion for testing
tc qdisc add dev eth0 root netem delay 100ms loss 1%

# Real-time cwnd tracking with bpftrace
bpftrace -e 'kprobe:tcp_cong_avoid_ai { 
  printf("cwnd=%d ssthresh=%d\n", 
    ((struct tcp_sock *)arg0)->snd_cwnd,
    ((struct tcp_sock *)arg0)->snd_ssthresh); 
}'

The single most important metric is retransmission rate. Below 0.1% is healthy. Between 0.1-1% is concerning. Above 1% means significant throughput is being lost to retransmissions and congestion window resets.

The Fairness Problem

Here's the uncomfortable truth: congestion control algorithms are only fair when competing with themselves. BBR v1 can consume 40% more bandwidth than CUBIC flows on the same bottleneck. CUBIC flows are unfair to Reno flows. DCTCP is catastrophically unfair to non-ECN flows.

This is why algorithm choice matters at the infrastructure level. Running BBR on a CDN while the ISPs use CUBIC-based traffic shaping can produce unexpected interactions. Always test in the actual deployment environment, not just in the lab.

Key Points

•Congestion control is about the NETWORK capacity, not the receiver's capacity — it prevents routers from dropping packets due to overloaded queues
•Without congestion control, TCP would cause congestion collapse — the internet literally stopped working in 1986 before Jacobson's fixes
•CUBIC is the default on Linux since 2.6.19 — it uses a cubic function to probe bandwidth more aggressively than Reno on high-BDP links
•BBR (Bottleneck Bandwidth and RTT) fundamentally changed the game by modeling the network instead of reacting to loss
•Congestion control algorithms are NOT interchangeable — BBR and CUBIC competing on the same bottleneck can cause unfairness

Key Components

Component	Role
Congestion Window (cwnd)	Sender-side limit on unacknowledged bytes in flight, separate from receiver's flow control window
Slow Start	Exponentially grows cwnd from 1 MSS until loss or ssthresh is reached, probing available bandwidth
Congestion Avoidance	Linearly grows cwnd after ssthresh, carefully probing for additional bandwidth one MSS per RTT
Fast Retransmit / Fast Recovery	Retransmits on triple duplicate ACK without waiting for timeout, halves cwnd instead of resetting to 1
ssthresh (Slow Start Threshold)	Boundary between slow start (exponential) and congestion avoidance (linear) phases

When to Use

Congestion control is always active — the question is which algorithm. Use CUBIC for general workloads, BBR for high-latency or lossy links, DCTCP for data centers with ECN. Never disable congestion control.

Tool Comparison

Tool	Type	Best For	Scale
CUBIC	Open Source	General-purpose default on Linux; good for most workloads without tuning	Any
BBR (v2/v3)	Open Source	High-latency links, lossy networks (cellular, satellite), video streaming	Large-Enterprise
Reno/NewReno	Open Source	Legacy systems, textbook reference implementation, low-BDP links	Any
DCTCP	Open Source	Data center networks with ECN support — maintains ultra-low latency at high utilization	Enterprise Data Centers

Debug Checklist

Check which congestion control algorithm is active: sysctl net.ipv4.tcp_congestion_control and ss -ti (shows algo per connection)
Monitor cwnd over time with ss -ti — a cwnd that keeps resetting to small values indicates persistent packet loss
Look at retransmission counts: nstat -az TcpRetransSegs — compare to total segments sent for loss rate
Check for bufferbloat: measure RTT under load vs idle. If RTT jumps from 20ms to 200ms, intermediate buffers are full
Verify ECN is enabled if using DCTCP: sysctl net.ipv4.tcp_ecn — both endpoints and all routers must support it

Common Mistakes

Confusing cwnd with rwnd. Flow control (rwnd) protects the receiver; congestion control (cwnd) protects the network. The sender uses min(cwnd, rwnd)
Thinking slow start is slow. It doubles cwnd every RTT — a 10 MSS initial window reaches 10,240 segments in just 10 RTTs. It's exponential growth.
Deploying BBR without understanding its fairness implications. BBR v1 is known to starve CUBIC flows sharing the same bottleneck
Ignoring initial cwnd tuning. Linux defaults to initcwnd=10; Google showed that increasing to 10 (from the old default of 3) improved page load times by 10%
Not monitoring congestion metrics. Optimization without measurement is guesswork — track retransmission rate, RTT variance, and cwnd over time

Real World Usage

•Netflix switched from CUBIC to BBR and saw 4-14% throughput improvement on their video delivery network, especially in regions with lossy last-mile connections
•Google deployed BBR on YouTube and saw 4% higher throughput globally, with 14% improvement in developing countries with lossy links
•Data center operators use DCTCP with ECN to keep tail latency low — traditional loss-based algorithms cause buffer bloat and latency spikes
•CDN providers like Cloudflare and Akamai tune congestion control per connection type — BBR for streaming, CUBIC for bulk downloads
•Cellular carriers see massive improvements with BBR because mobile networks have high loss rates that CUBIC interprets as congestion

RFCs & Specs

RFC 5681 — TCP Congestion Control (Slow Start, Congestion Avoidance, Fast Retransmit/Recovery)RFC 8312 — CUBIC for Fast Long-Distance NetworksRFC 9002 — QUIC Loss Detection and Congestion ControlRFC 3168 — Explicit Congestion Notification (ECN)draft-cardwell-iccrg-bbr-congestion-control — BBR Congestion Control

TCP Congestion Control

TCPIPECN (Explicit Congestion Notification)

Congestion control algorithms dynamically adjust TCP's sending rate to maximize throughput without collapsing the network.

The Problem

How should a TCP sender determine its transmission rate when it has no direct visibility into network congestion, router queue depths, or competing traffic?

Mental Model

Like driving on a highway — start slow, speed up until brake lights appear, then back off. BBR is like having GPS traffic data instead of just watching brake lights.

Architecture Diagram

How It Works

The Four Phases

# Watch congestion window changes in real-time
watch -n 0.5 "ss -ti dst 10.0.1.50 | grep -E 'cwnd|ssthresh|rtt'"

CUBIC: The Linux Default

W(t) = C × (t - K)³ + W_max

Where:
  C = scaling factor (0.4)
  K = time to reach W_max from the reduced window
  t = time since last loss event

CUBIC's behavior is independent of RTT, which means flows with different RTTs get fairer bandwidth allocation than Reno. This is critical for internet-scale fairness.

BBR: The Paradigm Shift

BtlBw (bottleneck bandwidth): maximum delivery rate observed over a sliding window
RTprop (minimum RTT): minimum RTT observed over a longer sliding window

The optimal operating point is BtlBw × RTprop — the bandwidth-delay product. BBR paces packets at exactly this rate, keeping the pipe full without filling buffers.

BBR Phases

Startup: Like slow start but pacing-based. Doubles sending rate each RTT until BtlBw estimate plateaus (no 25% increase in 3 RTTs).
Drain: Briefly sends below BtlBw to drain any queue built during startup.
ProbeBW: Steady state. Cycles through 8 phases — one at 1.25× BtlBw (probe for more bandwidth), one at 0.75× (drain queue), six at 1.0× (cruise).
ProbeRTT: Every 10 seconds, reduces cwnd to 4 packets for 200ms to get a clean RTprop sample.

Netflix's BBR Story

Netflix runs one of the world's largest CDNs (Open Connect). When they switched from CUBIC to BBR:

4-14% throughput improvement globally
Biggest gains in developing countries where last-mile connections are lossy (cellular, DSL)
CUBIC interpreted random wireless loss as congestion and backed off aggressively. BBR, not being loss-based, maintained throughput.

# Enable BBR on Linux
sysctl -w net.ipv4.tcp_congestion_control=bbr
sysctl -w net.core.default_qdisc=fq  # BBR requires fair queuing

# Verify
sysctl net.ipv4.tcp_congestion_control
# Output: net.ipv4.tcp_congestion_control = bbr

BBR vs CUBIC vs Reno: When to Use What

Characteristic	Reno/NewReno	CUBIC	BBR v2/v3
Loss model	Loss = congestion	Loss = congestion	Loss ≠ congestion
Growth	Linear (AIMD)	Cubic function	Model-based pacing
Bufferbloat	Causes it	Causes it	Avoids it
Lossy links	Terrible	Bad	Excellent
High-BDP	Very slow recovery	Fast recovery	Optimal
Fairness	Fair with itself	Fair with itself	BBRv1 unfair to CUBIC
Data centers	Not recommended	OK	Not recommended (use DCTCP)

Use CUBIC when: general-purpose server, no specific tuning requirements, mixed traffic. It's battle-tested and the default for a reason.

Use BBR when: serving video/large files, users on cellular/satellite links, high-latency paths. But deploy BBR v2/v3 — v1 has known fairness issues.

Use DCTCP when: data center networks where both endpoints are controlled and all switches support ECN. DCTCP maintains sub-millisecond queuing delay at 90%+ utilization.

Production Monitoring

Congestion control problems manifest as throughput issues, not errors. Metrics are essential:

# Per-connection congestion state
ss -ti dst 10.0.1.50
# Look for: cwnd, ssthresh, rtt, retrans, bytes_sent

# System-wide TCP statistics
nstat -az | grep -E "Tcp(Retrans|InSegs|OutSegs)"

# Simulate congestion for testing
tc qdisc add dev eth0 root netem delay 100ms loss 1%

# Real-time cwnd tracking with bpftrace
bpftrace -e 'kprobe:tcp_cong_avoid_ai { 
  printf("cwnd=%d ssthresh=%d\n", 
    ((struct tcp_sock *)arg0)->snd_cwnd,
    ((struct tcp_sock *)arg0)->snd_ssthresh); 
}'

The Fairness Problem

Key Points

•Congestion control is about the NETWORK capacity, not the receiver's capacity — it prevents routers from dropping packets due to overloaded queues
•Without congestion control, TCP would cause congestion collapse — the internet literally stopped working in 1986 before Jacobson's fixes
•CUBIC is the default on Linux since 2.6.19 — it uses a cubic function to probe bandwidth more aggressively than Reno on high-BDP links
•BBR (Bottleneck Bandwidth and RTT) fundamentally changed the game by modeling the network instead of reacting to loss
•Congestion control algorithms are NOT interchangeable — BBR and CUBIC competing on the same bottleneck can cause unfairness

Key Components

Component	Role
Congestion Window (cwnd)	Sender-side limit on unacknowledged bytes in flight, separate from receiver's flow control window
Slow Start	Exponentially grows cwnd from 1 MSS until loss or ssthresh is reached, probing available bandwidth
Congestion Avoidance	Linearly grows cwnd after ssthresh, carefully probing for additional bandwidth one MSS per RTT
Fast Retransmit / Fast Recovery	Retransmits on triple duplicate ACK without waiting for timeout, halves cwnd instead of resetting to 1
ssthresh (Slow Start Threshold)	Boundary between slow start (exponential) and congestion avoidance (linear) phases

When to Use

Tool Comparison

Tool	Type	Best For	Scale
CUBIC	Open Source	General-purpose default on Linux; good for most workloads without tuning	Any
BBR (v2/v3)	Open Source	High-latency links, lossy networks (cellular, satellite), video streaming	Large-Enterprise
Reno/NewReno	Open Source	Legacy systems, textbook reference implementation, low-BDP links	Any
DCTCP	Open Source	Data center networks with ECN support — maintains ultra-low latency at high utilization	Enterprise Data Centers

Debug Checklist

Check which congestion control algorithm is active: sysctl net.ipv4.tcp_congestion_control and ss -ti (shows algo per connection)
Monitor cwnd over time with ss -ti — a cwnd that keeps resetting to small values indicates persistent packet loss
Look at retransmission counts: nstat -az TcpRetransSegs — compare to total segments sent for loss rate
Check for bufferbloat: measure RTT under load vs idle. If RTT jumps from 20ms to 200ms, intermediate buffers are full
Verify ECN is enabled if using DCTCP: sysctl net.ipv4.tcp_ecn — both endpoints and all routers must support it

Common Mistakes

Confusing cwnd with rwnd. Flow control (rwnd) protects the receiver; congestion control (cwnd) protects the network. The sender uses min(cwnd, rwnd)
Thinking slow start is slow. It doubles cwnd every RTT — a 10 MSS initial window reaches 10,240 segments in just 10 RTTs. It's exponential growth.
Deploying BBR without understanding its fairness implications. BBR v1 is known to starve CUBIC flows sharing the same bottleneck
Ignoring initial cwnd tuning. Linux defaults to initcwnd=10; Google showed that increasing to 10 (from the old default of 3) improved page load times by 10%
Not monitoring congestion metrics. Optimization without measurement is guesswork — track retransmission rate, RTT variance, and cwnd over time

Real World Usage

•Netflix switched from CUBIC to BBR and saw 4-14% throughput improvement on their video delivery network, especially in regions with lossy last-mile connections
•Google deployed BBR on YouTube and saw 4% higher throughput globally, with 14% improvement in developing countries with lossy links
•Data center operators use DCTCP with ECN to keep tail latency low — traditional loss-based algorithms cause buffer bloat and latency spikes
•CDN providers like Cloudflare and Akamai tune congestion control per connection type — BBR for streaming, CUBIC for bulk downloads
•Cellular carriers see massive improvements with BBR because mobile networks have high loss rates that CUBIC interprets as congestion

The Problem

Mental Model

Architecture Diagram

How It Works

The Four Phases

CUBIC: The Linux Default

BBR: The Paradigm Shift

BBR Phases

Netflix's BBR Story

BBR vs CUBIC vs Reno: When to Use What

Production Monitoring

The Fairness Problem

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics

The Problem

Mental Model

Architecture Diagram

How It Works

The Four Phases

CUBIC: The Linux Default

BBR: The Paradigm Shift

BBR Phases

Netflix's BBR Story

BBR vs CUBIC vs Reno: When to Use What

Production Monitoring

The Fairness Problem

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics