TCP Deep Dive
TCP turns an unreliable network into a reliable byte stream using handshakes, sequence numbers, ACKs, and flow control.
The Problem
How does a transport protocol deliver data reliably and in order across a network that can drop, reorder, duplicate, or corrupt packets at any point?
Mental Model
Like a phone call — establish a connection, confirm each message received, then hang up. If the other person says 'what?', repeat it.
Architecture Diagram
How It Works
TCP is the protocol that makes the internet reliable. IP delivers packets on a best-effort basis — they can arrive out of order, get duplicated, or vanish entirely. TCP sits on top of IP and provides reliable, ordered, byte-stream delivery. Here's how every piece fits together.
The 3-Way Handshake
Before a single byte of application data flows, TCP establishes a synchronized connection:
- SYN: The client picks an initial sequence number (ISN), say 100, and sends a SYN segment. This ISN is randomized to prevent sequence prediction attacks.
- SYN-ACK: The server picks its own ISN (say 300), acknowledges the client's ISN by setting ack=101, and sends both back.
- ACK: The client acknowledges the server's ISN with ack=301. Connection is now ESTABLISHED on both sides.
This costs one full round-trip before data can flow. On a 100ms RTT link, that's 100ms of pure overhead — which is why connection reuse matters so much.
Client Server
|--- SYN (seq=100) ------->|
|<-- SYN-ACK (seq=300, ----|
| ack=101) |
|--- ACK (seq=101, -------->|
| ack=301) |
| [Connection ESTABLISHED]|
The TCP Header
The TCP header is 20 bytes minimum (up to 60 with options). The fields that matter most for debugging:
| Field | Size | Purpose |
|---|---|---|
| Source/Dest Port | 2+2 bytes | Identifies the connection endpoints |
| Sequence Number | 4 bytes | Position of first data byte in this segment |
| Acknowledgment Number | 4 bytes | Next byte the sender expects to receive |
| Window Size | 2 bytes | How many bytes the receiver can accept (flow control) |
| Flags | 6 bits | SYN, ACK, FIN, RST, PSH, URG |
| Checksum | 2 bytes | Integrity check over header + data + pseudo-header |
The options field carries critical extensions: MSS (maximum segment size), window scaling, timestamps, and SACK (selective acknowledgment).
Sequence Numbers and Acknowledgments
Every byte in a TCP stream has a sequence number. When the client sends 500 bytes starting at seq=101, the server ACKs with ack=601, meaning "I've received everything up to byte 600, send me 601 next."
This is a cumulative ACK scheme. If bytes 601-900 arrive but 901-1000 are lost, the server keeps ACKing 601 until the gap is filled. With Selective ACK (SACK), the receiver can say "I have 601-900 and 1001-1200, but I'm missing 901-1000" — this avoids unnecessary retransmission.
Flow Control: The Sliding Window
TCP's flow control prevents a fast sender from overwhelming a slow receiver. The receive window (rwnd) tells the sender how much buffer space the receiver has available.
Sender's view:
[Already ACKed] [Sent, not ACKed] [Can send] [Cannot send yet]
|<--- flight size -->|<- rwnd ->|
The sender can have at most min(rwnd, cwnd) bytes in flight, where cwnd is the congestion window (a separate concern). If the receiver's buffer fills up, it advertises rwnd=0 — a zero window — and the sender stops. It periodically sends window probes to check if space has opened up.
Window Scaling
The original TCP spec uses a 16-bit window field, capping the receive window at 65,535 bytes. On a high-bandwidth, high-latency link (say 1 Gbps, 100ms RTT), the bandwidth-delay product is 12.5 MB. A 64KB window means only 0.5% of the link capacity is usable.
Window scaling (RFC 7323) negotiated during the handshake multiplies the window by powers of 2 (up to a scale factor of 14, yielding a 1 GB window). This is essential for any modern high-speed connection.
# Check window scaling on active connections
ss -ti | grep -E "wscale|rcv_space"
Nagle's Algorithm and Delayed ACKs
Two optimizations that interact poorly when misunderstood:
Nagle's Algorithm (sender-side): If there's unacknowledged data in flight, buffer small writes and send them as one segment once the ACK arrives. This prevents the "small packet problem" where a telnet session sends 41-byte packets (40 bytes header + 1 byte data).
Delayed ACKs (receiver-side): Instead of ACKing every segment immediately, wait up to 200ms hoping to piggyback the ACK on a response data segment.
The deadly interaction: the sender writes a small chunk and waits for the ACK (Nagle). The receiver waits up to 200ms to piggyback the ACK (delayed ACK). Result: 200ms of artificial latency on every small write.
# Disable Nagle on a socket (TCP_NODELAY)
setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &one, sizeof(one));
For request-response protocols where each write is a complete message, disable Nagle. For bulk transfers, leave it on.
Connection Teardown and TIME_WAIT
TCP connection teardown is a 4-way process (though it's often seen as a 3-way FIN/FIN-ACK/ACK when the server piggybacks its FIN with the ACK):
- Active closer sends FIN
- Passive closer sends ACK
- Passive closer sends FIN
- Active closer sends ACK and enters TIME_WAIT
TIME_WAIT lasts for 2 × MSL (Maximum Segment Lifetime, typically 60 seconds on Linux). During this time, the socket tuple (src IP, src port, dst IP, dst port) cannot be reused. This prevents delayed segments from a previous connection being misinterpreted as belonging to a new one.
The TIME_WAIT Problem
On a busy server handling thousands of short-lived connections, TIME_WAIT sockets pile up and can exhaust ephemeral ports (default range: 32768-60999 = ~28K ports).
# Count TIME_WAIT sockets
ss -s | grep timewait
# Tune the kernel to reuse TIME_WAIT sockets (safe for clients)
sysctl -w net.ipv4.tcp_tw_reuse=1
# Increase ephemeral port range
sysctl -w net.ipv4.ip_local_port_range="1024 65535"
The real fix is connection pooling — reuse existing connections instead of creating new ones.
Retransmission
When a segment is lost, TCP detects it two ways:
- Timeout (RTO): No ACK within the retransmission timeout. The RTO is computed from smoothed RTT (SRTT) and RTT variance (RTTVAR) per RFC 6298. Timeouts are expensive — they reset the congestion window.
- Triple Duplicate ACK: Three duplicate ACKs for the same sequence number trigger fast retransmit without waiting for the timeout. This is much faster and preserves throughput.
# Monitor retransmission stats
netstat -s | grep -i retransmit
# Or on modern Linux:
nstat -az TcpRetransSegs
Production Debugging
When TCP connections misbehave in production, here's the systematic approach:
# 1. Check connection states
ss -tnp | awk '{print $1}' | sort | uniq -c | sort -rn
# 2. Capture traffic on the suspect port
tcpdump -i any port 5432 -w /tmp/pg_debug.pcap -c 10000
# 3. Check for socket buffer pressure
ss -tm # Shows memory usage per socket
# 4. Look at TCP metrics for a specific connection
ss -ti dst 10.0.1.50
# Output includes: rto, rtt, cwnd, retrans, rcv_space
Key things to look for in captures: retransmissions (packet loss), zero-window events (receiver too slow), RST packets (abrupt termination), and duplicate ACKs (mild congestion before loss).
Why This Matters for System Design
TCP behavior directly impacts system design decisions:
- Microservices: Every service call is a TCP connection. Creating new connections per request means the handshake overhead dominates latency. Use connection pools.
- Database connections: A PostgreSQL connection is a TCP stream plus a protocol handshake. Creating one takes 3-10ms. Pooling with PgBouncer or HikariCP is non-negotiable in production.
- Global deployments: TCP performance degrades with distance (higher RTT = lower throughput for a single connection). This is why CDNs terminate TCP close to users and why QUIC's 0-RTT matters.
- Load balancers: L4 load balancers route TCP connections; they must track connection state tables. A SYN flood attack fills this table — that's why SYN cookies exist.
Understanding TCP isn't optional for backend engineers. Every performance problem worth debugging — slow APIs, database timeouts, connection resets — traces back to TCP behavior.
Key Points
- •TCP provides reliable, ordered, byte-stream delivery over an unreliable network — it is the workhorse of the internet
- •The 3-way handshake costs one full RTT before any data flows, making connection setup the dominant cost for short-lived requests
- •Flow control via the receive window prevents a fast sender from overwhelming a slow receiver — this is per-connection, not per-network
- •Window scaling (RFC 7323) extends the 16-bit window field to support high-bandwidth, high-latency links like satellite or cross-continent
- •TIME_WAIT exists for a reason: it prevents old duplicate segments from corrupting a new connection on the same port tuple
Key Components
| Component | Role |
|---|---|
| 3-Way Handshake | Establishes a synchronized connection between client and server using SYN, SYN-ACK, and ACK segments |
| Sequence Numbers | Tracks every byte sent so the receiver can reorder out-of-order segments and detect gaps |
| Acknowledgments (ACKs) | Confirms receipt of data so the sender knows what made it and what needs retransmission |
| Sliding Window | Flow control mechanism that limits how much unacknowledged data can be in flight at once |
| Retransmission Timer (RTO) | Detects lost segments and triggers retransmission when ACKs don't arrive in time |
When to Use
Use TCP whenever reliable, ordered delivery is required and the overhead of connection setup and acknowledgment is tolerable. This covers 95% of internet traffic: web, APIs, databases, email, file transfer.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| tcpdump | Open Source | Packet-level capture and TCP flag inspection on the wire | Any |
| Wireshark | Open Source | Visual TCP stream analysis, retransmission graphs, and expert info | Any |
| ss (iproute2) | Open Source | Fast socket statistics — connection states, window sizes, RTT estimates | Any |
| Packetbeat | Open Source | Real-time TCP flow monitoring integrated with Elasticsearch dashboards | Medium-Enterprise |
Debug Checklist
- Check connection state with ss -tnp — look for SYN_SENT (can't reach server), TIME_WAIT accumulation (port exhaustion), or CLOSE_WAIT (app not closing sockets)
- Capture packets with tcpdump -i eth0 port 443 -w capture.pcap and analyze in Wireshark for retransmissions and zero-window events
- Monitor retransmission rate via netstat -s | grep retransmit — anything above 1% suggests network issues or buffer problems
- Check receive window with ss -ti to see if the receiver is advertising a small window (flow control bottleneck)
- Verify MSS and window scale options in the SYN/SYN-ACK exchange — misconfigured middleboxes often strip TCP options
Common Mistakes
- Confusing flow control (receiver-driven, sliding window) with congestion control (network-driven, cwnd). They are independent mechanisms that both limit send rate
- Ignoring TIME_WAIT accumulation on busy servers. Thousands of sockets stuck in TIME_WAIT can exhaust ephemeral ports — tune net.ipv4.tcp_tw_reuse
- Disabling Nagle's algorithm blindly. Nagle reduces small-packet overhead; disable it only for latency-sensitive apps like gaming or real-time trading
- Not understanding delayed ACKs. The receiver waits up to 200ms hoping to piggyback the ACK on a data response — this interacts badly with Nagle
- Assuming TCP is 'fast enough' without measuring. A single TCP connection on a high-latency link will underperform due to the bandwidth-delay product
Real World Usage
- •Every HTTP/1.1 and HTTP/2 connection rides on top of TCP — understanding TCP is prerequisite to debugging web performance
- •Database connections (PostgreSQL, MySQL) are long-lived TCP streams where window sizing and keep-alive tuning matter enormously
- •Microservice-to-microservice calls in Kubernetes traverse TCP connections managed by service meshes like Istio/Envoy
- •SSH sessions are TCP connections where Nagle and delayed ACKs directly affect interactive latency
- •Load balancers must track TCP connection state (SYN, ESTABLISHED, TIME_WAIT) to route correctly and avoid premature resets