TCP State Machine & Connection Lifecycle
Mental Model
Two people exchanging formal letters through a postal service. Starting a conversation takes three letters back and forth before anyone can send real content. Ending it also requires ceremony -- each side sends a goodbye letter and waits for acknowledgment. After the final goodbye, one side keeps the mailbox open for 60 days in case delayed letters from the old conversation wander in. And if one side sends a goodbye but the other never replies? That mailbox stays frozen, taking up space in the post office, indefinitely.
The Problem
A reverse proxy at 5,000 req/s burns through all 28,000 ephemeral ports in under a minute -- 300,000 TIME_WAIT sockets pile up and new connections fail with EADDRNOTAVAIL. On a different cluster, 12,000 CLOSE_WAIT sockets accumulate over 6 hours, eating file descriptors until the process hits the 65,535 fd limit and crashes. A third service drops 40% of inbound connections during a SYN flood: the SYN queue overflows at 128 entries before SYN cookies kick in, and ListenDrops climbs at 2,000 per second.
Architecture
Run ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn on any production server. The output tells a story.
Thousands of ESTABLISHED? Good -- that is healthy traffic. Tens of thousands of TIME_WAIT? Normal for a busy proxy, but dangerous if ports are running out. Hundreds of CLOSE_WAIT? There is a bug.
Every TCP connection moves through 11 states from birth to death. Most of the time, nobody notices. But when things go wrong -- and in production, they always go wrong -- the state machine is the first place to look.
What Actually Happens
Opening a connection. The client sends SYN (enters SYN_SENT). The server receives it, sends SYN-ACK (enters SYN_RECV). The client receives SYN-ACK, sends ACK (enters ESTABLISHED). The server receives ACK (enters ESTABLISHED). Three packets, three state transitions per side.
Transferring data. Both sides sit in ESTABLISHED. This is where the actual work happens.
Closing a connection. This is where things get interesting. Whoever calls close() first is the "active closer." They send FIN and enter FIN_WAIT_1. The other side receives FIN, sends ACK, and enters CLOSE_WAIT. Now the active closer is in FIN_WAIT_2, waiting for the other side to also close. When the passive closer finally calls close(), it sends its own FIN and enters LAST_ACK. The active closer receives it, sends the final ACK, and enters TIME_WAIT for 60 seconds.
This is where production problems live.
Under the Hood
TIME_WAIT: the most misunderstood state. TIME_WAIT exists for a specific reason: to prevent old packets from corrupting new connections. If a server restarts and a client reconnects on the same 4-tuple (same IPs and ports), a delayed packet from the old connection could arrive and get mixed into the new one. TIME_WAIT's 60-second duration (2 * MSL, hardcoded in Linux) ensures all old packets have expired.
Here's what most people get wrong: TIME_WAIT is not a problem in itself. Each TIME_WAIT socket costs only ~160 bytes (the kernel uses a special lightweight tcp_timewait_sock instead of the full 2KB tcp_sock). Having 50,000 TIME_WAIT sockets uses about 8MB of memory -- nothing.
The real problem is port exhaustion. With the default ephemeral port range (32768-60999, about 28K ports) and a 60-second timeout, the maximum outbound connection rate to a single destination is ~470/second. Go higher and ports run out.
Fixes, in order of preference: (1) connection pooling -- reuse ESTABLISHED connections instead of creating new ones, (2) widen the port range with sysctl net.ipv4.ip_local_port_range="1024 65535", (3) enable tcp_tw_reuse=1 for outbound connections.
CLOSE_WAIT: always a bug. CLOSE_WAIT means the remote peer sent FIN (they are done) but the application has not called close(). The socket is waiting for the application to finish up. If CLOSE_WAIT sockets are accumulating, the code is leaking file descriptors.
The most common cause: recv() returns 0 (EOF, meaning the peer closed), but the error-handling path doesn't close the socket. Or a connection pool holds onto sockets whose peers have disconnected, without health-checking them.
RST: the emergency exit. An RST immediately destroys the connection. No TIME_WAIT, no graceful FIN exchange. The kernel sends RST when data arrives for a nonexistent connection, when the application closes a socket with unread data, or when SO_LINGER is set with timeout 0. Receiving RST isn't catastrophic, but it means something unexpected happened.
tcp_tw_reuse vs the removed tcp_tw_recycle. tcp_tw_reuse=1 lets the kernel reuse a TIME_WAIT socket for a new outbound connection if the TCP timestamp is strictly greater. This is safe because timestamps prevent old-packet confusion. tcp_tw_recycle tracked timestamps per source IP, which broke catastrophically when clients shared IPs through NAT. It was removed in kernel 4.12.
Common Questions
Why does TIME_WAIT last exactly 60 seconds?
It's 2 * MSL (Maximum Segment Lifetime), where Linux hardcodes MSL to 30 seconds. The 2x factor ensures both the last ACK and any retransmitted FIN have time to expire. Unlike BSD where MSL is tunable, Linux's 60 seconds is not configurable via sysctl. The only escape for outbound connections is tcp_tw_reuse.
How to debug CLOSE_WAIT accumulation?
Run ss -tanp state close-wait to find the offending process. Then check the code: the most common cause is not closing sockets when recv() returns 0 (EOF). In connection pools, stale connections whose peers have disconnected pile up in CLOSE_WAIT if the pool doesn't periodically health-check.
A microservice fails with "cannot assign requested address." What's happening?
Ephemeral port exhaustion from TIME_WAIT. With the default range (32768-60999) and a single destination, at most ~28K connections can be in TIME_WAIT simultaneously. At 500 connections/second with 60-second TIME_WAIT, that's 30,000 ports consumed. Fix: (1) connection pooling (best), (2) widen port range, (3) enable tcp_tw_reuse=1.
What is a SYN flood and how do SYN cookies protect against it?
A SYN flood sends millions of SYNs with spoofed source IPs, filling the server's SYN queue. Without SYN cookies, each SYN allocates ~256 bytes. SYN cookies encode the connection state (MSS, timestamp, hash) in the SYN-ACK's initial sequence number. No server memory allocated. When the real client's ACK arrives, the kernel reconstructs the connection from the ISN. Tradeoff: SYN cookies can't carry TCP options (window scaling, SACK) negotiated during the handshake.
How Technologies Use This
An Nginx reverse proxy at 5,000 requests per second suddenly fails to connect to backends. The error is EADDRNOTAVAIL. Running ss reveals 300,000 TIME_WAIT sockets consuming every ephemeral port in the default 32768-60999 range. CPU and memory are fine, but the server is effectively down.
Each backend connection that closes enters TIME_WAIT for 60 seconds. At 5,000 closures per second, that is 300,000 TIME_WAIT sockets piling up simultaneously, exhausting all 28,000 available ephemeral ports. The problem is not TIME_WAIT itself (each socket costs only 160 bytes) but port exhaustion: no free ports means no new outbound connections.
Enable upstream keepalive to reuse ESTABLISHED connections instead of creating new ones, cutting TIME_WAIT accumulation by 90% or more. For the remaining churn, enable tcp_tw_reuse=1 to let outbound connections reuse TIME_WAIT sockets safely via TCP timestamps. On the listener side, SO_REUSEADDR lets Nginx restart instantly instead of waiting 60 seconds for TIME_WAIT on port 443 to expire.
A Kafka consumer crashes, but its partitions go unprocessed for minutes. The consumer group cannot rebalance because the broker still sees the dead consumer's TCP connection as ESTABLISHED. On another broker, CLOSE_WAIT sockets pile up steadily, leaking file descriptors until the process hits its fd limit.
When a consumer process dies without sending FIN, the kernel has no reason to close the TCP connection. Without keepalive probes, the broker's recv() blocks indefinitely on a connection that will never produce data. The CLOSE_WAIT accumulation is a different bug: a consumer received FIN from the broker (via connections.max.idle.ms closing idle connections) but never called close() on its end, leaving the socket stuck in CLOSE_WAIT forever.
Kafka enables TCP keepalive probes (typically 75-second intervals) to detect dead connections within 2-3 minutes and trigger automatic rebalancing. The connections.max.idle.ms setting (default 600,000ms) closes idle connections from the broker side. Monitor for CLOSE_WAIT accumulation on consumers, as it always indicates an application-side bug where received FINs are not followed by close().
A Go microservice making 500 concurrent HTTP requests to a single upstream suddenly fails with "cannot assign requested address." Running ss reveals tens of thousands of TIME_WAIT sockets to that one destination, consuming every ephemeral port in the 32768-60999 range. The service was working fine minutes ago.
The default http.Transport has MaxIdleConnsPerHost set to just 2. Every request beyond 2 concurrent opens a new TCP connection and immediately closes it after use. At 500 requests per second, that creates 498 TIME_WAIT sockets per second, exhausting all 28,000 ephemeral ports in under a minute. The connections are perfectly healthy while active, but the close-and-recreate pattern turns TIME_WAIT into a port exhaustion time bomb.
Raise MaxIdleConnsPerHost to match the expected concurrency so connections stay in ESTABLISHED state and get reused instead of closed. When a request is canceled via context, Go performs a graceful shutdown(SHUT_WR) half-close instead of an abrupt RST, preserving in-flight data. The net.Dialer respects the kernel's tcp_tw_reuse sysctl for outbound connections, providing an additional safety net.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| TIME_WAIT accumulation | Bridge NAT doubles conntrack TTL to 120s | HttpClient connection pool prevents it | http.Agent keepAlive reuses sockets | http.Transport MaxIdleConnsPerHost (default 2 is too low) | Service mesh (Envoy) pools upstream connections |
| CLOSE_WAIT leak | Container restart masks the leak | Finalizers may delay close(); use try-with-resources | socket.destroy() required on error paths | defer conn.Close() in every handler | Sidecar proxy absorbs leaked app connections |
| SYN flood defense | Host kernel SYN cookies protect all containers | N/A (handled by OS) | N/A (handled by OS) | N/A (handled by OS) | Cloud LB SYN proxy offloads to edge |
Stack Layer Mapping
| Layer | Component |
|---|---|
| Hardware/NIC | NIC interrupt coalescing affects SYN processing rate |
| Kernel TCP | tcp_states.h state machine, tcp_timewait_sock, SYN cookies |
| Syscall | connect(), accept(), shutdown(), close(), getsockopt(TCP_INFO) |
| Userspace | Connection pools (pgbouncer, Envoy, http.Transport) |
| Orchestration | K8s readinessProbe prevents routing to TIME_WAIT-saturated pods |
Design Rationale: TIME_WAIT is not a bug -- it is the guard against delayed packets from an old connection arriving on a recycled 4-tuple and corrupting a new one. CLOSE_WAIT exists so the application has time to finish processing before closing its end. SYN cookies are the clever compromise for flood defense: encode connection state in the sequence number itself, requiring zero server memory per SYN, at the cost of losing TCP option negotiation during the handshake.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| EADDRNOTAVAIL on connect() | Ephemeral port exhaustion from TIME_WAIT | ss -tan state time-wait | wc -l and sysctl net.ipv4.ip_local_port_range |
| CLOSE_WAIT count growing over hours | Application not calling close() after peer FIN | ss -tanp state close-wait to find leaking process |
| Connection timeout on SYN (no SYN-ACK) | SYN queue overflow or SYN flood | nstat -az | grep ListenDrop |
| Intermittent RST from server | SO_LINGER(0) or data arriving on closed socket | ss -tanp and check application linger settings |
| Service unreachable after restart | Old LISTEN socket in TIME_WAIT blocks bind() | ss -tan sport = :PORT and use SO_REUSEADDR |
| Conntrack table full in Docker/NAT | TIME_WAIT entries persist 120s in conntrack | conntrack -C and dmesg | grep conntrack |
When to Use / Avoid
- Use when diagnosing connection failures, port exhaustion, or fd leaks on any TCP-based service
- Use when tuning high-throughput proxies or load balancers that open thousands of short-lived backend connections per second
- Use when investigating SYN flood attacks or accept queue drops on public-facing listeners
- Avoid when working with UDP-only services (DNS, QUIC) -- UDP has no connection state machine
- Avoid when the issue is clearly application-layer (HTTP 5xx, TLS handshake) rather than transport-layer
Try It Yourself
1 # Count connections by state
2
3 ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
4
5 # Find CLOSE_WAIT sockets (application bug)
6
7 ss -tanp state close-wait
8
9 # Count TIME_WAIT sockets per destination
10
11 ss -tan state time-wait | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
12
13 # Check TIME_WAIT reuse setting
14
15 sysctl net.ipv4.tcp_tw_reuse
16
17 # Monitor accept queue drops in real-time
18
19 watch -n1 'nstat -az | grep -i listen'
20
21 # Show ephemeral port range (affects TIME_WAIT impact)
22
23 sysctl net.ipv4.ip_local_port_rangeDebug Checklist
- 1
ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn - 2
ss -tanp state close-wait - 3
ss -tan state time-wait | wc -l - 4
nstat -az | grep -i 'ListenDrop\|ListenOver\|TWKill' - 5
sysctl net.ipv4.tcp_tw_reuse net.ipv4.ip_local_port_range net.ipv4.tcp_max_tw_buckets - 6
cat /proc/net/sockstat | grep TCP
Key Takeaways
- ✓TIME_WAIT is 60 seconds on Linux. Hardcoded. Not tunable. Its job is to absorb delayed packets from old connections that could corrupt a new one reusing the same 4-tuple. Each TIME_WAIT socket costs only ~160 bytes -- the real danger is port exhaustion, not memory.
- ✓CLOSE_WAIT means the remote peer sent FIN but your application hasn't called close(). This is always an application bug -- leaked socket, missing error handling in a read loop, broken connection pool. CLOSE_WAIT sockets pile up until you hit fd limits.
- ✓tcp_tw_reuse=1 lets outbound connections reuse TIME_WAIT sockets when the TCP timestamp is strictly increasing. Safe and effective for clients behind proxies. tcp_tw_recycle was removed in kernel 4.12 because it broke NAT.
- ✓SYN cookies (tcp_syncookies=1, default on) defend against SYN floods without any server-side memory. The kernel encodes connection state into the SYN-ACK's initial sequence number. Valid ACKs reconstruct the connection from thin air.
- ✓shutdown(SHUT_WR) sends FIN but keeps the fd open for reading -- a half-close. close() sends FIN and releases the fd, but if another process shares the fd (via fork/dup), close() only decrements the refcount.
Common Pitfalls
- ✗Mistake: treating TIME_WAIT as a bug. Reality: TIME_WAIT is correct TCP behavior that prevents data corruption. The real fix is connection pooling, not hacks like tcp_tw_recycle (which was removed for breaking NAT).
- ✗Mistake: ignoring CLOSE_WAIT accumulation. Reality: CLOSE_WAIT sockets mean your app received the peer's FIN but never closed its end. Common in HTTP clients that don't drain response bodies, or connection pools that can't detect dead connections.
- ✗Mistake: setting SO_LINGER with timeout 0. Reality: this causes close() to send RST instead of FIN, destroying the connection without TIME_WAIT. It loses in-flight data and confuses the peer. Only acceptable for aborting known-bad connections.
- ✗Mistake: not monitoring SYN_RECV queue. Reality: under SYN flood attacks, the SYN queue fills even with SYN cookies enabled. Monitor TcpExtListenDrops and TcpExtTCPReqQFullDrop via nstat to detect drops.
Reference
In One Line
ss state counts tell the whole story: lots of TIME_WAIT means tune connection pooling or widen the port range; any CLOSE_WAIT means the application has a socket leak.