NetworkingTopic 4 of 13

Networking & SocketsIntermediate

TCP State Machine & Connection Lifecycle

NginxKafkaGo

🧠

Mental Model

Two people exchanging formal letters through a postal service. Starting a conversation takes three letters back and forth before anyone can send real content. Ending it also requires ceremony -- each side sends a goodbye letter and waits for acknowledgment. After the final goodbye, one side keeps the mailbox open for 60 days in case delayed letters from the old conversation wander in. And if one side sends a goodbye but the other never replies? That mailbox stays frozen, taking up space in the post office, indefinitely.

💡

The Problem

A reverse proxy at 5,000 req/s burns through all 28,000 ephemeral ports in under a minute -- 300,000 TIME_WAIT sockets pile up and new connections fail with EADDRNOTAVAIL. On a different cluster, 12,000 CLOSE_WAIT sockets accumulate over 6 hours, eating file descriptors until the process hits the 65,535 fd limit and crashes. A third service drops 40% of inbound connections during a SYN flood: the SYN queue overflows at 128 entries before SYN cookies kick in, and ListenDrops climbs at 2,000 per second.

Architecture

Run ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn on any production server. The output tells a story.

Thousands of ESTABLISHED? Good -- that is healthy traffic. Tens of thousands of TIME_WAIT? Normal for a busy proxy, but dangerous if ports are running out. Hundreds of CLOSE_WAIT? There is a bug.

Every TCP connection moves through 11 states from birth to death. Most of the time, nobody notices. But when things go wrong -- and in production, they always go wrong -- the state machine is the first place to look.

What Actually Happens

Opening a connection. The client sends SYN (enters SYN_SENT). The server receives it, sends SYN-ACK (enters SYN_RECV). The client receives SYN-ACK, sends ACK (enters ESTABLISHED). The server receives ACK (enters ESTABLISHED). Three packets, three state transitions per side.

Transferring data. Both sides sit in ESTABLISHED. This is where the actual work happens.

Closing a connection. This is where things get interesting. Whoever calls close() first is the "active closer." They send FIN and enter FIN_WAIT_1. The other side receives FIN, sends ACK, and enters CLOSE_WAIT. Now the active closer is in FIN_WAIT_2, waiting for the other side to also close. When the passive closer finally calls close(), it sends its own FIN and enters LAST_ACK. The active closer receives it, sends the final ACK, and enters TIME_WAIT for 60 seconds.

This is where production problems live.

Under the Hood

TIME_WAIT: the most misunderstood state. TIME_WAIT exists for a specific reason: to prevent old packets from corrupting new connections. If a server restarts and a client reconnects on the same 4-tuple (same IPs and ports), a delayed packet from the old connection could arrive and get mixed into the new one. TIME_WAIT's 60-second duration (2 * MSL, hardcoded in Linux) ensures all old packets have expired.

Here's what most people get wrong: TIME_WAIT is not a problem in itself. Each TIME_WAIT socket costs only ~160 bytes (the kernel uses a special lightweight tcp_timewait_sock instead of the full 2KB tcp_sock). Having 50,000 TIME_WAIT sockets uses about 8MB of memory -- nothing.

The real problem is port exhaustion. With the default ephemeral port range (32768-60999, about 28K ports) and a 60-second timeout, the maximum outbound connection rate to a single destination is ~470/second. Go higher and ports run out.

Fixes, in order of preference: (1) connection pooling -- reuse ESTABLISHED connections instead of creating new ones, (2) widen the port range with sysctl net.ipv4.ip_local_port_range="1024 65535", (3) enable tcp_tw_reuse=1 for outbound connections.

CLOSE_WAIT: always a bug. CLOSE_WAIT means the remote peer sent FIN (they are done) but the application has not called close(). The socket is waiting for the application to finish up. If CLOSE_WAIT sockets are accumulating, the code is leaking file descriptors.

The most common cause: recv() returns 0 (EOF, meaning the peer closed), but the error-handling path doesn't close the socket. Or a connection pool holds onto sockets whose peers have disconnected, without health-checking them.

RST: the emergency exit. An RST immediately destroys the connection. No TIME_WAIT, no graceful FIN exchange. The kernel sends RST when data arrives for a nonexistent connection, when the application closes a socket with unread data, or when SO_LINGER is set with timeout 0. Receiving RST isn't catastrophic, but it means something unexpected happened.

tcp_tw_reuse vs the removed tcp_tw_recycle. tcp_tw_reuse=1 lets the kernel reuse a TIME_WAIT socket for a new outbound connection if the TCP timestamp is strictly greater. This is safe because timestamps prevent old-packet confusion. tcp_tw_recycle tracked timestamps per source IP, which broke catastrophically when clients shared IPs through NAT. It was removed in kernel 4.12.

Common Questions

Why does TIME_WAIT last exactly 60 seconds?

It's 2 * MSL (Maximum Segment Lifetime), where Linux hardcodes MSL to 30 seconds. The 2x factor ensures both the last ACK and any retransmitted FIN have time to expire. Unlike BSD where MSL is tunable, Linux's 60 seconds is not configurable via sysctl. The only escape for outbound connections is tcp_tw_reuse.

How to debug CLOSE_WAIT accumulation?

Run ss -tanp state close-wait to find the offending process. Then check the code: the most common cause is not closing sockets when recv() returns 0 (EOF). In connection pools, stale connections whose peers have disconnected pile up in CLOSE_WAIT if the pool doesn't periodically health-check.

A microservice fails with "cannot assign requested address." What's happening?

Ephemeral port exhaustion from TIME_WAIT. With the default range (32768-60999) and a single destination, at most ~28K connections can be in TIME_WAIT simultaneously. At 500 connections/second with 60-second TIME_WAIT, that's 30,000 ports consumed. Fix: (1) connection pooling (best), (2) widen port range, (3) enable tcp_tw_reuse=1.

What is a SYN flood and how do SYN cookies protect against it?

A SYN flood sends millions of SYNs with spoofed source IPs, filling the server's SYN queue. Without SYN cookies, each SYN allocates ~256 bytes. SYN cookies encode the connection state (MSS, timestamp, hash) in the SYN-ACK's initial sequence number. No server memory allocated. When the real client's ACK arrives, the kernel reconstructs the connection from the ISN. Tradeoff: SYN cookies can't carry TCP options (window scaling, SACK) negotiated during the handshake.

How Technologies Use This

Nginx

An Nginx reverse proxy at 5,000 requests per second suddenly fails to connect to backends. The error is EADDRNOTAVAIL. Running ss reveals 300,000 TIME_WAIT sockets consuming every ephemeral port in the default 32768-60999 range. CPU and memory are fine, but the server is effectively down.

Each backend connection that closes enters TIME_WAIT for 60 seconds. At 5,000 closures per second, that is 300,000 TIME_WAIT sockets piling up simultaneously, exhausting all 28,000 available ephemeral ports. The problem is not TIME_WAIT itself (each socket costs only 160 bytes) but port exhaustion: no free ports means no new outbound connections.

Enable upstream keepalive to reuse ESTABLISHED connections instead of creating new ones, cutting TIME_WAIT accumulation by 90% or more. For the remaining churn, enable tcp_tw_reuse=1 to let outbound connections reuse TIME_WAIT sockets safely via TCP timestamps. On the listener side, SO_REUSEADDR lets Nginx restart instantly instead of waiting 60 seconds for TIME_WAIT on port 443 to expire.

Kafka

A Kafka consumer crashes, but its partitions go unprocessed for minutes. The consumer group cannot rebalance because the broker still sees the dead consumer's TCP connection as ESTABLISHED. On another broker, CLOSE_WAIT sockets pile up steadily, leaking file descriptors until the process hits its fd limit.

When a consumer process dies without sending FIN, the kernel has no reason to close the TCP connection. Without keepalive probes, the broker's recv() blocks indefinitely on a connection that will never produce data. The CLOSE_WAIT accumulation is a different bug: a consumer received FIN from the broker (via connections.max.idle.ms closing idle connections) but never called close() on its end, leaving the socket stuck in CLOSE_WAIT forever.

Kafka enables TCP keepalive probes (typically 75-second intervals) to detect dead connections within 2-3 minutes and trigger automatic rebalancing. The connections.max.idle.ms setting (default 600,000ms) closes idle connections from the broker side. Monitor for CLOSE_WAIT accumulation on consumers, as it always indicates an application-side bug where received FINs are not followed by close().

A Go microservice making 500 concurrent HTTP requests to a single upstream suddenly fails with "cannot assign requested address." Running ss reveals tens of thousands of TIME_WAIT sockets to that one destination, consuming every ephemeral port in the 32768-60999 range. The service was working fine minutes ago.

The default http.Transport has MaxIdleConnsPerHost set to just 2. Every request beyond 2 concurrent opens a new TCP connection and immediately closes it after use. At 500 requests per second, that creates 498 TIME_WAIT sockets per second, exhausting all 28,000 ephemeral ports in under a minute. The connections are perfectly healthy while active, but the close-and-recreate pattern turns TIME_WAIT into a port exhaustion time bomb.

Raise MaxIdleConnsPerHost to match the expected concurrency so connections stay in ESTABLISHED state and get reused instead of closed. When a request is canceled via context, Go performs a graceful shutdown(SHUT_WR) half-close instead of an abrupt RST, preserving in-flight data. The net.Dialer respects the kernel's tcp_tw_reuse sysctl for outbound connections, providing an additional safety net.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
TIME_WAIT accumulation	Bridge NAT doubles conntrack TTL to 120s	HttpClient connection pool prevents it	http.Agent keepAlive reuses sockets	http.Transport MaxIdleConnsPerHost (default 2 is too low)	Service mesh (Envoy) pools upstream connections
CLOSE_WAIT leak	Container restart masks the leak	Finalizers may delay close(); use try-with-resources	socket.destroy() required on error paths	defer conn.Close() in every handler	Sidecar proxy absorbs leaked app connections
SYN flood defense	Host kernel SYN cookies protect all containers	N/A (handled by OS)	N/A (handled by OS)	N/A (handled by OS)	Cloud LB SYN proxy offloads to edge

Stack Layer Mapping

Layer	Component
Hardware/NIC	NIC interrupt coalescing affects SYN processing rate
Kernel TCP	tcp_states.h state machine, tcp_timewait_sock, SYN cookies
Syscall	connect(), accept(), shutdown(), close(), getsockopt(TCP_INFO)
Userspace	Connection pools (pgbouncer, Envoy, http.Transport)
Orchestration	K8s readinessProbe prevents routing to TIME_WAIT-saturated pods

Design Rationale: TIME_WAIT is not a bug -- it is the guard against delayed packets from an old connection arriving on a recycled 4-tuple and corrupting a new one. CLOSE_WAIT exists so the application has time to finish processing before closing its end. SYN cookies are the clever compromise for flood defense: encode connection state in the sequence number itself, requiring zero server memory per SYN, at the cost of losing TCP option negotiation during the handshake.

If You See This, Think This

Symptom	Likely Cause	First Check
EADDRNOTAVAIL on connect()	Ephemeral port exhaustion from TIME_WAIT	`ss -tan state time-wait \| wc -l` and `sysctl net.ipv4.ip_local_port_range`
CLOSE_WAIT count growing over hours	Application not calling close() after peer FIN	`ss -tanp state close-wait` to find leaking process
Connection timeout on SYN (no SYN-ACK)	SYN queue overflow or SYN flood	`nstat -az \| grep ListenDrop`
Intermittent RST from server	SO_LINGER(0) or data arriving on closed socket	`ss -tanp` and check application linger settings
Service unreachable after restart	Old LISTEN socket in TIME_WAIT blocks bind()	`ss -tan sport = :PORT` and use SO_REUSEADDR
Conntrack table full in Docker/NAT	TIME_WAIT entries persist 120s in conntrack	`conntrack -C` and `dmesg \| grep conntrack`

When to Use / Avoid

Use when diagnosing connection failures, port exhaustion, or fd leaks on any TCP-based service
Use when tuning high-throughput proxies or load balancers that open thousands of short-lived backend connections per second
Use when investigating SYN flood attacks or accept queue drops on public-facing listeners
Avoid when working with UDP-only services (DNS, QUIC) -- UDP has no connection state machine
Avoid when the issue is clearly application-layer (HTTP 5xx, TLS handshake) rather than transport-layer

Try It Yourself

 1  # Count connections by state
 2  
 3  ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
 4  
 5  # Find CLOSE_WAIT sockets (application bug)
 6  
 7  ss -tanp state close-wait
 8  
 9  # Count TIME_WAIT sockets per destination
10  
11  ss -tan state time-wait | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
12  
13  # Check TIME_WAIT reuse setting
14  
15  sysctl net.ipv4.tcp_tw_reuse
16  
17  # Monitor accept queue drops in real-time
18  
19  watch -n1 'nstat -az | grep -i listen'
20  
21  # Show ephemeral port range (affects TIME_WAIT impact)
22  
23  sysctl net.ipv4.ip_local_port_range

Debug Checklist

1ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
2ss -tanp state close-wait
3ss -tan state time-wait | wc -l
4nstat -az | grep -i 'ListenDrop\|ListenOver\|TWKill'
5sysctl net.ipv4.tcp_tw_reuse net.ipv4.ip_local_port_range net.ipv4.tcp_max_tw_buckets
6cat /proc/net/sockstat | grep TCP

Key Takeaways

✓TIME_WAIT is 60 seconds on Linux. Hardcoded. Not tunable. Its job is to absorb delayed packets from old connections that could corrupt a new one reusing the same 4-tuple. Each TIME_WAIT socket costs only ~160 bytes -- the real danger is port exhaustion, not memory.
✓CLOSE_WAIT means the remote peer sent FIN but your application hasn't called close(). This is always an application bug -- leaked socket, missing error handling in a read loop, broken connection pool. CLOSE_WAIT sockets pile up until you hit fd limits.
✓tcp_tw_reuse=1 lets outbound connections reuse TIME_WAIT sockets when the TCP timestamp is strictly increasing. Safe and effective for clients behind proxies. tcp_tw_recycle was removed in kernel 4.12 because it broke NAT.
✓SYN cookies (tcp_syncookies=1, default on) defend against SYN floods without any server-side memory. The kernel encodes connection state into the SYN-ACK's initial sequence number. Valid ACKs reconstruct the connection from thin air.
✓shutdown(SHUT_WR) sends FIN but keeps the fd open for reading -- a half-close. close() sends FIN and releases the fd, but if another process shares the fd (via fork/dup), close() only decrements the refcount.

Common Pitfalls

✗Mistake: treating TIME_WAIT as a bug. Reality: TIME_WAIT is correct TCP behavior that prevents data corruption. The real fix is connection pooling, not hacks like tcp_tw_recycle (which was removed for breaking NAT).
✗Mistake: ignoring CLOSE_WAIT accumulation. Reality: CLOSE_WAIT sockets mean your app received the peer's FIN but never closed its end. Common in HTTP clients that don't drain response bodies, or connection pools that can't detect dead connections.
✗Mistake: setting SO_LINGER with timeout 0. Reality: this causes close() to send RST instead of FIN, destroying the connection without TIME_WAIT. It loses in-flight data and confuses the peer. Only acceptable for aborting known-bad connections.
✗Mistake: not monitoring SYN_RECV queue. Reality: under SYN flood attacks, the SYN queue fills even with SYN cookies enabled. Monitor TcpExtListenDrops and TcpExtTCPReqQFullDrop via nstat to detect drops.

Reference

System Calls

connectacceptshutdownclosegetsockopt

Tools

ss (socket statistics)nstat / /proc/net/netstatconntrack

📌

In One Line

ss state counts tell the whole story: lots of TIME_WAIT means tune connection pooling or widen the port range; any CLOSE_WAIT means the application has a socket leak.

TCP State Machine & Connection Lifecycle

NginxKafkaGo

🧠

Mental Model

💡

The Problem

Architecture

Run ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn on any production server. The output tells a story.

Thousands of ESTABLISHED? Good -- that is healthy traffic. Tens of thousands of TIME_WAIT? Normal for a busy proxy, but dangerous if ports are running out. Hundreds of CLOSE_WAIT? There is a bug.

What Actually Happens

Transferring data. Both sides sit in ESTABLISHED. This is where the actual work happens.

This is where production problems live.

Under the Hood

Common Questions

Why does TIME_WAIT last exactly 60 seconds?

How to debug CLOSE_WAIT accumulation?

A microservice fails with "cannot assign requested address." What's happening?

What is a SYN flood and how do SYN cookies protect against it?

How Technologies Use This

Nginx

Kafka

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
TIME_WAIT accumulation	Bridge NAT doubles conntrack TTL to 120s	HttpClient connection pool prevents it	http.Agent keepAlive reuses sockets	http.Transport MaxIdleConnsPerHost (default 2 is too low)	Service mesh (Envoy) pools upstream connections
CLOSE_WAIT leak	Container restart masks the leak	Finalizers may delay close(); use try-with-resources	socket.destroy() required on error paths	defer conn.Close() in every handler	Sidecar proxy absorbs leaked app connections
SYN flood defense	Host kernel SYN cookies protect all containers	N/A (handled by OS)	N/A (handled by OS)	N/A (handled by OS)	Cloud LB SYN proxy offloads to edge

Stack Layer Mapping

Layer	Component
Hardware/NIC	NIC interrupt coalescing affects SYN processing rate
Kernel TCP	tcp_states.h state machine, tcp_timewait_sock, SYN cookies
Syscall	connect(), accept(), shutdown(), close(), getsockopt(TCP_INFO)
Userspace	Connection pools (pgbouncer, Envoy, http.Transport)
Orchestration	K8s readinessProbe prevents routing to TIME_WAIT-saturated pods

If You See This, Think This

Symptom	Likely Cause	First Check
EADDRNOTAVAIL on connect()	Ephemeral port exhaustion from TIME_WAIT	`ss -tan state time-wait \| wc -l` and `sysctl net.ipv4.ip_local_port_range`
CLOSE_WAIT count growing over hours	Application not calling close() after peer FIN	`ss -tanp state close-wait` to find leaking process
Connection timeout on SYN (no SYN-ACK)	SYN queue overflow or SYN flood	`nstat -az \| grep ListenDrop`
Intermittent RST from server	SO_LINGER(0) or data arriving on closed socket	`ss -tanp` and check application linger settings
Service unreachable after restart	Old LISTEN socket in TIME_WAIT blocks bind()	`ss -tan sport = :PORT` and use SO_REUSEADDR
Conntrack table full in Docker/NAT	TIME_WAIT entries persist 120s in conntrack	`conntrack -C` and `dmesg \| grep conntrack`

When to Use / Avoid

Use when diagnosing connection failures, port exhaustion, or fd leaks on any TCP-based service
Use when tuning high-throughput proxies or load balancers that open thousands of short-lived backend connections per second
Use when investigating SYN flood attacks or accept queue drops on public-facing listeners
Avoid when working with UDP-only services (DNS, QUIC) -- UDP has no connection state machine
Avoid when the issue is clearly application-layer (HTTP 5xx, TLS handshake) rather than transport-layer

Try It Yourself

 1  # Count connections by state
 2  
 3  ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
 4  
 5  # Find CLOSE_WAIT sockets (application bug)
 6  
 7  ss -tanp state close-wait
 8  
 9  # Count TIME_WAIT sockets per destination
10  
11  ss -tan state time-wait | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head
12  
13  # Check TIME_WAIT reuse setting
14  
15  sysctl net.ipv4.tcp_tw_reuse
16  
17  # Monitor accept queue drops in real-time
18  
19  watch -n1 'nstat -az | grep -i listen'
20  
21  # Show ephemeral port range (affects TIME_WAIT impact)
22  
23  sysctl net.ipv4.ip_local_port_range

Debug Checklist

1ss -tan | awk '{print $1}' | sort | uniq -c | sort -rn
2ss -tanp state close-wait
3ss -tan state time-wait | wc -l
4nstat -az | grep -i 'ListenDrop\|ListenOver\|TWKill'
5sysctl net.ipv4.tcp_tw_reuse net.ipv4.ip_local_port_range net.ipv4.tcp_max_tw_buckets
6cat /proc/net/sockstat | grep TCP

Key Takeaways

✓TIME_WAIT is 60 seconds on Linux. Hardcoded. Not tunable. Its job is to absorb delayed packets from old connections that could corrupt a new one reusing the same 4-tuple. Each TIME_WAIT socket costs only ~160 bytes -- the real danger is port exhaustion, not memory.
✓CLOSE_WAIT means the remote peer sent FIN but your application hasn't called close(). This is always an application bug -- leaked socket, missing error handling in a read loop, broken connection pool. CLOSE_WAIT sockets pile up until you hit fd limits.
✓tcp_tw_reuse=1 lets outbound connections reuse TIME_WAIT sockets when the TCP timestamp is strictly increasing. Safe and effective for clients behind proxies. tcp_tw_recycle was removed in kernel 4.12 because it broke NAT.
✓SYN cookies (tcp_syncookies=1, default on) defend against SYN floods without any server-side memory. The kernel encodes connection state into the SYN-ACK's initial sequence number. Valid ACKs reconstruct the connection from thin air.
✓shutdown(SHUT_WR) sends FIN but keeps the fd open for reading -- a half-close. close() sends FIN and releases the fd, but if another process shares the fd (via fork/dup), close() only decrements the refcount.

Common Pitfalls

✗Mistake: treating TIME_WAIT as a bug. Reality: TIME_WAIT is correct TCP behavior that prevents data corruption. The real fix is connection pooling, not hacks like tcp_tw_recycle (which was removed for breaking NAT).
✗Mistake: ignoring CLOSE_WAIT accumulation. Reality: CLOSE_WAIT sockets mean your app received the peer's FIN but never closed its end. Common in HTTP clients that don't drain response bodies, or connection pools that can't detect dead connections.
✗Mistake: setting SO_LINGER with timeout 0. Reality: this causes close() to send RST instead of FIN, destroying the connection without TIME_WAIT. It loses in-flight data and confuses the peer. Only acceptable for aborting known-bad connections.
✗Mistake: not monitoring SYN_RECV queue. Reality: under SYN flood attacks, the SYN queue fills even with SYN cookies enabled. Monitor TcpExtListenDrops and TcpExtTCPReqQFullDrop via nstat to detect drops.

Reference

System Calls

connectacceptshutdownclosegetsockopt

Tools

ss (socket statistics)nstat / /proc/net/netstatconntrack

📌

In One Line

ss state counts tell the whole story: lots of TIME_WAIT means tune connection pooling or widen the port range; any CLOSE_WAIT means the application has a socket leak.

TCP State Machine & Connection Lifecycle

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

TCP State Machine & Connection Lifecycle

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics