Socket Programming (TCP/UDP)
Mental Model
A reception desk at a busy office. The receptionist sits at one desk with one phone number. Visitors arrive and fill out a sign-in sheet -- that is the SYN queue. Once checked in, they wait in the lobby -- the accept queue -- until a meeting room opens up. Each visitor gets a private room (a fresh fd from accept) where they can talk freely. The receptionist never leaves the desk, never joins a meeting. Dozens of meetings run simultaneously in separate rooms while the desk keeps greeting new arrivals.
The Problem
A Kafka broker restarts and sits unable to bind its port for 60 seconds -- TIME_WAIT sockets own the address and nobody set SO_REUSEADDR. A burst of 5,000 producer reconnections overflows the accept queue and connections vanish with no server-side error. Sixteen Nginx workers sharing one listen socket burn 30-40% CPU just waking up and finding nothing -- classic thundering herd, 15 out of 16 context-switch for zero useful work. PostgreSQL exhausts its 100 connection slots because dead clients without SO_KEEPALIVE leave recv() blocking forever. And Redis p99 latency spikes to 200ms because Nagle buffering holds 20-byte responses hostage, waiting for an ACK that takes its time arriving.
Architecture
Every networked application ever built -- web browsers, databases, chat apps, game servers -- talks through sockets. They are so fundamental that most developers never think about them. Just call connect() or listen() and data flows.
But here is the thing. Between the application code and the network wire, the kernel is doing an enormous amount of work: segmenting data into packets, retransmitting lost ones, controlling flow to avoid overwhelming the receiver. Sockets are the thin abstraction layer that hides all this complexity behind file descriptors.
When that abstraction leaks -- and it will -- knowing what is actually happening underneath becomes essential.
What Actually Happens
A TCP server follows a strict lifecycle. Each step maps to a syscall:
socket() creates an unbound socket object in the kernel. At this point, it's just a data structure with no address, no port, no connections.
bind() assigns a local address and port. Now the kernel knows where to deliver packets destined for this socket.
listen() is where the magic happens. It transitions the socket from "active" to "passive" mode and creates two internal queues: the SYN queue (half-open connections that have started the handshake) and the accept queue (fully established connections waiting to be picked up by the application).
accept() dequeues the next completed connection from the accept queue and returns a brand new file descriptor. The original listening socket stays exactly as it was, continuing to accept new connections. This is the key insight: one listening socket can produce thousands of connected sockets, each identified by a unique 4-tuple.
For a client, it's simpler: socket(), connect() (triggers the three-way handshake), send()/recv(), close().
For UDP, it is even simpler: no connections, no handshake, no accept queue. After socket() and bind(), the process calls recvfrom() to grab individual datagrams and sendto() to fire them off. Each datagram is independent. That's why DNS, game servers, and anything latency-sensitive often chooses UDP.
Under the Hood
The two queues worth knowing. When a client SYN arrives, the kernel responds with SYN-ACK and puts the connection in the SYN queue. When the client's ACK comes back, it moves to the accept queue as ESTABLISHED. The application's accept() call pulls from the accept queue.
Here is the catch: if the accept queue is full (the application is too slow calling accept()), the kernel silently drops the final ACK. The client thinks the connection is established -- it got a SYN-ACK -- but the server never creates a socket. The client's first data send eventually times out. Monitor this with nstat -az | grep ListenOverflows.
SO_REUSEADDR vs SO_REUSEPORT. SO_REUSEADDR is defensive: it allows binding to a port stuck in TIME_WAIT so server restarts do not fail for 60 seconds. Every TCP server should set it. SO_REUSEPORT is offensive: it lets multiple sockets bind to the same port, and the kernel distributes connections across them via a hash. This is how Nginx, Envoy, and Go's HTTP server eliminate the thundering herd problem.
Non-blocking sockets. By default, accept(), recv(), and send() block until they can do something useful. Setting O_NONBLOCK makes them return immediately with EAGAIN when nothing is ready. This is the foundation of every high-performance event-driven server -- combine non-blocking sockets with epoll and a single thread can handle hundreds of thousands of connections.
TCP_NODELAY. TCP's Nagle algorithm batches small writes until it receives an ACK, adding up to 200ms of latency. For request-response protocols like HTTP, Redis, or gRPC, this is unacceptable. TCP_NODELAY disables Nagle, sending data immediately.
Common Questions
Why can a server on port 80 handle millions of connections when there are only 65,535 ports?
The port limit applies per IP address. A TCP connection is identified by a 4-tuple: (src_ip, src_port, dst_ip, dst_port). The server uses one port (80), but each client connection has a unique (client_ip, client_port) pair. The theoretical limit is number_of_client_IPs * 65,535 per server port.
What happens when the accept queue fills up?
The kernel drops the client's final ACK. The client believes the connection succeeded (it got SYN-ACK), but no server-side socket exists. The client's first send() eventually times out. Check ListenOverflows in /proc/net/netstat. Fix: increase listen() backlog and net.core.somaxconn.
Why doesn't UDP need listen() or accept()?
UDP is connectionless. No state machine, no handshake, no per-connection socket. A single UDP socket receives datagrams from any sender via recvfrom(), with the sender's address returned alongside each datagram.
How does SO_REUSEPORT distribute connections?
The kernel hashes the incoming connection's source IP and port using a consistent hash and assigns it to one of the sockets bound to the port. Same client, same socket -- useful for stateful protocols. Since kernel 4.6, a BPF program can be attached to customize the selection.
How Technologies Use This
A Kafka broker crashes and gets restarted, but it refuses to bind its listener port for 60 seconds. Meanwhile, a burst of producer reconnections overflows the accept queue, silently dropping new connections. Cross-datacenter replication crawls at a fraction of the link speed despite plenty of bandwidth.
The port refusal happens because TIME_WAIT sockets hold the address without SO_REUSEADDR. The accept queue overflow occurs because the default listen backlog is too small for reconnection storms. The replication bottleneck is a classic bandwidth-delay product mismatch: default TCP buffers are far too small for high-latency WAN links, so the sender stalls waiting for window updates.
Kafka sets SO_REUSEADDR before bind and uses a listen backlog of 1024+ to absorb connection storms. Its NIO Selectors backed by epoll multiplex 10,000 connections across a handful of threads at under 5% CPU overhead. For cross-datacenter replication, tuning socket send and receive buffers to match the bandwidth-delay product unlocks the full link capacity.
With 16 Nginx workers sharing a single listen socket, CPU usage spikes 30-40% above expected under high concurrency. Profiling reveals the overhead is not in request handling but in context switches from the thundering herd: every incoming connection wakes all 16 workers, and 15 of them do nothing.
The root cause is the shared accept queue. When one connection arrives, the kernel wakes every process blocked on accept() for that socket. Only one wins; the other 15 are woken, context-switched in, find nothing, and go back to sleep. At thousands of connections per second, those wasted wakeups dominate CPU time.
SO_REUSEPORT lets each worker bind its own socket on the same port, giving each its own accept queue. The kernel distributes incoming connections via a source-hash, eliminating cross-worker contention entirely. Combined with non-blocking sockets and epoll, each worker independently handles 10,000+ concurrent connections, and benchmarks show 20-30% lower request latency compared to the old accept-mutex approach.
An application suddenly cannot connect to PostgreSQL. The server reports all 100 connection slots are in use, yet only a handful of clients are actively querying. Several slots are held by backends stuck on recv(), waiting for data from clients whose networks died silently minutes ago.
The postmaster fork()s a dedicated backend per client that inherits the connected socket fd. When a client's network disappears without sending FIN, the backend has no way to know the peer is gone. Without SO_KEEPALIVE, recv() blocks indefinitely, and that connection slot is consumed forever. With only 100 default slots, a few dead clients can lock out the entire application.
PostgreSQL enables SO_KEEPALIVE with configurable probe intervals (tcp_keepalives_idle defaults to 2 hours, but production deployments typically lower it to 60 seconds) to detect dead clients and reclaim their slots automatically. Local clients bypass TCP entirely via a Unix domain socket, cutting per-query latency by 30-50%.
Redis clients report p99 latency spiking to 200ms even though the server processes each GET in 0.1ms. Worse, during peak load with 100,000 connected clients, random clients appear to freeze for seconds while others respond instantly. After a crash, restarting the server fails with EADDRINUSE for up to 60 seconds.
The 200ms spike comes from Nagle buffering: without TCP_NODELAY, a tiny 20-byte response sits in the kernel send buffer waiting for an ACK before being sent. The client freezes come from blocking sockets: if even one recv() blocks on a slow client, every other client on the single-threaded event loop starves. The restart failure happens because TIME_WAIT sockets hold the port without SO_REUSEADDR.
Redis creates sockets with SOCK_NONBLOCK and registers them with epoll, so the event loop cycles through accept, read, process, and write in microseconds per iteration. TCP_NODELAY is set unconditionally so every response hits the wire immediately. SO_REUSEADDR prevents restart failures. The result is sub-millisecond p99 latency at 80,000+ operations per second on a single core.
A Go service handling 100,000 concurrent connections consumes gigabytes of memory and the OS reports 100,000 threads when naively spawning one OS thread per connection. At that scale, the scheduler collapses under context switch overhead and memory pressure kills the process.
The key insight is that Go's runtime translates every blocking net.Conn.Read() into a non-blocking epoll registration under the hood. When a goroutine calls Read(), it does not actually block an OS thread. The runtime parks the goroutine and registers the socket fd with epoll. When data arrives, epoll wakes the goroutine on whichever OS thread is available. This means 100,000 goroutines each waiting on their own socket require only 8-16 OS threads and under 200MB of memory.
Go's net.Listen sets SO_REUSEADDR by default so server restarts never fail on TIME_WAIT. The net.ListenConfig Control callback allows setting SO_REUSEPORT before bind for multi-process load balancing without any CGo or raw syscalls. The goroutine-to-epoll bridge delivers synchronous code with asynchronous performance.
A Node.js server on a 16-core machine hits a traffic spike and maxes out at roughly 10,000 requests per second. CPU monitoring shows one core pegged at 100% while the other 15 sit completely idle. 94% of the hardware is wasted because the single-threaded event loop cannot spread work across cores.
The single event loop is fundamentally bound to one CPU. No matter how efficient it is, one thread can only use one core. Without the cluster module, additional cores are unreachable. Without TCP_NODELAY, HTTP responses would also sit in Nagle buffers for up to 200ms before reaching clients, compounding the throughput ceiling.
The cluster module spawns worker processes that share the listen port via SO_REUSEPORT or fd passing through a Unix domain socket. Each worker runs its own epoll-based event loop with TCP_NODELAY enabled by default, so responses leave immediately. With 16 workers, throughput scales nearly linearly to 150,000+ requests per second, limited only by the listen backlog and somaxconn settings.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Socket creation | Container port mapping (-p) NATs to host socket; bridge network adds overhead | ServerSocket/SocketChannel in java.nio; Netty wraps epoll on Linux | net.createServer() wraps socket+bind+listen; cluster module for multi-process | net.Listen() sets SO_REUSEADDR by default; goroutine-per-connection with epoll bridge | Service ClusterIP/NodePort creates iptables/IPVS rules routing to pod sockets |
| Connection distribution | Docker Swarm uses IPVS for load balancing across containers | Netty's SO_REUSEPORT support for multi-threaded accept | cluster module uses SO_REUSEPORT or fd passing via IPC | net.ListenConfig Control callback for SO_REUSEPORT | kube-proxy distributes via iptables DNAT or IPVS across endpoints |
| Keep-alive / dead peer | Docker health checks at container level; TCP keepalive inside app | SO_KEEPALIVE via setOption on SocketChannel | server.keepAliveTimeout (default 5s); SO_KEEPALIVE set by default | net.Dialer.KeepAlive (default 15s) | readiness probes detect unresponsive pods; TCP liveness probes optional |
| Buffer tuning | --sysctl flag to set net.core.rmem_max inside container | -Djdk.net.tcp.* or SocketChannel.setOption for buffer sizes | Socket.bufferSize or raw setsockopt via native addon | net.Dialer with custom Control to set buffer sizes | sysctl tuning via init containers or securityContext |
| Non-blocking I/O | N/A -- handled by application inside container | NIO Selector wraps epoll; Netty's EpollEventLoopGroup for native | libuv event loop wraps epoll; single-threaded with worker threads for CPU | Go runtime netpoller wraps epoll; transparent to goroutines | N/A -- handled by application; Envoy/Istio sidecar manages connection pooling |
| Stack Layer | Mechanism |
|---|---|
| Application | socket(), bind(), listen(), accept(), connect(), send(), recv() syscalls |
| Socket layer | struct socket (VFS) + struct sock (protocol state, buffers, queues) |
| TCP/IP stack | SYN/accept queues, congestion control, retransmission, sk_buff packet chains |
| Network device | Driver queues (txqueuelen), interrupt coalescing, checksum offload |
| Hardware | NIC ring buffers, RSS (Receive Side Scaling) for multi-queue distribution |
Design rationale: Sockets hide an enormous amount of machinery -- segmentation, retransmission, flow control, congestion -- behind plain file descriptors so that applications can use read() and write() as if the network were a file. The cost of that abstraction is that the defaults are generic. Buffer sizes, keepalive intervals, and Nagle behavior all need per-workload tuning, and the accept queue must be sized for peak burst rates, not averages.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| EADDRINUSE on server restart | TIME_WAIT sockets holding the port; SO_REUSEADDR not set | ss -tan state time-wait sport = :$PORT |
| Clients connect but first send times out | Accept queue full; server too slow calling accept() | nstat -az |
| 30-40% CPU overhead with multiple workers | Thundering herd on shared listen socket | Switch to SO_REUSEPORT; each worker gets own accept queue |
| p99 latency spikes on small responses | Nagle algorithm buffering small writes | Verify TCP_NODELAY is set on the socket |
| Dead clients hold connection slots indefinitely | No SO_KEEPALIVE; peer died without sending FIN | ss -to to check keepalive timer; enable SO_KEEPALIVE with short intervals |
| send() returns fewer bytes than requested | Normal TCP behavior; short write not handled in a loop | Wrap send() in a while loop retrying with remaining data |
When to Use / Avoid
- Use TCP sockets when reliable, ordered delivery is required (HTTP, database protocols, message queues)
- Use UDP sockets when latency matters more than reliability (DNS, gaming, real-time media)
- Use SO_REUSEADDR on every TCP server to prevent EADDRINUSE on restart during TIME_WAIT
- Use SO_REUSEPORT to distribute connections across multiple worker processes and eliminate thundering herd
- Use TCP_NODELAY for request-response protocols where Nagle buffering adds unacceptable latency
- Avoid blocking sockets in high-concurrency servers -- use non-blocking + epoll or io_uring instead
Try It Yourself
1 # List all listening TCP sockets with process info
2
3 ss -tlnp
4
5 # Show socket statistics summary
6
7 ss -s
8
9 # Check accept queue overflow
10
11 nstat -az | grep -i listen
12
13 # Monitor TCP connection states
14
15 ss -tan state established | wc -l
16
17 # Check somaxconn (max accept queue)
18
19 sysctl net.core.somaxconn
20
21 # Trace new connections on port 80
22
23 tcpdump -i any 'tcp port 80 and tcp[tcpflags] & (tcp-syn) != 0' -nnDebug Checklist
- 1
ss -tlnp -- list all listening TCP sockets with process info and backlog - 2
ss -s -- summary of socket states (established, TIME_WAIT, etc.) - 3
nstat -az | grep -i listen -- check for accept queue overflows (ListenOverflows, ListenDrops) - 4
ss -tan state time-wait | wc -l -- count TIME_WAIT sockets - 5
sysctl net.core.somaxconn net.ipv4.tcp_max_syn_backlog -- check queue size limits - 6
tcpdump -i any port $PORT -nn -c 20 -- capture initial packets to debug connection issues
Key Takeaways
- ✓The backlog in listen(fd, backlog) controls the accept queue, not the SYN queue. Most people get this backwards. On modern kernels, the SYN queue is dynamically sized, and the backlog is capped by net.core.somaxconn (default 4096 since kernel 5.4).
- ✓SO_REUSEADDR lets you bind to a port stuck in TIME_WAIT -- it's why every TCP server sets it before bind(). SO_REUSEPORT goes further: multiple sockets on the same port, kernel distributes connections via hash. This is how Nginx eliminated the thundering herd.
- ✓accept() returns a NEW file descriptor every time. The listening socket never changes. That's how a server on port 80 handles thousands of connections: each one has a unique 4-tuple (src_ip, src_port, dst_ip, dst_port).
- ✓UDP sockets can call connect() too -- it sets a default destination so send() works instead of sendto(), and the kernel filters incoming packets to only deliver from the connected peer. It also enables the kernel to return ICMP errors as socket errors.
- ✓send() and recv() may transfer fewer bytes than you asked for. Always loop. MSG_WAITALL on recv() blocks until the full length arrives (TCP only), but for send(), there's no shortcut -- you must handle short writes yourself.
Common Pitfalls
- ✗Mistake: not handling EINTR. Reality: system calls like accept(), recv(), and send() can be interrupted by signals and return -1 with errno=EINTR. Production code must retry the call.
- ✗Mistake: ignoring send()'s return value. Reality: send() on a TCP socket may send fewer bytes than requested (short write). Always check the return value and retry with the remaining data.
- ✗Mistake: forgetting SO_REUSEADDR before bind(). Reality: without it, restarting a server within TIME_WAIT (up to 60 seconds) fails with EADDRINUSE. Every TCP server should set this option.
- ✗Mistake: using a small backlog in listen(). Reality: on high-connection-rate servers, a small backlog causes the accept queue to fill, dropping new connections silently. Set backlog to at least 1024 or SOMAXCONN.
Reference
In One Line
SO_REUSEADDR before bind, backlog sized for peak connection rate, TCP_NODELAY on every interactive socket, and always loop on short reads/writes -- the four things every TCP server gets wrong at least once.