Unix Domain Sockets
Mental Model
Two offices on the same floor with a mail slot cut into the shared wall. Instead of stuffing a letter into an envelope, stamping it, sending it down to the mailroom, having it sorted and routed and delivered back upstairs, one office slides paper straight through the slot. And the slot has a trick: an office can pass its actual key through, and the other side gets a working copy that opens the same door. No carrier involved, no postage required.
The Problem
Nginx proxying 50,000 req/s to a backend on the same machine over TCP loopback wastes 10-15 microseconds of kernel CPU per request -- checksums, routing lookups, congestion control -- for data that never touches a wire. Redis at 100,000 ops/s over loopback adds 0.05ms of pure kernel overhead per command, totaling 5 seconds of wasted CPU every second of wall time. Docker exposed on a TCP port gets compromised in minutes when a single network ACL is misconfigured. PostgreSQL OLTP workloads measure 30-50% lower throughput over TCP loopback versus the local Unix socket, and short-lived connections pile up TIME_WAIT sockets that eat ephemeral ports for no reason.
Architecture
Every Docker command talks through one. Every PostgreSQL query on localhost uses one. Every Nginx zero-downtime restart depends on one.
Unix domain sockets are everywhere in modern infrastructure, yet most developers never interact with them directly. They just see the speed improvement and move on. But understanding what they actually do -- and what they can do that TCP never could -- reveals why they are the backbone of local IPC.
Here's the short version: they're sockets that skip the network.
What Actually Happens
When two processes communicate over AF_UNIX, the data path is remarkably simple: sender's user buffer --> kernel socket buffer --> receiver's user buffer. That's it. No IP routing decisions. No TCP state machine. No checksum computation. No segmentation. No congestion control.
The result? 2-3x higher throughput and 50% lower latency compared to TCP loopback for small messages.
Unix domain sockets come in three flavors:
SOCK_STREAM -- a reliable, ordered byte stream. Works exactly like TCP, but faster.
SOCK_DGRAM -- reliable datagrams with message boundaries. Unlike UDP, these are guaranteed to arrive. AF_UNIX datagrams never get lost.
SOCK_SEQPACKET -- the best of both worlds. Reliable, ordered, connection-oriented, AND preserves message boundaries. Each send() maps to exactly one recv(). Perfect for control protocols where framing matters.
Under the Hood
The killer feature: fd passing. When a file descriptor is sent via SCM_RIGHTS, the kernel calls fget() on the fd to grab the struct file, bumps the reference count, allocates a fresh fd number in the receiver's table via get_unused_fd_flags(), and installs the file reference. The receiver gets a completely independent fd pointing to the same underlying kernel object.
This works for any fd type: regular files, sockets, pipes, eventfds, timerfd, signalfd -- anything.
This is how Nginx does zero-downtime restarts. The old master passes listen socket fds to the new master over a Unix domain socket using SCM_RIGHTS. The new master fork()s its own workers with those fds. At no point is the listen socket closed. Zero dropped connections.
Three ways to address a Unix socket. Filesystem path (e.g., /var/run/docker.sock) -- creates a visible socket file, access controlled by file permissions. Abstract namespace (sun_path starts with \0) -- exists only in memory within the network namespace, vanishes automatically when the last fd closes. Unnamed (socketpair()) -- two connected sockets with no address at all, perfect for parent-child communication.
Why it's faster than TCP loopback. TCP on localhost still does the full dance: three-way handshake (kernel processing even if instantaneous), checksums on every packet, Nagle algorithm buffering, congestion window management, and TIME_WAIT on close. AF_UNIX skips all of this. connect() completes synchronously. The kernel doesn't allocate inet_sock or tcp_sock structures -- just the lightweight unix_sock.
Security model. For filesystem sockets, the security story is simple: write permission on the socket file is required to connect. For abstract sockets, there are no permission checks -- any process in the same network namespace can connect. Plan accordingly.
Common Questions
Why is PostgreSQL faster over Unix sockets than TCP for local connections?
TCP loopback still runs the full TCP stack on every query: handshake overhead, checksums, Nagle buffering, congestion control, TIME_WAIT on close. For OLTP workloads where queries are small and frequent, TCP overhead per query is a significant fraction of total latency. Unix sockets measure 30-50% throughput improvement by cutting all of that out.
How does Nginx pass listen sockets to new workers during hot restart?
The master opens listen sockets (socket/bind/listen), then fork()s workers that inherit the fds. During binary upgrade, the old master sends listen fds to the new master via SCM_RIGHTS over a Unix domain socket. The new master fork()s its own workers. The listen socket is never closed, so connections keep flowing seamlessly.
What's the difference between SOCK_DGRAM and SOCK_SEQPACKET for AF_UNIX?
Both preserve message boundaries. The difference is connection semantics. SOCK_DGRAM is connectionless -- each message is independent, and the destination must be specified with sendto(). SOCK_SEQPACKET is connection-oriented -- connect/accept establish a link, messages arrive in order, and the kernel notifies the process when the peer disconnects. Use SOCK_SEQPACKET for control channels that need both framing and connection management.
Can epoll be used with Unix domain sockets?
Yes. Both AF_UNIX and AF_INET sockets implement the poll file operation that epoll hooks into. Redis, for example, serves both TCP and Unix socket clients in the exact same epoll event loop.
How Technologies Use This
A CI pipeline exposes the Docker daemon on a TCP port for remote builds. A misconfigured network ACL leaves it open to the internet, and an attacker gains full root control of the host within minutes. Alternatively, a developer mounts /var/run/docker.sock into a container for Docker-in-Docker, unknowingly granting every process in that container unrestricted root access to the host.
The core issue is securing a daemon API that grants root-level power. TCP requires network ACLs, TLS certificates, and careful firewall rules, any of which can be misconfigured. A Unix domain socket at /var/run/docker.sock sidesteps the network entirely. Security reduces to filesystem permissions: root ownership with 0660 means only users in the docker group can connect, and there is zero TCP overhead on the hundreds of API calls per minute a busy CI pipeline generates.
Use the Unix socket for all local Docker communication and never expose the TCP listener unless absolutely necessary. When mounting the socket into containers for Docker-in-Docker, understand that any process with write access to that socket effectively has full root control over the host, bypassing all container isolation.
Nginx proxying 50,000 requests per second to a backend on the same machine shows unexpectedly high CPU usage in the kernel. Profiling reveals 10-15 microseconds of overhead per request on TCP checksums, routing lookups, congestion window management, and Nagle buffering, all for data that never leaves the host.
TCP loopback runs the full network stack even when both endpoints share the same kernel. Every round trip pays for protocol processing that protects against network problems that cannot occur locally. The overhead is invisible on a per-request basis but adds up to a major throughput ceiling at scale.
Switching proxy_pass to a Unix domain socket eliminates all of that, delivering 2-3x higher throughput because data simply memcpys between kernel buffers. Beyond raw performance, Nginx uses SCM_RIGHTS fd passing over a Unix socket during binary upgrades: the old master sends listen socket file descriptors to the new master, which forks its own workers. The listen socket never closes, achieving zero dropped connections during deployment.
A co-located application issues 100,000 Redis commands per second over TCP loopback. Each GET takes 0.1ms on the server, but the client sees 0.15ms per command. Half of the total latency is kernel overhead from a network stack the data never needs. That wasted overhead adds up to 5 seconds of CPU time every second of real time.
TCP loopback still runs handshakes, checksums, Nagle buffering, and congestion control on every command, even though both processes share the same kernel. For a protocol where commands and responses are tiny and frequent, the per-command TCP overhead becomes a significant fraction of total latency.
Switching to a Unix domain socket cuts per-command round-trip latency by 30-50% by skipping the entire TCP/IP stack. Redis registers both its TCP listener and Unix socket listener in the same epoll event loop, so clients can connect via whichever path suits them without any change to the server architecture. Use the Unix socket for all co-located clients and reserve TCP for remote access.
Two Go microservices on the same host communicate over TCP loopback. At 20,000 gRPC requests per second, p99 latency is higher than expected and CPU profiling shows significant time in kernel TCP processing: handshakes, checksums, congestion windows, and TIME_WAIT cleanup for connections that never touch a wire.
TCP loopback runs the full protocol stack even for co-located processes. Every connection pays for a three-way handshake, every byte gets checksummed, every close enters TIME_WAIT. These protections guard against network problems that cannot happen when both endpoints share the same kernel, making all of it pure overhead.
Changing net.Dial("tcp", "localhost:8080") to net.Dial("unix", "/tmp/svc.sock") returns the same net.Conn interface, making it a one-line code change. AF_UNIX skips the entire TCP/IP stack, delivering 2-3x throughput improvement, roughly 40% lower p99 latency, and measurably reduced CPU utilization for co-located services.
A Node.js application behind Nginx on the same machine shows elevated kernel CPU on every proxied request. Each request through TCP loopback incurs handshake overhead, checksum computation, and Nagle buffering for data that never touches a wire. At high request rates, this overhead becomes a meaningful fraction of total server cost.
TCP loopback treats every local request as if it were crossing the internet. The checksums protect against bit flips that cannot occur in a memory copy. The Nagle buffering optimizes for network conditions that do not exist. The handshake overhead accumulates with every new connection from the reverse proxy.
Calling net.createServer().listen("/tmp/app.sock") eliminates all of that, reducing per-request kernel overhead by roughly 60% with a one-line change. Node.js also relies on Unix socketpairs internally for the cluster and child_process modules, where the master and workers communicate through connected AF_UNIX sockets for IPC messages, handle distribution, and graceful shutdown coordination.
An OLTP workload issuing thousands of small queries per second over TCP loopback shows higher latency than expected. Each query pays for checksums, congestion control, Nagle delays, and TIME_WAIT accumulation on short-lived connections. The overhead makes zero sense when client and server share the same kernel.
TCP loopback processes every local query through the full network stack. For small, frequent queries where execution time is a fraction of a millisecond, the TCP overhead per round trip becomes a measurable fraction of total latency. Short-lived connections also pile up TIME_WAIT sockets, consuming ephemeral ports unnecessarily.
PostgreSQL defaults to a Unix domain socket at /var/run/postgresql/.s.PGSQL.5432, delivering 30-50% higher throughput by skipping the entire TCP/IP stack. There is also a security benefit: peer authentication over the Unix socket verifies the client's OS username via kernel credentials (SO_PEERCRED) without any password exchange, eliminating an entire class of credential-theft attacks for local applications.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Local IPC socket | /var/run/docker.sock for CLI and runtime | JNI via AF_UNIX (Java 16+ java.net.UnixDomainSocketAddress) | net.createServer().listen('/tmp/app.sock') | net.Dial("unix", "/tmp/svc.sock") | CRI socket /var/run/containerd/containerd.sock |
| fd passing | containerd passes shim fds via SCM_RIGHTS | Not directly supported in pure Java | child_process module uses socketpair internally | syscall.Sendmsg with Cmsg for SCM_RIGHTS | Container runtime passes fds to shims |
| Security model | Socket file permissions (0660, docker group) | File permissions on socket path | File permissions on socket path | File permissions on socket path | RBAC + socket mount restrictions in pod spec |
| Performance vs TCP | Avoids TCP overhead for container API calls | 30-50% faster than loopback for JDBC local | ~60% lower kernel overhead per request | 2-3x throughput, ~40% lower p99 | Kubelet to CRI runtime uses Unix socket |
Stack Layer Mapping
| Layer | Component |
|---|---|
| Kernel socket | unix_sock structure (lightweight, no inet_sock/tcp_sock overhead) |
| Address types | Filesystem path, abstract namespace (\0prefix), unnamed (socketpair) |
| Data path | User buffer --> kernel sk_buff --> user buffer (memcpy, no network stack) |
| Ancillary data | SCM_RIGHTS (fd passing), SCM_CREDENTIALS (PID/UID/GID), SCM_SECURITY |
| Userspace | Same socket API as TCP (connect, send, recv, epoll-compatible) |
Design Rationale: Two processes on the same machine should not pay for checksums against bit flips that cannot happen in a memcpy, congestion control on a path with no network, or routing decisions for data that never leaves the kernel. AF_UNIX strips all of that away. SCM_RIGHTS exists because passing file descriptors between processes enables patterns -- systemd socket activation, Nginx hot restarts -- that are impossible over TCP. Abstract namespace sockets trade filesystem persistence for automatic cleanup: no stale socket files to worry about.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| bind() fails with EADDRINUSE | Stale socket file from previous run | ls -la <socket_path> -- unlink before bind |
| Permission denied on connect() | Socket file permissions too restrictive | stat <socket_path> -- check owner and mode |
| Passed fd is invalid in receiver | Ancillary data buffer too small, kernel silently dropped fd | Check CMSG_SPACE calculation in recvmsg buffer |
| Abstract socket accessible from any container | Abstract sockets have no permission checks | Switch to filesystem socket or use network namespace isolation |
| Socket path truncated silently | Path exceeds 108-byte sun_path limit | Shorten path or use abstract namespace |
| High CPU on TCP loopback between co-located services | Full TCP stack overhead for local traffic | ss -tlnp -- switch to AF_UNIX |
When to Use / Avoid
- Use when two processes on the same host need to communicate with minimal latency and overhead
- Use when passing open file descriptors between processes (SCM_RIGHTS) for zero-downtime restarts
- Use when filesystem permissions provide sufficient access control for the daemon API
- Use when SOCK_SEQPACKET is needed for reliable message-boundary-preserving control channels
- Avoid when processes are on different hosts -- AF_UNIX is strictly local
- Avoid when abstract namespace sockets are insufficient for security (no permission checks)
Try It Yourself
1 # List all Unix domain sockets
2
3 ss -xlnp
4
5 # Connect to Docker socket and query API
6
7 curl --unix-socket /var/run/docker.sock http://localhost/v1.41/containers/json
8
9 # Create a simple Unix socket server with socat
10
11 socat UNIX-LISTEN:/tmp/test.sock,fork EXEC:/bin/cat &
12
13 # Connect and send data
14
15 echo 'hello' | socat - UNIX-CONNECT:/tmp/test.sock
16
17 # Check socket file permissions
18
19 ls -la /var/run/docker.sock
20
21 # Monitor Unix socket traffic
22
23 strace -e sendmsg,recvmsg -p $(pidof dockerd) -f 2>&1 | head -20Debug Checklist
- 1
ss -xlnp - 2
ls -la /var/run/*.sock /tmp/*.sock 2>/dev/null - 3
lsof -U - 4
strace -e connect,sendmsg,recvmsg -p <PID> 2>&1 | head -20 - 5
stat /var/run/docker.sock - 6
ss -xp state connected
Key Takeaways
- ✓2-3x faster than TCP loopback, and the reason is what it skips: no IP routing, no TCP checksums, no segmentation, no congestion control, no TIME_WAIT. Data is just memcpy'd between socket buffers in the kernel.
- ✓SCM_RIGHTS lets you pass open file descriptors between unrelated processes. The kernel duplicates the fd in the receiver's table -- new number, same underlying struct file. This is how Nginx hands listen sockets to new workers during hot restart without dropping connections.
- ✓Abstract namespace sockets (sun_path[0] = '\0') live only in memory, not the filesystem. They vanish automatically when the last fd closes -- no stale socket files to clean up. Docker's containerd uses them for shim communication.
- ✓SOCK_SEQPACKET gives you the best of both worlds: reliable, ordered delivery like a stream, but with message boundaries preserved. Each send() becomes exactly one recv(). Ideal for control protocols where framing matters.
- ✓Filesystem-path sockets use file permissions for access control -- only processes with write permission to the socket path can connect. Abstract sockets have no permissions: anything in the same network namespace can connect, so plan your authentication accordingly.
Common Pitfalls
- ✗Mistake: not unlinking the socket file before bind(). Reality: if the file exists from a previous run, bind() fails with EADDRINUSE. Always unlink(path) before bind(), or use abstract sockets that self-clean.
- ✗Mistake: using long socket paths. Reality: sun_path is limited to 108 bytes including the null terminator. Long paths like /run/containers/storage/overlay-containers/../attach silently truncate.
- ✗Mistake: undersizing the ancillary data buffer. Reality: if recvmsg()'s msg_control buffer is too small, the kernel silently closes excess file descriptors. You leak fds in the sender's process with no error.
- ✗Mistake: using read()/write() for fd passing. Reality: only sendmsg()/recvmsg() support ancillary data (cmsg). read/write work for data but cannot pass file descriptors or credentials.
Reference
In One Line
AF_UNIX for co-located processes gives 2-3x the throughput of TCP loopback with the same API, and fd passing enables zero-downtime patterns that TCP simply cannot replicate.