NetworkingTopic 5 of 13

Networking & SocketsIntermediate

Unix Domain Sockets

DockerNginxRedisGoNode.jsPostgreSQL

🧠

Mental Model

Two offices on the same floor with a mail slot cut into the shared wall. Instead of stuffing a letter into an envelope, stamping it, sending it down to the mailroom, having it sorted and routed and delivered back upstairs, one office slides paper straight through the slot. And the slot has a trick: an office can pass its actual key through, and the other side gets a working copy that opens the same door. No carrier involved, no postage required.

💡

The Problem

Nginx proxying 50,000 req/s to a backend on the same machine over TCP loopback wastes 10-15 microseconds of kernel CPU per request -- checksums, routing lookups, congestion control -- for data that never touches a wire. Redis at 100,000 ops/s over loopback adds 0.05ms of pure kernel overhead per command, totaling 5 seconds of wasted CPU every second of wall time. Docker exposed on a TCP port gets compromised in minutes when a single network ACL is misconfigured. PostgreSQL OLTP workloads measure 30-50% lower throughput over TCP loopback versus the local Unix socket, and short-lived connections pile up TIME_WAIT sockets that eat ephemeral ports for no reason.

Architecture

Every Docker command talks through one. Every PostgreSQL query on localhost uses one. Every Nginx zero-downtime restart depends on one.

Unix domain sockets are everywhere in modern infrastructure, yet most developers never interact with them directly. They just see the speed improvement and move on. But understanding what they actually do -- and what they can do that TCP never could -- reveals why they are the backbone of local IPC.

Here's the short version: they're sockets that skip the network.

What Actually Happens

When two processes communicate over AF_UNIX, the data path is remarkably simple: sender's user buffer --> kernel socket buffer --> receiver's user buffer. That's it. No IP routing decisions. No TCP state machine. No checksum computation. No segmentation. No congestion control.

The result? 2-3x higher throughput and 50% lower latency compared to TCP loopback for small messages.

Unix domain sockets come in three flavors:

SOCK_STREAM -- a reliable, ordered byte stream. Works exactly like TCP, but faster.

SOCK_DGRAM -- reliable datagrams with message boundaries. Unlike UDP, these are guaranteed to arrive. AF_UNIX datagrams never get lost.

SOCK_SEQPACKET -- the best of both worlds. Reliable, ordered, connection-oriented, AND preserves message boundaries. Each send() maps to exactly one recv(). Perfect for control protocols where framing matters.

Under the Hood

The killer feature: fd passing. When a file descriptor is sent via SCM_RIGHTS, the kernel calls fget() on the fd to grab the struct file, bumps the reference count, allocates a fresh fd number in the receiver's table via get_unused_fd_flags(), and installs the file reference. The receiver gets a completely independent fd pointing to the same underlying kernel object.

This works for any fd type: regular files, sockets, pipes, eventfds, timerfd, signalfd -- anything.

This is how Nginx does zero-downtime restarts. The old master passes listen socket fds to the new master over a Unix domain socket using SCM_RIGHTS. The new master fork()s its own workers with those fds. At no point is the listen socket closed. Zero dropped connections.

Three ways to address a Unix socket. Filesystem path (e.g., /var/run/docker.sock) -- creates a visible socket file, access controlled by file permissions. Abstract namespace (sun_path starts with \0) -- exists only in memory within the network namespace, vanishes automatically when the last fd closes. Unnamed (socketpair()) -- two connected sockets with no address at all, perfect for parent-child communication.

Why it's faster than TCP loopback. TCP on localhost still does the full dance: three-way handshake (kernel processing even if instantaneous), checksums on every packet, Nagle algorithm buffering, congestion window management, and TIME_WAIT on close. AF_UNIX skips all of this. connect() completes synchronously. The kernel doesn't allocate inet_sock or tcp_sock structures -- just the lightweight unix_sock.

Security model. For filesystem sockets, the security story is simple: write permission on the socket file is required to connect. For abstract sockets, there are no permission checks -- any process in the same network namespace can connect. Plan accordingly.

Common Questions

Why is PostgreSQL faster over Unix sockets than TCP for local connections?

TCP loopback still runs the full TCP stack on every query: handshake overhead, checksums, Nagle buffering, congestion control, TIME_WAIT on close. For OLTP workloads where queries are small and frequent, TCP overhead per query is a significant fraction of total latency. Unix sockets measure 30-50% throughput improvement by cutting all of that out.

How does Nginx pass listen sockets to new workers during hot restart?

The master opens listen sockets (socket/bind/listen), then fork()s workers that inherit the fds. During binary upgrade, the old master sends listen fds to the new master via SCM_RIGHTS over a Unix domain socket. The new master fork()s its own workers. The listen socket is never closed, so connections keep flowing seamlessly.

What's the difference between SOCK_DGRAM and SOCK_SEQPACKET for AF_UNIX?

Both preserve message boundaries. The difference is connection semantics. SOCK_DGRAM is connectionless -- each message is independent, and the destination must be specified with sendto(). SOCK_SEQPACKET is connection-oriented -- connect/accept establish a link, messages arrive in order, and the kernel notifies the process when the peer disconnects. Use SOCK_SEQPACKET for control channels that need both framing and connection management.

Can epoll be used with Unix domain sockets?

Yes. Both AF_UNIX and AF_INET sockets implement the poll file operation that epoll hooks into. Redis, for example, serves both TCP and Unix socket clients in the exact same epoll event loop.

How Technologies Use This

Docker

A CI pipeline exposes the Docker daemon on a TCP port for remote builds. A misconfigured network ACL leaves it open to the internet, and an attacker gains full root control of the host within minutes. Alternatively, a developer mounts /var/run/docker.sock into a container for Docker-in-Docker, unknowingly granting every process in that container unrestricted root access to the host.

The core issue is securing a daemon API that grants root-level power. TCP requires network ACLs, TLS certificates, and careful firewall rules, any of which can be misconfigured. A Unix domain socket at /var/run/docker.sock sidesteps the network entirely. Security reduces to filesystem permissions: root ownership with 0660 means only users in the docker group can connect, and there is zero TCP overhead on the hundreds of API calls per minute a busy CI pipeline generates.

Use the Unix socket for all local Docker communication and never expose the TCP listener unless absolutely necessary. When mounting the socket into containers for Docker-in-Docker, understand that any process with write access to that socket effectively has full root control over the host, bypassing all container isolation.

Nginx

Nginx proxying 50,000 requests per second to a backend on the same machine shows unexpectedly high CPU usage in the kernel. Profiling reveals 10-15 microseconds of overhead per request on TCP checksums, routing lookups, congestion window management, and Nagle buffering, all for data that never leaves the host.

TCP loopback runs the full network stack even when both endpoints share the same kernel. Every round trip pays for protocol processing that protects against network problems that cannot occur locally. The overhead is invisible on a per-request basis but adds up to a major throughput ceiling at scale.

Switching proxy_pass to a Unix domain socket eliminates all of that, delivering 2-3x higher throughput because data simply memcpys between kernel buffers. Beyond raw performance, Nginx uses SCM_RIGHTS fd passing over a Unix socket during binary upgrades: the old master sends listen socket file descriptors to the new master, which forks its own workers. The listen socket never closes, achieving zero dropped connections during deployment.

Redis

A co-located application issues 100,000 Redis commands per second over TCP loopback. Each GET takes 0.1ms on the server, but the client sees 0.15ms per command. Half of the total latency is kernel overhead from a network stack the data never needs. That wasted overhead adds up to 5 seconds of CPU time every second of real time.

TCP loopback still runs handshakes, checksums, Nagle buffering, and congestion control on every command, even though both processes share the same kernel. For a protocol where commands and responses are tiny and frequent, the per-command TCP overhead becomes a significant fraction of total latency.

Switching to a Unix domain socket cuts per-command round-trip latency by 30-50% by skipping the entire TCP/IP stack. Redis registers both its TCP listener and Unix socket listener in the same epoll event loop, so clients can connect via whichever path suits them without any change to the server architecture. Use the Unix socket for all co-located clients and reserve TCP for remote access.

Two Go microservices on the same host communicate over TCP loopback. At 20,000 gRPC requests per second, p99 latency is higher than expected and CPU profiling shows significant time in kernel TCP processing: handshakes, checksums, congestion windows, and TIME_WAIT cleanup for connections that never touch a wire.

TCP loopback runs the full protocol stack even for co-located processes. Every connection pays for a three-way handshake, every byte gets checksummed, every close enters TIME_WAIT. These protections guard against network problems that cannot happen when both endpoints share the same kernel, making all of it pure overhead.

Changing net.Dial("tcp", "localhost:8080") to net.Dial("unix", "/tmp/svc.sock") returns the same net.Conn interface, making it a one-line code change. AF_UNIX skips the entire TCP/IP stack, delivering 2-3x throughput improvement, roughly 40% lower p99 latency, and measurably reduced CPU utilization for co-located services.

Node.js

A Node.js application behind Nginx on the same machine shows elevated kernel CPU on every proxied request. Each request through TCP loopback incurs handshake overhead, checksum computation, and Nagle buffering for data that never touches a wire. At high request rates, this overhead becomes a meaningful fraction of total server cost.

TCP loopback treats every local request as if it were crossing the internet. The checksums protect against bit flips that cannot occur in a memory copy. The Nagle buffering optimizes for network conditions that do not exist. The handshake overhead accumulates with every new connection from the reverse proxy.

Calling net.createServer().listen("/tmp/app.sock") eliminates all of that, reducing per-request kernel overhead by roughly 60% with a one-line change. Node.js also relies on Unix socketpairs internally for the cluster and child_process modules, where the master and workers communicate through connected AF_UNIX sockets for IPC messages, handle distribution, and graceful shutdown coordination.

PostgreSQL

An OLTP workload issuing thousands of small queries per second over TCP loopback shows higher latency than expected. Each query pays for checksums, congestion control, Nagle delays, and TIME_WAIT accumulation on short-lived connections. The overhead makes zero sense when client and server share the same kernel.

TCP loopback processes every local query through the full network stack. For small, frequent queries where execution time is a fraction of a millisecond, the TCP overhead per round trip becomes a measurable fraction of total latency. Short-lived connections also pile up TIME_WAIT sockets, consuming ephemeral ports unnecessarily.

PostgreSQL defaults to a Unix domain socket at /var/run/postgresql/.s.PGSQL.5432, delivering 30-50% higher throughput by skipping the entire TCP/IP stack. There is also a security benefit: peer authentication over the Unix socket verifies the client's OS username via kernel credentials (SO_PEERCRED) without any password exchange, eliminating an entire class of credential-theft attacks for local applications.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Local IPC socket	/var/run/docker.sock for CLI and runtime	JNI via AF_UNIX (Java 16+ java.net.UnixDomainSocketAddress)	net.createServer().listen('/tmp/app.sock')	net.Dial("unix", "/tmp/svc.sock")	CRI socket /var/run/containerd/containerd.sock
fd passing	containerd passes shim fds via SCM_RIGHTS	Not directly supported in pure Java	child_process module uses socketpair internally	syscall.Sendmsg with Cmsg for SCM_RIGHTS	Container runtime passes fds to shims
Security model	Socket file permissions (0660, docker group)	File permissions on socket path	File permissions on socket path	File permissions on socket path	RBAC + socket mount restrictions in pod spec
Performance vs TCP	Avoids TCP overhead for container API calls	30-50% faster than loopback for JDBC local	~60% lower kernel overhead per request	2-3x throughput, ~40% lower p99	Kubelet to CRI runtime uses Unix socket

Stack Layer Mapping

Layer	Component
Kernel socket	unix_sock structure (lightweight, no inet_sock/tcp_sock overhead)
Address types	Filesystem path, abstract namespace (\0prefix), unnamed (socketpair)
Data path	User buffer --> kernel sk_buff --> user buffer (memcpy, no network stack)
Ancillary data	SCM_RIGHTS (fd passing), SCM_CREDENTIALS (PID/UID/GID), SCM_SECURITY
Userspace	Same socket API as TCP (connect, send, recv, epoll-compatible)

Design Rationale: Two processes on the same machine should not pay for checksums against bit flips that cannot happen in a memcpy, congestion control on a path with no network, or routing decisions for data that never leaves the kernel. AF_UNIX strips all of that away. SCM_RIGHTS exists because passing file descriptors between processes enables patterns -- systemd socket activation, Nginx hot restarts -- that are impossible over TCP. Abstract namespace sockets trade filesystem persistence for automatic cleanup: no stale socket files to worry about.

If You See This, Think This

Symptom	Likely Cause	First Check
bind() fails with EADDRINUSE	Stale socket file from previous run	`ls -la <socket_path>` -- unlink before bind
Permission denied on connect()	Socket file permissions too restrictive	`stat <socket_path>` -- check owner and mode
Passed fd is invalid in receiver	Ancillary data buffer too small, kernel silently dropped fd	Check CMSG_SPACE calculation in recvmsg buffer
Abstract socket accessible from any container	Abstract sockets have no permission checks	Switch to filesystem socket or use network namespace isolation
Socket path truncated silently	Path exceeds 108-byte sun_path limit	Shorten path or use abstract namespace
High CPU on TCP loopback between co-located services	Full TCP stack overhead for local traffic	`ss -tlnp` -- switch to AF_UNIX

When to Use / Avoid

Use when two processes on the same host need to communicate with minimal latency and overhead
Use when passing open file descriptors between processes (SCM_RIGHTS) for zero-downtime restarts
Use when filesystem permissions provide sufficient access control for the daemon API
Use when SOCK_SEQPACKET is needed for reliable message-boundary-preserving control channels
Avoid when processes are on different hosts -- AF_UNIX is strictly local
Avoid when abstract namespace sockets are insufficient for security (no permission checks)

Try It Yourself

 1  # List all Unix domain sockets
 2  
 3  ss -xlnp
 4  
 5  # Connect to Docker socket and query API
 6  
 7  curl --unix-socket /var/run/docker.sock http://localhost/v1.41/containers/json
 8  
 9  # Create a simple Unix socket server with socat
10  
11  socat UNIX-LISTEN:/tmp/test.sock,fork EXEC:/bin/cat &
12  
13  # Connect and send data
14  
15  echo 'hello' | socat - UNIX-CONNECT:/tmp/test.sock
16  
17  # Check socket file permissions
18  
19  ls -la /var/run/docker.sock
20  
21  # Monitor Unix socket traffic
22  
23  strace -e sendmsg,recvmsg -p $(pidof dockerd) -f 2>&1 | head -20

Debug Checklist

1ss -xlnp
2ls -la /var/run/*.sock /tmp/*.sock 2>/dev/null
3lsof -U
4strace -e connect,sendmsg,recvmsg -p <PID> 2>&1 | head -20
5stat /var/run/docker.sock
6ss -xp state connected

Key Takeaways

✓2-3x faster than TCP loopback, and the reason is what it skips: no IP routing, no TCP checksums, no segmentation, no congestion control, no TIME_WAIT. Data is just memcpy'd between socket buffers in the kernel.
✓SCM_RIGHTS lets you pass open file descriptors between unrelated processes. The kernel duplicates the fd in the receiver's table -- new number, same underlying struct file. This is how Nginx hands listen sockets to new workers during hot restart without dropping connections.
✓Abstract namespace sockets (sun_path[0] = '\0') live only in memory, not the filesystem. They vanish automatically when the last fd closes -- no stale socket files to clean up. Docker's containerd uses them for shim communication.
✓SOCK_SEQPACKET gives you the best of both worlds: reliable, ordered delivery like a stream, but with message boundaries preserved. Each send() becomes exactly one recv(). Ideal for control protocols where framing matters.
✓Filesystem-path sockets use file permissions for access control -- only processes with write permission to the socket path can connect. Abstract sockets have no permissions: anything in the same network namespace can connect, so plan your authentication accordingly.

Common Pitfalls

✗Mistake: not unlinking the socket file before bind(). Reality: if the file exists from a previous run, bind() fails with EADDRINUSE. Always unlink(path) before bind(), or use abstract sockets that self-clean.
✗Mistake: using long socket paths. Reality: sun_path is limited to 108 bytes including the null terminator. Long paths like /run/containers/storage/overlay-containers/../attach silently truncate.
✗Mistake: undersizing the ancillary data buffer. Reality: if recvmsg()'s msg_control buffer is too small, the kernel silently closes excess file descriptors. You leak fds in the sender's process with no error.
✗Mistake: using read()/write() for fd passing. Reality: only sendmsg()/recvmsg() support ancillary data (cmsg). read/write work for data but cannot pass file descriptors or credentials.

Reference

System Calls

socketbindconnectsendmsgrecvmsgsocketpair

Tools

ss -xsocatlsof

📌

In One Line

AF_UNIX for co-located processes gives 2-3x the throughput of TCP loopback with the same API, and fd passing enables zero-downtime patterns that TCP simply cannot replicate.

Unix Domain Sockets

DockerNginxRedisGoNode.jsPostgreSQL

🧠

Mental Model

💡

The Problem

Architecture

Every Docker command talks through one. Every PostgreSQL query on localhost uses one. Every Nginx zero-downtime restart depends on one.

Here's the short version: they're sockets that skip the network.

What Actually Happens

The result? 2-3x higher throughput and 50% lower latency compared to TCP loopback for small messages.

Unix domain sockets come in three flavors:

SOCK_STREAM -- a reliable, ordered byte stream. Works exactly like TCP, but faster.

SOCK_DGRAM -- reliable datagrams with message boundaries. Unlike UDP, these are guaranteed to arrive. AF_UNIX datagrams never get lost.

Under the Hood

This works for any fd type: regular files, sockets, pipes, eventfds, timerfd, signalfd -- anything.

Common Questions

Why is PostgreSQL faster over Unix sockets than TCP for local connections?

How does Nginx pass listen sockets to new workers during hot restart?

What's the difference between SOCK_DGRAM and SOCK_SEQPACKET for AF_UNIX?

Can epoll be used with Unix domain sockets?

Yes. Both AF_UNIX and AF_INET sockets implement the poll file operation that epoll hooks into. Redis, for example, serves both TCP and Unix socket clients in the exact same epoll event loop.

How Technologies Use This

Docker

Nginx

Redis

Node.js

PostgreSQL

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Local IPC socket	/var/run/docker.sock for CLI and runtime	JNI via AF_UNIX (Java 16+ java.net.UnixDomainSocketAddress)	net.createServer().listen('/tmp/app.sock')	net.Dial("unix", "/tmp/svc.sock")	CRI socket /var/run/containerd/containerd.sock
fd passing	containerd passes shim fds via SCM_RIGHTS	Not directly supported in pure Java	child_process module uses socketpair internally	syscall.Sendmsg with Cmsg for SCM_RIGHTS	Container runtime passes fds to shims
Security model	Socket file permissions (0660, docker group)	File permissions on socket path	File permissions on socket path	File permissions on socket path	RBAC + socket mount restrictions in pod spec
Performance vs TCP	Avoids TCP overhead for container API calls	30-50% faster than loopback for JDBC local	~60% lower kernel overhead per request	2-3x throughput, ~40% lower p99	Kubelet to CRI runtime uses Unix socket

Stack Layer Mapping

Layer	Component
Kernel socket	unix_sock structure (lightweight, no inet_sock/tcp_sock overhead)
Address types	Filesystem path, abstract namespace (\0prefix), unnamed (socketpair)
Data path	User buffer --> kernel sk_buff --> user buffer (memcpy, no network stack)
Ancillary data	SCM_RIGHTS (fd passing), SCM_CREDENTIALS (PID/UID/GID), SCM_SECURITY
Userspace	Same socket API as TCP (connect, send, recv, epoll-compatible)

If You See This, Think This

Symptom	Likely Cause	First Check
bind() fails with EADDRINUSE	Stale socket file from previous run	`ls -la <socket_path>` -- unlink before bind
Permission denied on connect()	Socket file permissions too restrictive	`stat <socket_path>` -- check owner and mode
Passed fd is invalid in receiver	Ancillary data buffer too small, kernel silently dropped fd	Check CMSG_SPACE calculation in recvmsg buffer
Abstract socket accessible from any container	Abstract sockets have no permission checks	Switch to filesystem socket or use network namespace isolation
Socket path truncated silently	Path exceeds 108-byte sun_path limit	Shorten path or use abstract namespace
High CPU on TCP loopback between co-located services	Full TCP stack overhead for local traffic	`ss -tlnp` -- switch to AF_UNIX

When to Use / Avoid

Use when two processes on the same host need to communicate with minimal latency and overhead
Use when passing open file descriptors between processes (SCM_RIGHTS) for zero-downtime restarts
Use when filesystem permissions provide sufficient access control for the daemon API
Use when SOCK_SEQPACKET is needed for reliable message-boundary-preserving control channels
Avoid when processes are on different hosts -- AF_UNIX is strictly local
Avoid when abstract namespace sockets are insufficient for security (no permission checks)

Try It Yourself

 1  # List all Unix domain sockets
 2  
 3  ss -xlnp
 4  
 5  # Connect to Docker socket and query API
 6  
 7  curl --unix-socket /var/run/docker.sock http://localhost/v1.41/containers/json
 8  
 9  # Create a simple Unix socket server with socat
10  
11  socat UNIX-LISTEN:/tmp/test.sock,fork EXEC:/bin/cat &
12  
13  # Connect and send data
14  
15  echo 'hello' | socat - UNIX-CONNECT:/tmp/test.sock
16  
17  # Check socket file permissions
18  
19  ls -la /var/run/docker.sock
20  
21  # Monitor Unix socket traffic
22  
23  strace -e sendmsg,recvmsg -p $(pidof dockerd) -f 2>&1 | head -20

Debug Checklist

1ss -xlnp
2ls -la /var/run/*.sock /tmp/*.sock 2>/dev/null
3lsof -U
4strace -e connect,sendmsg,recvmsg -p <PID> 2>&1 | head -20
5stat /var/run/docker.sock
6ss -xp state connected

Key Takeaways

✓2-3x faster than TCP loopback, and the reason is what it skips: no IP routing, no TCP checksums, no segmentation, no congestion control, no TIME_WAIT. Data is just memcpy'd between socket buffers in the kernel.
✓SCM_RIGHTS lets you pass open file descriptors between unrelated processes. The kernel duplicates the fd in the receiver's table -- new number, same underlying struct file. This is how Nginx hands listen sockets to new workers during hot restart without dropping connections.
✓Abstract namespace sockets (sun_path[0] = '\0') live only in memory, not the filesystem. They vanish automatically when the last fd closes -- no stale socket files to clean up. Docker's containerd uses them for shim communication.
✓SOCK_SEQPACKET gives you the best of both worlds: reliable, ordered delivery like a stream, but with message boundaries preserved. Each send() becomes exactly one recv(). Ideal for control protocols where framing matters.
✓Filesystem-path sockets use file permissions for access control -- only processes with write permission to the socket path can connect. Abstract sockets have no permissions: anything in the same network namespace can connect, so plan your authentication accordingly.

Common Pitfalls

✗Mistake: not unlinking the socket file before bind(). Reality: if the file exists from a previous run, bind() fails with EADDRINUSE. Always unlink(path) before bind(), or use abstract sockets that self-clean.
✗Mistake: using long socket paths. Reality: sun_path is limited to 108 bytes including the null terminator. Long paths like /run/containers/storage/overlay-containers/../attach silently truncate.
✗Mistake: undersizing the ancillary data buffer. Reality: if recvmsg()'s msg_control buffer is too small, the kernel silently closes excess file descriptors. You leak fds in the sender's process with no error.
✗Mistake: using read()/write() for fd passing. Reality: only sendmsg()/recvmsg() support ancillary data (cmsg). read/write work for data but cannot pass file descriptors or credentials.

Reference

System Calls

socketbindconnectsendmsgrecvmsgsocketpair

Tools

ss -xsocatlsof

📌

In One Line

AF_UNIX for co-located processes gives 2-3x the throughput of TCP loopback with the same API, and fd passing enables zero-downtime patterns that TCP simply cannot replicate.

Unix Domain Sockets

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Unix Domain Sockets

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics