NetworkingTopic 9 of 13

Networking & SocketsAdvanced

Network Namespaces & veth Pairs

DockerKubernetes

🧠

Mental Model

Row of aquariums, each self-contained: own water, own fish, own filter. Narrow tubes (veth pairs) connect each one to a central tank (the bridge). Fish swim through the tube to the central tank, then out to the ocean. Each aquarium has its own water chemistry -- IPs, routes, firewall rules. Drain one and the others do not notice. A pump at the exit (MASQUERADE) relabels every fish heading to the ocean so return traffic finds the right tank.

💡

The Problem

Fifty containers on one host, all binding port 80. Without isolation the second call to bind() fails with EADDRINUSE -- 49 out of 50 containers refuse to start. Shared routing table, shared firewall rules, shared conntrack: a bad iptables rule in one service breaks networking for every other. Bridge networking adds 1-2 us per veth hop, three hops per packet, 3-6 us of added latency on p99-sensitive services. Past 1,000 pods per node, ARP flooding from bridge-based networking saturates the forwarding table.

Architecture

Run docker run nginx and something remarkable happens. The container gets its own IP address. It binds port 80 -- even though 20 other containers on the same host are also binding port 80. It has its own routing table, its own firewall rules, its own view of the network.

No magic. No virtualization tricks. Just a Linux kernel feature that's been around since 2.6.24: network namespaces.

Ever wondered how containers actually get their own network? This is the answer. And it's simpler than most people think.

What Actually Happens

A network namespace is a complete, independent copy of the Linux network stack. Each one has its own:

Network interfaces
IP addresses
Routing tables
ARP tables
iptables/nftables rules
conntrack entries
/proc/net data
Socket state

Processes in different namespaces can't see each other's network resources. They're completely isolated unless explicitly connected.

The kernel creates one namespace at boot -- the "init" or "default" namespace -- containing all physical interfaces. New namespaces are created via clone(CLONE_NEWNET), unshare(CLONE_NEWNET), or ip netns add. They start empty: just a loopback interface, no routes, no rules.

veth pairs are virtual Ethernet cables. One end goes in the host namespace, the other goes in the container namespace. Any packet sent on one end instantly appears on the other. Multiple veth endpoints connect to a Linux bridge -- a software L2 switch -- which handles MAC learning and forwarding.

Under the Hood

How Docker wires it all up. Step by step:

docker run creates a new network namespace via clone(CLONE_NEWNET).
A veth pair is created: vethXXXXXX stays in the host namespace, the other end enters the container and gets renamed to eth0.
The host end (vethXXXXXX) attaches to the docker0 bridge.
Container's eth0 gets an IP from docker0's subnet (typically 172.17.0.0/16).
A default route inside the container points to 172.17.0.1 (docker0's IP).
iptables MASQUERADE in the host namespace rewrites source addresses for outbound traffic.

Six steps. That's all container networking is.

Kubernetes takes it further. The Kubernetes networking model requires that every pod has a routable IP -- no NAT between pods. The "pause" container creates the shared network namespace for the pod. Application containers join it via setns(). The CNI plugin (Calico, Flannel, Cilium) handles the veth pair, assigns the pod IP, and programs routes so every pod can reach every other pod directly.

Calico uses L3 routing with BGP instead of bridges, which scales better by avoiding ARP flooding. Cilium uses eBPF programs attached to veth interfaces to bypass both bridging and iptables entirely.

The performance cost. Each veth hop adds ~1-2 microseconds of latency (memcpy between sk_buffs). A container with bridge networking has 3 hops: container eth0 --> veth host --> bridge --> host eth0. Host networking has 0 hops.

For most workloads, this is negligible. For latency-sensitive data planes, Docker's --network=host or Kubernetes' hostNetwork: true bypasses the container network stack entirely.

macvlan and ipvlan: the alternatives. macvlan creates virtual interfaces with unique MAC addresses directly on the physical NIC -- each container appears on the host network with its own MAC and IP. No bridge, no NAT. ipvlan shares the parent's MAC but assigns different IPs, useful when the switch limits MAC addresses per port. Both skip the bridge forwarding path and are faster than the default bridge setup.

Common Questions

How does Docker's bridge networking work, step by step?

(1) New netns via clone(CLONE_NEWNET). (2) veth pair created: vethXXXXXX in host, eth0 in container. (3) Host end attached to docker0 bridge. (4) Container eth0 gets IP from 172.17.0.0/16. (5) Default route: via 172.17.0.1. (6) iptables MASQUERADE rule: iptables -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE.

Why does Kubernetes require routable pod IPs?

It simplifies everything. Pod IPs become stable identifiers. No NAT means no conntrack exhaustion, no port conflicts, no asymmetric routing. CNI plugins implement this via overlay networks (VXLAN), BGP routing (Calico), or eBPF routing (Cilium).

What's the performance difference between bridge and host networking?

Bridge networking adds per-packet overhead: veth memcpy (~1us per hop), bridge MAC lookup, iptables traversal, conntrack. For iperf3 benchmarks, bridge networking is ~10-20% slower than host. For latency-sensitive workloads, the 3-5us additional per-packet overhead impacts p99. Use host networking for critical data-plane services, or Cilium's eBPF datapath to short-circuit the bridge.

How does nsenter work at the syscall level?

It opens /proc/<pid>/ns/net to get a file descriptor to the target namespace, then calls setns(fd, CLONE_NEWNET) to switch the process's network stack. After setns(), the process uses the target's interfaces, routes, and iptables rules. Each namespace type (PID, mount, network) is independently switchable.

How Technologies Use This

Docker

Deploying 50 containers on the same host where all of them need to bind port 80 creates an immediate problem. Without network namespaces, the second container to call bind() fails with EADDRINUSE and 49 out of 50 containers refuse to start. A latency-sensitive service behind container networking also shows 1-2 microseconds of added delay per packet compared to running on bare metal.

All 50 containers share one kernel, so without isolation they share one network stack, one set of ports, and one routing table. Docker solves this with clone(CLONE_NEWNET), creating a separate network namespace per container. Each gets its own private eth0 interface, its own IP from the 172.17.0.0/16 subnet, its own routing table, and its own iptables rules. A veth pair connects each namespace to the docker0 bridge in the host namespace, and a MASQUERADE rule rewrites source addresses for outbound traffic.

The 1-2 microsecond overhead comes from the veth memcpy and bridge forwarding. For most workloads it is negligible. For latency-sensitive data-plane services, use --network=host to bypass the container namespace entirely, giving the container direct access to the host's network stack at zero overhead.

Kubernetes

A sidecar container in a pod needs to communicate with the main application container over localhost, but containers in different pods must remain fully isolated. Without shared network namespaces, co-located containers would need TCP or Unix sockets with explicit configuration, adding latency and complexity for processes that should feel like they run on the same machine.

The pause container creates a network namespace that all application containers in the pod join via the setns() syscall. Containers within a pod share the same loopback, so they communicate over 127.0.0.1 at loopback speed. Pod-to-pod traffic flows through veth pairs with routable IPs assigned by the CNI plugin. Unlike Docker's bridge model, Kubernetes requires no NAT between pods, meaning every pod gets a cluster-routable IP.

Calico achieves this with L3 BGP routing instead of bridges, avoiding ARP flooding that degrades at 1,000+ pods per node. Cilium goes further by replacing iptables entirely with eBPF programs on the veth interfaces, reducing per-packet overhead by up to 40% compared to kube-proxy's iptables mode. Use shared namespaces for co-located communication and CNI routing for pod-to-pod isolation.

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Network isolation	clone(CLONE_NEWNET) per container	N/A (OS-level)	N/A (OS-level)	N/A (OS-level)	Pause container creates shared netns per pod
Virtual networking	veth + docker0 bridge + MASQUERADE	N/A	N/A	N/A	CNI plugin (Calico/Cilium/Flannel) manages veth + routing
Port binding	-p flag with DNAT rules	ServerSocket.bind()	net.createServer().listen()	net.Listen()	Service + NodePort/ClusterIP
Host networking	--network=host (bypasses netns)	N/A	N/A	N/A	hostNetwork: true in pod spec
Namespace entry	docker exec (uses setns)	N/A	N/A	N/A	kubectl exec (uses CRI -> setns)

Stack Layer Mapping

Layer	Network Namespace Mechanism
Physical NIC	Lives in init namespace, shared via bridges or macvlan
Linux bridge	Software L2 switch connecting veth endpoints (docker0)
veth pair	Virtual cable: memcpy between sk_buffs, ~1-2us per hop
Network namespace	struct net holds isolated interfaces, routes, iptables, conntrack
Container runtime	Creates netns via clone(CLONE_NEWNET), configures veth + IP
CNI plugin	Assigns pod IP, programs routes (L3 BGP, VXLAN overlay, or eBPF)

Design Rationale

Userspace port remapping cannot prevent raw socket access or iptables rule conflicts -- isolation has to happen at the kernel level. veth pairs use memcpy rather than real packet transmission because both ends share the same kernel and no NIC hardware is involved. The bridge model works well up to hundreds of containers per host, but L2 bridging hits ARP flooding and MAC table limits at scale. Calico sidesteps this with L3 BGP routing, and Cilium goes further by replacing both the bridge and iptables with eBPF programs attached directly to veth interfaces.

If You See This, Think This

Symptom	Likely Cause	First Check
Container cannot reach external network	Missing default route or MASQUERADE rule	`nsenter --net=... -- ip route` and `iptables -t nat -L POSTROUTING`
EADDRINUSE on container port bind	Port conflict in shared namespace (not using netns)	Verify container has its own netns: `ls -la /proc/<pid>/ns/net`
Localhost connections fail inside container	Loopback not brought up in namespace	`nsenter --net=... -- ip link show lo` (should be UP)
Stale veth interfaces on host after container exit	Namespace bind-mounted and not cleaned up	`ip link show type veth` and check for orphaned interfaces
High latency on container networking	veth + bridge + iptables overhead (3-6us per packet)	Use `--network=host` for latency-sensitive services
ARP flooding at 1000+ containers	Bridge-based networking saturating forwarding table	Switch to L3 CNI (Calico BGP) or eBPF (Cilium)

When to Use / Avoid

Use when:

Running multiple containers that need port isolation on the same host
Building container networking with per-pod routable IPs (Kubernetes CNI)
Testing network configurations without affecting the host network stack
Implementing VPN routing where only VPN traffic goes through a separate interface
Debugging container networking by entering the namespace with nsenter

Avoid when:

Latency-sensitive data-plane services need bare-metal performance (use --network=host)
A single application owns the host and no isolation is needed
macvlan/ipvlan provide a simpler model (direct NIC attachment without bridges)

Try It Yourself

 1  # Create a network namespace and configure networking
 2  
 3  ip netns add ns1
 4  
 5  ip link add veth-host type veth peer name veth-ns1
 6  
 7  ip link set veth-ns1 netns ns1
 8  
 9  ip addr add 10.0.0.1/24 dev veth-host
10  
11  ip link set veth-host up
12  
13  ip netns exec ns1 ip addr add 10.0.0.2/24 dev veth-ns1
14  
15  ip netns exec ns1 ip link set veth-ns1 up
16  
17  ip netns exec ns1 ip link set lo up
18  
19  # Test connectivity
20  
21  ip netns exec ns1 ping -c1 10.0.0.1
22  
23  # Inspect Docker container networking
24  
25  docker inspect -f '{{.State.Pid}}' <container_id>
26  
27  nsenter --net=/proc/<pid>/ns/net -- ip addr

Debug Checklist

1ip netns list -- list named network namespaces
2nsenter --net=/proc/<pid>/ns/net -- ip addr -- inspect container interfaces
3nsenter --net=/proc/<pid>/ns/net -- ip route -- inspect container routing table
4bridge link show -- show interfaces attached to bridges
5ip link show master docker0 -- list veth endpoints on docker0 bridge
6ethtool -S <veth> -- show veth pair packet counters and drops

Key Takeaways

✓Every namespace starts empty -- just a loopback interface and nothing else. The init namespace (PID 1) owns the physical NICs. You must explicitly move or create interfaces in new namespaces via 'ip link set dev vethX netns <ns>'.
✓A veth pair is a virtual Ethernet cable: packets in one end, out the other. No loss, no reordering, no MTU surprises. Throughput is limited by CPU (it's memcpy between sk_buffs), not by any physical link.
✓Docker's bridge mode in five words: veth pair, docker0 bridge, MASQUERADE. Each container gets its own netns, a veth connects it to the bridge, and iptables rewrites source addresses for outbound traffic.
✓Kubernetes pods share a network namespace. The pause container creates it; app containers join via setns(). The CNI plugin (Calico, Flannel, Cilium) handles the veth pair and assigns a cluster-routable pod IP.
✓setns(fd, CLONE_NEWNET) switches a process into an existing namespace. This is how nsenter and 'docker exec' work. The fd comes from /proc/<pid>/ns/net or a bind-mounted namespace file.

Common Pitfalls

✗Mistake: forgetting to bring up loopback in a new namespace. Reality: without 'ip link set lo up', localhost connections fail inside the namespace. Every namespace needs this.
✗Mistake: assigning an IP but no route. Reality: even after giving the veth an IP inside the namespace, traffic can't reach external networks without a default route pointing to the bridge's IP.
✗Mistake: confusing 'ip netns' with Docker namespaces. Reality: 'ip netns' creates bind mounts in /var/run/netns/. Docker doesn't use named namespaces -- it creates them via clone(CLONE_NEWNET) and identifies them through /proc/<pid>/ns/net.
✗Mistake: assuming stale veth pairs are cleaned up. Reality: if the container's netns was bind-mounted and not unmounted, stale interfaces persist even after the container exits. Normal exit (all processes terminated, no bind mount) does clean up automatically.

Reference

System Calls

unsharesetnsclone

Tools

ip netnsnsenterbridge / brctl

📌

In One Line

One network namespace per container, one veth pair per connection to the host -- and nsenter into the namespace is always the first debugging step.

Network Namespaces & veth Pairs

DockerKubernetes

🧠

Mental Model

💡

The Problem

Architecture

No magic. No virtualization tricks. Just a Linux kernel feature that's been around since 2.6.24: network namespaces.

Ever wondered how containers actually get their own network? This is the answer. And it's simpler than most people think.

What Actually Happens

A network namespace is a complete, independent copy of the Linux network stack. Each one has its own:

Network interfaces
IP addresses
Routing tables
ARP tables
iptables/nftables rules
conntrack entries
/proc/net data
Socket state

Processes in different namespaces can't see each other's network resources. They're completely isolated unless explicitly connected.

Under the Hood

How Docker wires it all up. Step by step:

docker run creates a new network namespace via clone(CLONE_NEWNET).
A veth pair is created: vethXXXXXX stays in the host namespace, the other end enters the container and gets renamed to eth0.
The host end (vethXXXXXX) attaches to the docker0 bridge.
Container's eth0 gets an IP from docker0's subnet (typically 172.17.0.0/16).
A default route inside the container points to 172.17.0.1 (docker0's IP).
iptables MASQUERADE in the host namespace rewrites source addresses for outbound traffic.

Six steps. That's all container networking is.

Calico uses L3 routing with BGP instead of bridges, which scales better by avoiding ARP flooding. Cilium uses eBPF programs attached to veth interfaces to bypass both bridging and iptables entirely.

For most workloads, this is negligible. For latency-sensitive data planes, Docker's --network=host or Kubernetes' hostNetwork: true bypasses the container network stack entirely.

Common Questions

How does Docker's bridge networking work, step by step?

Why does Kubernetes require routable pod IPs?

What's the performance difference between bridge and host networking?

How does nsenter work at the syscall level?

How Technologies Use This

Docker

Kubernetes

Same Concept Across Tech

Concept	Docker	JVM	Node.js	Go	K8s
Network isolation	clone(CLONE_NEWNET) per container	N/A (OS-level)	N/A (OS-level)	N/A (OS-level)	Pause container creates shared netns per pod
Virtual networking	veth + docker0 bridge + MASQUERADE	N/A	N/A	N/A	CNI plugin (Calico/Cilium/Flannel) manages veth + routing
Port binding	-p flag with DNAT rules	ServerSocket.bind()	net.createServer().listen()	net.Listen()	Service + NodePort/ClusterIP
Host networking	--network=host (bypasses netns)	N/A	N/A	N/A	hostNetwork: true in pod spec
Namespace entry	docker exec (uses setns)	N/A	N/A	N/A	kubectl exec (uses CRI -> setns)

Stack Layer Mapping

Layer	Network Namespace Mechanism
Physical NIC	Lives in init namespace, shared via bridges or macvlan
Linux bridge	Software L2 switch connecting veth endpoints (docker0)
veth pair	Virtual cable: memcpy between sk_buffs, ~1-2us per hop
Network namespace	struct net holds isolated interfaces, routes, iptables, conntrack
Container runtime	Creates netns via clone(CLONE_NEWNET), configures veth + IP
CNI plugin	Assigns pod IP, programs routes (L3 BGP, VXLAN overlay, or eBPF)

Design Rationale

If You See This, Think This

Symptom	Likely Cause	First Check
Container cannot reach external network	Missing default route or MASQUERADE rule	`nsenter --net=... -- ip route` and `iptables -t nat -L POSTROUTING`
EADDRINUSE on container port bind	Port conflict in shared namespace (not using netns)	Verify container has its own netns: `ls -la /proc/<pid>/ns/net`
Localhost connections fail inside container	Loopback not brought up in namespace	`nsenter --net=... -- ip link show lo` (should be UP)
Stale veth interfaces on host after container exit	Namespace bind-mounted and not cleaned up	`ip link show type veth` and check for orphaned interfaces
High latency on container networking	veth + bridge + iptables overhead (3-6us per packet)	Use `--network=host` for latency-sensitive services
ARP flooding at 1000+ containers	Bridge-based networking saturating forwarding table	Switch to L3 CNI (Calico BGP) or eBPF (Cilium)

When to Use / Avoid

Use when:

Running multiple containers that need port isolation on the same host
Building container networking with per-pod routable IPs (Kubernetes CNI)
Testing network configurations without affecting the host network stack
Implementing VPN routing where only VPN traffic goes through a separate interface
Debugging container networking by entering the namespace with nsenter

Avoid when:

Latency-sensitive data-plane services need bare-metal performance (use --network=host)
A single application owns the host and no isolation is needed
macvlan/ipvlan provide a simpler model (direct NIC attachment without bridges)

Try It Yourself

 1  # Create a network namespace and configure networking
 2  
 3  ip netns add ns1
 4  
 5  ip link add veth-host type veth peer name veth-ns1
 6  
 7  ip link set veth-ns1 netns ns1
 8  
 9  ip addr add 10.0.0.1/24 dev veth-host
10  
11  ip link set veth-host up
12  
13  ip netns exec ns1 ip addr add 10.0.0.2/24 dev veth-ns1
14  
15  ip netns exec ns1 ip link set veth-ns1 up
16  
17  ip netns exec ns1 ip link set lo up
18  
19  # Test connectivity
20  
21  ip netns exec ns1 ping -c1 10.0.0.1
22  
23  # Inspect Docker container networking
24  
25  docker inspect -f '{{.State.Pid}}' <container_id>
26  
27  nsenter --net=/proc/<pid>/ns/net -- ip addr

Debug Checklist

1ip netns list -- list named network namespaces
2nsenter --net=/proc/<pid>/ns/net -- ip addr -- inspect container interfaces
3nsenter --net=/proc/<pid>/ns/net -- ip route -- inspect container routing table
4bridge link show -- show interfaces attached to bridges
5ip link show master docker0 -- list veth endpoints on docker0 bridge
6ethtool -S <veth> -- show veth pair packet counters and drops

Key Takeaways

✓Every namespace starts empty -- just a loopback interface and nothing else. The init namespace (PID 1) owns the physical NICs. You must explicitly move or create interfaces in new namespaces via 'ip link set dev vethX netns <ns>'.
✓A veth pair is a virtual Ethernet cable: packets in one end, out the other. No loss, no reordering, no MTU surprises. Throughput is limited by CPU (it's memcpy between sk_buffs), not by any physical link.
✓Docker's bridge mode in five words: veth pair, docker0 bridge, MASQUERADE. Each container gets its own netns, a veth connects it to the bridge, and iptables rewrites source addresses for outbound traffic.
✓Kubernetes pods share a network namespace. The pause container creates it; app containers join via setns(). The CNI plugin (Calico, Flannel, Cilium) handles the veth pair and assigns a cluster-routable pod IP.
✓setns(fd, CLONE_NEWNET) switches a process into an existing namespace. This is how nsenter and 'docker exec' work. The fd comes from /proc/<pid>/ns/net or a bind-mounted namespace file.

Common Pitfalls

✗Mistake: forgetting to bring up loopback in a new namespace. Reality: without 'ip link set lo up', localhost connections fail inside the namespace. Every namespace needs this.
✗Mistake: assigning an IP but no route. Reality: even after giving the veth an IP inside the namespace, traffic can't reach external networks without a default route pointing to the bridge's IP.
✗Mistake: confusing 'ip netns' with Docker namespaces. Reality: 'ip netns' creates bind mounts in /var/run/netns/. Docker doesn't use named namespaces -- it creates them via clone(CLONE_NEWNET) and identifies them through /proc/<pid>/ns/net.
✗Mistake: assuming stale veth pairs are cleaned up. Reality: if the container's netns was bind-mounted and not unmounted, stale interfaces persist even after the container exits. Normal exit (all processes terminated, no bind mount) does clean up automatically.

Reference

System Calls

unsharesetnsclone

Tools

ip netnsnsenterbridge / brctl

📌

In One Line

One network namespace per container, one veth pair per connection to the host -- and nsenter into the namespace is always the first debugging step.

Network Namespaces & veth Pairs

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics

Network Namespaces & veth Pairs

Mental Model

The Problem

Architecture

What Actually Happens

Under the Hood

Common Questions

How Technologies Use This

Same Concept Across Tech

If You See This, Think This

When to Use / Avoid

Try It Yourself

Debug Checklist

Key Takeaways

Common Pitfalls

Reference

In One Line

Related Topics