Network Namespaces & veth Pairs
Mental Model
Row of aquariums, each self-contained: own water, own fish, own filter. Narrow tubes (veth pairs) connect each one to a central tank (the bridge). Fish swim through the tube to the central tank, then out to the ocean. Each aquarium has its own water chemistry -- IPs, routes, firewall rules. Drain one and the others do not notice. A pump at the exit (MASQUERADE) relabels every fish heading to the ocean so return traffic finds the right tank.
The Problem
Fifty containers on one host, all binding port 80. Without isolation the second call to bind() fails with EADDRINUSE -- 49 out of 50 containers refuse to start. Shared routing table, shared firewall rules, shared conntrack: a bad iptables rule in one service breaks networking for every other. Bridge networking adds 1-2 us per veth hop, three hops per packet, 3-6 us of added latency on p99-sensitive services. Past 1,000 pods per node, ARP flooding from bridge-based networking saturates the forwarding table.
Architecture
Run docker run nginx and something remarkable happens. The container gets its own IP address. It binds port 80 -- even though 20 other containers on the same host are also binding port 80. It has its own routing table, its own firewall rules, its own view of the network.
No magic. No virtualization tricks. Just a Linux kernel feature that's been around since 2.6.24: network namespaces.
Ever wondered how containers actually get their own network? This is the answer. And it's simpler than most people think.
What Actually Happens
A network namespace is a complete, independent copy of the Linux network stack. Each one has its own:
- Network interfaces
- IP addresses
- Routing tables
- ARP tables
- iptables/nftables rules
- conntrack entries
- /proc/net data
- Socket state
Processes in different namespaces can't see each other's network resources. They're completely isolated unless explicitly connected.
The kernel creates one namespace at boot -- the "init" or "default" namespace -- containing all physical interfaces. New namespaces are created via clone(CLONE_NEWNET), unshare(CLONE_NEWNET), or ip netns add. They start empty: just a loopback interface, no routes, no rules.
veth pairs are virtual Ethernet cables. One end goes in the host namespace, the other goes in the container namespace. Any packet sent on one end instantly appears on the other. Multiple veth endpoints connect to a Linux bridge -- a software L2 switch -- which handles MAC learning and forwarding.
Under the Hood
How Docker wires it all up. Step by step:
docker runcreates a new network namespace via clone(CLONE_NEWNET).- A veth pair is created:
vethXXXXXXstays in the host namespace, the other end enters the container and gets renamed toeth0. - The host end (
vethXXXXXX) attaches to the docker0 bridge. - Container's eth0 gets an IP from docker0's subnet (typically 172.17.0.0/16).
- A default route inside the container points to 172.17.0.1 (docker0's IP).
- iptables MASQUERADE in the host namespace rewrites source addresses for outbound traffic.
Six steps. That's all container networking is.
Kubernetes takes it further. The Kubernetes networking model requires that every pod has a routable IP -- no NAT between pods. The "pause" container creates the shared network namespace for the pod. Application containers join it via setns(). The CNI plugin (Calico, Flannel, Cilium) handles the veth pair, assigns the pod IP, and programs routes so every pod can reach every other pod directly.
Calico uses L3 routing with BGP instead of bridges, which scales better by avoiding ARP flooding. Cilium uses eBPF programs attached to veth interfaces to bypass both bridging and iptables entirely.
The performance cost. Each veth hop adds ~1-2 microseconds of latency (memcpy between sk_buffs). A container with bridge networking has 3 hops: container eth0 --> veth host --> bridge --> host eth0. Host networking has 0 hops.
For most workloads, this is negligible. For latency-sensitive data planes, Docker's --network=host or Kubernetes' hostNetwork: true bypasses the container network stack entirely.
macvlan and ipvlan: the alternatives. macvlan creates virtual interfaces with unique MAC addresses directly on the physical NIC -- each container appears on the host network with its own MAC and IP. No bridge, no NAT. ipvlan shares the parent's MAC but assigns different IPs, useful when the switch limits MAC addresses per port. Both skip the bridge forwarding path and are faster than the default bridge setup.
Common Questions
How does Docker's bridge networking work, step by step?
(1) New netns via clone(CLONE_NEWNET). (2) veth pair created: vethXXXXXX in host, eth0 in container. (3) Host end attached to docker0 bridge. (4) Container eth0 gets IP from 172.17.0.0/16. (5) Default route: via 172.17.0.1. (6) iptables MASQUERADE rule: iptables -t nat -A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE.
Why does Kubernetes require routable pod IPs?
It simplifies everything. Pod IPs become stable identifiers. No NAT means no conntrack exhaustion, no port conflicts, no asymmetric routing. CNI plugins implement this via overlay networks (VXLAN), BGP routing (Calico), or eBPF routing (Cilium).
What's the performance difference between bridge and host networking?
Bridge networking adds per-packet overhead: veth memcpy (~1us per hop), bridge MAC lookup, iptables traversal, conntrack. For iperf3 benchmarks, bridge networking is ~10-20% slower than host. For latency-sensitive workloads, the 3-5us additional per-packet overhead impacts p99. Use host networking for critical data-plane services, or Cilium's eBPF datapath to short-circuit the bridge.
How does nsenter work at the syscall level?
It opens /proc/<pid>/ns/net to get a file descriptor to the target namespace, then calls setns(fd, CLONE_NEWNET) to switch the process's network stack. After setns(), the process uses the target's interfaces, routes, and iptables rules. Each namespace type (PID, mount, network) is independently switchable.
How Technologies Use This
Deploying 50 containers on the same host where all of them need to bind port 80 creates an immediate problem. Without network namespaces, the second container to call bind() fails with EADDRINUSE and 49 out of 50 containers refuse to start. A latency-sensitive service behind container networking also shows 1-2 microseconds of added delay per packet compared to running on bare metal.
All 50 containers share one kernel, so without isolation they share one network stack, one set of ports, and one routing table. Docker solves this with clone(CLONE_NEWNET), creating a separate network namespace per container. Each gets its own private eth0 interface, its own IP from the 172.17.0.0/16 subnet, its own routing table, and its own iptables rules. A veth pair connects each namespace to the docker0 bridge in the host namespace, and a MASQUERADE rule rewrites source addresses for outbound traffic.
The 1-2 microsecond overhead comes from the veth memcpy and bridge forwarding. For most workloads it is negligible. For latency-sensitive data-plane services, use --network=host to bypass the container namespace entirely, giving the container direct access to the host's network stack at zero overhead.
A sidecar container in a pod needs to communicate with the main application container over localhost, but containers in different pods must remain fully isolated. Without shared network namespaces, co-located containers would need TCP or Unix sockets with explicit configuration, adding latency and complexity for processes that should feel like they run on the same machine.
The pause container creates a network namespace that all application containers in the pod join via the setns() syscall. Containers within a pod share the same loopback, so they communicate over 127.0.0.1 at loopback speed. Pod-to-pod traffic flows through veth pairs with routable IPs assigned by the CNI plugin. Unlike Docker's bridge model, Kubernetes requires no NAT between pods, meaning every pod gets a cluster-routable IP.
Calico achieves this with L3 BGP routing instead of bridges, avoiding ARP flooding that degrades at 1,000+ pods per node. Cilium goes further by replacing iptables entirely with eBPF programs on the veth interfaces, reducing per-packet overhead by up to 40% compared to kube-proxy's iptables mode. Use shared namespaces for co-located communication and CNI routing for pod-to-pod isolation.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Network isolation | clone(CLONE_NEWNET) per container | N/A (OS-level) | N/A (OS-level) | N/A (OS-level) | Pause container creates shared netns per pod |
| Virtual networking | veth + docker0 bridge + MASQUERADE | N/A | N/A | N/A | CNI plugin (Calico/Cilium/Flannel) manages veth + routing |
| Port binding | -p flag with DNAT rules | ServerSocket.bind() | net.createServer().listen() | net.Listen() | Service + NodePort/ClusterIP |
| Host networking | --network=host (bypasses netns) | N/A | N/A | N/A | hostNetwork: true in pod spec |
| Namespace entry | docker exec (uses setns) | N/A | N/A | N/A | kubectl exec (uses CRI -> setns) |
Stack Layer Mapping
| Layer | Network Namespace Mechanism |
|---|---|
| Physical NIC | Lives in init namespace, shared via bridges or macvlan |
| Linux bridge | Software L2 switch connecting veth endpoints (docker0) |
| veth pair | Virtual cable: memcpy between sk_buffs, ~1-2us per hop |
| Network namespace | struct net holds isolated interfaces, routes, iptables, conntrack |
| Container runtime | Creates netns via clone(CLONE_NEWNET), configures veth + IP |
| CNI plugin | Assigns pod IP, programs routes (L3 BGP, VXLAN overlay, or eBPF) |
Design Rationale
Userspace port remapping cannot prevent raw socket access or iptables rule conflicts -- isolation has to happen at the kernel level. veth pairs use memcpy rather than real packet transmission because both ends share the same kernel and no NIC hardware is involved. The bridge model works well up to hundreds of containers per host, but L2 bridging hits ARP flooding and MAC table limits at scale. Calico sidesteps this with L3 BGP routing, and Cilium goes further by replacing both the bridge and iptables with eBPF programs attached directly to veth interfaces.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Container cannot reach external network | Missing default route or MASQUERADE rule | nsenter --net=... -- ip route and iptables -t nat -L POSTROUTING |
| EADDRINUSE on container port bind | Port conflict in shared namespace (not using netns) | Verify container has its own netns: ls -la /proc/<pid>/ns/net |
| Localhost connections fail inside container | Loopback not brought up in namespace | nsenter --net=... -- ip link show lo (should be UP) |
| Stale veth interfaces on host after container exit | Namespace bind-mounted and not cleaned up | ip link show type veth and check for orphaned interfaces |
| High latency on container networking | veth + bridge + iptables overhead (3-6us per packet) | Use --network=host for latency-sensitive services |
| ARP flooding at 1000+ containers | Bridge-based networking saturating forwarding table | Switch to L3 CNI (Calico BGP) or eBPF (Cilium) |
When to Use / Avoid
Use when:
- Running multiple containers that need port isolation on the same host
- Building container networking with per-pod routable IPs (Kubernetes CNI)
- Testing network configurations without affecting the host network stack
- Implementing VPN routing where only VPN traffic goes through a separate interface
- Debugging container networking by entering the namespace with nsenter
Avoid when:
- Latency-sensitive data-plane services need bare-metal performance (use --network=host)
- A single application owns the host and no isolation is needed
- macvlan/ipvlan provide a simpler model (direct NIC attachment without bridges)
Try It Yourself
1 # Create a network namespace and configure networking
2
3 ip netns add ns1
4
5 ip link add veth-host type veth peer name veth-ns1
6
7 ip link set veth-ns1 netns ns1
8
9 ip addr add 10.0.0.1/24 dev veth-host
10
11 ip link set veth-host up
12
13 ip netns exec ns1 ip addr add 10.0.0.2/24 dev veth-ns1
14
15 ip netns exec ns1 ip link set veth-ns1 up
16
17 ip netns exec ns1 ip link set lo up
18
19 # Test connectivity
20
21 ip netns exec ns1 ping -c1 10.0.0.1
22
23 # Inspect Docker container networking
24
25 docker inspect -f '{{.State.Pid}}' <container_id>
26
27 nsenter --net=/proc/<pid>/ns/net -- ip addrDebug Checklist
- 1
ip netns list -- list named network namespaces - 2
nsenter --net=/proc/<pid>/ns/net -- ip addr -- inspect container interfaces - 3
nsenter --net=/proc/<pid>/ns/net -- ip route -- inspect container routing table - 4
bridge link show -- show interfaces attached to bridges - 5
ip link show master docker0 -- list veth endpoints on docker0 bridge - 6
ethtool -S <veth> -- show veth pair packet counters and drops
Key Takeaways
- ✓Every namespace starts empty -- just a loopback interface and nothing else. The init namespace (PID 1) owns the physical NICs. You must explicitly move or create interfaces in new namespaces via 'ip link set dev vethX netns <ns>'.
- ✓A veth pair is a virtual Ethernet cable: packets in one end, out the other. No loss, no reordering, no MTU surprises. Throughput is limited by CPU (it's memcpy between sk_buffs), not by any physical link.
- ✓Docker's bridge mode in five words: veth pair, docker0 bridge, MASQUERADE. Each container gets its own netns, a veth connects it to the bridge, and iptables rewrites source addresses for outbound traffic.
- ✓Kubernetes pods share a network namespace. The pause container creates it; app containers join via setns(). The CNI plugin (Calico, Flannel, Cilium) handles the veth pair and assigns a cluster-routable pod IP.
- ✓setns(fd, CLONE_NEWNET) switches a process into an existing namespace. This is how nsenter and 'docker exec' work. The fd comes from /proc/<pid>/ns/net or a bind-mounted namespace file.
Common Pitfalls
- ✗Mistake: forgetting to bring up loopback in a new namespace. Reality: without 'ip link set lo up', localhost connections fail inside the namespace. Every namespace needs this.
- ✗Mistake: assigning an IP but no route. Reality: even after giving the veth an IP inside the namespace, traffic can't reach external networks without a default route pointing to the bridge's IP.
- ✗Mistake: confusing 'ip netns' with Docker namespaces. Reality: 'ip netns' creates bind mounts in /var/run/netns/. Docker doesn't use named namespaces -- it creates them via clone(CLONE_NEWNET) and identifies them through /proc/<pid>/ns/net.
- ✗Mistake: assuming stale veth pairs are cleaned up. Reality: if the container's netns was bind-mounted and not unmounted, stale interfaces persist even after the container exits. Normal exit (all processes terminated, no bind mount) does clean up automatically.
Reference
In One Line
One network namespace per container, one veth pair per connection to the host -- and nsenter into the namespace is always the first debugging step.