Container Networking & Namespaces
Container networking layers namespaces, veth pairs, bridges, and overlays to give every container an isolated network stack while maintaining seamless cross-host communication.
The Problem
Containers need isolated network stacks but must communicate across hosts seamlessly. A single Linux host can run hundreds of containers, each requiring its own IP, routing, and firewall rules — yet cross-host communication must appear as if all containers share one flat network. Solving this without per-container hardware NICs requires virtual networking primitives layered on top of the kernel.
Mental Model
Like apartments — each has its own address (namespace), connected by hallways (veth) to the lobby (bridge), with postal service (overlay) connecting buildings (nodes).
Architecture Diagram
How It Works
Container networking solves a deceptively hard problem: give every container an isolated network stack while making cross-host communication transparent. The solution layers several Linux kernel primitives — network namespaces, virtual ethernet pairs, bridges, and overlay protocols — into a coherent system that the container runtime manages automatically.
Understanding this stack from the bottom up is essential for diagnosing pod network failures, choosing the right CNI plugin, and reasoning about performance at scale.
Network Namespaces: Isolation at the Kernel Level
A network namespace is a kernel construct that creates an independent copy of the network stack. Each namespace has its own:
- Network interfaces (lo, eth0)
- IP addresses and routing tables
- iptables/nftables rules
- Socket tables and connection tracking
When Docker or Kubernetes creates a container, the first step is always unshare(CLONE_NEWNET) — creating a fresh network namespace. At this point, the container has only a loopback interface. It is completely disconnected from the outside world.
# Create a namespace manually (this is what container runtimes do)
ip netns add container1
# The new namespace has only loopback
ip netns exec container1 ip link
# Output: lo (DOWN)
# List all namespaces on the host
ip netns list
The critical insight: processes inside the namespace cannot see or interact with network interfaces outside it. A container cannot sniff traffic from another container or the host. This is not an application-level sandbox — it is kernel-enforced isolation.
veth Pairs: The Hallway Between Namespaces
A veth (virtual ethernet) pair is a bidirectional pipe. Packets sent into one end emerge from the other. Container runtimes create a veth pair, place one end inside the container namespace (typically named eth0) and leave the other on the host namespace (typically named something like vethXXXXXX).
# Create a veth pair
ip link add veth-host type veth peer name veth-container
# Move one end into the container namespace
ip link set veth-container netns container1
# Assign IPs
ip addr add 10.244.1.1/24 dev veth-host
ip netns exec container1 ip addr add 10.244.1.2/24 dev veth-container
# Bring both ends up
ip link set veth-host up
ip netns exec container1 ip link set veth-container up
At this point, the host can reach 10.244.1.2 and the container can reach 10.244.1.1. But containers on the same host also need to reach each other. Enter the Linux bridge.
Linux Bridge: The Virtual Switch
A Linux bridge operates as a Layer 2 switch inside the kernel. Docker creates docker0; Kubernetes CNIs typically create cbr0 or cni0. All container veth endpoints on the host side connect to this bridge.
When Pod A (10.244.1.2) sends a packet to Pod B (10.244.1.3) on the same node:
- Packet exits Pod A's
eth0(veth-container end) - Emerges on the host-side veth, connected to the bridge
- Bridge performs MAC learning and forwards to Pod B's host-side veth
- Packet enters Pod B's namespace via its veth pair
This is pure L2 forwarding — no routing, no NAT. Latency is microseconds. Same-host pod communication is effectively free.
Cross-Host Communication: Overlays and Routing
The interesting challenge is cross-node traffic. Pod A on Node 1 (10.244.1.2) needs to reach Pod C on Node 2 (10.244.2.2). The underlying physical network knows nothing about pod IPs — it only routes node IPs (192.168.1.x). Two approaches solve this.
Overlay Networks (VXLAN)
VXLAN wraps the original pod-to-pod L2 frame inside a UDP packet addressed from Node 1's IP to Node 2's IP. Node 2's VTEP (VXLAN Tunnel Endpoint) decapsulates and delivers the inner frame to the destination pod's bridge.
Original: [Pod A 10.244.1.2] → [Pod C 10.244.2.2]
On wire: [Node1 192.168.1.10] → [Node2 192.168.1.11] UDP:4789
Inside: [Pod A 10.244.1.2] → [Pod C 10.244.2.2]
The tradeoff: ~50 bytes of encapsulation overhead per packet and additional CPU for encap/decap. For most workloads this is negligible. For high-throughput workloads pushing 10+ Gbps, it matters.
Direct Routing (BGP)
Calico in BGP mode takes a different approach. Each node runs a BGP daemon (BIRD) that advertises its pod CIDR to peer nodes or to the physical network's routers. No encapsulation. The physical network learns that 10.244.1.0/24 is reachable via 192.168.1.10 and routes natively.
| Approach | Overhead | Network Requirement | Complexity |
|---|---|---|---|
| VXLAN | ~50 bytes/packet | Any L3 network | Low |
| IP-in-IP | ~20 bytes/packet | Any L3 network | Low |
| BGP (no overlay) | 0 bytes | BGP-capable routers | Medium |
| VPC-native (AWS/GKE) | 0 bytes | Cloud VPC | Low (cloud-managed) |
Kubernetes Pod Networking Model
Kubernetes imposes three requirements on the network:
- Every pod gets a unique, routable IP — no NAT between pods.
- Pods on one node can communicate with pods on any other node — without NAT.
- The IP a pod sees for itself is the same IP others use to reach it — no address translation surprises.
The CNI plugin implements these requirements. When the kubelet creates a pod:
- kubelet calls the CNI binary (
/opt/cni/bin/calico,/opt/cni/bin/cilium-cni, etc.) - CNI allocates an IP from the node's pod CIDR (via IPAM)
- CNI creates the veth pair and attaches one end to the pod namespace
- CNI configures routes — either adds the pod to the bridge or sets up direct routing
- CNI returns the IP to the kubelet, which reports it to the API server
Service Networking: kube-proxy
Kubernetes Services provide stable virtual IPs (ClusterIPs) that load-balance across pods. kube-proxy implements this by programming packet-level rules on every node.
iptables mode (default): Creates DNAT rules in the KUBE-SERVICES chain. For a Service with 3 endpoints, kube-proxy creates a probabilistic chain that randomly distributes connections. At 5,000+ services, the linear iptables chain-walk becomes measurable.
IPVS mode: Uses the Linux Virtual Server kernel module with hash-based connection tracking. Service lookup is O(1) regardless of scale. Supports weighted round-robin, least connections, and source hash algorithms.
# iptables mode: inspect service rules
iptables -t nat -L KUBE-SERVICES -n | head -20
# IPVS mode: list virtual servers
ipvsadm -Ln
External Traffic: NodePort, LoadBalancer, Ingress
Traffic from outside the cluster follows a different path:
- NodePort: kube-proxy opens a port (30000-32767) on every node. External traffic hits any node's IP:NodePort and gets DNAT'd to a pod endpoint. The pod might be on a different node, causing an extra hop.
- LoadBalancer: Cloud provider provisions an external LB (AWS NLB/ALB, GCP LB) that distributes to NodePorts. The
externalTrafficPolicy: Localsetting avoids the extra hop by only forwarding to nodes running the target pods. - Ingress: An in-cluster reverse proxy (NGINX, Traefik, Istio Gateway) that receives all external HTTP traffic on a single LoadBalancer and routes by hostname/path to backend Services.
Performance Considerations
Container networking overhead varies dramatically by configuration:
- Same-node pod-to-pod: Sub-microsecond via bridge forwarding. Effectively zero overhead.
- Cross-node with VXLAN: 5-15% throughput reduction due to encapsulation and MTU reduction.
- Cross-node with BGP/native routing: Near-bare-metal performance — no encapsulation tax.
- Service routing via iptables: Adds measurable latency at 5,000+ services due to linear rule evaluation.
- Service routing via IPVS or eBPF: O(1) lookup — service count does not affect per-packet latency.
For latency-sensitive workloads, the combination of Calico BGP (no overlay) with Cilium eBPF (no iptables) delivers performance within 2-3% of bare metal. This is why production Kubernetes operators care deeply about CNI selection — the default configuration can leave 15-20% of available network throughput on the table.
Key Points
- •Every container gets its own network namespace with a dedicated network stack — isolated interfaces, routes, and iptables rules that cannot see the host's or other containers' stacks.
- •veth pairs are the plumbing: one end sits inside the container namespace (eth0), the other connects to a bridge or directly to the host routing table. Deleting either end destroys both.
- •VXLAN encapsulates L2 frames inside UDP packets (port 4789) to stretch a flat L2 network across L3 boundaries — the overlay tax is roughly 50 bytes of header per packet.
- •Kubernetes requires that every pod gets a routable IP and that pods communicate without NAT. The CNI plugin enforces this contract regardless of the underlying network topology.
- •kube-proxy in IPVS mode uses hash tables for O(1) service routing, supporting 10,000+ services without the linear chain-walk penalty of iptables mode.
Key Components
| Component | Role |
|---|---|
| Network Namespaces | Isolated network stacks within a single Linux host — each namespace has its own interfaces, routing table, iptables rules, and sockets |
| veth Pairs | Virtual Ethernet cables that connect a container's network namespace to the host namespace, always created in pairs with one end in each namespace |
| Linux Bridge (docker0 / cbr0) | A virtual L2 switch inside the host kernel that connects all container veth endpoints, enabling same-host container communication |
| CNI Plugin | The Container Network Interface plugin that Kubernetes calls to attach a pod to the network — responsible for IP allocation, route setup, and overlay encapsulation |
| kube-proxy | Kubernetes component that programs iptables or IPVS rules to implement Service ClusterIPs, translating virtual IPs to actual pod endpoints |
When to Use
Container networking is not optional — any Docker or Kubernetes deployment uses these primitives. Choose Calico or Cilium for production Kubernetes. Use Flannel for simple clusters that do not need network policy. Prefer BGP mode over VXLAN when the underlying network supports it, to avoid encapsulation overhead.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Calico | Open Source | BGP-based pod networking with no overlay overhead, strong NetworkPolicy enforcement, and support for both iptables and eBPF data planes | Medium-Enterprise |
| Flannel | Open Source | Simple VXLAN overlay networking with minimal configuration — good for small clusters where advanced policy is not needed | Small-Medium |
| Cilium | Open Source | eBPF-native CNI that replaces kube-proxy, provides L7 network policy, and includes built-in observability via Hubble | Medium-Enterprise |
| WeaveNet | Open Source | Mesh overlay with automatic encryption and multicast support, easy setup for development and smaller clusters | Small-Medium |
Debug Checklist
- Inspect the container's network namespace: nsenter -t <PID> -n ip addr shows interfaces and IPs inside the container's isolated stack.
- Check bridge connectivity: brctl show (or bridge link) lists which veth endpoints are connected to which bridge.
- Verify pod CIDR allocation: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' shows the subnet assigned to each node.
- Test cross-node connectivity: kubectl exec into a pod on node A and ping a pod IP on node B — if it fails, the overlay or BGP peering is broken.
- Examine kube-proxy rules: iptables -t nat -L KUBE-SERVICES shows the DNAT rules mapping ClusterIPs to pod endpoints.
Common Mistakes
- Assuming containers on different hosts can communicate without an overlay or direct routing setup. Without VXLAN, IP-in-IP, or BGP-advertised routes, cross-host pod traffic is black-holed.
- Running Docker's default bridge mode in production Kubernetes. The docker0 bridge uses NAT and port mapping, violating Kubernetes' flat-network requirement.
- Ignoring MTU mismatches when using overlays. VXLAN adds 50 bytes of header — if the underlying network MTU is 1500, the container MTU must be 1450 or fragmentation kills throughput.
- Not setting resource limits on kube-proxy. In iptables mode with 5,000+ services, kube-proxy can consume significant CPU regenerating rules on every endpoint change.
- Debugging container networking from the host namespace. The container has a different routing table — always exec into the container or use nsenter to enter its network namespace.
Real World Usage
- •Google GKE uses a VPC-native CNI that assigns pod IPs from the VPC subnet directly, eliminating overlay overhead entirely with alias IP ranges on each node.
- •AWS EKS defaults to the AWS VPC CNI plugin, which attaches real ENI (Elastic Network Interface) secondary IPs to pods — giving each pod a routable VPC address.
- •Docker Desktop uses a user-space network stack (vpnkit) to bridge the Linux VM's container network to the macOS/Windows host, which is why Docker networking behaves differently locally.
- •Shopify runs thousands of pods across multiple clusters using Calico in BGP mode, peering directly with their Top-of-Rack switches to advertise pod CIDRs without overlays.