Modern PatternsTopic 3 of 6

Modern PatternsAdvanced

Container Networking & Namespaces

VXLANIP-in-IPBGP

Container networking layers namespaces, veth pairs, bridges, and overlays to give every container an isolated network stack while maintaining seamless cross-host communication.

The Problem

Containers need isolated network stacks but must communicate across hosts seamlessly. A single Linux host can run hundreds of containers, each requiring its own IP, routing, and firewall rules — yet cross-host communication must appear as if all containers share one flat network. Solving this without per-container hardware NICs requires virtual networking primitives layered on top of the kernel.

Mental Model

Like apartments — each has its own address (namespace), connected by hallways (veth) to the lobby (bridge), with postal service (overlay) connecting buildings (nodes).

Architecture Diagram

How It Works

Container networking solves a deceptively hard problem: give every container an isolated network stack while making cross-host communication transparent. The solution layers several Linux kernel primitives — network namespaces, virtual ethernet pairs, bridges, and overlay protocols — into a coherent system that the container runtime manages automatically.

Understanding this stack from the bottom up is essential for diagnosing pod network failures, choosing the right CNI plugin, and reasoning about performance at scale.

Network Namespaces: Isolation at the Kernel Level

A network namespace is a kernel construct that creates an independent copy of the network stack. Each namespace has its own:

Network interfaces (lo, eth0)
IP addresses and routing tables
iptables/nftables rules
Socket tables and connection tracking

When Docker or Kubernetes creates a container, the first step is always unshare(CLONE_NEWNET) — creating a fresh network namespace. At this point, the container has only a loopback interface. It is completely disconnected from the outside world.

# Create a namespace manually (this is what container runtimes do)
ip netns add container1

# The new namespace has only loopback
ip netns exec container1 ip link
# Output: lo (DOWN)

# List all namespaces on the host
ip netns list

The critical insight: processes inside the namespace cannot see or interact with network interfaces outside it. A container cannot sniff traffic from another container or the host. This is not an application-level sandbox — it is kernel-enforced isolation.

veth Pairs: The Hallway Between Namespaces

A veth (virtual ethernet) pair is a bidirectional pipe. Packets sent into one end emerge from the other. Container runtimes create a veth pair, place one end inside the container namespace (typically named eth0) and leave the other on the host namespace (typically named something like vethXXXXXX).

# Create a veth pair
ip link add veth-host type veth peer name veth-container

# Move one end into the container namespace
ip link set veth-container netns container1

# Assign IPs
ip addr add 10.244.1.1/24 dev veth-host
ip netns exec container1 ip addr add 10.244.1.2/24 dev veth-container

# Bring both ends up
ip link set veth-host up
ip netns exec container1 ip link set veth-container up

At this point, the host can reach 10.244.1.2 and the container can reach 10.244.1.1. But containers on the same host also need to reach each other. Enter the Linux bridge.

Linux Bridge: The Virtual Switch

A Linux bridge operates as a Layer 2 switch inside the kernel. Docker creates docker0; Kubernetes CNIs typically create cbr0 or cni0. All container veth endpoints on the host side connect to this bridge.

When Pod A (10.244.1.2) sends a packet to Pod B (10.244.1.3) on the same node:

Packet exits Pod A's eth0 (veth-container end)
Emerges on the host-side veth, connected to the bridge
Bridge performs MAC learning and forwards to Pod B's host-side veth
Packet enters Pod B's namespace via its veth pair

This is pure L2 forwarding — no routing, no NAT. Latency is microseconds. Same-host pod communication is effectively free.

Cross-Host Communication: Overlays and Routing

The interesting challenge is cross-node traffic. Pod A on Node 1 (10.244.1.2) needs to reach Pod C on Node 2 (10.244.2.2). The underlying physical network knows nothing about pod IPs — it only routes node IPs (192.168.1.x). Two approaches solve this.

Overlay Networks (VXLAN)

VXLAN wraps the original pod-to-pod L2 frame inside a UDP packet addressed from Node 1's IP to Node 2's IP. Node 2's VTEP (VXLAN Tunnel Endpoint) decapsulates and delivers the inner frame to the destination pod's bridge.

Original: [Pod A 10.244.1.2] → [Pod C 10.244.2.2]
On wire:  [Node1 192.168.1.10] → [Node2 192.168.1.11] UDP:4789
          Inside: [Pod A 10.244.1.2] → [Pod C 10.244.2.2]

The tradeoff: ~50 bytes of encapsulation overhead per packet and additional CPU for encap/decap. For most workloads this is negligible. For high-throughput workloads pushing 10+ Gbps, it matters.

Direct Routing (BGP)

Calico in BGP mode takes a different approach. Each node runs a BGP daemon (BIRD) that advertises its pod CIDR to peer nodes or to the physical network's routers. No encapsulation. The physical network learns that 10.244.1.0/24 is reachable via 192.168.1.10 and routes natively.

Approach	Overhead	Network Requirement	Complexity
VXLAN	~50 bytes/packet	Any L3 network	Low
IP-in-IP	~20 bytes/packet	Any L3 network	Low
BGP (no overlay)	0 bytes	BGP-capable routers	Medium
VPC-native (AWS/GKE)	0 bytes	Cloud VPC	Low (cloud-managed)

Kubernetes Pod Networking Model

Kubernetes imposes three requirements on the network:

Every pod gets a unique, routable IP — no NAT between pods.
Pods on one node can communicate with pods on any other node — without NAT.
The IP a pod sees for itself is the same IP others use to reach it — no address translation surprises.

The CNI plugin implements these requirements. When the kubelet creates a pod:

kubelet calls the CNI binary (/opt/cni/bin/calico, /opt/cni/bin/cilium-cni, etc.)
CNI allocates an IP from the node's pod CIDR (via IPAM)
CNI creates the veth pair and attaches one end to the pod namespace
CNI configures routes — either adds the pod to the bridge or sets up direct routing
CNI returns the IP to the kubelet, which reports it to the API server

Service Networking: kube-proxy

Kubernetes Services provide stable virtual IPs (ClusterIPs) that load-balance across pods. kube-proxy implements this by programming packet-level rules on every node.

iptables mode (default): Creates DNAT rules in the KUBE-SERVICES chain. For a Service with 3 endpoints, kube-proxy creates a probabilistic chain that randomly distributes connections. At 5,000+ services, the linear iptables chain-walk becomes measurable.

IPVS mode: Uses the Linux Virtual Server kernel module with hash-based connection tracking. Service lookup is O(1) regardless of scale. Supports weighted round-robin, least connections, and source hash algorithms.

# iptables mode: inspect service rules
iptables -t nat -L KUBE-SERVICES -n | head -20

# IPVS mode: list virtual servers
ipvsadm -Ln

External Traffic: NodePort, LoadBalancer, Ingress

Traffic from outside the cluster follows a different path:

NodePort: kube-proxy opens a port (30000-32767) on every node. External traffic hits any node's IP:NodePort and gets DNAT'd to a pod endpoint. The pod might be on a different node, causing an extra hop.
LoadBalancer: Cloud provider provisions an external LB (AWS NLB/ALB, GCP LB) that distributes to NodePorts. The externalTrafficPolicy: Local setting avoids the extra hop by only forwarding to nodes running the target pods.
Ingress: An in-cluster reverse proxy (NGINX, Traefik, Istio Gateway) that receives all external HTTP traffic on a single LoadBalancer and routes by hostname/path to backend Services.

Performance Considerations

Container networking overhead varies dramatically by configuration:

Same-node pod-to-pod: Sub-microsecond via bridge forwarding. Effectively zero overhead.
Cross-node with VXLAN: 5-15% throughput reduction due to encapsulation and MTU reduction.
Cross-node with BGP/native routing: Near-bare-metal performance — no encapsulation tax.
Service routing via iptables: Adds measurable latency at 5,000+ services due to linear rule evaluation.
Service routing via IPVS or eBPF: O(1) lookup — service count does not affect per-packet latency.

For latency-sensitive workloads, the combination of Calico BGP (no overlay) with Cilium eBPF (no iptables) delivers performance within 2-3% of bare metal. This is why production Kubernetes operators care deeply about CNI selection — the default configuration can leave 15-20% of available network throughput on the table.

Key Points

•Every container gets its own network namespace with a dedicated network stack — isolated interfaces, routes, and iptables rules that cannot see the host's or other containers' stacks.
•veth pairs are the plumbing: one end sits inside the container namespace (eth0), the other connects to a bridge or directly to the host routing table. Deleting either end destroys both.
•VXLAN encapsulates L2 frames inside UDP packets (port 4789) to stretch a flat L2 network across L3 boundaries — the overlay tax is roughly 50 bytes of header per packet.
•Kubernetes requires that every pod gets a routable IP and that pods communicate without NAT. The CNI plugin enforces this contract regardless of the underlying network topology.
•kube-proxy in IPVS mode uses hash tables for O(1) service routing, supporting 10,000+ services without the linear chain-walk penalty of iptables mode.

Key Components

Component	Role
Network Namespaces	Isolated network stacks within a single Linux host — each namespace has its own interfaces, routing table, iptables rules, and sockets
veth Pairs	Virtual Ethernet cables that connect a container's network namespace to the host namespace, always created in pairs with one end in each namespace
Linux Bridge (docker0 / cbr0)	A virtual L2 switch inside the host kernel that connects all container veth endpoints, enabling same-host container communication
CNI Plugin	The Container Network Interface plugin that Kubernetes calls to attach a pod to the network — responsible for IP allocation, route setup, and overlay encapsulation
kube-proxy	Kubernetes component that programs iptables or IPVS rules to implement Service ClusterIPs, translating virtual IPs to actual pod endpoints

When to Use

Container networking is not optional — any Docker or Kubernetes deployment uses these primitives. Choose Calico or Cilium for production Kubernetes. Use Flannel for simple clusters that do not need network policy. Prefer BGP mode over VXLAN when the underlying network supports it, to avoid encapsulation overhead.

Tool Comparison

Tool	Type	Best For	Scale
Calico	Open Source	BGP-based pod networking with no overlay overhead, strong NetworkPolicy enforcement, and support for both iptables and eBPF data planes	Medium-Enterprise
Flannel	Open Source	Simple VXLAN overlay networking with minimal configuration — good for small clusters where advanced policy is not needed	Small-Medium
Cilium	Open Source	eBPF-native CNI that replaces kube-proxy, provides L7 network policy, and includes built-in observability via Hubble	Medium-Enterprise
WeaveNet	Open Source	Mesh overlay with automatic encryption and multicast support, easy setup for development and smaller clusters	Small-Medium

Debug Checklist

Inspect the container's network namespace: nsenter -t <PID> -n ip addr shows interfaces and IPs inside the container's isolated stack.
Check bridge connectivity: brctl show (or bridge link) lists which veth endpoints are connected to which bridge.
Verify pod CIDR allocation: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' shows the subnet assigned to each node.
Test cross-node connectivity: kubectl exec into a pod on node A and ping a pod IP on node B — if it fails, the overlay or BGP peering is broken.
Examine kube-proxy rules: iptables -t nat -L KUBE-SERVICES shows the DNAT rules mapping ClusterIPs to pod endpoints.

Common Mistakes

Assuming containers on different hosts can communicate without an overlay or direct routing setup. Without VXLAN, IP-in-IP, or BGP-advertised routes, cross-host pod traffic is black-holed.
Running Docker's default bridge mode in production Kubernetes. The docker0 bridge uses NAT and port mapping, violating Kubernetes' flat-network requirement.
Ignoring MTU mismatches when using overlays. VXLAN adds 50 bytes of header — if the underlying network MTU is 1500, the container MTU must be 1450 or fragmentation kills throughput.
Not setting resource limits on kube-proxy. In iptables mode with 5,000+ services, kube-proxy can consume significant CPU regenerating rules on every endpoint change.
Debugging container networking from the host namespace. The container has a different routing table — always exec into the container or use nsenter to enter its network namespace.

Real World Usage

•Google GKE uses a VPC-native CNI that assigns pod IPs from the VPC subnet directly, eliminating overlay overhead entirely with alias IP ranges on each node.
•AWS EKS defaults to the AWS VPC CNI plugin, which attaches real ENI (Elastic Network Interface) secondary IPs to pods — giving each pod a routable VPC address.
•Docker Desktop uses a user-space network stack (vpnkit) to bridge the Linux VM's container network to the macOS/Windows host, which is why Docker networking behaves differently locally.
•Shopify runs thousands of pods across multiple clusters using Calico in BGP mode, peering directly with their Top-of-Rack switches to advertise pod CIDRs without overlays.

RFCs & Specs

RFC 7348 — VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 NetworksCNI Specification — Container Network InterfaceKubernetes Networking Model Documentation

Container Networking & Namespaces

VXLANIP-in-IPBGP

Container networking layers namespaces, veth pairs, bridges, and overlays to give every container an isolated network stack while maintaining seamless cross-host communication.

The Problem

Mental Model

Like apartments — each has its own address (namespace), connected by hallways (veth) to the lobby (bridge), with postal service (overlay) connecting buildings (nodes).

Architecture Diagram

How It Works

Understanding this stack from the bottom up is essential for diagnosing pod network failures, choosing the right CNI plugin, and reasoning about performance at scale.

Network Namespaces: Isolation at the Kernel Level

A network namespace is a kernel construct that creates an independent copy of the network stack. Each namespace has its own:

Network interfaces (lo, eth0)
IP addresses and routing tables
iptables/nftables rules
Socket tables and connection tracking

# Create a namespace manually (this is what container runtimes do)
ip netns add container1

# The new namespace has only loopback
ip netns exec container1 ip link
# Output: lo (DOWN)

# List all namespaces on the host
ip netns list

veth Pairs: The Hallway Between Namespaces

# Create a veth pair
ip link add veth-host type veth peer name veth-container

# Move one end into the container namespace
ip link set veth-container netns container1

# Assign IPs
ip addr add 10.244.1.1/24 dev veth-host
ip netns exec container1 ip addr add 10.244.1.2/24 dev veth-container

# Bring both ends up
ip link set veth-host up
ip netns exec container1 ip link set veth-container up

At this point, the host can reach 10.244.1.2 and the container can reach 10.244.1.1. But containers on the same host also need to reach each other. Enter the Linux bridge.

Linux Bridge: The Virtual Switch

When Pod A (10.244.1.2) sends a packet to Pod B (10.244.1.3) on the same node:

Packet exits Pod A's eth0 (veth-container end)
Emerges on the host-side veth, connected to the bridge
Bridge performs MAC learning and forwards to Pod B's host-side veth
Packet enters Pod B's namespace via its veth pair

This is pure L2 forwarding — no routing, no NAT. Latency is microseconds. Same-host pod communication is effectively free.

Cross-Host Communication: Overlays and Routing

Overlay Networks (VXLAN)

Original: [Pod A 10.244.1.2] → [Pod C 10.244.2.2]
On wire:  [Node1 192.168.1.10] → [Node2 192.168.1.11] UDP:4789
          Inside: [Pod A 10.244.1.2] → [Pod C 10.244.2.2]

The tradeoff: ~50 bytes of encapsulation overhead per packet and additional CPU for encap/decap. For most workloads this is negligible. For high-throughput workloads pushing 10+ Gbps, it matters.

Direct Routing (BGP)

Approach	Overhead	Network Requirement	Complexity
VXLAN	~50 bytes/packet	Any L3 network	Low
IP-in-IP	~20 bytes/packet	Any L3 network	Low
BGP (no overlay)	0 bytes	BGP-capable routers	Medium
VPC-native (AWS/GKE)	0 bytes	Cloud VPC	Low (cloud-managed)

Kubernetes Pod Networking Model

Kubernetes imposes three requirements on the network:

Every pod gets a unique, routable IP — no NAT between pods.
Pods on one node can communicate with pods on any other node — without NAT.
The IP a pod sees for itself is the same IP others use to reach it — no address translation surprises.

The CNI plugin implements these requirements. When the kubelet creates a pod:

kubelet calls the CNI binary (/opt/cni/bin/calico, /opt/cni/bin/cilium-cni, etc.)
CNI allocates an IP from the node's pod CIDR (via IPAM)
CNI creates the veth pair and attaches one end to the pod namespace
CNI configures routes — either adds the pod to the bridge or sets up direct routing
CNI returns the IP to the kubelet, which reports it to the API server

Service Networking: kube-proxy

Kubernetes Services provide stable virtual IPs (ClusterIPs) that load-balance across pods. kube-proxy implements this by programming packet-level rules on every node.

# iptables mode: inspect service rules
iptables -t nat -L KUBE-SERVICES -n | head -20

# IPVS mode: list virtual servers
ipvsadm -Ln

External Traffic: NodePort, LoadBalancer, Ingress

Traffic from outside the cluster follows a different path:

NodePort: kube-proxy opens a port (30000-32767) on every node. External traffic hits any node's IP:NodePort and gets DNAT'd to a pod endpoint. The pod might be on a different node, causing an extra hop.
LoadBalancer: Cloud provider provisions an external LB (AWS NLB/ALB, GCP LB) that distributes to NodePorts. The externalTrafficPolicy: Local setting avoids the extra hop by only forwarding to nodes running the target pods.
Ingress: An in-cluster reverse proxy (NGINX, Traefik, Istio Gateway) that receives all external HTTP traffic on a single LoadBalancer and routes by hostname/path to backend Services.

Performance Considerations

Container networking overhead varies dramatically by configuration:

Same-node pod-to-pod: Sub-microsecond via bridge forwarding. Effectively zero overhead.
Cross-node with VXLAN: 5-15% throughput reduction due to encapsulation and MTU reduction.
Cross-node with BGP/native routing: Near-bare-metal performance — no encapsulation tax.
Service routing via iptables: Adds measurable latency at 5,000+ services due to linear rule evaluation.
Service routing via IPVS or eBPF: O(1) lookup — service count does not affect per-packet latency.

Key Points

•Every container gets its own network namespace with a dedicated network stack — isolated interfaces, routes, and iptables rules that cannot see the host's or other containers' stacks.
•veth pairs are the plumbing: one end sits inside the container namespace (eth0), the other connects to a bridge or directly to the host routing table. Deleting either end destroys both.
•VXLAN encapsulates L2 frames inside UDP packets (port 4789) to stretch a flat L2 network across L3 boundaries — the overlay tax is roughly 50 bytes of header per packet.
•Kubernetes requires that every pod gets a routable IP and that pods communicate without NAT. The CNI plugin enforces this contract regardless of the underlying network topology.
•kube-proxy in IPVS mode uses hash tables for O(1) service routing, supporting 10,000+ services without the linear chain-walk penalty of iptables mode.

Key Components

Component	Role
Network Namespaces	Isolated network stacks within a single Linux host — each namespace has its own interfaces, routing table, iptables rules, and sockets
veth Pairs	Virtual Ethernet cables that connect a container's network namespace to the host namespace, always created in pairs with one end in each namespace
Linux Bridge (docker0 / cbr0)	A virtual L2 switch inside the host kernel that connects all container veth endpoints, enabling same-host container communication
CNI Plugin	The Container Network Interface plugin that Kubernetes calls to attach a pod to the network — responsible for IP allocation, route setup, and overlay encapsulation
kube-proxy	Kubernetes component that programs iptables or IPVS rules to implement Service ClusterIPs, translating virtual IPs to actual pod endpoints

When to Use

Tool Comparison

Tool	Type	Best For	Scale
Calico	Open Source	BGP-based pod networking with no overlay overhead, strong NetworkPolicy enforcement, and support for both iptables and eBPF data planes	Medium-Enterprise
Flannel	Open Source	Simple VXLAN overlay networking with minimal configuration — good for small clusters where advanced policy is not needed	Small-Medium
Cilium	Open Source	eBPF-native CNI that replaces kube-proxy, provides L7 network policy, and includes built-in observability via Hubble	Medium-Enterprise
WeaveNet	Open Source	Mesh overlay with automatic encryption and multicast support, easy setup for development and smaller clusters	Small-Medium

Debug Checklist

Inspect the container's network namespace: nsenter -t <PID> -n ip addr shows interfaces and IPs inside the container's isolated stack.
Check bridge connectivity: brctl show (or bridge link) lists which veth endpoints are connected to which bridge.
Verify pod CIDR allocation: kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' shows the subnet assigned to each node.
Test cross-node connectivity: kubectl exec into a pod on node A and ping a pod IP on node B — if it fails, the overlay or BGP peering is broken.
Examine kube-proxy rules: iptables -t nat -L KUBE-SERVICES shows the DNAT rules mapping ClusterIPs to pod endpoints.

Common Mistakes

Assuming containers on different hosts can communicate without an overlay or direct routing setup. Without VXLAN, IP-in-IP, or BGP-advertised routes, cross-host pod traffic is black-holed.
Running Docker's default bridge mode in production Kubernetes. The docker0 bridge uses NAT and port mapping, violating Kubernetes' flat-network requirement.
Ignoring MTU mismatches when using overlays. VXLAN adds 50 bytes of header — if the underlying network MTU is 1500, the container MTU must be 1450 or fragmentation kills throughput.
Not setting resource limits on kube-proxy. In iptables mode with 5,000+ services, kube-proxy can consume significant CPU regenerating rules on every endpoint change.
Debugging container networking from the host namespace. The container has a different routing table — always exec into the container or use nsenter to enter its network namespace.

Real World Usage

•Google GKE uses a VPC-native CNI that assigns pod IPs from the VPC subnet directly, eliminating overlay overhead entirely with alias IP ranges on each node.
•AWS EKS defaults to the AWS VPC CNI plugin, which attaches real ENI (Elastic Network Interface) secondary IPs to pods — giving each pod a routable VPC address.
•Docker Desktop uses a user-space network stack (vpnkit) to bridge the Linux VM's container network to the macOS/Windows host, which is why Docker networking behaves differently locally.
•Shopify runs thousands of pods across multiple clusters using Calico in BGP mode, peering directly with their Top-of-Rack switches to advertise pod CIDRs without overlays.

RFCs & Specs

RFC 7348 — VXLAN: A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 NetworksCNI Specification — Container Network InterfaceKubernetes Networking Model Documentation

The Problem

Mental Model

Architecture Diagram

How It Works

Network Namespaces: Isolation at the Kernel Level

veth Pairs: The Hallway Between Namespaces

Linux Bridge: The Virtual Switch

Cross-Host Communication: Overlays and Routing

Kubernetes Pod Networking Model

Service Networking: kube-proxy

External Traffic: NodePort, LoadBalancer, Ingress

Performance Considerations

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics

The Problem

Mental Model

Architecture Diagram

How It Works

Network Namespaces: Isolation at the Kernel Level

veth Pairs: The Hallway Between Namespaces

Linux Bridge: The Virtual Switch

Cross-Host Communication: Overlays and Routing

Kubernetes Pod Networking Model

Service Networking: kube-proxy

External Traffic: NodePort, LoadBalancer, Ingress

Performance Considerations

Key Points

Key Components

When to Use

Tool Comparison

Debug Checklist

Common Mistakes

Real World Usage

RFCs & Specs

Related Topics