NetworkingTopic 11 of 13

Networking & SocketsAdvanced

VXLAN & Overlay Networking

DockerKubernetes

🧠

Mental Model

Imagine two office buildings in different cities. Employees in each building pass paper memos to coworkers on the same floor directly -- just walk over and hand it off. But to reach someone in the other building, a memo goes to the mailroom (VTEP), gets stuffed into a larger envelope (encapsulation) with the other building's street address (outer IP) written on it, and dropped into the postal system (physical network). The postal service only reads the outer envelope and does not care what is inside. When the envelope arrives at the other building's mailroom (remote VTEP), the clerk opens the outer envelope (decapsulation), reads the inner memo's recipient name (inner MAC), checks the building directory (FDB), and delivers it to the right desk (pod). The floor number on the inner memo is the VNI -- it tells the mailroom which department the memo belongs to, keeping finance memos separate from engineering memos even though they travel in the same postal system.

💡

The Problem

Pod A on node 1 sends an HTTP request to pod B on node 2. The connection times out. Both pods work fine talking to local pods on their own node. The physical network between the nodes is healthy -- node-to-node ping works. Tcpdump on the destination node shows nothing arriving. The cluster runs 200 pods across 15 nodes, and roughly 40% of cross-node connections fail. Worse, some connections work for small payloads but silently drop large responses. A 1400-byte ping succeeds; a 1500-byte ping with DF bit set gets black-holed. The ops team spends three days blaming the application before discovering that VXLAN encapsulation adds 50 bytes to every packet, the inner MTU was never reduced from 1500 to 1450, and the network silently drops oversized encapsulated frames with no ICMP error reaching the sender.

Architecture

How does a packet from a container on one physical machine reach a container on a completely different machine when the two machines are on separate IP subnets? The physical network has no idea that containers exist. It only knows how to route packets between host machines.

This is the fundamental problem that VXLAN solves.

Why Overlay Networks Exist

A Kubernetes cluster with 50 nodes running 40 pods each has 2,000 pod IPs. The physical network knows about 50 node IPs and nothing about the 2,000 pod IPs.

There are two solutions. The first is teaching the physical network about pod IPs by injecting routes via BGP. This is what Calico does in BGP mode. It works when the network team cooperates, but many data centers and most cloud environments do not allow tenants to peer BGP with infrastructure routers.

The second is building a virtual network on top of the physical one -- encapsulate pod-to-pod traffic inside node-to-node UDP packets so the physical network only sees traffic between nodes it already knows how to route. This is an overlay network, and VXLAN (Virtual Extensible LAN, RFC 7348) is the dominant protocol for it.

VXLAN provides Layer 2 adjacency over a Layer 3 physical network. From the pod perspective, every pod is on the same Ethernet segment. They can ARP for each other and communicate as if plugged into the same switch. In reality, every frame is encapsulated inside a UDP packet, sent across the physical network, and decapsulated on the destination node.

The VXLAN Header Format

Every VXLAN-encapsulated packet has the following structure, from outer to inner:

+-------------------+  14 bytes  Outer Ethernet header (src/dst MAC of physical NICs)
+-------------------+  20 bytes  Outer IPv4 header (src=local VTEP IP, dst=remote VTEP IP)
+-------------------+   8 bytes  Outer UDP header (src=hash-based, dst=4789)
+-------------------+   8 bytes  VXLAN header (flags + 24-bit VNI)
+-------------------+  14 bytes  Inner Ethernet header (src/dst MAC of pods)
+-------------------+  20 bytes  Inner IPv4 header (src/dst IP of pods)
+-------------------+  N  bytes  Inner payload (TCP/HTTP/application data)
+-------------------+
Total overhead: 50 bytes (14 + 20 + 8 + 8)

The outer Ethernet header uses the MACs of the physical NICs. The outer IP header carries VTEP-to-VTEP tunnel endpoints -- the physical network routes this like any other IP packet.

The outer UDP header uses destination port 4789 (IANA-assigned). The source port is computed from a hash of the inner 5-tuple (src IP, dst IP, protocol, src port, dst port). Physical switches use the outer 5-tuple for ECMP hashing, so varying the source port distributes VXLAN traffic across multiple physical links.

The VXLAN header is 8 bytes. The I flag must be set to indicate a valid VNI. The last three bytes (24 bits) hold the VNI, supporting up to 16,777,216 unique segments -- solving the 4,094-segment limitation of 802.1Q VLANs.

The Full Packet Path

Here is the complete journey of a packet from Pod A on Node 1 to Pod B on Node 2:

1. Pod A writes to a socket. The kernel builds an inner IP packet (src=10.244.1.5, dst=10.244.2.8), wraps it in an inner Ethernet frame with Pod B's MAC, and sends it through the pod's eth0.

2. Veth pair to bridge. Pod A's eth0 is one end of a veth pair; the other end is on the cni0 bridge. The frame exits the pod's network namespace into the host namespace.

3. Bridge to VXLAN device. The bridge sees the destination MAC is not local and forwards the frame to the flannel.1 VXLAN device.

4. VTEP encapsulates. The VTEP looks up the inner destination MAC in its FDB to find the outer destination IP (10.0.0.2). It prepends outer Ethernet + outer IP (src=10.0.0.1, dst=10.0.0.2) + outer UDP (dst=4789) + VXLAN header (VNI=1). The inner frame is now a UDP payload.

5. Kernel routes the outer packet to eth0 (physical NIC) and transmits it as regular UDP.

6. Physical network forwards based on the outer IP header. No knowledge of pod addresses.

7. Node 2 receives. The kernel sees UDP port 4789 and hands it to the VXLAN module.

8. Remote VTEP decapsulates -- strips outer headers, checks VNI, delivers the inner frame to the local cni0 bridge.

9. Bridge to Pod B. The bridge forwards via the veth pair into Pod B's namespace.

10. Pod B reads from socket. From Pod B's perspective, Pod A is on the same local network.

VTEP and FDB

The VTEP (Virtual Tunnel Endpoint) is the VXLAN device on each node. In Flannel it is flannel.1, in Calico it is vxlan.calico, in Docker Swarm it is created automatically per overlay. Each VTEP has its own MAC and IP, separate from the physical NIC. It connects to the local bridge (cni0 in Flannel) -- traffic the bridge cannot deliver locally gets forwarded to the VTEP for encapsulation. On the receiving end, the VTEP decapsulates and feeds frames back into the bridge.

The FDB (Forwarding Database) maps inner destination MACs to outer destination IPs (remote VTEP addresses). Without correct entries, the VTEP does not know where to send encapsulated frames. In multicast-based VXLAN, entries are learned by flooding BUM traffic to a multicast group. In controller-based VXLAN (Flannel, Calico, Docker Swarm), entries are programmed statically by the CNI plugin via netlink. Stale FDB entries are a common problem -- when a node leaves the cluster, entries on other nodes still point to the old VTEP. Always inspect with bridge fdb show dev flannel.1.

ARP Suppression

In a VXLAN overlay with hundreds of pods, ARP broadcasts would flood every VTEP on every node. ARP suppression (proxy flag on the VXLAN device) solves this -- the local VTEP answers ARP requests on behalf of remote pods using cached IP-to-MAC mappings from its neighbor table. The ARP request never leaves the local node. Flannel programs neighbor entries (ip neigh add) alongside FDB entries, eliminating virtually all broadcast traffic from the overlay.

Flannel VXLAN Mode

Each node runs flanneld, which watches the Kubernetes API for node events. When a new node joins, flanneld on every other node programs three things: (1) a route for the new node's pod subnet via the VTEP, (2) an FDB entry mapping the new node's VTEP MAC to its VTEP IP, (3) a neighbor entry mapping the VTEP IP to its VTEP MAC. The result is a fully meshed overlay -- no multicast, no dynamic learning, every entry programmed deterministically.

The downside is simplicity. Flannel does not support network policies and its only overlay mode is VXLAN. For clusters needing policies or alternative routing, Calico or Cilium are better choices.

Calico: VXLAN Mode vs BGP Mode

Calico offers three data plane modes:

BGP mode (no overlay): routes pod subnets via node IPs, peers with physical network via BGP. Zero overhead, zero MTU reduction. Fastest option. Requires BGP peering, which many cloud environments block.

IP-in-IP mode: encapsulates in IP-in-IP tunnels (protocol 4). 20 bytes overhead (just outer IP header). Some cloud networks block protocol 4.

VXLAN mode: identical encapsulation to Flannel but adds network policies via iptables or eBPF. VTEP is vxlan.calico.

Performance (approximate, 10Gbps NIC, MTU 1500): BGP ~9.4 Gbps / ~15 us latency. IP-in-IP ~9.0 Gbps / ~18 us. VXLAN ~8.5 Gbps / ~20 us. With jumbo frames the differences narrow significantly.

Geneve: The Successor to VXLAN

Geneve (RFC 8926) evolves VXLAN by replacing the fixed 8-byte header with a variable-length header carrying TLV (Type-Length-Value) options. OVN (Open Virtual Network) and Cilium prefer Geneve because they can embed security context, flow identifiers, and policy information directly in the tunnel header.

Geneve uses UDP port 6081. Base overhead is the same as VXLAN (50 bytes) but grows with options. Kernel support is stable since Linux 3.18. Cilium uses Geneve by default. The choice rarely matters for pure container networking, but Geneve's extensibility becomes important for service mesh and policy-aware networking.

MTU: The Silent Killer

MTU misconfiguration causes more overlay networking incidents than any other issue. VXLAN adds 50 bytes. A 1500-byte inner frame becomes 1550 on the wire. Physical MTU is 1500. TCP sets DF (Don't Fragment) by default. The router drops the packet and sends ICMP "Fragmentation Needed" -- but many firewalls block ICMP, so the sender never learns and retransmits forever. This is a PMTUD black hole.

The symptoms are deceptive: small packets work (pings, DNS), large transfers fail silently (file downloads, HTTP responses over ~1400 bytes). The application team blames the code. The network team says ping works. Nobody checks MTU.

Fix: set inner MTU to 1450 on all VXLAN devices and pod interfaces. Flannel: net-conf.json ConfigMap. Calico: vxlanMTU in FelixConfiguration. Docker: --opt com.docker.network.driver.mtu=1450.

Better fix: jumbo frames. Physical MTU 9000, inner MTU 8950. AWS, GCP, Azure all support jumbo frames within a VPC.

Performance Considerations

Without hardware offload, encap/decap costs approximately 2-5 microseconds per packet. A node handling 100,000 pps spends 200-500 ms of CPU per second on encapsulation alone -- roughly 20-50% of a single core.

Hardware offload eliminates most of this. Modern NICs (Intel X710, Mellanox ConnectX-5+, Broadcom NetXtreme) handle VXLAN encap/decap, checksum, and TSO entirely in hardware. Check with ethtool -k eth0 | grep vxlan -- look for tx-udp_tnl-segmentation: on. RSS (Receive Side Scaling) uses the outer UDP source port to distribute packets across CPU cores; without it, all VXLAN traffic hits a single core.

Docker Overlay and WireGuard Encryption

Docker Swarm uses VXLAN for its overlay driver. docker network create --driver overlay --opt com.docker.network.driver.mtu=1450 my-overlay creates the network; Swarm's gossip protocol (SWIM-based) distributes FDB entries automatically. The --opt encrypted flag adds IPSec ESP on top of VXLAN. Each overlay gets a unique VNI for isolation.

WireGuard is replacing IPSec for encrypted overlay traffic -- simpler, faster, and built into the kernel since 5.6. Calico and Cilium both support WireGuard encryption, establishing tunnels between every node pair. The overhead is roughly 60 bytes and 1-3 us per packet. With ChaCha20Poly1305, a single core encrypts at 3-5 Gbps. The advantage over IPSec is operational simplicity: one UDP port, no IKE negotiation, key management handled by the CNI plugin.

Common Questions

Why can pods on the same node talk but not across nodes?

Local traffic goes through the cni0 bridge directly without VXLAN. Cross-node traffic requires: (a) VXLAN device up and configured, (b) FDB entries mapping remote MACs to VTEP IPs, (c) UDP 4789 open between nodes, (d) physical MTU accommodating 50-byte overhead. Any one failing breaks cross-node while local stays fine.

Why are large packets being dropped?

VXLAN adds 50 bytes. A 1500-byte inner frame becomes 1550 on the wire, exceeding the physical 1500 MTU. With DF set (TCP default), the packet is dropped. If ICMP is blocked, the sender never learns and retransmits forever. Fix: inner MTU 1450 or jumbo frames.

VXLAN vs Calico BGP -- which is faster?

Calico BGP has zero encap overhead. Throughput difference is 5-10% on 10Gbps with MTU 1500. Latency difference is 2-5 us without offload, under 1 us with offload. If BGP peering is available, use it. If not, VXLAN works everywhere.

How does a VTEP learn where to send encapsulated packets?

In controller-based VXLAN (Flannel, Calico, Docker Swarm), the CNI plugin programs FDB entries via netlink using cluster topology from the Kubernetes API. In multicast-based VXLAN, the VTEP floods unknown unicast and learns from responses. Container networking almost universally uses controller-based mode.

How Technologies Use This

Docker

Two containers on different Swarm worker nodes cannot reach each other over an overlay network. Connection timeouts occur, but containers on the same host communicate fine. The host firewall is open, and docker network ls shows the overlay as healthy.

Docker overlay networking relies on VXLAN encapsulation over UDP port 4789. Each Swarm node runs a VTEP that wraps inner container frames inside outer UDP packets. If the physical network blocks UDP 4789, or if the MTU is 1500 and the inner MTU is not reduced to 1450, oversized encapsulated packets get silently dropped.

Swarm creates a VXLAN tunnel mesh when `docker network create --driver overlay` is used. Each node gets a VTEP bound to a VNI, and the FDB is populated by Swarm's gossip protocol. The fix is ensuring UDP 4789 is open between all nodes and setting `--opt com.docker.network.driver.mtu=1450` on the overlay network.

Kubernetes

Pods on node A reach other pods on node A, but cannot reach pods on node B. Flannel VXLAN mode shows no errors, nodes can ping each other. Large HTTP responses fail intermittently while small pings succeed.

Two problems. First, FDB entries on the VXLAN device may be stale or missing after flannel restarts, so the VTEP does not know which outer IP to send to. Second, VXLAN adds 50 bytes overhead, dropping effective MTU from 1500 to 1450. Packets larger than 1450 bytes get dropped when DF is set. Small pings fit; large HTTP payloads fail silently.

Fix: verify FDB entries with `bridge fdb show dev flannel.1` and set pod MTU to 1450 in the Flannel ConfigMap. On cloud providers with jumbo frames (MTU 9001 on AWS), set physical MTU 9000 and pod MTU 8950. Calico BGP mode avoids encapsulation entirely but requires BGP peering.

Same Concept Across Tech

Concept	Docker Swarm	Kubernetes (Flannel)	Kubernetes (Calico)	AWS VPC CNI	Cilium
Overlay mechanism	VXLAN with gossip FDB	VXLAN with static FDB entries	VXLAN or IP-in-IP or BGP	No overlay (native VPC IPs)	VXLAN or Geneve
VTEP device	auto-created per overlay	flannel.1	vxlan.calico	N/A	cilium_vxlan
FDB programming	Swarm gossip protocol	flanneld daemon	calico-node (Felix)	N/A	Cilium agent (eBPF)
Default inner MTU	1450	1450 (manual config)	1450 (auto-detected)	9001 (instance MTU)	1450 or auto
Encryption option	IPSec over VXLAN	WireGuard (via Calico)	WireGuard built-in	VPC encryption	WireGuard or IPSec

Stack Layer Mapping

Layer	VXLAN Mechanism
Hardware (NIC)	VXLAN TX/RX offload, outer checksum offload, RSS on outer UDP src port
Kernel (datapath)	vxlan kernel module performs encap/decap, FDB lookup, ARP proxy
CNI plugin (control plane)	Programs FDB entries, routes, and ARP entries via netlink
Orchestrator	Assigns pod subnets to nodes, distributes network topology
Pod / Container	Sees a normal Ethernet interface with reduced MTU, unaware of encapsulation

Design Rationale

Physical networks are routed L3 for scalability, but containers need L2 adjacency. VXLAN tunnels L2 over L3 UDP. The 24-bit VNI provides 16M segments (solving VLAN exhaustion), and the outer UDP source port hash enables ECMP. Tradeoff: 50 bytes overhead, ~2-5 us encap/decap CPU cost, MTU management complexity.

If You See This, Think This

Symptom	Likely Cause	First Check
Pods talk locally but not across nodes	VXLAN tunnel not established or FDB entries missing	`bridge fdb show dev flannel.1` for remote VTEP entries
Large packets dropped, small packets work	MTU mismatch -- inner MTU not reduced for 50-byte VXLAN overhead	`ip link show flannel.1` to check MTU, `ping -M do -s 1422`
All cross-node traffic fails after node reboot	flannel/calico-node not running, FDB entries not reprogrammed	`systemctl status flanneld` and `ip -d link show type vxlan`
Intermittent timeouts on cross-node connections	Physical firewall blocking UDP 4789 or ECMP hashing imbalance	`tcpdump -i eth0 udp port 4789` on both nodes
High CPU on nodes during heavy cross-node traffic	VXLAN encap/decap in software without NIC offload	`ethtool -k eth0
ARP storms flooding the overlay network	ARP suppression not enabled on VTEP, broadcast flooding	`bridge -d link show dev vxlan0` to check proxy_arp setting

When to Use / Avoid

Use when:

Pods or containers on different nodes need L2 adjacency over an L3 physical network
The physical network cannot be reconfigured to add routes for pod subnets (no BGP peering available)
Multi-tenant isolation is required -- each tenant gets a unique VNI (up to 16M segments)
Running Flannel, Calico VXLAN mode, or Docker Swarm overlay networking
The cluster spans multiple subnets or availability zones with only IP reachability between them

Avoid when:

The physical network supports BGP and can advertise pod routes directly (Calico BGP mode is faster)
Performance is critical and the 50-byte overhead or ~2-5 us encap/decap latency is unacceptable
The environment already uses AWS VPC CNI or similar native IP assignment (no encapsulation needed)
Multicast is unavailable and the CNI plugin does not handle unicast FDB programming
The application is sensitive to MTU issues and cannot be configured for 1450-byte packets

Try It Yourself

 1  # Create a VXLAN interface with VNI 42, bound to eth0, remote port 4789
 2  
 3  ip link add vxlan0 type vxlan id 42 dev eth0 dstport 4789 local 10.0.0.1 nolearning proxy
 4  
 5  # Bring up the VXLAN device and set inner MTU to 1450
 6  
 7  ip link set vxlan0 up mtu 1450
 8  
 9  # Add a static FDB entry mapping a remote MAC to a remote VTEP IP
10  
11  bridge fdb append aa:bb:cc:dd:ee:ff dev vxlan0 dst 10.0.0.2
12  
13  # Add a default gateway entry for BUM (broadcast/unknown/multicast) traffic
14  
15  bridge fdb append 00:00:00:00:00:00 dev vxlan0 dst 10.0.0.2
16  
17  # Show all FDB entries on the VXLAN device
18  
19  bridge fdb show dev vxlan0
20  
21  # Display detailed VXLAN device parameters (VNI, port, learning mode)
22  
23  ip -d link show vxlan0
24  
25  # Capture VXLAN-encapsulated packets on the physical interface
26  
27  tcpdump -i eth0 udp port 4789 -nn -c 20
28  
29  # Capture inner overlay traffic on the VTEP device (decapsulated)
30  
31  tcpdump -i flannel.1 -nn -c 20
32  
33  # Check VXLAN hardware offload support on the NIC
34  
35  ethtool -k eth0 | grep -i vxlan
36  
37  # Show VXLAN device statistics (TX/RX packets, errors, drops)
38  
39  ip -s link show flannel.1

Debug Checklist

1ip -d link show type vxlan -- list all VXLAN devices with VNI, port, and local IP
2bridge fdb show dev flannel.1 -- verify FDB entries map inner MACs to correct outer IPs
3tcpdump -i eth0 udp port 4789 -c 10 -nn -- capture encapsulated VXLAN traffic on physical NIC
4tcpdump -i flannel.1 -c 10 -nn -- capture decapsulated overlay traffic on VTEP
5ip -s link show flannel.1 -- check TX/RX counters and error counts on VTEP
6ping -M do -s 1422 <remote-pod-ip> -- test path MTU (1422 + 28 ICMP/IP = 1450 inner)
7ethtool -k eth0 | grep vxlan -- check hardware offload status for VXLAN encap/decap
8iptables -L -n | grep 4789 -- verify firewall allows UDP 4789 between nodes
9cat /sys/class/net/flannel.1/mtu -- confirm VXLAN device MTU is 1450

Key Takeaways

✓VXLAN encapsulation adds exactly 50 bytes: 14 (outer Ethernet header) + 20 (outer IPv4 header) + 8 (UDP header) + 8 (VXLAN header with 24-bit VNI). This means inner MTU must be reduced from 1500 to 1450 on standard networks. Failing to adjust the MTU is the single most common cause of mysterious connectivity failures in overlay networks.
✓The FDB is the control plane of VXLAN. Without correct FDB entries, the VTEP does not know which remote node to send encapsulated frames to. In multicast-based VXLAN, the FDB learns entries by flooding BUM (Broadcast, Unknown unicast, Multicast) traffic to a multicast group. In controller-based VXLAN (Flannel, Docker Swarm), entries are programmed directly via netlink, eliminating the need for multicast on the physical network.
✓The outer UDP source port is not random -- it is derived from a hash of the inner packet headers. This is critical for performance. Physical switches and routers use the outer 5-tuple for ECMP hashing, so varying the source port distributes VXLAN traffic across multiple physical links. Without this, all overlay traffic between two nodes follows a single path.
✓VXLAN operates at Layer 2 inside Layer 3. This means the overlay provides L2 adjacency between pods on different nodes, even though the physical network between those nodes is purely L3 routed. Broadcast, ARP, and multicast all work inside the overlay as if every pod were on the same Ethernet segment.
✓Hardware offload (Intel X710, Mellanox ConnectX-5+) handles VXLAN encap/decap entirely in the NIC, including outer checksums and TSO. Check with 'ethtool -k eth0 | grep vxlan' -- look for tx-udp_tnl-segmentation and rx-udp_tnl-segmentation.

Common Pitfalls

✗Mistake: Leaving inner MTU at 1500 when physical MTU is also 1500. Reality: VXLAN adds 50 bytes, so 1500-byte inner frames become 1550 outer frames exceeding physical MTU. DF-set packets are silently dropped. Set inner MTU to 1450, or physical MTU to 9000 and inner to 8950.
✗Mistake: Blocking UDP 4789 in the host firewall while expecting overlay networking to work. Reality: all cross-node overlay traffic fails silently. Node-to-node ping still works, making it look like an application bug.
✗Mistake: Assuming stale FDB entries clean up automatically. Reality: during restarts or upgrades, entries can point to non-existent VTEPs. Inspect with 'bridge fdb show dev flannel.1' and remove with 'bridge fdb del'.
✗Mistake: Using VXLAN multicast mode in the cloud. Reality: AWS, GCP, Azure do not support IP multicast. BUM flooding fails silently. Use unicast mode or let the CNI plugin manage FDB entries directly.

Reference

System Calls

socketsendmsgrecvmsgioctl

Tools

ip link (iproute2)bridge fdbtcpdump

📌

In One Line

VXLAN wraps L2 frames inside UDP packets so pods on different L3 nodes share a virtual Ethernet segment; set inner MTU to 1450 or everything breaks.