Netfilter & nftables/iptables
Mental Model
Airport with five checkpoints. The entrance (PREROUTING) checks the ticket and may redirect a traveler to a different terminal before anyone decides where they go. The routing desk reads the destination: staying here (INPUT) or connecting to another airport (FORWARD). The gate only lets ticketed passengers board. The transit lounge screens connecting travelers. The exit (POSTROUTING) stamps a return address on the passport so replies find their way back. A logbook at the entrance (conntrack) tracks every traveler -- but the logbook has a fixed number of pages. Once full, new arrivals get turned away at the door.
The Problem
Ten million packets per second across 5,000 Kubernetes Services. kube-proxy in iptables mode maintains 20,000+ rules walked linearly on every packet -- 5-10 ms of added latency, 15-20% of node CPU. During traffic spikes the conntrack table fills at 262,144 entries and silently drops new connections with no application-level error. Docker containers on private 172.17.0.x addresses are invisible to external clients without NAT rules most operators never inspect. And flushing firewall rules does not clear existing conntrack entries, so "blocked" connections keep flowing.
Architecture
Every packet that enters a Linux server passes through a gauntlet. Five checkpoints, each one a chance to inspect, modify, redirect, or drop it. This gauntlet is Netfilter, and most systems are already using it -- whether the operator knows it or not.
Docker uses it to make container ports reachable. Kubernetes uses it to load-balance traffic across pods. The host firewall is built on it. And when any of those things mysteriously break, the answer is almost always hiding in the Netfilter rules.
The hardest part isn't the rules themselves. It's understanding the order they run in.
What Actually Happens
When a packet arrives, it hits PREROUTING first. The raw table can mark it for NOTRACK (skip connection tracking). Then conntrack creates or updates a connection entry. Then mangle can modify headers. Then NAT applies DNAT -- changing the destination address before the routing decision happens.
The routing subsystem checks: is this packet for me (local delivery) or for somewhere else (forwarding)?
Local packets go to INPUT, where filter rules decide whether to accept or drop. Forwarded packets go to FORWARD for the same decision. Locally generated packets start at OUTPUT.
Everything leaving the system passes through POSTROUTING, where SNAT and MASQUERADE rewrite source addresses.
Connection tracking is the engine underneath. Conntrack maintains a hash table of every active connection. For TCP, it tracks the full state machine. For UDP and ICMP, it creates pseudo-connections based on timeouts. This is what makes stateful firewalling possible -- instead of writing separate rules for request and response packets, a single rule handles it and conntrack automatically allows the return traffic.
But conntrack is also the most expensive part of Netfilter. Every packet triggers a hash table lookup and potential update. At high packet rates, this adds 5-10% CPU overhead.
Under the Hood
iptables vs nftables. iptables (since Linux 2.4) uses a fixed structure: five tables (filter, nat, mangle, raw, security), each with built-in chains at specific hooks. Rules in a chain are matched linearly -- first match wins. A chain with 10,000 rules checks each one sequentially per packet.
nftables (since Linux 3.13) rethinks this. User-defined tables with arbitrary chains. A register-based virtual machine for matching. Native support for sets and maps -- meaning O(1) hash lookups instead of O(n) linear scans. And atomic ruleset updates: the entire ruleset swaps in a single transaction, eliminating the window of inconsistency that iptables has during multi-rule changes.
The conntrack table is finite. nf_conntrack_max defaults to 262,144 entries. When it fills up, new connections are silently dropped. dmesg shows "nf_conntrack: table full, dropping packet." Each entry costs ~350 bytes, so 1 million entries = 350 MB. The ideal bucket ratio is max/buckets = 4.
This is one of the most common causes of mysterious connection failures in Docker and Kubernetes environments. Conntrack entries from TIME_WAIT connections persist for 120 seconds (double the TCP TIME_WAIT), which makes the problem worse under high connection rates.
MASQUERADE vs SNAT. Both rewrite the source address in POSTROUTING. SNAT uses a fixed source IP -- efficient, predictable. MASQUERADE dynamically looks up the outgoing interface's IP per connection -- necessary when the host IP can change (DHCP, cloud instances). Docker and Kubernetes use MASQUERADE for container egress.
Common Questions
Why does Docker need iptables?
Docker creates four types of rules: (1) MASQUERADE in nat/POSTROUTING so container traffic uses the host's IP, (2) DNAT in nat/PREROUTING for published ports (-p 8080:80), (3) FORWARD rules in filter to control container-to-container communication, (4) a DOCKER chain for isolation. Each container with published ports adds ~6 iptables rules.
How does kube-proxy implement Services with iptables?
For each ClusterIP Service, kube-proxy creates a DNAT chain in nat/PREROUTING. It randomly selects an endpoint using iptables -m statistic --mode random --probability rules. A Service with 3 endpoints uses 3 rules with probabilities 1/3, 1/2, and 1/1. At 5000+ Services, the O(n) rule traversal per packet becomes a real bottleneck -- which is why IPVS mode and eBPF (Cilium) exist.
What happens when the conntrack table fills up?
New connections get silently dropped. SYN packets disappear. dmesg shows "nf_conntrack: table full, dropping packet." conntrack -S shows an incrementing drop counter. Fix: increase nf_conntrack_max and buckets, reduce timeouts (nf_conntrack_tcp_timeout_established defaults to 5 days), or bypass conntrack for high-volume flows using the raw table's NOTRACK target.
How to debug a packet being dropped by Netfilter?
Four approaches: (1) Check rule hit counters with iptables -L -n -v -- zero-count rules never matched. (2) Use TRACE: iptables -t raw -A PREROUTING -p tcp --dport 80 -j TRACE logs every table/chain/rule the packet traverses to dmesg. (3) nftables trace: nft add rule .. meta nftrace set 1 with nft monitor trace. (4) Check conntrack state: conntrack -L -d <dst_ip>.
How Technologies Use This
A container port is published with -p 8080:80, but external clients cannot reach it. The container runs on a private 172.17.0.x IP that no external router knows about. At scale, a host with 200+ containers shows measurably higher per-packet latency as the kernel walks a growing chain of iptables rules on every incoming packet.
Without a DNAT rule in the nat/PREROUTING chain, the packet destination remains the host's public IP and the routing decision delivers it to a local process, never reaching the container. Docker inserts a DNAT rule that rewrites the destination to the container's internal IP before routing. A MASQUERADE rule in POSTROUTING rewrites the container's source address on outbound traffic so return packets find their way back. Each container with published ports adds roughly 6 iptables rules.
For small deployments, the iptables approach works fine. At 200+ containers, the linear O(n) rule walk per packet adds measurable latency. Docker now supports nftables as a backend, replacing linear chain evaluation with O(1) set-based lookups. Always verify rules with iptables -t nat -L -n -v when container ports appear unreachable.
A Kubernetes cluster grows to 5,000 Services and node CPU spikes to 15-20% spent entirely on packet processing. Per-packet latency jumps by 5-10ms. Profiling reveals the kernel walking through 20,000+ iptables rules on every single packet to find the right DNAT target for service routing.
The kube-proxy iptables mode creates O(n) rules per Service, each using probability-based random selection for load balancing. At 5,000 Services, that is 20,000+ rules evaluated linearly per packet. This is the iptables scaling wall: the cost grows linearly with the number of services, and there is no way to optimize the chain walk within the iptables framework.
IPVS mode replaces the linear chain with a kernel-level hash table that resolves service routing in O(1), cutting per-packet overhead to microseconds regardless of service count. Cilium goes further by replacing kube-proxy entirely with eBPF programs attached at the TC and XDP hooks, eliminating conntrack overhead for east-west traffic and reducing service routing latency by 40-60% compared to iptables mode.
A Go application inside a Kubernetes pod dials a ClusterIP service address and the connection times out. The ClusterIP is a virtual IP that no real pod owns; without DNAT rules translating it to an actual pod endpoint, the packet enters a black hole. The Go application never touches Netfilter directly, yet its connectivity depends entirely on it.
Every outbound packet from the pod hits DNAT rules in PREROUTING that rewrite the ClusterIP to an actual pod endpoint, and MASQUERADE rules in POSTROUTING that handle egress NAT. In iptables mode, each rule is evaluated linearly, so 1,000+ services means 1,000+ rules walked per packet. The Go application sees this as elevated connection latency and increased CPU usage on the node, without any visibility into the cause.
Cilium, itself written in Go, replaces these iptables rules by attaching eBPF programs directly to Netfilter hooks and TC ingress/egress points. BPF hash map lookups resolve service destinations in O(1) instead of O(n) linear chain walks, cutting per-packet CPU cost by 30-50% in clusters with 1,000+ services.
Same Concept Across Tech
| Concept | Docker | JVM | Node.js | Go | K8s |
|---|---|---|---|---|---|
| Port mapping | -p flag creates DNAT + MASQUERADE rules | N/A (OS-level) | N/A (OS-level) | N/A (OS-level) | NodePort/LoadBalancer create DNAT via kube-proxy |
| Firewall rules | docker network + iptables DOCKER chain | SecurityManager (deprecated) | N/A | N/A | NetworkPolicy (CNI enforces via iptables/eBPF) |
| Connection tracking | Host conntrack shared across containers | N/A | N/A | N/A | conntrack per node, shared across all pods |
| Service routing | N/A (single host) | N/A | N/A | N/A | kube-proxy iptables/IPVS/eBPF modes |
| NAT bypass | Host networking mode (--net=host) | N/A | N/A | N/A | hostNetwork: true in pod spec |
Stack Layer Mapping
| Layer | Netfilter Mechanism |
|---|---|
| NIC driver | Packet received, passed to network stack |
| PREROUTING hook | Raw table, conntrack, mangle, NAT DNAT |
| Routing decision | Local delivery (INPUT) vs forwarding (FORWARD) |
| INPUT/FORWARD hooks | Filter rules, security module checks |
| OUTPUT hook | Locally generated packets: raw, conntrack, filter |
| POSTROUTING hook | SNAT, MASQUERADE for egress rewriting |
Design Rationale
Five hooks because NAT has to happen at specific points relative to the routing decision -- DNAT before routing so the destination is correct when the route is looked up, SNAT after routing so the source matches the outgoing interface. Filtering belongs at the input, forward, and output stages. Connection tracking was separated out because stateful matching (ESTABLISHED, RELATED) eliminates over 90% of explicit rules a stateless firewall would need. nftables replaced iptables because linear chain evaluation is O(n) by design and cannot be fixed -- hash-based sets make large rulesets viable at the scale Kubernetes demands.
If You See This, Think This
| Symptom | Likely Cause | First Check |
|---|---|---|
| Published container port unreachable | Missing DNAT rule in nat/PREROUTING | `iptables -t nat -L -n -v |
| "nf_conntrack: table full" in dmesg | Conntrack table exhausted (default 262,144) | conntrack -C and sysctl net.netfilter.nf_conntrack_max |
| Connections persist after rule flush | Conntrack entries not cleared | conntrack -F to flush conntrack alongside rule changes |
| High CPU on packet processing | Linear iptables rule evaluation at scale | `iptables -L -n |
| Container cannot reach external network | Missing MASQUERADE rule in POSTROUTING | iptables -t nat -L POSTROUTING -n -v |
| Service routing latency spikes in K8s | kube-proxy iptables mode with 5000+ Services | Check kube-proxy mode: kubectl get cm -n kube-system kube-proxy |
When to Use / Avoid
Use when:
- Building host-level firewalls to control inbound/outbound traffic
- Debugging Docker published ports or Kubernetes Service connectivity
- Implementing NAT for container egress (MASQUERADE) or port forwarding (DNAT)
- Diagnosing connection drops from conntrack table exhaustion
- Auditing security rules before/after deployment changes
Avoid when:
- Packet processing at 10M+ pps requires kernel-bypass (use XDP or DPDK instead)
- Running 5,000+ Kubernetes Services (switch to IPVS mode or Cilium eBPF)
- Simple application-level filtering suffices (use socket-level ACLs, not iptables)
Try It Yourself
1 # List all iptables rules with counters
2
3 iptables -L -n -v --line-numbers
4
5 # Show NAT table rules (Docker port mappings)
6
7 iptables -t nat -L -n -v
8
9 # Check conntrack table size and current usage
10
11 sysctl net.netfilter.nf_conntrack_max
12
13 conntrack -C
14
15 # Show conntrack statistics (drops = table full)
16
17 conntrack -S
18
19 # List all nftables rules
20
21 nft list ruleset
22
23 # Trace packet through netfilter (nftables)
24
25 nft add rule inet filter input meta nftrace set 1
26
27 nft monitor traceDebug Checklist
- 1
iptables -L -n -v --line-numbers -- show filter rules with hit counters - 2
iptables -t nat -L -n -v -- show NAT table rules (Docker/K8s DNAT) - 3
conntrack -C -- count total conntrack entries - 4
conntrack -S -- show per-CPU stats including drops - 5
sysctl net.netfilter.nf_conntrack_max -- check conntrack table limit - 6
dmesg | grep conntrack -- check for 'table full' messages
Key Takeaways
- ✓Order matters more than anything else. PREROUTING runs DNAT before the routing decision. INPUT catches locally-bound traffic. FORWARD handles transit. OUTPUT intercepts locally-generated packets. POSTROUTING does SNAT after routing. Put a rule in the wrong chain and it silently never matches.
- ✓Connection tracking (conntrack) is the most expensive part of Netfilter -- a hash table lookup and update on every packet adds 5-10% overhead at high packet rates. The raw table's NOTRACK target bypasses conntrack for specific flows when you don't need stateful tracking.
- ✓The conntrack table has a hard limit (nf_conntrack_max, default 262144). When it's full, new connections are silently dropped. Each entry costs ~350 bytes. You'll see 'nf_conntrack: table full' in dmesg -- one of the most common causes of mysterious connection failures in containerized environments.
- ✓iptables checks rules linearly -- 10,000 rules means 10,000 checks per packet. nftables uses hash-based sets and maps for O(1) lookups, making large rulesets orders of magnitude faster. This is why Kubernetes at scale can't use iptables mode.
- ✓Kubernetes kube-proxy in iptables mode creates O(n) rules per Service. At 5000+ services, rule evaluation adds measurable latency to every packet. IPVS mode and eBPF (Cilium) avoid this scaling wall entirely.
Common Pitfalls
- ✗Mistake: adding DNAT rules in the FORWARD chain. Reality: DNAT must happen in PREROUTING, before the routing decision. By the time a packet reaches FORWARD, the destination is already resolved and DNAT is ignored.
- ✗Mistake: mixing iptables and nftables without understanding shared state. Reality: both use the same kernel conntrack subsystem. Mixed rules on the same system cause unexpected interactions.
- ✗Mistake: ignoring conntrack entries from TIME_WAIT connections. Reality: conntrack entries persist for tcp_timeout_time_wait (default 120 seconds) -- double the TCP TIME_WAIT. This exacerbates table exhaustion under high connection rates.
- ✗Mistake: flushing iptables rules without flushing conntrack. Reality: existing connections continue through old conntrack entries even after rules are removed. Use 'conntrack -F' alongside rule changes.
Reference
In One Line
DNAT goes in PREROUTING, SNAT in POSTROUTING -- get the hook wrong and the rule silently never matches -- and always watch conntrack -C before blaming the application for dropped connections.