DNS Failure Patterns

Why DNS Fails

DNS looks simple until it breaks. You have a hierarchy of caches, each with its own TTL, each operated by a different party, and none of them coordinated. Your authoritative nameserver goes down and you're at the mercy of whatever TTL you set last week.

The most common DNS failure patterns fall into three buckets: provider outages (your DNS host goes down), propagation delays (you made a change and it hasn't reached everyone), and resolver bugs (the thing doing the lookup is broken). Each one requires a different response.

The NXDOMAIN Problem

When a DNS record disappears, resolvers don't just fail. They cache the negative response. RFC 2308 says NXDOMAIN results should be cached for the duration of the SOA minimum TTL, which defaults to hours in many configurations. This means even after you fix the root cause, clients keep getting "domain not found" for a long time.

Aggressive negative caching in ISP resolvers makes this worse. Some ISPs cache NXDOMAIN for up to 24 hours regardless of what your SOA says. There's nothing you can do about this from your side.

Multi-Provider DNS

Running two DNS providers is the single most effective mitigation. Tools like OctoDNS (GitHub's tool) and DNSControl (Stack Overflow's tool) sync zone files across providers automatically. The setup takes a day. The alternative is hoping your single provider never goes down.

The catch: NS record TTLs at the registrar level are typically 48 hours. If your primary provider fails and you need to remove its NS records, that change takes days to propagate. You need both providers active and answering queries at all times, not as a cold standby.

Internal DNS Failures

Kubernetes clusters run CoreDNS, and it's a single point of failure that people forget about. A CoreDNS pod OOM kill, a bad ConfigMap update, or ndots:5 causing excessive queries can bring down every service in the cluster. Monitor CoreDNS memory usage, query rates, and error rates separately from your external DNS monitoring.

Set ndots: 2 in your pod DNS config. The default ndots: 5 causes Kubernetes to try five different domain suffixes before making an external query, multiplying DNS traffic by 5x.

Recovery Patterns

When DNS fails, you need parallel recovery tracks. Track one: fix the DNS issue (restore the provider, correct the record, restart the resolver). Track two: bypass DNS entirely for critical paths using IP addresses, /etc/hosts entries, or service mesh direct routing. Track two buys you time while track one deals with propagation delays.

After recovery, audit every TTL in your zones. Any record with a TTL above 300 seconds for a critical service is a risk you're choosing to accept.

Why DNS Fails

The NXDOMAIN Problem

Aggressive negative caching in ISP resolvers makes this worse. Some ISPs cache NXDOMAIN for up to 24 hours regardless of what your SOA says. There's nothing you can do about this from your side.

Multi-Provider DNS

Internal DNS Failures

Set ndots: 2 in your pod DNS config. The default ndots: 5 causes Kubernetes to try five different domain suffixes before making an external query, multiplying DNS traffic by 5x.

Recovery Patterns

After recovery, audit every TTL in your zones. Any record with a TTL above 300 seconds for a critical service is a risk you're choosing to accept.

Why DNS Fails

The NXDOMAIN Problem

Multi-Provider DNS

Internal DNS Failures

Recovery Patterns

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

DNS Failure Patterns

Why DNS Fails

The NXDOMAIN Problem

Multi-Provider DNS

Internal DNS Failures

Recovery Patterns

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics