DNS Failure Patterns
Why DNS Fails
DNS looks simple until it breaks. You have a hierarchy of caches, each with its own TTL, each operated by a different party, and none of them coordinated. Your authoritative nameserver goes down and you're at the mercy of whatever TTL you set last week.
The most common DNS failure patterns fall into three buckets: provider outages (your DNS host goes down), propagation delays (you made a change and it hasn't reached everyone), and resolver bugs (the thing doing the lookup is broken). Each one requires a different response.
The NXDOMAIN Problem
When a DNS record disappears, resolvers don't just fail. They cache the negative response. RFC 2308 says NXDOMAIN results should be cached for the duration of the SOA minimum TTL, which defaults to hours in many configurations. This means even after you fix the root cause, clients keep getting "domain not found" for a long time.
Aggressive negative caching in ISP resolvers makes this worse. Some ISPs cache NXDOMAIN for up to 24 hours regardless of what your SOA says. There's nothing you can do about this from your side.
Multi-Provider DNS
Running two DNS providers is the single most effective mitigation. Tools like OctoDNS (GitHub's tool) and DNSControl (Stack Overflow's tool) sync zone files across providers automatically. The setup takes a day. The alternative is hoping your single provider never goes down.
The catch: NS record TTLs at the registrar level are typically 48 hours. If your primary provider fails and you need to remove its NS records, that change takes days to propagate. You need both providers active and answering queries at all times, not as a cold standby.
Internal DNS Failures
Kubernetes clusters run CoreDNS, and it's a single point of failure that people forget about. A CoreDNS pod OOM kill, a bad ConfigMap update, or ndots:5 causing excessive queries can bring down every service in the cluster. Monitor CoreDNS memory usage, query rates, and error rates separately from your external DNS monitoring.
Set ndots: 2 in your pod DNS config. The default ndots: 5 causes Kubernetes to try five different domain suffixes before making an external query, multiplying DNS traffic by 5x.
Recovery Patterns
When DNS fails, you need parallel recovery tracks. Track one: fix the DNS issue (restore the provider, correct the record, restart the resolver). Track two: bypass DNS entirely for critical paths using IP addresses, /etc/hosts entries, or service mesh direct routing. Track two buys you time while track one deals with propagation delays.
After recovery, audit every TTL in your zones. Any record with a TTL above 300 seconds for a critical service is a risk you're choosing to accept.
Incident Timeline
- T+0mDNS provider starts returning SERVFAIL for primary domain. External monitoring catches it before internal alerts fire.
- T+2mCached DNS records still serving traffic. Services with short TTLs (30s) begin failing first. API gateway health checks start failing.
- T+5mCustomer-facing error rates spike to 40%. Support tickets flood in. On-call SRE paged via PagerDuty.
- T+10mTeam identifies DNS as root cause. Attempts to switch to secondary DNS provider begin. TTL on NS records is 48 hours, blocking fast failover.
- T+15mHotfix deployed: hardcoded IP addresses in critical service configs as temporary bypass. Partial traffic recovery for known endpoints.
- T+30mSecondary DNS provider fully propagated for zones with low TTLs. Full recovery takes 4-6 hours due to cached NXDOMAIN responses in ISP resolvers.
Detection Signals
- •NXDOMAIN response rate exceeding baseline by 10x in resolver logs
- •Spike in connection timeout errors across multiple unrelated services simultaneously
- •External synthetic monitoring failures from multiple geographic regions
- •DNS query latency exceeding 500ms from internal resolvers
Prevention
- Run at least two DNS providers (e.g., Route 53 + Cloudflare) with automated zone sync using tools like OctoDNS or DNSControl
- Set TTLs to 60 seconds for critical records. The performance cost of low TTLs is negligible compared to hours-long outages
- Maintain a list of hardcoded IP fallbacks for internal service-to-service communication that bypasses DNS
- Monitor DNS resolution from multiple vantage points using ThousandEyes or Catchpoint, not just from your own infrastructure
- Run internal DNS (CoreDNS, Unbound) with aggressive caching and serve-stale configured
Key Points
- •DNS failures are uniquely catastrophic because they affect every service simultaneously, unlike most infrastructure failures that are scoped to a single component
- •NXDOMAIN responses get cached by recursive resolvers, meaning recovery takes far longer than the original outage
- •The 2016 Dyn DDoS attack took down Twitter, GitHub, Netflix, and Reddit because they relied on a single DNS provider
- •Internal service mesh DNS (Kubernetes CoreDNS) fails differently than external DNS. A CoreDNS OOM kill can take down an entire cluster
- •DNS-over-HTTPS and DNS-over-TLS add new failure modes that traditional monitoring misses
Common Mistakes
- ✗Setting DNS TTLs to 24 hours 'for performance' and then being unable to failover during an incident
- ✗Testing DNS failover only in staging where TTLs and caching behavior differ from production
- ✗Assuming your cloud provider's DNS is redundant enough on its own without a secondary provider
- ✗Forgetting that certificate validation depends on DNS, so DNS failures also break TLS handshakes for new connections