DNS Protocol Deep Dive
DNS translates domain names to IP addresses through a hierarchical caching system of recursive and authoritative servers.
The Problem
Every internet request starts with a DNS lookup. How does a human-readable domain name get translated into an IP address, and what happens when this critical system fails or gets attacked?
Mental Model
Like looking up a phone number - first check the contacts list, then the local directory, then the national registry
Architecture Diagram
How It Works
Every time a URL is typed, a link is clicked, or an API call is made, DNS is the first thing that happens. Before the browser can open a TCP connection, it needs an IP address. DNS — the Domain Name System — is the internet's directory service that translates example.com into 93.184.216.34.
The process involves multiple caching layers and up to four network hops. Understanding this chain is critical because DNS is the single most common cause of "the internet is down" experiences.
The Resolution Walk
When a browser requests www.example.com, here's what actually happens:
- Browser cache — Chrome/Firefox maintain their own DNS cache (typically 60 seconds). If the site was visited recently, done.
- OS cache — The operating system's stub resolver checks its cache. On macOS, check with
sudo dscacheutil -flushcache. - Recursive resolver — The configured DNS server (ISP's resolver, 8.8.8.8, or 1.1.1.1) checks its cache.
- Root servers — If the recursive resolver has no cached answer, it asks a root server. There are 13 logical root servers (A through M), operated by organizations like Verisign, ICANN, and the US military. They don't know the answer but know who does.
- TLD servers — The root server says "for
.com, ask these TLD servers." The TLD server for.comknows which authoritative servers handleexample.com. - Authoritative server — The final authority. It has the actual records and responds with the IP address (plus a TTL).
# Watch the entire resolution walk
dig +trace www.example.com
# Output (simplified):
# . IN NS a.root-servers.net. ← Root
# com. IN NS a.gtld-servers.net. ← TLD
# example.com. IN NS ns1.example.com. ← Authoritative
# www.example.com. IN A 93.184.216.34 ← Answer!
DNS Record Types
DNS isn't just about IP addresses. Different record types serve different purposes:
| Type | Purpose | Example |
|---|---|---|
| A | Maps name to IPv4 address | example.com → 93.184.216.34 |
| AAAA | Maps name to IPv6 address | example.com → 2606:2800:220:1:... |
| CNAME | Alias pointing to another name | www.example.com → example.com |
| MX | Mail server for the domain (with priority) | example.com → 10 mail.example.com |
| TXT | Arbitrary text, used for SPF, DKIM, verification | v=spf1 include:_spf.google.com ~all |
| SRV | Service location (host + port) | _http._tcp.example.com → 0 5 80 www.example.com |
| NS | Delegates a zone to authoritative name servers | example.com → ns1.example.com |
| SOA | Zone authority info (serial, refresh, retry) | Primary NS, admin email, serial number |
# Query specific record types
dig A example.com # IPv4 address
dig AAAA example.com # IPv6 address
dig MX example.com # Mail servers
dig TXT example.com # SPF, DKIM, verification records
dig NS example.com # Name servers
dig SOA example.com # Zone authority
dig SRV _http._tcp.example.com # Service records
# Short output format
dig +short A example.com
# 93.184.216.34
TTL Strategy — The Art of Cache Timing
TTL (Time to Live) is the number of seconds a resolver should cache a record. Getting it right is a balancing act:
High TTL (86400 / 24 hours):
- Fewer DNS queries → faster page loads
- Less load on authoritative servers
- Slower to update — changes take up to 24 hours to propagate
- Good for: stable records that rarely change
Low TTL (60 seconds):
- Changes propagate quickly
- More DNS queries → slightly slower page loads
- More load on authoritative servers
- Good for: records that change frequently, pre-migration preparation
The migration pattern: When planning a server migration, lower the TTL to 60 seconds at least 48 hours BEFORE the change. After the old 24-hour TTL expires and all caches have refreshed with the new low TTL, make the IP change. After verifying the migration, raise the TTL back.
# Check current TTL
dig +nocmd +noall +answer example.com
# example.com. 3600 IN A 93.184.216.34
# ↑ TTL in seconds (1 hour)
DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT)
Traditional DNS is unencrypted. Anyone on the network path — the ISP, coffee shop WiFi operator, or government — can see every domain being looked up. DoH and DoT fix this.
DNS-over-TLS (DoT): Wraps DNS queries in TLS on port 853. Simple but easily blocked by network operators (just block port 853).
DNS-over-HTTPS (DoH): Encapsulates DNS queries in HTTPS on port 443. Indistinguishable from regular HTTPS traffic, making it nearly impossible to block without blocking all HTTPS.
# DoH query using curl
curl -s -H 'Accept: application/dns-json' \
'https://1.1.1.1/dns-query?name=example.com&type=A' | jq
# Response:
# {
# "Answer": [
# { "name": "example.com", "type": 1, "TTL": 3600, "data": "93.184.216.34" }
# ]
# }
Both Chrome and Firefox now support DoH. Configure the browser or OS to use Cloudflare (1.1.1.1) or Google (8.8.8.8) for encrypted DNS.
DNS as Infrastructure — Beyond Name Resolution
Modern DNS is far more than a phone book. It's an active infrastructure component:
Health-Checked DNS Failover
AWS Route 53 and Cloudflare DNS can monitor servers and automatically remove unhealthy IPs from DNS responses:
example.com → Health check fails for 93.184.216.34
→ Remove from DNS response
→ Serve only healthy IP: 93.184.216.35
Geographic Routing
Return different IPs based on the resolver's location:
User in US → dig example.com → 52.0.14.116 (us-east-1)
User in Europe → dig example.com → 54.72.14.116 (eu-west-1)
User in Asia → dig example.com → 13.230.14.16 (ap-northeast-1)
Weighted Routing
Distribute traffic by percentage — useful for canary deployments:
example.com → 90% → 10.0.1.1 (stable)
→ 10% → 10.0.2.1 (canary)
Service Discovery in Kubernetes
Kubernetes uses CoreDNS to provide automatic service discovery. Every Service gets a DNS name:
my-service.my-namespace.svc.cluster.local → 10.96.0.42 (ClusterIP)
Pods can reach services by name without knowing IPs. When services scale up or down, DNS updates automatically.
DNS Attacks and Defenses
DNS is a high-value attack target because compromising it redirects all traffic:
DNS Cache Poisoning: Attacker sends forged responses to a recursive resolver, inserting fake records. The resolver caches them and serves them to all clients.
- Defense: DNSSEC (DNS Security Extensions) — cryptographically signs records so resolvers can verify authenticity.
DNS Amplification DDoS: Attacker sends small queries with a spoofed source IP (the victim's). DNS servers send large responses to the victim.
- Defense: Response Rate Limiting (RRL) on authoritative servers, BCP38 (ingress filtering).
DNS Hijacking: Attacker changes the domain's NS records (via compromised registrar account) to point to their servers.
- Defense: Registrar lock, DNSSEC, multi-factor auth on registrar accounts.
# Verify DNSSEC for a domain
dig +dnssec example.com
# Check if DNSSEC validation passes
dig +cd example.com # cd = checking disabled
# Compare results — if they differ, DNSSEC validation is active
DNS is invisible when it works and catastrophic when it fails. Every production system should monitor DNS resolution time, have redundant authoritative servers, and understand their TTL strategy.
Key Points
- •DNS resolution involves up to 4 hops: browser cache → OS cache → recursive resolver → authoritative server (via root → TLD)
- •TTL (Time to Live) controls how long each cache layer holds a record — too short wastes bandwidth, too long delays changes
- •DNS uses UDP by default for speed (single packet query/response) but falls back to TCP for responses over 512 bytes
- •DNS-over-HTTPS (DoH) encrypts DNS queries inside HTTPS, preventing ISPs and networks from snooping on browsing activity
- •A single DNS lookup failure can cascade into a complete outage — DNS is the most critical single point of failure on the internet
Key Components
| Component | Role |
|---|---|
| Stub Resolver | The DNS client library in the OS that sends queries to the configured recursive resolver |
| Recursive Resolver | Walks the DNS tree on behalf of the client, querying root → TLD → authoritative servers |
| Authoritative Name Server | The source of truth for a domain's DNS records — answers definitively for zones it owns |
| Root Name Servers | 13 logical root servers (A through M) that know where to find every TLD's name servers |
| DNS Cache | Multiple caching layers (browser, OS, resolver, CDN) that store responses to avoid repeated lookups |
When to Use
DNS is involved in every internet connection — it is not optional. The choices are DNS provider (authoritative), resolver (recursive), TTL strategy, and whether to use DNS-based traffic management (weighted, geolocation, failover).
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Cloudflare DNS (1.1.1.1) | Managed | Fastest public recursive resolver with built-in privacy and malware filtering | Global anycast, sub-15ms median response time |
| AWS Route 53 | Managed | Authoritative DNS integrated with AWS ecosystem, health checks, and traffic routing | Enterprise DNS with 100% SLA |
| BIND | Open Source | The reference DNS implementation — full-featured authoritative and recursive server | Runs root servers and large ISP resolvers |
| CoreDNS | Open Source | Kubernetes-native DNS with plugin architecture for service discovery | Cloud-native clusters |
Debug Checklist
- Use dig +trace domain.com to walk the entire resolution chain from root to authoritative
- Check TTL values with dig domain.com — low TTLs mean frequent re-resolution, high TTLs delay changes
- Verify propagation across resolvers: dig @8.8.8.8 vs dig @1.1.1.1 vs dig @208.67.222.222
- Check for DNSSEC validation issues with dig +dnssec domain.com
- Monitor DNS resolution latency — add DNS timing to the APM dashboards
Common Mistakes
- Setting TTLs too high before a migration — there is no way to force clients to drop cached records, so lower TTLs BEFORE the change
- Not understanding CNAME flattening — CNAMEs at the zone apex (example.com) violate RFC 1034 but some providers support it
- Forgetting that DNS propagation isn't instant — different resolvers cache records for different durations based on TTL
- Using A records when CNAME would be better — A records hardcode IPs, CNAMEs follow name changes automatically
- Not monitoring DNS resolution time — slow DNS adds latency to every single request end users make
Real World Usage
- •Cloudflare's 1.1.1.1 handles over 1 trillion DNS queries per day using global anycast routing
- •AWS Route 53's weighted routing distributes traffic across regions based on configurable percentages
- •Google's 8.8.8.8 pioneered public DNS resolvers, providing an alternative to ISP-provided DNS
- •Kubernetes uses CoreDNS internally for service discovery — every pod resolves service-name.namespace.svc.cluster.local
- •Akamai's authoritative DNS serves records for many of the world's largest websites with sub-millisecond latency