DNS & Service Discovery
Architecture Diagram
Why It Exists
In the old world of static infrastructure, teams hardcoded IPs in config files and called it a day. That does not work in cloud-native environments. Instances are ephemeral. IPs change on every deploy, every scale event, every failure recovery. Service discovery provides a dynamic registry so services find each other by name, not by address. Still managing IP addresses in config files for anything beyond a handful of machines? Stop. That's a maintenance nightmare in the making.
How It Works
Traditional DNS
- Client queries a recursive resolver, which walks the chain: root nameserver, TLD nameserver, authoritative nameserver.
- Each hop has its own TTL-based cache. Total resolution takes 10-100ms uncached, under 1ms cached.
- The record types that matter most: A (IPv4), AAAA (IPv6), CNAME (alias), SRV (service + port), TXT (metadata).
This system has worked since the 1980s. It is battle-tested, universally supported, and honestly pretty elegant for what it does. The problem is that it was designed for things that do not move around much. Modern infrastructure moves around constantly.
Service Discovery Patterns
Client-Side Discovery. The client queries the service registry directly and picks an instance. The client gets full control over load balancing. Netflix Eureka and gRPC name resolution work this way. The downside: every client needs the discovery logic baked in, coupling clients to the registry implementation.
Server-Side Discovery. The client sends requests to a load balancer or router, which queries the registry and forwards traffic. The client stays simple, but it adds a network hop. AWS ELB + ECS and Kubernetes Services both follow this pattern. For most teams, this is the right default. Simpler clients are worth the extra hop.
DNS-Based Discovery. Services register themselves as DNS records (A or SRV). Clients just use standard DNS resolution. It is simple and requires zero client changes, but it is also limited. DNS does not support health-aware routing out of the box. Something like Consul DNS needs to be layered on top to get that.
Production Considerations
- Failure detection. Heartbeat intervals (typically 10-30s) control how fast dead instances get deregistered. Faster detection means more network overhead. Pick the trade-off based on how much stale routing is tolerable.
- Consistency model. Consul uses Raft consensus (CP). Eureka uses peer-to-peer replication (AP). The question is: what is worse for the system, getting a stale endpoint list, or the registry being temporarily unavailable? Most teams should pick AP unless they have a strong reason not to.
- DNS TTL in Kubernetes. CoreDNS with
ndots:5(the default) causes excessive DNS queries because every short hostname triggers five lookup attempts. Tunendotsto 2-3 for most workloads. This is one of those things almost nobody configures, and almost everybody should. - Graceful shutdown. Services need to deregister from the registry before stopping. Kubernetes sends SIGTERM, waits
terminationGracePeriodSeconds, then sends SIGKILL. Make sure the app actually handles SIGTERM and deregisters, or it will serve traffic to a process that is shutting down. - Multi-cluster discovery. If services span multiple Kubernetes clusters, look at Consul mesh gateways, Istio multi-cluster, or DNS-based federation. None of these are simple. Budget real engineering time for this.
Failure Scenarios
Scenario 1: DNS Resolution Cascade During Provider Outage. The authoritative DNS provider (say Route 53 or Dyn) goes down. All DNS queries for those domains start failing as cached TTLs expire. With a 60s TTL, the service is 100% unreachable within 60 seconds. This is not hypothetical. In the 2016 Dyn attack, Twitter, GitHub, Spotify, and Netflix all went offline because they depended on a single DNS provider. Detection: run external synthetic DNS monitors (hosted on a different provider) that query from multiple regions. Alert if resolution latency exceeds 500ms or NXDOMAIN appears for records that should exist. Recovery: use dual-provider DNS. Set up both Route 53 and Cloudflare DNS with NS records pointing to both. Set SOA refresh and retry intervals to 3600s and 900s. Keep a cold-standby zone file ready to export to a backup provider within minutes. This is cheap insurance that most teams skip until they get burned.
Scenario 2: Service Registry Split-Brain. In a Consul cluster, a network partition isolates 2 of 5 server nodes. The minority side (2 nodes) cannot elect a leader and stops serving writes. If services registered against that partition crash and restart, they cannot re-register. The majority side (3 nodes) keeps accepting registrations but has no visibility into services on the other side. Clients get inconsistent service lists depending on which Consul agent they hit. Detection: monitor consul.raft.leader and alert on leader election frequency above 1/hour. Track consul.members.alive divergence between datacenters. Recovery: set leave_on_terminate = true so partitioned nodes deregister cleanly. Drop the anti-entropy sync interval to 60s instead of the default 300s. Implement client-side circuit breakers that fall back to a cached known-good endpoint list when the registry is unavailable.
Scenario 3: DNS TTL Caching Prevents Failover. A critical service fails over from the primary region (us-east-1) to DR (us-west-2). The DNS record gets updated, and it has a 300s TTL. Should be fine, right? Except intermediate resolvers (ISP DNS, corporate recursive resolvers) cache aggressively. Some honor the TTL, others hold onto records for up to 24 hours. The result: 15% of traffic still hits the dead primary for hours after failover. Detection: compare per-region origin traffic against expected ratios post-failover. Monitor dns_query_count at both authoritative nameservers. Recovery: for planned failovers, lower the TTL to 30s at least 24 hours before the event. For unplanned failovers, accept the TTL propagation delay and focus on keeping the old IP responding, even if it just returns a redirect. For truly critical paths, use Anycast routing instead of DNS failover. Anycast reroutes in seconds via BGP, not minutes via DNS propagation.
Capacity Planning
CoreDNS in Kubernetes handles roughly 30,000 queries/second per instance on 2 vCPU. A typical Kubernetes cluster with 500 pods generates around 5,000 DNS queries/second. Route 53 supports effectively unlimited queries but charges $0.40 per million. Consul can handle about 10,000 service registrations per datacenter with sub-second health checking.
| Metric | Healthy Range | Warning | Action |
|---|---|---|---|
| DNS resolution P99 | < 5ms (cached), < 50ms (uncached) | > 100ms | Check resolver capacity, enable caching |
| CoreDNS pod CPU | < 60% | > 75% | Scale CoreDNS replicas |
| Registry sync latency | < 2s | > 10s | Check network, increase Raft throughput |
| Stale endpoint ratio | 0% | > 1% | Reduce health check interval |
| DNS query error rate | < 0.01% | > 0.1% | Investigate SERVFAIL responses |
Real-world numbers for context: Google runs its own global DNS infrastructure serving roughly 1.2 trillion queries/day through 8.8.8.8/8.8.4.4. Uber runs Consul clusters with about 8,000 services and 200,000 endpoints across multiple datacenters, processing around 50K registrations per minute during peak deploys. Netflix uses Eureka with roughly 600 microservices and handles about 100K registration events per hour using AP (availability-prioritized) semantics. Quick capacity formula: dns_instances = (pods * avg_queries_per_second_per_pod * 2) / per_instance_qps. A 1,000-pod cluster at 10 QPS/pod needs: (1000 * 10 * 2) / 30000 = ~1 instance (always run at least 3 for HA).
Architecture Decision Record
ADR: Service Discovery Strategy
Context: Choosing between DNS-based, registry-based, and mesh-based service discovery affects system reliability, latency, and how much operational work the team takes on.
| Criteria (Weight) | DNS-Based (CoreDNS) | Registry (Consul/Eureka) | Mesh-Based (Istio/Linkerd) |
|---|---|---|---|
| Ops complexity (25%) | Low | Medium | High |
| Failover speed (25%) | 30s-5min (TTL bound) | 1-10s (health check) | < 1s (proxy-level) |
| Client changes (20%) | None (standard DNS) | SDK/sidecar required | None (transparent proxy) |
| Multi-platform (15%) | Any (universal protocol) | VMs + containers | Kubernetes-centric |
| Observability (15%) | Minimal | Service health dashboard | Full L7 metrics |
Decision framework:
- Kubernetes-only, fewer than 50 services, team under 30 engineers. Use Kubernetes-native DNS (CoreDNS + Services). It works out of the box, requires zero additional infrastructure, and handles 90% of use cases. Tune
ndots:2in pod DNS config to reduce query amplification. Seriously, do not overcomplicate this. - Kubernetes + VMs, 50-200 services. Deploy Consul as a service registry with a DNS interface. It bridges the Kubernetes and VM worlds, provides health checking, and the KV store is useful for dynamic configuration. For teams already in the HashiCorp ecosystem (Nomad, Vault), this is the natural fit.
- Multi-region, more than 200 services, strong consistency required. Use Consul with WAN federation across datacenters. Configure
translate_wan_addrs = truefor cross-datacenter communication. The cost is added latency, but the payoff is consistent service catalogs across regions. - Already running a service mesh, more than 100 services. Use the mesh's built-in discovery (Istio's pilot, Linkerd's destination). Do not run a separate discovery system on top of it. The mesh already maintains real-time endpoint tables, and layering another system creates confusion about which is the source of truth.
Key Points
- •DNS maps human-readable names to IP addresses. It is, quite literally, the internet's phone book.
- •Service discovery lets microservices find each other at runtime instead of relying on hardcoded addresses
- •Client-side vs server-side discovery: different trade-offs in complexity and load distribution
- •TTL management matters more than people think. Too short and it hammers the DNS servers, too long and failover stalls.
- •Health-aware DNS (Route 53, Consul) pulls unhealthy endpoints out of responses automatically
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Consul | Open Source | Service mesh, health checks, KV store | Medium-Enterprise |
| CoreDNS | Open Source | Kubernetes DNS, plugin-based | Medium-Enterprise |
| AWS Route 53 | Managed | Global DNS, health checks, failover routing | Small-Enterprise |
| etcd | Open Source | Kubernetes backing store, strong consistency | Medium-Large |
Common Mistakes
- Caching DNS results forever in application code. Ignoring the TTL means routing traffic to dead hosts.
- Not retrying with fresh DNS resolution after a connection failure. The cached IP might be the problem.
- Using DNS for load balancing without realizing that most clients cache the first resolved IP and stick with it
- Running the service registry as a single instance, which turns it into a single point of failure for the entire fleet
- Forgetting to monitor DNS resolution latency. Slow DNS adds hidden latency to every single request.