DNS & Service Discovery

Why It Exists

In the old world of static infrastructure, teams hardcoded IPs in config files and called it a day. That does not work in cloud-native environments. Instances are ephemeral. IPs change on every deploy, every scale event, every failure recovery. Service discovery provides a dynamic registry so services find each other by name, not by address. Still managing IP addresses in config files for anything beyond a handful of machines? Stop. That's a maintenance nightmare in the making.

How It Works

Traditional DNS

Client queries a recursive resolver, which walks the chain: root nameserver, TLD nameserver, authoritative nameserver.
Each hop has its own TTL-based cache. Total resolution takes 10-100ms uncached, under 1ms cached.
The record types that matter most: A (IPv4), AAAA (IPv6), CNAME (alias), SRV (service + port), TXT (metadata).

This system has worked since the 1980s. It is battle-tested, universally supported, and honestly pretty elegant for what it does. The problem is that it was designed for things that do not move around much. Modern infrastructure moves around constantly.

Service Discovery Patterns

Client-Side Discovery. The client queries the service registry directly and picks an instance. The client gets full control over load balancing. Netflix Eureka and gRPC name resolution work this way. The downside: every client needs the discovery logic baked in, coupling clients to the registry implementation.

Server-Side Discovery. The client sends requests to a load balancer or router, which queries the registry and forwards traffic. The client stays simple, but it adds a network hop. AWS ELB + ECS and Kubernetes Services both follow this pattern. For most teams, this is the right default. Simpler clients are worth the extra hop.

DNS-Based Discovery. Services register themselves as DNS records (A or SRV). Clients just use standard DNS resolution. It is simple and requires zero client changes, but it is also limited. DNS does not support health-aware routing out of the box. Something like Consul DNS needs to be layered on top to get that.

Production Considerations

Failure detection. Heartbeat intervals (typically 10-30s) control how fast dead instances get deregistered. Faster detection means more network overhead. Pick the trade-off based on how much stale routing is tolerable.
Consistency model. Consul uses Raft consensus (CP). Eureka uses peer-to-peer replication (AP). The question is: what is worse for the system, getting a stale endpoint list, or the registry being temporarily unavailable? Most teams should pick AP unless they have a strong reason not to.
DNS TTL in Kubernetes. CoreDNS with ndots:5 (the default) causes excessive DNS queries because every short hostname triggers five lookup attempts. Tune ndots to 2-3 for most workloads. This is one of those things almost nobody configures, and almost everybody should.
Graceful shutdown. Services need to deregister from the registry before stopping. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds, then sends SIGKILL. Make sure the app actually handles SIGTERM and deregisters, or it will serve traffic to a process that is shutting down.
Multi-cluster discovery. If services span multiple Kubernetes clusters, look at Consul mesh gateways, Istio multi-cluster, or DNS-based federation. None of these are simple. Budget real engineering time for this.

Failure Scenarios

Scenario 1: DNS Resolution Cascade During Provider Outage. The authoritative DNS provider (say Route 53 or Dyn) goes down. All DNS queries for those domains start failing as cached TTLs expire. With a 60s TTL, the service is 100% unreachable within 60 seconds. This is not hypothetical. In the 2016 Dyn attack, Twitter, GitHub, Spotify, and Netflix all went offline because they depended on a single DNS provider. Detection: run external synthetic DNS monitors (hosted on a different provider) that query from multiple regions. Alert if resolution latency exceeds 500ms or NXDOMAIN appears for records that should exist. Recovery: use dual-provider DNS. Set up both Route 53 and Cloudflare DNS with NS records pointing to both. Set SOA refresh and retry intervals to 3600s and 900s. Keep a cold-standby zone file ready to export to a backup provider within minutes. This is cheap insurance that most teams skip until they get burned.

Scenario 2: Service Registry Split-Brain. In a Consul cluster, a network partition isolates 2 of 5 server nodes. The minority side (2 nodes) cannot elect a leader and stops serving writes. If services registered against that partition crash and restart, they cannot re-register. The majority side (3 nodes) keeps accepting registrations but has no visibility into services on the other side. Clients get inconsistent service lists depending on which Consul agent they hit. Detection: monitor consul.raft.leader and alert on leader election frequency above 1/hour. Track consul.members.alive divergence between datacenters. Recovery: set leave_on_terminate = true so partitioned nodes deregister cleanly. Drop the anti-entropy sync interval to 60s instead of the default 300s. Implement client-side circuit breakers that fall back to a cached known-good endpoint list when the registry is unavailable.

Scenario 3: DNS TTL Caching Prevents Failover. A critical service fails over from the primary region (us-east-1) to DR (us-west-2). The DNS record gets updated, and it has a 300s TTL. Should be fine, right? Except intermediate resolvers (ISP DNS, corporate recursive resolvers) cache aggressively. Some honor the TTL, others hold onto records for up to 24 hours. The result: 15% of traffic still hits the dead primary for hours after failover. Detection: compare per-region origin traffic against expected ratios post-failover. Monitor dns_query_count at both authoritative nameservers. Recovery: for planned failovers, lower the TTL to 30s at least 24 hours before the event. For unplanned failovers, accept the TTL propagation delay and focus on keeping the old IP responding, even if it just returns a redirect. For truly critical paths, use Anycast routing instead of DNS failover. Anycast reroutes in seconds via BGP, not minutes via DNS propagation.

Capacity Planning

CoreDNS in Kubernetes handles roughly 30,000 queries/second per instance on 2 vCPU. A typical Kubernetes cluster with 500 pods generates around 5,000 DNS queries/second. Route 53 supports effectively unlimited queries but charges $0.40 per million. Consul can handle about 10,000 service registrations per datacenter with sub-second health checking.

Metric	Healthy Range	Warning	Action
DNS resolution P99	< 5ms (cached), < 50ms (uncached)	> 100ms	Check resolver capacity, enable caching
CoreDNS pod CPU	< 60%	> 75%	Scale CoreDNS replicas
Registry sync latency	< 2s	> 10s	Check network, increase Raft throughput
Stale endpoint ratio	0%	> 1%	Reduce health check interval
DNS query error rate	< 0.01%	> 0.1%	Investigate SERVFAIL responses

Real-world numbers for context: Google runs its own global DNS infrastructure serving roughly 1.2 trillion queries/day through 8.8.8.8/8.8.4.4. Uber runs Consul clusters with about 8,000 services and 200,000 endpoints across multiple datacenters, processing around 50K registrations per minute during peak deploys. Netflix uses Eureka with roughly 600 microservices and handles about 100K registration events per hour using AP (availability-prioritized) semantics. Quick capacity formula: dns_instances = (pods * avg_queries_per_second_per_pod * 2) / per_instance_qps. A 1,000-pod cluster at 10 QPS/pod needs: (1000 * 10 * 2) / 30000 = ~1 instance (always run at least 3 for HA).

Architecture Decision Record

ADR: Service Discovery Strategy

Context: Choosing between DNS-based, registry-based, and mesh-based service discovery affects system reliability, latency, and how much operational work the team takes on.

Criteria (Weight)	DNS-Based (CoreDNS)	Registry (Consul/Eureka)	Mesh-Based (Istio/Linkerd)
Ops complexity (25%)	Low	Medium	High
Failover speed (25%)	30s-5min (TTL bound)	1-10s (health check)	< 1s (proxy-level)
Client changes (20%)	None (standard DNS)	SDK/sidecar required	None (transparent proxy)
Multi-platform (15%)	Any (universal protocol)	VMs + containers	Kubernetes-centric
Observability (15%)	Minimal	Service health dashboard	Full L7 metrics

Decision framework:

Kubernetes-only, fewer than 50 services, team under 30 engineers. Use Kubernetes-native DNS (CoreDNS + Services). It works out of the box, requires zero additional infrastructure, and handles 90% of use cases. Tune ndots:2 in pod DNS config to reduce query amplification. Seriously, do not overcomplicate this.
Kubernetes + VMs, 50-200 services. Deploy Consul as a service registry with a DNS interface. It bridges the Kubernetes and VM worlds, provides health checking, and the KV store is useful for dynamic configuration. For teams already in the HashiCorp ecosystem (Nomad, Vault), this is the natural fit.
Multi-region, more than 200 services, strong consistency required. Use Consul with WAN federation across datacenters. Configure translate_wan_addrs = true for cross-datacenter communication. The cost is added latency, but the payoff is consistent service catalogs across regions.
Already running a service mesh, more than 100 services. Use the mesh's built-in discovery (Istio's pilot, Linkerd's destination). Do not run a separate discovery system on top of it. The mesh already maintains real-time endpoint tables, and layering another system creates confusion about which is the source of truth.

Tool	Type	Best For	Scale
Consul	Open Source	Service mesh, health checks, KV store	Medium-Enterprise
CoreDNS	Open Source	Kubernetes DNS, plugin-based	Medium-Enterprise
AWS Route 53	Managed	Global DNS, health checks, failover routing	Small-Enterprise
etcd	Open Source	Kubernetes backing store, strong consistency	Medium-Large

Why It Exists

How It Works

Traditional DNS

Client queries a recursive resolver, which walks the chain: root nameserver, TLD nameserver, authoritative nameserver.
Each hop has its own TTL-based cache. Total resolution takes 10-100ms uncached, under 1ms cached.
The record types that matter most: A (IPv4), AAAA (IPv6), CNAME (alias), SRV (service + port), TXT (metadata).

Service Discovery Patterns

Production Considerations

Failure detection. Heartbeat intervals (typically 10-30s) control how fast dead instances get deregistered. Faster detection means more network overhead. Pick the trade-off based on how much stale routing is tolerable.
Consistency model. Consul uses Raft consensus (CP). Eureka uses peer-to-peer replication (AP). The question is: what is worse for the system, getting a stale endpoint list, or the registry being temporarily unavailable? Most teams should pick AP unless they have a strong reason not to.
DNS TTL in Kubernetes. CoreDNS with ndots:5 (the default) causes excessive DNS queries because every short hostname triggers five lookup attempts. Tune ndots to 2-3 for most workloads. This is one of those things almost nobody configures, and almost everybody should.
Graceful shutdown. Services need to deregister from the registry before stopping. Kubernetes sends SIGTERM, waits terminationGracePeriodSeconds, then sends SIGKILL. Make sure the app actually handles SIGTERM and deregisters, or it will serve traffic to a process that is shutting down.
Multi-cluster discovery. If services span multiple Kubernetes clusters, look at Consul mesh gateways, Istio multi-cluster, or DNS-based federation. None of these are simple. Budget real engineering time for this.

Failure Scenarios

Capacity Planning

Metric	Healthy Range	Warning	Action
DNS resolution P99	< 5ms (cached), < 50ms (uncached)	> 100ms	Check resolver capacity, enable caching
CoreDNS pod CPU	< 60%	> 75%	Scale CoreDNS replicas
Registry sync latency	< 2s	> 10s	Check network, increase Raft throughput
Stale endpoint ratio	0%	> 1%	Reduce health check interval
DNS query error rate	< 0.01%	> 0.1%	Investigate SERVFAIL responses

Architecture Decision Record

ADR: Service Discovery Strategy

Context: Choosing between DNS-based, registry-based, and mesh-based service discovery affects system reliability, latency, and how much operational work the team takes on.

Criteria (Weight)	DNS-Based (CoreDNS)	Registry (Consul/Eureka)	Mesh-Based (Istio/Linkerd)
Ops complexity (25%)	Low	Medium	High
Failover speed (25%)	30s-5min (TTL bound)	1-10s (health check)	< 1s (proxy-level)
Client changes (20%)	None (standard DNS)	SDK/sidecar required	None (transparent proxy)
Multi-platform (15%)	Any (universal protocol)	VMs + containers	Kubernetes-centric
Observability (15%)	Minimal	Service health dashboard	Full L7 metrics

Decision framework:

Kubernetes-only, fewer than 50 services, team under 30 engineers. Use Kubernetes-native DNS (CoreDNS + Services). It works out of the box, requires zero additional infrastructure, and handles 90% of use cases. Tune ndots:2 in pod DNS config to reduce query amplification. Seriously, do not overcomplicate this.
Kubernetes + VMs, 50-200 services. Deploy Consul as a service registry with a DNS interface. It bridges the Kubernetes and VM worlds, provides health checking, and the KV store is useful for dynamic configuration. For teams already in the HashiCorp ecosystem (Nomad, Vault), this is the natural fit.
Multi-region, more than 200 services, strong consistency required. Use Consul with WAN federation across datacenters. Configure translate_wan_addrs = true for cross-datacenter communication. The cost is added latency, but the payoff is consistent service catalogs across regions.
Already running a service mesh, more than 100 services. Use the mesh's built-in discovery (Istio's pilot, Linkerd's destination). Do not run a separate discovery system on top of it. The mesh already maintains real-time endpoint tables, and layering another system creates confusion about which is the source of truth.

Architecture Diagram

Why It Exists

How It Works

Traditional DNS

Service Discovery Patterns

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Service Discovery Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics

DNS & Service Discovery

Architecture Diagram

Why It Exists

How It Works

Traditional DNS

Service Discovery Patterns

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Service Discovery Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics