Service Discovery & mDNS
Service discovery replaces hardcoded addresses with a dynamic registry where services register on startup, deregister on shutdown, and clients look up healthy instances at call time.
The Problem
In a dynamic environment where service instances start, stop, scale, and move across hosts, hardcoded IP addresses and port numbers break immediately. Services need a mechanism to find each other at runtime.
Mental Model
mDNS is like shouting in a room — everyone nearby hears the announcement. A service registry is like a hotel front desk — check in on arrival, the desk tells visitors which room to find the guest in, and checkout removes the entry.
Architecture Diagram
How It Works
Every networked system faces the same fundamental problem: service A needs to talk to service B, and service B's network location is not known in advance. In a world of containers, autoscaling, and rolling deployments, IP addresses and ports change constantly. Service discovery is the infrastructure that lets services find each other at runtime without hardcoded configuration.
The Two Patterns
Service discovery implementations split into two categories, and the choice has architectural consequences.
Client-side discovery means the client queries a service registry directly, receives a list of healthy instances, and picks one using its own load balancing logic (round-robin, random, weighted). The client is responsible for instance selection and failure handling. Netflix Eureka with Ribbon is the canonical example — the Java client fetches a registry snapshot and makes routing decisions locally.
Server-side discovery means the client sends a request to a well-known endpoint (a load balancer or DNS name), and an intermediary routes the request to a healthy instance. The client neither knows nor cares about individual instances. Kubernetes ClusterIP Services work this way — the client resolves the Service DNS name to a single virtual IP, and kube-proxy or IPVS routes the packet to a backend pod.
| Aspect | Client-Side | Server-Side |
|---|---|---|
| Load balancing logic | In the client library | In the intermediary |
| Failure handling | Client retries another instance | Intermediary retries or returns error |
| Language coupling | Requires a discovery-aware client library per language | Language-agnostic — any HTTP client works |
| Network hops | Direct to instance (no middlebox) | Extra hop through LB or proxy |
| Operational complexity | Client libraries must stay in sync | Central point to monitor and configure |
mDNS and DNS-SD: Zero-Configuration Discovery
Before cloud-native service meshes existed, Apple solved local network discovery with Bonjour — a combination of Multicast DNS (mDNS) and DNS-Based Service Discovery (DNS-SD).
mDNS eliminates the need for a DNS server on local networks. Every device runs an mDNS responder that listens on multicast address 224.0.0.251 port 5353. When a device wants to resolve myprinter.local, it sends a multicast query. The device named myprinter hears the query and responds directly — no central authority required.
The .local TLD is reserved for mDNS by RFC 6762. Any device can claim a name in this namespace by responding to queries for it. Name conflicts are handled through a probe-and-announce mechanism: before claiming a name, a device sends probe queries to check if the name is already taken. If a conflict is detected, the device appends a number and tries again.
DNS-SD layers service type discovery on top of standard DNS records. A device advertising an HTTP server publishes:
_http._tcp.local. PTR My Web Server._http._tcp.local.
My Web Server._http._tcp.local. SRV 0 0 8080 myhost.local.
My Web Server._http._tcp.local. TXT "path=/api" "version=2.1"
The PTR record lists instances of a service type. The SRV record provides the host and port. The TXT record carries arbitrary key-value metadata. Any device on the network can discover all HTTP servers by querying _http._tcp.local. — no configuration file, no central registry, no manual setup.
# Browse for services on macOS
dns-sd -B _http._tcp local.
# Browse for services on Linux (Avahi)
avahi-browse -art
# Resolve a specific service instance
dns-sd -L "My Web Server" _http._tcp local.
mDNS is limited to the local broadcast domain — multicast does not cross routers without an mDNS reflector. This makes it ideal for local development, IoT devices, printers, and media streaming (AirPlay, Chromecast), but it does not scale to datacenter service discovery.
Consul: Production-Grade Discovery
HashiCorp Consul is the most widely deployed standalone service discovery system. Every node in the infrastructure runs a Consul agent. Services register with their local agent, specifying a name, port, and one or more health checks.
{
"service": {
"name": "payment-api",
"port": 8080,
"tags": ["v2", "production"],
"check": {
"http": "http://localhost:8080/health",
"interval": "10s",
"timeout": "2s",
"deregister_critical_service_after": "30s"
}
}
}
Consul agents form a gossip-based cluster using the Serf protocol. Server nodes (typically 3 or 5 per datacenter) maintain a strongly consistent service catalog using Raft consensus. Client agents forward registration and queries to the servers.
Discovery happens through two interfaces: DNS (query payment-api.service.consul and receive A records for healthy instances) or HTTP API (query /v1/health/service/payment-api?passing=true and receive a JSON list with full metadata including tags, addresses, and health status).
Consul's multi-datacenter architecture is where it differentiates. Each datacenter has its own Raft quorum, but datacenters link through WAN gossip. A service in dc1 can discover services in dc2 by querying payment-api.service.dc2.consul. This cross-datacenter awareness makes Consul the standard for hybrid-cloud and multi-region service discovery.
Kubernetes Service Discovery
Kubernetes has service discovery built into the platform through CoreDNS and the Service abstraction.
ClusterIP Services (server-side discovery) create a virtual IP address. CoreDNS resolves payment-service.production.svc.cluster.local to this VIP. kube-proxy programs iptables or IPVS rules to distribute traffic from the VIP to healthy pod IPs. The client connects to one stable DNS name and never knows which pod handles the request.
Headless Services (client-side discovery) set clusterIP: None. CoreDNS returns A records for every ready pod IP instead of a single VIP. The client receives the full list and implements its own selection logic — critical for stateful workloads like databases where the client needs to connect to a specific replica (primary vs read replica).
EndpointSlices replaced the Endpoints API for scalability. A single Endpoints object for a Service with 5,000 pods produced a multi-megabyte object rewritten in full on every pod change. EndpointSlices split the list into chunks of 100 endpoints each, enabling incremental updates and reducing API server and etcd load dramatically.
# Resolve a ClusterIP service
nslookup payment-service.production.svc.cluster.local
# Resolve a headless service — returns all pod IPs
nslookup headless-db.production.svc.cluster.local
# Inspect EndpointSlices
kubectl get endpointslices -l kubernetes.io/service-name=payment-service -o wide
Health Checking: The Non-Negotiable Layer
A service registry without health checking is a liability. Dead instances stay registered, callers waste time connecting to them, and the system degrades to worse-than-no-discovery behavior. At least with hardcoded addresses, the failure is immediate and obvious.
Health checks come in three tiers:
Liveness — "Is the process alive?" A TCP connect succeeds or HTTP returns 200. Kubernetes liveness probes restart containers that fail this check.
Readiness — "Can the process handle traffic?" The service may be alive but still warming caches, loading ML models, or waiting for a database connection pool to fill. Kubernetes readiness probes remove unready pods from Service endpoints until they pass.
Deep health — "Are all critical dependencies reachable?" The health endpoint checks database connectivity, cache reachability, and downstream service availability. This prevents routing traffic to an instance that will fail every request because its database connection is severed.
# Kubernetes readiness probe
readinessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
failureThreshold: 3
etcd and ZooKeeper: The Registry Backends
Both etcd and ZooKeeper serve as distributed, strongly consistent stores suitable for service registration.
etcd is the backing store for Kubernetes itself. Services register by writing keys with a lease (TTL). If the service fails to renew the lease, etcd automatically deletes the key. Watches on key prefixes provide real-time notification when instances come and go. The gRPC-based API is efficient and well-supported across languages.
ZooKeeper uses ephemeral nodes tied to client sessions. When a service creates an ephemeral znode and the service process dies, the ZooKeeper session expires and the node is automatically deleted. Kafka, HBase, and the Hadoop ecosystem have relied on this pattern for over a decade.
The trade-off: etcd is simpler, API-friendly, and the Kubernetes standard. ZooKeeper has richer primitives (sequential nodes, multi-version concurrency, recipes for leader election) but is operationally heavier and tied to the Java ecosystem. New systems overwhelmingly choose etcd or Consul; ZooKeeper remains relevant where Kafka or Hadoop dependencies already exist.
Failure Modes Worth Designing For
Service discovery introduces its own failure modes beyond the services it discovers:
Registry unavailability. If Consul servers lose quorum or etcd's Raft cluster partitions, new registrations and lookups fail. The mitigation is client-side caching — cache the last-known-good instance list and use it as a fallback. Consul agents cache locally; Envoy xDS caches the last snapshot.
Stale registrations. An instance crashes without deregistering. The registry continues to include it until the health check TTL expires. Mitigation: aggressive health check intervals (5-10s) combined with short deregistration delays. Implement graceful shutdown hooks that deregister before process exit, but never rely on graceful shutdown alone — processes get OOM-killed and SIGKILL'd.
Thundering herd on recovery. When a registry comes back after a partition, every client simultaneously re-fetches and reconnects. This can overload the registry or the newly discovered services. Mitigation: add jitter to retry intervals and connection attempts.
Understanding these failure modes is what separates a discovery setup that works in demos from one that holds up during a 3 AM production incident.
Key Points
- •Client-side discovery (the client queries a registry and picks an instance) gives maximum flexibility but pushes load balancing logic into every consumer
- •Server-side discovery (a load balancer or DNS sits between client and registry) centralizes routing but adds a hop and a potential single point of failure
- •mDNS uses multicast UDP on 224.0.0.251:5353 and the .local TLD — no infrastructure required, but limited to the local broadcast domain
- •Kubernetes combines server-side discovery (ClusterIP Services resolved by CoreDNS) with client-side patterns (headless Services returning all pod IPs)
- •Health checking is not optional — a registry full of dead instances is worse than no registry at all, because callers waste time connecting to corpses
Key Components
| Component | Role |
|---|---|
| Service Registry | A central or distributed data store that maintains the current set of available service instances and their network locations |
| Health Check | A periodic probe (HTTP, TCP, or gRPC) that verifies a service instance is alive and capable of handling requests |
| mDNS Responder | A daemon that answers multicast DNS queries on the .local domain without requiring a central DNS server — every device is its own authority |
| DNS-SD (Service Discovery) | A protocol layered on DNS that uses PTR, SRV, and TXT records to advertise service type, instance name, host, port, and metadata |
| Endpoint Slice | A Kubernetes API object that tracks the set of ready pod IPs for a Service, replacing the older monolithic Endpoints object for scalability |
When to Use
Use mDNS/DNS-SD for local network zero-configuration discovery (printers, IoT, development environments). Use Consul or etcd for multi-service, multi-datacenter production discovery with health checking. Use Kubernetes-native CoreDNS discovery when running entirely on Kubernetes.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Consul | Open Source | Multi-datacenter service discovery with built-in health checking, KV store, and service mesh (Connect) | Enterprise, multi-cloud |
| CoreDNS | Open Source | Kubernetes-native DNS-based service discovery with a plugin architecture for extensibility | Cloud-native clusters |
| etcd | Open Source | Distributed key-value store used as the backing store for Kubernetes and as a service registry for custom discovery | Cluster-level, strongly consistent |
| ZooKeeper | Open Source | Mature coordination service with ephemeral nodes for automatic deregistration — battle-tested in Hadoop and Kafka ecosystems | Enterprise, Java-centric |
Debug Checklist
- Check if the service is registered: consul catalog services or kubectl get endpoints <service-name> — missing endpoints mean the registration or readiness probe failed
- Verify health check status: consul members or consul health checks — a failing health check deregisters the instance from active rotation
- Test DNS resolution inside the cluster: kubectl run -it --rm debug --image=busybox -- nslookup my-service.my-namespace.svc.cluster.local
- For mDNS, verify the responder is running: avahi-browse -a (Linux) or dns-sd -B _http._tcp (macOS) to list all advertised services
- Check for split-brain in the registry: compare service listings from multiple Consul agents or etcd members — inconsistency indicates quorum loss or network partition
Common Mistakes
- Registering a service instance without a health check. The instance crashes, the registry still routes traffic to it, and callers see connection refused for the full TTL.
- Using DNS-based discovery with high TTLs for rapidly scaling services. DNS caches stale records — a service that scaled from 3 to 30 instances still gets traffic to only the original 3.
- Confusing Kubernetes ClusterIP Services with headless Services. ClusterIP gives a single virtual IP (server-side discovery). Headless returns all pod IPs (client-side discovery). The load balancing behavior is fundamentally different.
- Running mDNS across subnets without a reflector. mDNS is multicast-scoped to the local link — it does not cross routers without an explicit mDNS reflector or gateway.
- Treating service discovery as fire-and-forget. Registry data goes stale when instances fail to deregister on shutdown. Implement graceful deregistration in the shutdown hook AND rely on TTL-based expiry as a safety net.
Real World Usage
- •Apple's Bonjour (mDNS + DNS-SD) powers AirPlay, AirPrint, and zero-configuration device discovery across every macOS and iOS device
- •Kubernetes runs CoreDNS as the cluster DNS, resolving service-name.namespace.svc.cluster.local to ClusterIP or pod IPs for every service
- •HashiCorp Consul provides multi-datacenter service discovery for organizations running heterogeneous infrastructure across cloud providers
- •Netflix built Eureka for client-side discovery in the JVM ecosystem, registering thousands of microservice instances across AWS regions