Service Discovery & Registration
Architecture Diagram
Why It Exists
Anyone who has hard-coded an IP address into a config file and had it work for years was running traditional infrastructure. That approach falls apart the moment containers enter the picture.
Pods get created and destroyed during scaling events, deployments, node failures. IPs change constantly. Maintaining a static list of endpoints is impossible when the infrastructure treats instances as disposable. Service discovery provides a stable way for services to find each other dynamically, even as everything underneath is churning.
This is not optional in a microservices world. Without it, one deployment can route traffic into a void.
How It Works
Registration Patterns
Self-registration. The service instance registers itself with the registry on startup and sends periodic heartbeats. If the heartbeats stop, the registry drops the instance. Eureka works this way: each service bundles an Eureka client library that handles registration and heartbeats. Simple to set up, but it couples every service to the registry client.
Third-party registration. An external agent watches the infrastructure and handles registration automatically. In Kubernetes, the kubelet and controller manager play this role. When a pod starts and passes readiness checks, its IP gets added to the Endpoints object for its Service. The service itself knows nothing about the registry. Cleaner separation of concerns, but infrastructure-level support is needed to pull it off.
Discovery Patterns
Client-side discovery. The client queries the registry, gets back a list of available instances, and picks one using its own load-balancing logic (round robin, random, least connections, whatever). Netflix Ribbon and gRPC's built-in resolver use this pattern. The result is fine-grained control, but every service needs a discovery-aware client library. That is real overhead to maintain.
Server-side discovery. The client just sends requests to a load balancer or router. The load balancer queries the registry and forwards the request to a healthy instance. The client never touches the registry. Kubernetes Services, AWS ALB, and Consul Connect all work this way. Operationally simpler, but it adds an extra network hop.
In practice, most teams should start with server-side discovery. Client-side discovery makes sense when there are specific load-balancing needs that a generic LB cannot handle.
Kubernetes Services Deep-Dive
Kubernetes provides five Service types. Each solves a different problem:
| Type | Behavior | Use Case |
|---|---|---|
| ClusterIP | Virtual IP accessible only within the cluster | Default internal service communication |
| NodePort | Exposes service on each node's IP at a static port (30000-32767) | Development, non-production access |
| LoadBalancer | Provisions a cloud load balancer (ELB, GLB) pointing to NodePorts | Production external traffic |
| ExternalName | CNAME alias to an external DNS name | Bridging to external services (RDS, SaaS APIs) |
Headless (clusterIP: None) | No virtual IP; DNS returns all pod IPs directly | StatefulSets, client-side load balancing |
When a Service is created, CoreDNS creates a DNS record: <service>.<namespace>.svc.cluster.local. For ClusterIP services, this resolves to the virtual IP. For headless services, it returns A records for each individual pod, so clients can connect directly. EndpointSlices (which replaced the older Endpoints API) track which pod IPs back each service and update within seconds of pod readiness changes.
Headless services are underused. For sticky sessions or talking to specific pods in a StatefulSet (like individual Kafka brokers), they are exactly the right tool.
Production Considerations
- Graceful shutdown. Services must deregister before stopping. In Kubernetes, set
terminationGracePeriodSecondsand handle SIGTERM to finish in-flight requests. Use apreStophook to add a short delay. This gives kube-proxy time to remove the pod from iptables rules before the pod actually shuts down. Skip this step and intermittent 502s during deployments are guaranteed. - DNS TTL tuning. Default DNS TTL in CoreDNS is 30 seconds. For fast failover, drop it to 5 seconds, but know that this increases DNS query volume. Java applications are especially nasty here. The JVM caches DNS forever by default. Set
networkaddress.cache.ttl=10injava.securityor expect hours spent debugging why Java services keep hitting dead pods. - Registry high availability. Consul needs a quorum of 3+ servers. Eureka is AP (eventually consistent), meaning it keeps serving stale data during partitions instead of failing. Know the registry's consistency model before picking it. This matters more than most teams expect during an actual outage.
- Cross-cluster discovery. For multi-cluster setups, look at Consul's WAN federation, Istio's multi-cluster mesh, or external DNS (Route53, Cloud DNS) with health-check-based record management. Each approach has tradeoffs around latency, consistency, and operational burden.
- Health check design. Readiness probes should verify downstream dependencies (database connectivity, cache availability). Liveness probes should check only the process itself. A liveness probe that checks the database will cause cascading restarts during database outages. I have watched this take down an entire cluster. Do not make this mistake.
Failure Scenarios
Scenario 1: Split-Brain Registry During Network Partition. In a Consul cluster spanning two data centers, a network partition isolates the two sites. Each side tries to elect its own leader. Services in DC-1 cannot discover services in DC-2, and new registrations on one side are invisible to the other. When the partition heals, conflicting state needs reconciliation. Detection: consul.raft.leader.lastContact exceeds 200ms, consul.serf.member.flap spikes, cross-DC service resolution failures in application logs. Recovery: Consul's Raft consensus actually prevents true split-brain because the minority side loses quorum and becomes read-only. But WAN-federated clusters may still serve stale data. Prevention: run 5 Consul servers (tolerates 2 failures), deploy across 3 AZs, and put circuit breakers on cross-DC calls.
Scenario 2: DNS Cache Poisoning from Stale TTLs. After a deployment scales down from 10 to 3 pods, CoreDNS updates within seconds. But Java services caching DNS with the JVM's default infinite TTL keep routing to terminated pod IPs. Requests to dead IPs time out after 30 seconds, and P99 latency spikes from 50ms to 30s for affected callers. Detection: client-side connection timeout errors, kube_endpoint_address_available shows 3 but the client is hitting 10 IPs. Recovery: restart affected Java services (which flushes the DNS cache) and set networkaddress.cache.ttl=5 in java.security. Prevention: enforce DNS TTL settings via JVM arguments in the base Docker images. Bake it into the image so no team can forget it. Some orgs use sidecar-based service discovery (Envoy) to sidestep DNS caching entirely.
Scenario 3: Thundering Herd on Registry Recovery. The Consul cluster goes down for 3 minutes during maintenance. All services lose heartbeat confirmation and start aggressive reconnection attempts. When Consul comes back, 5,000 services simultaneously re-register and re-fetch the full service catalog. The registry gets crushed, enters a degraded state, and takes another 10 minutes to stabilize. Detection: consul.client.rpc.exceeded errors, consul.runtime.alloc_bytes spikes, leader election churn. Recovery: implement exponential backoff with jitter on client reconnection (Consul supports retry_join with backoff). This is the kind of thing to test before it bites in production. Netflix's Eureka avoids this by design with AP semantics. Clients cache the last-known registry and keep operating during outages, then re-sync incrementally on recovery. Stale data beats no data.
Capacity Planning
| Component | Threshold / Guideline | Real-World Reference |
|---|---|---|
| Registry cluster size | 3 servers (small), 5 servers (large); always odd | HashiCorp recommends 5 Consul servers for production |
| Services per registry | < 10K services with < 100K total instances | Alibaba Nacos: handles 1M+ service instances in production |
| Health check interval | 10-30s (shorter = faster failover, more load on registry) | Netflix Eureka: 30s heartbeat, 90s eviction timeout |
| DNS query volume | CoreDNS: ~10K QPS per replica; scale replicas with cluster size | Google: internal DNS handles 10M+ QPS across fleet |
| EndpointSlice updates | < 1s propagation for pod readiness changes | Kubernetes: EndpointSlices reduce API load by 100x vs Endpoints at scale |
| Cross-cluster discovery | < 5s for service registration propagation across regions | Stripe: < 3s cross-region with dedicated Consul WAN gossip ring |
Key formulas: Registry QPS = services * (1/heartbeat_interval) + discovery_queries_per_sec. For 5,000 services with 30s heartbeats and 500 lookups/sec: 5000/30 + 500 = 667 QPS. A single Consul server handles ~5K QPS without breaking a sweat. DNS planning: CoreDNS replicas = ceil(cluster_pods / 5000) as a baseline. Each replica handles ~10K QPS, and each pod generates roughly 2 QPS of DNS queries during normal operation. If services talk to each other heavily, double that estimate.
Architecture Decision Record
ADR: Service Discovery Approach Selection
Context: Choosing between Kubernetes-native Services, a dedicated registry (Consul/Eureka), or service mesh-based discovery for a microservices platform.
| Criteria (Weight) | K8s Services + CoreDNS | Consul | Eureka | Service Mesh (Istio/Linkerd) |
|---|---|---|---|---|
| Multi-platform support (20%) | K8s only | VMs, containers, bare metal, multi-cloud | Primarily JVM/Spring | K8s only (Istio); K8s-focused (Linkerd) |
| Operational complexity (20%) | None (built into K8s) | Medium (Consul cluster ops) | Low (AP, self-healing) | High (control plane, sidecar injection) |
| Health checking (15%) | Readiness/liveness probes | Native TCP/HTTP/gRPC checks + scripts | Client heartbeat only | L7 health checking via sidecar |
| Consistency model (15%) | Strongly consistent (etcd-backed) | CP by default (Raft consensus) | AP (eventually consistent, self-preservation) | Strongly consistent (control plane) |
| Advanced routing (15%) | Basic (round-robin via kube-proxy/IPVS) | DNS-based + Connect for L7 | Client-side load balancing (Ribbon) | Full L7: retries, circuit breaking, canary, mTLS |
| Configuration management (10%) | ConfigMaps/Secrets (separate concern) | Integrated KV store | Not included | Not included |
| Team expertise required (5%) | K8s platform knowledge | HashiCorp ecosystem | Spring Cloud ecosystem | Deep networking + K8s expertise |
Decision guidance: For pure Kubernetes environments, start with native Services. It is zero additional infrastructure and covers 80% of discovery needs. Do not add complexity until it's actually needed.
Add Consul for a mixed fleet (VMs + containers), need cross-cluster or multi-cloud discovery, or want integrated key-value configuration. Consul earns its keep in heterogeneous environments.
Pick Eureka only if the team is deeply invested in Spring Cloud. It works well (Netflix ran 100K+ instances on it), but it is a narrow solution tied to one ecosystem. For teams not already in that ecosystem, do not adopt it just for discovery.
Invest in a service mesh (Istio or Linkerd) when mTLS everywhere, advanced traffic management (canary releases, fault injection), or L7 observability is needed. But be honest about the cost: at least 2 dedicated platform engineers are needed to run a mesh properly. Linkerd is significantly simpler than Istio and is the right starting point if the team is new to service mesh.
Many organizations at scale run a hybrid: K8s Services for intra-cluster communication, a service mesh for cross-service policy, and Consul for VM-based legacy services that are still migrating. That is fine. Purity is overrated. Pick what works for each layer and move on.
Key Points
- •Services need to find each other dynamically when IPs change constantly in containerized environments
- •Self-registration means the service handles its own registry updates. Third-party registration offloads that to an external agent.
- •Client-side discovery gives callers control over load balancing. Server-side is simpler but adds a hop.
- •Health-aware routing pulls unhealthy instances out of the pool automatically
- •In Kubernetes, Services and Endpoints are the built-in mechanism, backed by kube-proxy and CoreDNS
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Kubernetes Services | Open Source | Built-in K8s discovery, ClusterIP/NodePort/LoadBalancer | Medium-Enterprise |
| Consul | Open Source | Multi-platform, health checks, KV config | Medium-Enterprise |
| Eureka | Open Source | Spring Cloud ecosystem, AP semantics | Medium-Large |
| Nacos | Open Source | Service discovery + config management, popular in Java | Medium-Large |
Common Mistakes
- Hardcoding service endpoints in config files. This breaks the moment instances change.
- Skipping graceful deregistration, so clients keep routing to terminated instances
- Ignoring DNS TTL caching. Stale DNS records send requests to dead instances.
- Not separating liveness from readiness. A service that is still booting should not receive traffic.
- Running a single registry instance, which turns the registry itself into a single point of failure