Service Discovery & Registration

Why It Exists

Anyone who has hard-coded an IP address into a config file and had it work for years was running traditional infrastructure. That approach falls apart the moment containers enter the picture.

Pods get created and destroyed during scaling events, deployments, node failures. IPs change constantly. Maintaining a static list of endpoints is impossible when the infrastructure treats instances as disposable. Service discovery provides a stable way for services to find each other dynamically, even as everything underneath is churning.

This is not optional in a microservices world. Without it, one deployment can route traffic into a void.

How It Works

Registration Patterns

Self-registration. The service instance registers itself with the registry on startup and sends periodic heartbeats. If the heartbeats stop, the registry drops the instance. Eureka works this way: each service bundles an Eureka client library that handles registration and heartbeats. Simple to set up, but it couples every service to the registry client.

Third-party registration. An external agent watches the infrastructure and handles registration automatically. In Kubernetes, the kubelet and controller manager play this role. When a pod starts and passes readiness checks, its IP gets added to the Endpoints object for its Service. The service itself knows nothing about the registry. Cleaner separation of concerns, but infrastructure-level support is needed to pull it off.

Discovery Patterns

Client-side discovery. The client queries the registry, gets back a list of available instances, and picks one using its own load-balancing logic (round robin, random, least connections, whatever). Netflix Ribbon and gRPC's built-in resolver use this pattern. The result is fine-grained control, but every service needs a discovery-aware client library. That is real overhead to maintain.

Server-side discovery. The client just sends requests to a load balancer or router. The load balancer queries the registry and forwards the request to a healthy instance. The client never touches the registry. Kubernetes Services, AWS ALB, and Consul Connect all work this way. Operationally simpler, but it adds an extra network hop.

In practice, most teams should start with server-side discovery. Client-side discovery makes sense when there are specific load-balancing needs that a generic LB cannot handle.

Kubernetes Services Deep-Dive

Kubernetes provides five Service types. Each solves a different problem:

Type	Behavior	Use Case
ClusterIP	Virtual IP accessible only within the cluster	Default internal service communication
NodePort	Exposes service on each node's IP at a static port (30000-32767)	Development, non-production access
LoadBalancer	Provisions a cloud load balancer (ELB, GLB) pointing to NodePorts	Production external traffic
ExternalName	CNAME alias to an external DNS name	Bridging to external services (RDS, SaaS APIs)
Headless (`clusterIP: None`)	No virtual IP; DNS returns all pod IPs directly	StatefulSets, client-side load balancing

When a Service is created, CoreDNS creates a DNS record: <service>.<namespace>.svc.cluster.local. For ClusterIP services, this resolves to the virtual IP. For headless services, it returns A records for each individual pod, so clients can connect directly. EndpointSlices (which replaced the older Endpoints API) track which pod IPs back each service and update within seconds of pod readiness changes.

Headless services are underused. For sticky sessions or talking to specific pods in a StatefulSet (like individual Kafka brokers), they are exactly the right tool.

Production Considerations

Graceful shutdown. Services must deregister before stopping. In Kubernetes, set terminationGracePeriodSeconds and handle SIGTERM to finish in-flight requests. Use a preStop hook to add a short delay. This gives kube-proxy time to remove the pod from iptables rules before the pod actually shuts down. Skip this step and intermittent 502s during deployments are guaranteed.
DNS TTL tuning. Default DNS TTL in CoreDNS is 30 seconds. For fast failover, drop it to 5 seconds, but know that this increases DNS query volume. Java applications are especially nasty here. The JVM caches DNS forever by default. Set networkaddress.cache.ttl=10 in java.security or expect hours spent debugging why Java services keep hitting dead pods.
Registry high availability. Consul needs a quorum of 3+ servers. Eureka is AP (eventually consistent), meaning it keeps serving stale data during partitions instead of failing. Know the registry's consistency model before picking it. This matters more than most teams expect during an actual outage.
Cross-cluster discovery. For multi-cluster setups, look at Consul's WAN federation, Istio's multi-cluster mesh, or external DNS (Route53, Cloud DNS) with health-check-based record management. Each approach has tradeoffs around latency, consistency, and operational burden.
Health check design. Readiness probes should verify downstream dependencies (database connectivity, cache availability). Liveness probes should check only the process itself. A liveness probe that checks the database will cause cascading restarts during database outages. I have watched this take down an entire cluster. Do not make this mistake.

Failure Scenarios

Scenario 1: Split-Brain Registry During Network Partition. In a Consul cluster spanning two data centers, a network partition isolates the two sites. Each side tries to elect its own leader. Services in DC-1 cannot discover services in DC-2, and new registrations on one side are invisible to the other. When the partition heals, conflicting state needs reconciliation. Detection: consul.raft.leader.lastContact exceeds 200ms, consul.serf.member.flap spikes, cross-DC service resolution failures in application logs. Recovery: Consul's Raft consensus actually prevents true split-brain because the minority side loses quorum and becomes read-only. But WAN-federated clusters may still serve stale data. Prevention: run 5 Consul servers (tolerates 2 failures), deploy across 3 AZs, and put circuit breakers on cross-DC calls.

Scenario 2: DNS Cache Poisoning from Stale TTLs. After a deployment scales down from 10 to 3 pods, CoreDNS updates within seconds. But Java services caching DNS with the JVM's default infinite TTL keep routing to terminated pod IPs. Requests to dead IPs time out after 30 seconds, and P99 latency spikes from 50ms to 30s for affected callers. Detection: client-side connection timeout errors, kube_endpoint_address_available shows 3 but the client is hitting 10 IPs. Recovery: restart affected Java services (which flushes the DNS cache) and set networkaddress.cache.ttl=5 in java.security. Prevention: enforce DNS TTL settings via JVM arguments in the base Docker images. Bake it into the image so no team can forget it. Some orgs use sidecar-based service discovery (Envoy) to sidestep DNS caching entirely.

Scenario 3: Thundering Herd on Registry Recovery. The Consul cluster goes down for 3 minutes during maintenance. All services lose heartbeat confirmation and start aggressive reconnection attempts. When Consul comes back, 5,000 services simultaneously re-register and re-fetch the full service catalog. The registry gets crushed, enters a degraded state, and takes another 10 minutes to stabilize. Detection: consul.client.rpc.exceeded errors, consul.runtime.alloc_bytes spikes, leader election churn. Recovery: implement exponential backoff with jitter on client reconnection (Consul supports retry_join with backoff). This is the kind of thing to test before it bites in production. Netflix's Eureka avoids this by design with AP semantics. Clients cache the last-known registry and keep operating during outages, then re-sync incrementally on recovery. Stale data beats no data.

Capacity Planning

Component	Threshold / Guideline	Real-World Reference
Registry cluster size	3 servers (small), 5 servers (large); always odd	HashiCorp recommends 5 Consul servers for production
Services per registry	< 10K services with < 100K total instances	Alibaba Nacos: handles 1M+ service instances in production
Health check interval	10-30s (shorter = faster failover, more load on registry)	Netflix Eureka: 30s heartbeat, 90s eviction timeout
DNS query volume	CoreDNS: ~10K QPS per replica; scale replicas with cluster size	Google: internal DNS handles 10M+ QPS across fleet
EndpointSlice updates	< 1s propagation for pod readiness changes	Kubernetes: EndpointSlices reduce API load by 100x vs Endpoints at scale
Cross-cluster discovery	< 5s for service registration propagation across regions	Stripe: < 3s cross-region with dedicated Consul WAN gossip ring

Key formulas: Registry QPS = services * (1/heartbeat_interval) + discovery_queries_per_sec. For 5,000 services with 30s heartbeats and 500 lookups/sec: 5000/30 + 500 = 667 QPS. A single Consul server handles ~5K QPS without breaking a sweat. DNS planning: CoreDNS replicas = ceil(cluster_pods / 5000) as a baseline. Each replica handles ~10K QPS, and each pod generates roughly 2 QPS of DNS queries during normal operation. If services talk to each other heavily, double that estimate.

Architecture Decision Record

ADR: Service Discovery Approach Selection

Context: Choosing between Kubernetes-native Services, a dedicated registry (Consul/Eureka), or service mesh-based discovery for a microservices platform.

Criteria (Weight)	K8s Services + CoreDNS	Consul	Eureka	Service Mesh (Istio/Linkerd)
Multi-platform support (20%)	K8s only	VMs, containers, bare metal, multi-cloud	Primarily JVM/Spring	K8s only (Istio); K8s-focused (Linkerd)
Operational complexity (20%)	None (built into K8s)	Medium (Consul cluster ops)	Low (AP, self-healing)	High (control plane, sidecar injection)
Health checking (15%)	Readiness/liveness probes	Native TCP/HTTP/gRPC checks + scripts	Client heartbeat only	L7 health checking via sidecar
Consistency model (15%)	Strongly consistent (etcd-backed)	CP by default (Raft consensus)	AP (eventually consistent, self-preservation)	Strongly consistent (control plane)
Advanced routing (15%)	Basic (round-robin via kube-proxy/IPVS)	DNS-based + Connect for L7	Client-side load balancing (Ribbon)	Full L7: retries, circuit breaking, canary, mTLS
Configuration management (10%)	ConfigMaps/Secrets (separate concern)	Integrated KV store	Not included	Not included
Team expertise required (5%)	K8s platform knowledge	HashiCorp ecosystem	Spring Cloud ecosystem	Deep networking + K8s expertise

Decision guidance: For pure Kubernetes environments, start with native Services. It is zero additional infrastructure and covers 80% of discovery needs. Do not add complexity until it's actually needed.

Add Consul for a mixed fleet (VMs + containers), need cross-cluster or multi-cloud discovery, or want integrated key-value configuration. Consul earns its keep in heterogeneous environments.

Pick Eureka only if the team is deeply invested in Spring Cloud. It works well (Netflix ran 100K+ instances on it), but it is a narrow solution tied to one ecosystem. For teams not already in that ecosystem, do not adopt it just for discovery.

Invest in a service mesh (Istio or Linkerd) when mTLS everywhere, advanced traffic management (canary releases, fault injection), or L7 observability is needed. But be honest about the cost: at least 2 dedicated platform engineers are needed to run a mesh properly. Linkerd is significantly simpler than Istio and is the right starting point if the team is new to service mesh.

Many organizations at scale run a hybrid: K8s Services for intra-cluster communication, a service mesh for cross-service policy, and Consul for VM-based legacy services that are still migrating. That is fine. Purity is overrated. Pick what works for each layer and move on.

Tool	Type	Best For	Scale
Kubernetes Services	Open Source	Built-in K8s discovery, ClusterIP/NodePort/LoadBalancer	Medium-Enterprise
Consul	Open Source	Multi-platform, health checks, KV config	Medium-Enterprise
Eureka	Open Source	Spring Cloud ecosystem, AP semantics	Medium-Large
Nacos	Open Source	Service discovery + config management, popular in Java	Medium-Large

Why It Exists

Anyone who has hard-coded an IP address into a config file and had it work for years was running traditional infrastructure. That approach falls apart the moment containers enter the picture.

This is not optional in a microservices world. Without it, one deployment can route traffic into a void.

How It Works

Registration Patterns

Discovery Patterns

In practice, most teams should start with server-side discovery. Client-side discovery makes sense when there are specific load-balancing needs that a generic LB cannot handle.

Kubernetes Services Deep-Dive

Kubernetes provides five Service types. Each solves a different problem:

Type	Behavior	Use Case
ClusterIP	Virtual IP accessible only within the cluster	Default internal service communication
NodePort	Exposes service on each node's IP at a static port (30000-32767)	Development, non-production access
LoadBalancer	Provisions a cloud load balancer (ELB, GLB) pointing to NodePorts	Production external traffic
ExternalName	CNAME alias to an external DNS name	Bridging to external services (RDS, SaaS APIs)
Headless (`clusterIP: None`)	No virtual IP; DNS returns all pod IPs directly	StatefulSets, client-side load balancing

Headless services are underused. For sticky sessions or talking to specific pods in a StatefulSet (like individual Kafka brokers), they are exactly the right tool.

Production Considerations

Graceful shutdown. Services must deregister before stopping. In Kubernetes, set terminationGracePeriodSeconds and handle SIGTERM to finish in-flight requests. Use a preStop hook to add a short delay. This gives kube-proxy time to remove the pod from iptables rules before the pod actually shuts down. Skip this step and intermittent 502s during deployments are guaranteed.
DNS TTL tuning. Default DNS TTL in CoreDNS is 30 seconds. For fast failover, drop it to 5 seconds, but know that this increases DNS query volume. Java applications are especially nasty here. The JVM caches DNS forever by default. Set networkaddress.cache.ttl=10 in java.security or expect hours spent debugging why Java services keep hitting dead pods.
Registry high availability. Consul needs a quorum of 3+ servers. Eureka is AP (eventually consistent), meaning it keeps serving stale data during partitions instead of failing. Know the registry's consistency model before picking it. This matters more than most teams expect during an actual outage.
Cross-cluster discovery. For multi-cluster setups, look at Consul's WAN federation, Istio's multi-cluster mesh, or external DNS (Route53, Cloud DNS) with health-check-based record management. Each approach has tradeoffs around latency, consistency, and operational burden.
Health check design. Readiness probes should verify downstream dependencies (database connectivity, cache availability). Liveness probes should check only the process itself. A liveness probe that checks the database will cause cascading restarts during database outages. I have watched this take down an entire cluster. Do not make this mistake.

Failure Scenarios

Capacity Planning

Component	Threshold / Guideline	Real-World Reference
Registry cluster size	3 servers (small), 5 servers (large); always odd	HashiCorp recommends 5 Consul servers for production
Services per registry	< 10K services with < 100K total instances	Alibaba Nacos: handles 1M+ service instances in production
Health check interval	10-30s (shorter = faster failover, more load on registry)	Netflix Eureka: 30s heartbeat, 90s eviction timeout
DNS query volume	CoreDNS: ~10K QPS per replica; scale replicas with cluster size	Google: internal DNS handles 10M+ QPS across fleet
EndpointSlice updates	< 1s propagation for pod readiness changes	Kubernetes: EndpointSlices reduce API load by 100x vs Endpoints at scale
Cross-cluster discovery	< 5s for service registration propagation across regions	Stripe: < 3s cross-region with dedicated Consul WAN gossip ring

Architecture Decision Record

ADR: Service Discovery Approach Selection

Context: Choosing between Kubernetes-native Services, a dedicated registry (Consul/Eureka), or service mesh-based discovery for a microservices platform.

Criteria (Weight)	K8s Services + CoreDNS	Consul	Eureka	Service Mesh (Istio/Linkerd)
Multi-platform support (20%)	K8s only	VMs, containers, bare metal, multi-cloud	Primarily JVM/Spring	K8s only (Istio); K8s-focused (Linkerd)
Operational complexity (20%)	None (built into K8s)	Medium (Consul cluster ops)	Low (AP, self-healing)	High (control plane, sidecar injection)
Health checking (15%)	Readiness/liveness probes	Native TCP/HTTP/gRPC checks + scripts	Client heartbeat only	L7 health checking via sidecar
Consistency model (15%)	Strongly consistent (etcd-backed)	CP by default (Raft consensus)	AP (eventually consistent, self-preservation)	Strongly consistent (control plane)
Advanced routing (15%)	Basic (round-robin via kube-proxy/IPVS)	DNS-based + Connect for L7	Client-side load balancing (Ribbon)	Full L7: retries, circuit breaking, canary, mTLS
Configuration management (10%)	ConfigMaps/Secrets (separate concern)	Integrated KV store	Not included	Not included
Team expertise required (5%)	K8s platform knowledge	HashiCorp ecosystem	Spring Cloud ecosystem	Deep networking + K8s expertise

Add Consul for a mixed fleet (VMs + containers), need cross-cluster or multi-cloud discovery, or want integrated key-value configuration. Consul earns its keep in heterogeneous environments.

Architecture Diagram

Why It Exists

How It Works

Registration Patterns

Discovery Patterns

Kubernetes Services Deep-Dive

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Service Discovery Approach Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics

Service Discovery & Registration

Architecture Diagram

Why It Exists

How It Works

Registration Patterns

Discovery Patterns

Kubernetes Services Deep-Dive

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Service Discovery Approach Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics