Service Mesh
Architecture Diagram
Why It Exists
Once an organization goes from 5 services to 500, everything about inter-service communication gets painful. Every team needs mTLS, retries, circuit breaking, rate limiting, and observability. Without a mesh, each team builds these differently. Some don't build them at all. A service mesh pulls all of that networking logic out of application code and into infrastructure, so it works the same way everywhere.
The honest truth: most teams adopt a mesh too early. With fewer than 50 services, teams can probably get by with shared libraries and a good cert-manager setup. The mesh shines when the cross-cutting concerns start drowning teams in boilerplate.
How It Works
Data Plane
Every service pod gets an Envoy sidecar proxy injected automatically through a Kubernetes admission webhook. All inbound and outbound traffic routes through this proxy. What the proxy actually does:
- mTLS: Handles automatic certificate rotation and identity verification. No need to touch a certificate in application code.
- Load balancing: P2C, least-request, round-robin across upstream instances. Pick the algorithm that fits the traffic pattern.
- Circuit breaking: Opens the circuit after N consecutive failures, moves to half-open after a configurable timeout.
- Retries: Set retry budgets with backoff and choose which status codes trigger retries. Get this wrong and the consequences are painful (see failure scenarios below).
- Observability: Emits latency, error rate, and throughput metrics, plus access logs and trace spans. This is the feature that sells people on meshes, honestly.
Control Plane
Istiod (Istio's control plane) takes high-level routing rules, compiles them into Envoy-native xDS configuration, and pushes them to every proxy over gRPC streaming. It also acts as the Certificate Authority for mTLS certificates. One process doing a lot of work, which is why it needs careful monitoring.
Decision Criteria
| Factor | With Mesh | Without Mesh |
|---|---|---|
| mTLS everywhere | Automatic | Manual per-service |
| Observability | Free L7 metrics | Instrument each service |
| Retries/circuit breakers | Mesh-wide policy | Per-service library |
| Latency overhead | +2-5ms per hop | None |
| Operational complexity | High (control plane) | Lower |
| Memory overhead | +50-100MB per pod | None |
Look at this table and if the reaction is "we don't need most of this yet," trust that instinct. Come back when the need arises.
Production Considerations
- Canary deployments: Use traffic splitting (90/10) to safely roll out new versions. The mesh handles weighted routing at the proxy level. This alone can justify the mesh for some teams.
- Rate limiting: Mesh-level rate limiting complements application-level limits but doesn't replace them. Use it for cross-cutting global limits.
- Multi-cluster: Istio supports multi-cluster meshes with shared or replicated control planes. This matters for disaster recovery and regional isolation, but it doubles the operational complexity. Plan accordingly.
- Ambient mesh: Istio's newer ambient mode removes sidecars and uses per-node ztunnel proxies instead. Cuts resource overhead significantly, but it's still maturing. I wouldn't bet production on it without thorough testing.
- Debugging: Learn
istioctl proxy-configand the Envoy admin API (localhost:15000) before deploying to production. Not after. These are the tools that save the on-call engineer at 3 AM.
Failure Scenarios
Scenario 1: Control Plane Outage (Istiod Crash Loop). Istiod enters a crash loop because of a malformed VirtualService resource. New proxy configurations stop flowing. Existing sidecars keep running with their last known configuration (stale xDS cache), so live traffic still works. But new pod deployments break because the sidecar injector (part of Istiod) is unavailable. Pods either start without proxies or fail admission entirely. During a rolling deploy, new pods without sidecars skip mTLS and can't talk to mesh-enabled services. Detection: monitor istiod_uptime, pilot_xds_push_errors, and sidecar_injection_failure_total. Alert if Istiod restarts more than 2 times in 10 minutes. Recovery: the stale xDS cache buys hours, not minutes. Find the malformed resource (kubectl get virtualservices -A | xargs -I {} istioctl analyze), fix it, restart Istiod. Long-term: add admission webhook validation that rejects invalid Istio CRDs, and run istioctl analyze in CI before applying any mesh config.
Scenario 2: Retry Storm Amplification. This one bites almost everyone eventually. Service A calls Service B, which calls Service C. Default retry policy: 3 retries. Service C gets slow (P99 jumps from 50ms to 5s). Service B retries 3x to C. Service A retries 3x to B. Total requests hitting C: 3 * 3 = 9x amplification. With 4 service hops, that's 3^4 = 81x. The slow service drowns under 81 times its normal load and crashes completely. Detection: monitor per-service upstream_rq_retry in Envoy. Alert when retry ratio exceeds 10% for any service. Track retry_budget_exhausted to see when the ceiling is being hit. Recovery: implement retry budgets (Istio: retries.retryOn with envoy.retry_budget filter) that cap total retry traffic at 20% of baseline per service. Set retry limits globally to attempts: 2, not the default 3. Add circuit breakers: open after 50% error rate over 30s, half-open after 60s.
Scenario 3: mTLS Certificate Rotation Failure. Istiod's CA certificate expires, or the certificate rotation process fails silently. Existing connections keep working because TLS sessions are already established. But new connections fail the mTLS handshake. During a rolling deploy, new pods get fresh certificates signed by a CA that existing pods don't trust anymore. Symptom: gradual increase in upstream_cx_connect_fail that correlates with deploy activity. Services that haven't restarted talk to each other just fine, but freshly deployed pods are isolated. Detection: monitor citadel_server_root_cert_expiry_seconds and alert at less than 30 days remaining. Track ssl_handshake_errors per service. Recovery: check istioctl proxy-config secret <pod> to verify cert chain validity. If the root CA has expired, follow Istio's root CA rotation procedure (plug-in CA with new root, rotate intermediates). Prevention: automate cert expiry monitoring and run annual rotation drills. Don't wait for the cert to actually expire to find out the rotation process is broken.
Capacity Planning
Each Envoy sidecar consumes 50-100MB RAM and 0.1-0.5 vCPU at baseline. Under load (1K RPS through the proxy), expect roughly 100-150MB RAM and 0.5-1.0 vCPU. Istiod uses about 1GB RAM per 1000 sidecars for xDS configuration distribution.
| Metric | Healthy | Warning | Critical |
|---|---|---|---|
| Sidecar memory per pod | < 120MB | > 200MB | > 300MB |
| Sidecar CPU per pod | < 0.3 vCPU | > 0.7 vCPU | > 1.0 vCPU |
| Istiod memory | < 2GB | > 4GB | > 8GB |
| xDS push latency (P99) | < 5s | > 15s | > 60s |
| mTLS handshake failures | 0/min | > 10/min | > 100/min |
What this looks like at scale: Uber runs one of the largest service meshes, with around 4,000 microservices and tens of thousands of Envoy sidecars. At that scale, sidecar overhead adds roughly 15% to cluster-wide compute costs (about $2-3M/year). Lyft, who created Envoy, pushes around 3M RPS through the mesh with less than 2ms of added P99 latency per hop. Quick capacity formulas: mesh_memory_overhead = pod_count * 100MB, mesh_cpu_overhead = pod_count * 0.3 vCPU, istiod_memory = (pod_count / 1000) * 1GB. For a 2,000-pod cluster, that's 200GB of additional memory (around $800/mo on EC2), 600 vCPU of additional compute (around $3,000/mo), and 2GB for Istiod.
Architecture Decision Record
ADR: Service Mesh Adoption Strategy
Context: Deciding whether, when, and which service mesh to adopt. This is one of the most consequential infrastructure decisions a platform team makes, and it's hard to reverse once deeply committed.
| Criteria (Weight) | No Mesh (Libraries) | Linkerd | Istio | Istio Ambient |
|---|---|---|---|---|
| Ops complexity (25%) | Low (per-team) | Medium | High | Medium |
| Resource overhead (20%) | None | ~30MB/proxy | ~100MB/proxy | ~20MB/node (ztunnel) |
| Feature completeness (20%) | Custom | Core features | Full (traffic mgmt, authz) | Full (maturing) |
| Adoption effort (20%) | Gradual (per-service) | Low (auto-inject) | Medium (CRD learning curve) | Low (no sidecars) |
| Multi-cluster (15%) | Manual | Supported | Mature | Supported |
Decision framework:
- < 20 services AND team < 30 engineers: Skip the service mesh. Use shared libraries for retries and circuit breaking (Resilience4j for Java, go-retries for Go). Handle mTLS with cert-manager and application-level TLS. The operational overhead of running a mesh will cost more than it saves. Revisit at 50 services.
- 20-100 services AND Kubernetes-only AND the team values simplicity: Go with Linkerd. Single CLI install, auto-inject sidecars, mTLS by default, roughly 30MB per sidecar. Linkerd's operational model is 10x simpler than Istio's. Buoyant (the company behind Linkerd) reports a median install-to-production time of 2 weeks.
- 100-500 services AND multi-cluster OR advanced traffic management needs: Deploy Istio with a dedicated platform team of 2-3 engineers. The full feature set (VirtualService, DestinationRule, AuthorizationPolicy, WasmPlugin) justifies the complexity at this scale. Budget 3-6 months for a full rollout, adopting namespace by namespace.
- > 500 services AND cost-sensitive AND willing to run newer technology: Evaluate Istio Ambient mode. Replacing per-pod sidecars with per-node ztunnel (L4) and optional waypoint proxies (L7) cuts resource overhead by 60-80%. At 5,000 pods, that saves roughly 400GB RAM ($1,600/mo). The trade-off: ambient mode has less production mileage than sidecar mode as of 2025.
- Multi-platform (Kubernetes + VMs + bare metal): Consul Connect is the only mesh with real multi-platform support. It runs on VMs without Kubernetes, integrates with Nomad, and bundles service registry with the mesh. For a hybrid-cloud transition, this is the best option.
Key Points
- •A dedicated infrastructure layer that handles service-to-service communication through sidecar proxies
- •Handles mTLS, retries, circuit breaking, and observability without touching application code
- •Data plane (Envoy sidecars) moves traffic; control plane (Istiod) manages configuration
- •Adds roughly 2-5ms latency per hop because of the sidecar proxy. Know the trade-off before committing.
- •Starts paying for itself around 50+ services. Adopting it too early just adds complexity for no real gain.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Istio | Open Source | Full-featured mesh, Envoy-based, large community | Large-Enterprise |
| Linkerd | Open Source | Lightweight, Rust proxy, simpler operations | Medium-Large |
| Consul Connect | Open Source | Multi-platform (K8s + VMs), integrated service discovery | Medium-Enterprise |
| AWS App Mesh | Managed | AWS-native, ECS/EKS integration | Medium-Enterprise |
Common Mistakes
- Adopting a service mesh with fewer than 20 services. The operational overhead just isn't worth it at that scale.
- Forgetting the per-pod resource cost. Each Envoy sidecar eats 50-100MB RAM, and that adds up fast.
- Thinking the mesh handles all security. It provides transport encryption, not application-level authorization.
- Leaving retry policies at defaults. Untuned retries will amplify failures into a retry storm.
- Treating the control plane as a fire-and-forget component. If Istiod goes down, new proxy configs stop flowing.