Zero Trust & Network Security
Architecture Diagram
Why It Exists
Traditional perimeter security trusts anything inside the network and blocks anything outside. That model falls apart the moment microservices run across shared Kubernetes clusters, span multiple clouds, or there are thousands of pods where any one of them could get compromised. The perimeter is a fiction at that point.
Zero trust throws out implicit trust entirely. Every service-to-service call has to prove its identity and show it has authorization. No exceptions, no shortcuts. The 2020 SolarWinds breach showed exactly what happens when lateral movement inside a "trusted" network goes unchecked. Attackers sat in that environment for months because nothing challenged them once they were past the front door.
The key takeaway: the internal network is not a safe zone. Treat it like the public internet.
How It Works
Identity with SPIFFE/SPIRE
SPIFFE (Secure Production Identity Framework For Everyone) provides a standard for workload identity. Each service gets a cryptographic identity document (SVID), either an X.509 certificate or a JWT, that says "I am service-A in namespace-production in cluster-us-east." SPIRE is the reference implementation that issues and rotates these identities.
Service meshes like Istio implement SPIFFE natively, issuing short-lived certificates (24h TTL) through the mesh's built-in CA. The short TTL matters. If a cert gets compromised, it stops working within hours instead of sitting valid for a year.
Micro-Segmentation with Network Policies
Kubernetes NetworkPolicy resources control pod-to-pod traffic at L3/L4:
- Default deny ingress -- No pod can receive traffic unless explicitly allowed
- Allow by label selector --
spec.ingress.from.podSelector.matchLabels: {app: api-gateway}lets only the gateway reach the service - Namespace isolation --
spec.ingress.from.namespaceSelectorblocks cross-namespace communication
CNI plugins (Calico, Cilium) actually enforce these policies. Without a CNI that supports them, NetworkPolicy objects are just YAML that does nothing. Double-check this. Cilium goes further with L7 policies, enabling rules like "allow only GET /health from the monitoring namespace."
Policy-as-Code with OPA
Open Policy Agent evaluates authorization decisions using Rego policies. Deploy it as a Kubernetes admission webhook and it can block: privileged containers, images from unapproved registries, missing required labels, unset resource limits, and hostNetwork access.
For runtime authorization, services query OPA as a sidecar for fine-grained decisions: "Can service-A hit /payments with role=admin?" This is where the real power is, but writing Rego is an acquired taste. Budget time for the team to learn it properly.
Production Considerations
- Certificate rotation -- Use short-lived certificates (1-24 hours) with automatic rotation. Long-lived certs are a ticking time bomb. When they expire, every service-to-service call fails at once, and the result is a very bad day.
- Policy testing -- Rego policies need unit tests (
opa test) and integration tests in CI. A bad deny-all policy in production blocks all traffic instantly. I have seen this take down entire platforms in under a minute. - Gradual enforcement -- Roll out network policies in audit mode first (Calico
logaction), study the traffic patterns, then switch todeny. Skipping the observation phase causes outages. Every time. - Defense in depth layers -- Combine WAF at the edge, mTLS between services, NetworkPolicy at the CNI, OPA for authorization, and application-level RBAC. No single layer is enough on its own. Each one catches what the others miss.
- East-west visibility -- Use Cilium Hubble or Istio Kiali to visualize actual service-to-service traffic flows. Writing correct network policies is impossible without understanding the real communication graph. Guessing here is a recipe for outages or security holes.
Failure Scenarios
Scenario 1: Certificate Authority Compromise Breaks All mTLS -- The Istio root CA private key leaks through a misconfigured backup sitting in an unencrypted S3 bucket. An attacker can now mint valid SPIFFE SVIDs for any service identity in the mesh. All mTLS authentication becomes worthless. The attacker can impersonate any service and pull data across service boundaries. The blast radius is every service in the mesh. Detection: Monitor certificate issuance rates via istio_agent_cert_issuance_count. A spike from an unknown source means trouble. Implement Certificate Transparency logs for internal CAs. Recovery: Rotate the root CA immediately, which forces re-issuance of all workload certificates (expect 5-10 minutes of service disruption). Switch to an intermediate CA architecture so a root CA compromise only requires rotating the intermediate. Store root CA keys in an HSM (AWS CloudHSM, Google Cloud HSM). Never in software. This is non-negotiable.
Scenario 2: Network Policy Misconfiguration Causes Production Outage -- A platform engineer applies a default-deny-all NetworkPolicy to the production namespace to harden security. Good intention, bad execution: the corresponding allow rules for DNS (kube-dns) egress are missing. All pods lose DNS resolution. Service discovery breaks. Every HTTP call returns connection errors. The entire production namespace goes dark within 30 seconds. Detection: Build a "canary NetworkPolicy" test in CI that applies policies to a test namespace and validates DNS resolution and critical service connectivity before promoting to production. Monitor coredns_dns_request_count for sudden drops. Recovery: Delete the overly restrictive policy immediately (kubectl delete networkpolicy default-deny -n production). Create a standard policy template that always includes DNS egress (port 53, protocol UDP/TCP to kube-system). Require dry-run validation with kubectl diff before any NetworkPolicy change hits production.
Scenario 3: OPA Policy Evaluation Latency Spikes Under Load -- OPA runs as an admission webhook for all Kubernetes API requests. A Rego policy with nested iterations over all pods in the cluster (O(n^2) complexity) works fine with 200 pods but falls apart at 2,000. Webhook response time blows past the 10-second timeout in the ValidatingWebhookConfiguration. All kubectl apply commands and Deployment rollouts fail. Auto-scaling cannot create new pods. The cluster is effectively frozen. Detection: Monitor opa_decision_latency_seconds p99 and alert when it passes 500ms. Track webhook timeout rates via apiserver_admission_webhook_rejection_count. Recovery: Set failurePolicy: Ignore temporarily on the webhook to unfreeze cluster operations (yes, this means accepting a security gap, but the alternative is a dead cluster). Optimize the Rego policy using indexing and pre-computed sets instead of iteration. Add OPA bundle caching and shard policies across multiple OPA instances.
Capacity Planning
mTLS certificate issuance scale: Each pod in an Istio mesh needs a workload certificate, renewed before expiry (default 24h TTL). With 5,000 pods, the mesh CA handles roughly 5,000 CSR/sign operations per 24 hours plus pod restarts. Istiod can handle about 500 CSR/sec on a single instance, so there's headroom, but keep an eye on it during large rollouts.
| Scale Tier | Pods | Services | Network Policies | mTLS Certs/Day | Reference |
|---|---|---|---|---|---|
| Startup | 50 | 10 | 20 | 100 | Single cluster |
| Mid-scale | 500 | 50 | 200 | 1,500 | Multi-team platform |
| Large-scale | 5,000 | 300 | 2,000 | 15,000 | Cloudflare-scale |
| Hyper-scale | 50,000+ | 2,000+ | 20,000+ | 200,000+ | Google BeyondCorp |
Key thresholds: Calico processes NetworkPolicy updates in O(n) time per policy. Above 5,000 policies per cluster, look into namespace-level policy aggregation. Cilium eBPF maps have a default limit of 65,536 entries, so monitor cilium_bpf_map_pressure. OPA decision latency should stay under 5ms for admission webhooks. If a policy takes over 100ms, it needs optimization -- treat that as a bug. At 10,000+ pods, run dedicated Istiod instances per namespace group to avoid control plane bottlenecks. For reference, Stripe processes 500M+ API requests per day with per-request authorization, and their policy engine evaluates in under 1ms.
Architecture Decision Record
Decision: Choosing a Zero Trust Implementation Approach
| Criteria (Weight) | Istio (Service Mesh) | Cilium (eBPF) | Calico + OPA | Cloud-Native (AWS VPC Lattice) |
|---|---|---|---|---|
| mTLS automation (25%) | 5 - Automatic, transparent | 3 - WireGuard-based encryption | 2 - Manual cert management | 3 - Managed TLS |
| L7 policy control (20%) | 5 - Full HTTP/gRPC authz | 4 - L7 visibility + basic policy | 3 - OPA for authz, separate from network | 3 - Basic path-based routing |
| Performance overhead (20%) | 2 - Sidecar proxy adds 2-3ms latency | 5 - Kernel-level, <0.5ms overhead | 4 - Minimal overhead, no sidecar | 4 - Managed, optimized |
| Operational complexity (15%) | 2 - Sidecar injection, control plane ops | 3 - eBPF kernel compatibility | 3 - Multiple components | 5 - Fully managed |
| Observability (10%) | 5 - Kiali service map, Prometheus metrics | 4 - Hubble flow visibility | 3 - Separate observability setup | 3 - CloudWatch integration |
| Multi-cloud support (10%) | 4 - Any K8s cluster | 4 - Any K8s with modern kernel | 4 - Any K8s cluster | 1 - AWS only |
When to choose what:
- Team < 20, Kubernetes-native: Cilium. eBPF-based networking with zero sidecar overhead, built-in network policies, and Hubble for observability. Lowest operational burden by far.
- Team 20-100, need full L7 control: Istio. The most mature service mesh with solid mTLS, authorization policies, and traffic management. The cost is sidecar latency, but the tradeoff is usually worth it.
- Regulated industry, need policy-as-code: Calico + OPA. Separating network policy (Calico) from authorization logic (OPA Rego) provides auditable, version-controlled security policies that compliance teams can actually review.
- AWS-only, minimize ops: VPC Lattice. Fully managed service-to-service connectivity. Best for zero trust without owning the infrastructure. The lock-in is real, though.
- Hybrid cloud, high-security: Istio with SPIRE. SPIFFE-based identity works across clouds. Pair it with Calico for network-level enforcement as defense-in-depth. This is the most complex setup, but it covers the most ground.
Key Points
- •Every request gets authenticated and authorized regardless of where it originates. Being 'inside' the network means nothing.
- •Micro-segmentation replaces the old network perimeter. Each service-to-service call is policy-controlled.
- •mTLS provides both encryption and identity verification between services
- •Kubernetes network policies restrict pod-to-pod communication at the CNI level
- •Defense in depth matters. Combine network policies, service mesh mTLS, and application-level authz together.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Istio | Open Source | mTLS, authorization policies, service identity (SPIFFE) | Large-Enterprise |
| Calico | Open Source | K8s network policies, eBPF dataplane | Medium-Enterprise |
| Open Policy Agent | Open Source | Policy-as-code, admission control, authz decisions | Medium-Enterprise |
| Cilium | Open Source | eBPF networking, L7 visibility, network policies | Medium-Enterprise |
Common Mistakes
- Relying only on network perimeter security. Once inside, attackers move laterally with zero resistance.
- Deploying mTLS without certificate rotation. Expired certs cause outages, and they always expire at the worst time.
- Leaving network policies wide open. An 'allow all' default defeats the entire purpose.
- Skipping audit trails on policy changes. Security policies need the same review process as application code.
- Ignoring east-west traffic. Most real attacks exploit service-to-service communication, not north-south.