Zero Trust & Network Security

Why It Exists

Traditional perimeter security trusts anything inside the network and blocks anything outside. That model falls apart the moment microservices run across shared Kubernetes clusters, span multiple clouds, or there are thousands of pods where any one of them could get compromised. The perimeter is a fiction at that point.

Zero trust throws out implicit trust entirely. Every service-to-service call has to prove its identity and show it has authorization. No exceptions, no shortcuts. The 2020 SolarWinds breach showed exactly what happens when lateral movement inside a "trusted" network goes unchecked. Attackers sat in that environment for months because nothing challenged them once they were past the front door.

The key takeaway: the internal network is not a safe zone. Treat it like the public internet.

How It Works

Identity with SPIFFE/SPIRE

SPIFFE (Secure Production Identity Framework For Everyone) provides a standard for workload identity. Each service gets a cryptographic identity document (SVID), either an X.509 certificate or a JWT, that says "I am service-A in namespace-production in cluster-us-east." SPIRE is the reference implementation that issues and rotates these identities.

Service meshes like Istio implement SPIFFE natively, issuing short-lived certificates (24h TTL) through the mesh's built-in CA. The short TTL matters. If a cert gets compromised, it stops working within hours instead of sitting valid for a year.

Micro-Segmentation with Network Policies

Kubernetes NetworkPolicy resources control pod-to-pod traffic at L3/L4:

Default deny ingress -- No pod can receive traffic unless explicitly allowed
Allow by label selector -- spec.ingress.from.podSelector.matchLabels: {app: api-gateway} lets only the gateway reach the service
Namespace isolation -- spec.ingress.from.namespaceSelector blocks cross-namespace communication

CNI plugins (Calico, Cilium) actually enforce these policies. Without a CNI that supports them, NetworkPolicy objects are just YAML that does nothing. Double-check this. Cilium goes further with L7 policies, enabling rules like "allow only GET /health from the monitoring namespace."

Policy-as-Code with OPA

Open Policy Agent evaluates authorization decisions using Rego policies. Deploy it as a Kubernetes admission webhook and it can block: privileged containers, images from unapproved registries, missing required labels, unset resource limits, and hostNetwork access.

For runtime authorization, services query OPA as a sidecar for fine-grained decisions: "Can service-A hit /payments with role=admin?" This is where the real power is, but writing Rego is an acquired taste. Budget time for the team to learn it properly.

Production Considerations

Certificate rotation -- Use short-lived certificates (1-24 hours) with automatic rotation. Long-lived certs are a ticking time bomb. When they expire, every service-to-service call fails at once, and the result is a very bad day.
Policy testing -- Rego policies need unit tests (opa test) and integration tests in CI. A bad deny-all policy in production blocks all traffic instantly. I have seen this take down entire platforms in under a minute.
Gradual enforcement -- Roll out network policies in audit mode first (Calico log action), study the traffic patterns, then switch to deny. Skipping the observation phase causes outages. Every time.
Defense in depth layers -- Combine WAF at the edge, mTLS between services, NetworkPolicy at the CNI, OPA for authorization, and application-level RBAC. No single layer is enough on its own. Each one catches what the others miss.
East-west visibility -- Use Cilium Hubble or Istio Kiali to visualize actual service-to-service traffic flows. Writing correct network policies is impossible without understanding the real communication graph. Guessing here is a recipe for outages or security holes.

Failure Scenarios

Scenario 1: Certificate Authority Compromise Breaks All mTLS -- The Istio root CA private key leaks through a misconfigured backup sitting in an unencrypted S3 bucket. An attacker can now mint valid SPIFFE SVIDs for any service identity in the mesh. All mTLS authentication becomes worthless. The attacker can impersonate any service and pull data across service boundaries. The blast radius is every service in the mesh. Detection: Monitor certificate issuance rates via istio_agent_cert_issuance_count. A spike from an unknown source means trouble. Implement Certificate Transparency logs for internal CAs. Recovery: Rotate the root CA immediately, which forces re-issuance of all workload certificates (expect 5-10 minutes of service disruption). Switch to an intermediate CA architecture so a root CA compromise only requires rotating the intermediate. Store root CA keys in an HSM (AWS CloudHSM, Google Cloud HSM). Never in software. This is non-negotiable.

Scenario 2: Network Policy Misconfiguration Causes Production Outage -- A platform engineer applies a default-deny-all NetworkPolicy to the production namespace to harden security. Good intention, bad execution: the corresponding allow rules for DNS (kube-dns) egress are missing. All pods lose DNS resolution. Service discovery breaks. Every HTTP call returns connection errors. The entire production namespace goes dark within 30 seconds. Detection: Build a "canary NetworkPolicy" test in CI that applies policies to a test namespace and validates DNS resolution and critical service connectivity before promoting to production. Monitor coredns_dns_request_count for sudden drops. Recovery: Delete the overly restrictive policy immediately (kubectl delete networkpolicy default-deny -n production). Create a standard policy template that always includes DNS egress (port 53, protocol UDP/TCP to kube-system). Require dry-run validation with kubectl diff before any NetworkPolicy change hits production.

Scenario 3: OPA Policy Evaluation Latency Spikes Under Load -- OPA runs as an admission webhook for all Kubernetes API requests. A Rego policy with nested iterations over all pods in the cluster (O(n^2) complexity) works fine with 200 pods but falls apart at 2,000. Webhook response time blows past the 10-second timeout in the ValidatingWebhookConfiguration. All kubectl apply commands and Deployment rollouts fail. Auto-scaling cannot create new pods. The cluster is effectively frozen. Detection: Monitor opa_decision_latency_seconds p99 and alert when it passes 500ms. Track webhook timeout rates via apiserver_admission_webhook_rejection_count. Recovery: Set failurePolicy: Ignore temporarily on the webhook to unfreeze cluster operations (yes, this means accepting a security gap, but the alternative is a dead cluster). Optimize the Rego policy using indexing and pre-computed sets instead of iteration. Add OPA bundle caching and shard policies across multiple OPA instances.

Capacity Planning

mTLS certificate issuance scale: Each pod in an Istio mesh needs a workload certificate, renewed before expiry (default 24h TTL). With 5,000 pods, the mesh CA handles roughly 5,000 CSR/sign operations per 24 hours plus pod restarts. Istiod can handle about 500 CSR/sec on a single instance, so there's headroom, but keep an eye on it during large rollouts.

Scale Tier	Pods	Services	Network Policies	mTLS Certs/Day	Reference
Startup	50	10	20	100	Single cluster
Mid-scale	500	50	200	1,500	Multi-team platform
Large-scale	5,000	300	2,000	15,000	Cloudflare-scale
Hyper-scale	50,000+	2,000+	20,000+	200,000+	Google BeyondCorp

Key thresholds: Calico processes NetworkPolicy updates in O(n) time per policy. Above 5,000 policies per cluster, look into namespace-level policy aggregation. Cilium eBPF maps have a default limit of 65,536 entries, so monitor cilium_bpf_map_pressure. OPA decision latency should stay under 5ms for admission webhooks. If a policy takes over 100ms, it needs optimization -- treat that as a bug. At 10,000+ pods, run dedicated Istiod instances per namespace group to avoid control plane bottlenecks. For reference, Stripe processes 500M+ API requests per day with per-request authorization, and their policy engine evaluates in under 1ms.

Architecture Decision Record

Decision: Choosing a Zero Trust Implementation Approach

Criteria (Weight)	Istio (Service Mesh)	Cilium (eBPF)	Calico + OPA	Cloud-Native (AWS VPC Lattice)
mTLS automation (25%)	5 - Automatic, transparent	3 - WireGuard-based encryption	2 - Manual cert management	3 - Managed TLS
L7 policy control (20%)	5 - Full HTTP/gRPC authz	4 - L7 visibility + basic policy	3 - OPA for authz, separate from network	3 - Basic path-based routing
Performance overhead (20%)	2 - Sidecar proxy adds 2-3ms latency	5 - Kernel-level, <0.5ms overhead	4 - Minimal overhead, no sidecar	4 - Managed, optimized
Operational complexity (15%)	2 - Sidecar injection, control plane ops	3 - eBPF kernel compatibility	3 - Multiple components	5 - Fully managed
Observability (10%)	5 - Kiali service map, Prometheus metrics	4 - Hubble flow visibility	3 - Separate observability setup	3 - CloudWatch integration
Multi-cloud support (10%)	4 - Any K8s cluster	4 - Any K8s with modern kernel	4 - Any K8s cluster	1 - AWS only

When to choose what:

Team < 20, Kubernetes-native: Cilium. eBPF-based networking with zero sidecar overhead, built-in network policies, and Hubble for observability. Lowest operational burden by far.
Team 20-100, need full L7 control: Istio. The most mature service mesh with solid mTLS, authorization policies, and traffic management. The cost is sidecar latency, but the tradeoff is usually worth it.
Regulated industry, need policy-as-code: Calico + OPA. Separating network policy (Calico) from authorization logic (OPA Rego) provides auditable, version-controlled security policies that compliance teams can actually review.
AWS-only, minimize ops: VPC Lattice. Fully managed service-to-service connectivity. Best for zero trust without owning the infrastructure. The lock-in is real, though.
Hybrid cloud, high-security: Istio with SPIRE. SPIFFE-based identity works across clouds. Pair it with Calico for network-level enforcement as defense-in-depth. This is the most complex setup, but it covers the most ground.

Tool	Type	Best For	Scale
Istio	Open Source	mTLS, authorization policies, service identity (SPIFFE)	Large-Enterprise
Calico	Open Source	K8s network policies, eBPF dataplane	Medium-Enterprise
Open Policy Agent	Open Source	Policy-as-code, admission control, authz decisions	Medium-Enterprise
Cilium	Open Source	eBPF networking, L7 visibility, network policies	Medium-Enterprise

Why It Exists

The key takeaway: the internal network is not a safe zone. Treat it like the public internet.

How It Works

Identity with SPIFFE/SPIRE

Micro-Segmentation with Network Policies

Kubernetes NetworkPolicy resources control pod-to-pod traffic at L3/L4:

Default deny ingress -- No pod can receive traffic unless explicitly allowed
Allow by label selector -- spec.ingress.from.podSelector.matchLabels: {app: api-gateway} lets only the gateway reach the service
Namespace isolation -- spec.ingress.from.namespaceSelector blocks cross-namespace communication

Policy-as-Code with OPA

Production Considerations

Certificate rotation -- Use short-lived certificates (1-24 hours) with automatic rotation. Long-lived certs are a ticking time bomb. When they expire, every service-to-service call fails at once, and the result is a very bad day.
Policy testing -- Rego policies need unit tests (opa test) and integration tests in CI. A bad deny-all policy in production blocks all traffic instantly. I have seen this take down entire platforms in under a minute.
Gradual enforcement -- Roll out network policies in audit mode first (Calico log action), study the traffic patterns, then switch to deny. Skipping the observation phase causes outages. Every time.
Defense in depth layers -- Combine WAF at the edge, mTLS between services, NetworkPolicy at the CNI, OPA for authorization, and application-level RBAC. No single layer is enough on its own. Each one catches what the others miss.
East-west visibility -- Use Cilium Hubble or Istio Kiali to visualize actual service-to-service traffic flows. Writing correct network policies is impossible without understanding the real communication graph. Guessing here is a recipe for outages or security holes.

Failure Scenarios

Capacity Planning

Scale Tier	Pods	Services	Network Policies	mTLS Certs/Day	Reference
Startup	50	10	20	100	Single cluster
Mid-scale	500	50	200	1,500	Multi-team platform
Large-scale	5,000	300	2,000	15,000	Cloudflare-scale
Hyper-scale	50,000+	2,000+	20,000+	200,000+	Google BeyondCorp

Architecture Decision Record

Decision: Choosing a Zero Trust Implementation Approach

Criteria (Weight)	Istio (Service Mesh)	Cilium (eBPF)	Calico + OPA	Cloud-Native (AWS VPC Lattice)
mTLS automation (25%)	5 - Automatic, transparent	3 - WireGuard-based encryption	2 - Manual cert management	3 - Managed TLS
L7 policy control (20%)	5 - Full HTTP/gRPC authz	4 - L7 visibility + basic policy	3 - OPA for authz, separate from network	3 - Basic path-based routing
Performance overhead (20%)	2 - Sidecar proxy adds 2-3ms latency	5 - Kernel-level, <0.5ms overhead	4 - Minimal overhead, no sidecar	4 - Managed, optimized
Operational complexity (15%)	2 - Sidecar injection, control plane ops	3 - eBPF kernel compatibility	3 - Multiple components	5 - Fully managed
Observability (10%)	5 - Kiali service map, Prometheus metrics	4 - Hubble flow visibility	3 - Separate observability setup	3 - CloudWatch integration
Multi-cloud support (10%)	4 - Any K8s cluster	4 - Any K8s with modern kernel	4 - Any K8s cluster	1 - AWS only

When to choose what:

Team < 20, Kubernetes-native: Cilium. eBPF-based networking with zero sidecar overhead, built-in network policies, and Hubble for observability. Lowest operational burden by far.
Team 20-100, need full L7 control: Istio. The most mature service mesh with solid mTLS, authorization policies, and traffic management. The cost is sidecar latency, but the tradeoff is usually worth it.
Regulated industry, need policy-as-code: Calico + OPA. Separating network policy (Calico) from authorization logic (OPA Rego) provides auditable, version-controlled security policies that compliance teams can actually review.
AWS-only, minimize ops: VPC Lattice. Fully managed service-to-service connectivity. Best for zero trust without owning the infrastructure. The lock-in is real, though.
Hybrid cloud, high-security: Istio with SPIRE. SPIFFE-based identity works across clouds. Pair it with Calico for network-level enforcement as defense-in-depth. This is the most complex setup, but it covers the most ground.

Architecture Diagram

Why It Exists

How It Works

Identity with SPIFFE/SPIRE

Micro-Segmentation with Network Policies

Policy-as-Code with OPA

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Zero Trust & Network Security

Architecture Diagram

Why It Exists

How It Works

Identity with SPIFFE/SPIRE

Micro-Segmentation with Network Policies

Policy-as-Code with OPA

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics