Certificate Expiry Incidents

The Anatomy of a Cert Expiry Outage

A certificate expiry outage is pure embarrassment. You had months of warning. The expiry date was encoded in the certificate itself. Monitoring tools exist. Automation exists. And yet, cert expiry incidents keep happening at companies of every size.

The failure mode is binary. At 23:59:59 everything works. At 00:00:00 every new TLS handshake fails. There's no gradual degradation, no canary signal, no slow rollout. One second you're serving traffic, the next you're not. This cliff-edge behavior is what makes cert expiry so dangerous despite being so preventable.

Why Automation Fails

cert-manager handles 90% of certificate lifecycle in Kubernetes. But the other 10% is where incidents happen. Certificates on legacy load balancers that aren't managed by Kubernetes. Wildcard certificates shared across teams. Internal CA certificates with 10-year lifetimes that nobody tracks. Partner-issued certificates where renewal requires a manual process.

Let's Encrypt renewals can fail silently. The DNS-01 challenge requires API access to your DNS provider. If those credentials rotate, renewals stop working and nobody notices until the cert expires. HTTP-01 challenges fail if your ingress routing changes. Always monitor successful renewal events, not just expiry dates.

The mTLS Trap

Service-to-service mTLS certificates are the hidden risk. Public-facing certificates get attention because customers see the errors. Internal certificates fail silently or with cryptic connection errors that look like network issues.

Istio service mesh certificates rotate automatically with short lifetimes (24 hours by default). But the root CA certificate that signs them has a longer lifetime, and when that expires, every certificate in the mesh becomes invalid simultaneously. This is how a single expired root CA can take down hundreds of microservices at once.

Building a Certificate Inventory

You can't monitor what you don't know about. Run a weekly scan of all your endpoints:

Scan external endpoints with openssl s_client or tools like ssl-cert-check. Scan internal endpoints through your service mesh. Check cloud provider certificate stores (ACM, GCP Certificate Manager). Check Kubernetes secrets of type kubernetes.io/tls. Check CI/CD systems that use certificates for code signing.

Put every certificate into a single inventory with owner, expiry date, renewal method, and deployment locations. This inventory is your source of truth.

Recovery When It Happens

When a cert expires, you need the fastest path to a valid certificate. Keep a runbook with the exact commands for each certificate type. Have backup certificates from a different CA pre-generated and stored in a vault (not expired, obviously). Know which services need a restart vs a reload after cert replacement. Nginx reloads gracefully. Some Java applications require a full restart because they cache the keystore at startup.

After recovery, don't just fix the expired cert. Fix the process that let it expire.

The Anatomy of a Cert Expiry Outage

Why Automation Fails

The mTLS Trap

Building a Certificate Inventory

You can't monitor what you don't know about. Run a weekly scan of all your endpoints:

Put every certificate into a single inventory with owner, expiry date, renewal method, and deployment locations. This inventory is your source of truth.

Recovery When It Happens

After recovery, don't just fix the expired cert. Fix the process that let it expire.

The Anatomy of a Cert Expiry Outage

Why Automation Fails

The mTLS Trap

Building a Certificate Inventory

Recovery When It Happens

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Certificate Expiry Incidents

The Anatomy of a Cert Expiry Outage

Why Automation Fails

The mTLS Trap

Building a Certificate Inventory

Recovery When It Happens

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics