Certificate Expiry Incidents
The Anatomy of a Cert Expiry Outage
A certificate expiry outage is pure embarrassment. You had months of warning. The expiry date was encoded in the certificate itself. Monitoring tools exist. Automation exists. And yet, cert expiry incidents keep happening at companies of every size.
The failure mode is binary. At 23:59:59 everything works. At 00:00:00 every new TLS handshake fails. There's no gradual degradation, no canary signal, no slow rollout. One second you're serving traffic, the next you're not. This cliff-edge behavior is what makes cert expiry so dangerous despite being so preventable.
Why Automation Fails
cert-manager handles 90% of certificate lifecycle in Kubernetes. But the other 10% is where incidents happen. Certificates on legacy load balancers that aren't managed by Kubernetes. Wildcard certificates shared across teams. Internal CA certificates with 10-year lifetimes that nobody tracks. Partner-issued certificates where renewal requires a manual process.
Let's Encrypt renewals can fail silently. The DNS-01 challenge requires API access to your DNS provider. If those credentials rotate, renewals stop working and nobody notices until the cert expires. HTTP-01 challenges fail if your ingress routing changes. Always monitor successful renewal events, not just expiry dates.
The mTLS Trap
Service-to-service mTLS certificates are the hidden risk. Public-facing certificates get attention because customers see the errors. Internal certificates fail silently or with cryptic connection errors that look like network issues.
Istio service mesh certificates rotate automatically with short lifetimes (24 hours by default). But the root CA certificate that signs them has a longer lifetime, and when that expires, every certificate in the mesh becomes invalid simultaneously. This is how a single expired root CA can take down hundreds of microservices at once.
Building a Certificate Inventory
You can't monitor what you don't know about. Run a weekly scan of all your endpoints:
Scan external endpoints with openssl s_client or tools like ssl-cert-check. Scan internal endpoints through your service mesh. Check cloud provider certificate stores (ACM, GCP Certificate Manager). Check Kubernetes secrets of type kubernetes.io/tls. Check CI/CD systems that use certificates for code signing.
Put every certificate into a single inventory with owner, expiry date, renewal method, and deployment locations. This inventory is your source of truth.
Recovery When It Happens
When a cert expires, you need the fastest path to a valid certificate. Keep a runbook with the exact commands for each certificate type. Have backup certificates from a different CA pre-generated and stored in a vault (not expired, obviously). Know which services need a restart vs a reload after cert replacement. Nginx reloads gracefully. Some Java applications require a full restart because they cache the keystore at startup.
After recovery, don't just fix the expired cert. Fix the process that let it expire.
Incident Timeline
- T+0mTLS certificate expires at 00:00 UTC. New connections start failing with SSL handshake errors. Existing keep-alive connections continue working temporarily.
- T+2mExternal monitoring detects SSL errors. Error rate climbs as connection pools recycle. Mobile apps hit harder than web due to stricter certificate pinning.
- T+5mOn-call paged. Initial investigation checks application logs, sees connection reset errors. Takes 3 minutes to identify certificate expiry as root cause.
- T+10mTeam attempts manual certificate renewal. Let's Encrypt rate limits hit because someone ran renewals in a loop during testing earlier that week.
- T+15mBackup certificate from a different CA (DigiCert) deployed manually. Nginx reloaded. Traffic partially recovering for services behind the load balancer.
- T+30mFull recovery after updating certificates on all edge nodes, internal mTLS certs, and CDN configuration. Post-incident review scheduled.
Detection Signals
- •SSL handshake failure rate exceeding 1% in load balancer metrics
- •x509: certificate has expired errors appearing in application logs
- •Spike in HTTP 502/503 errors at the reverse proxy layer
- •Client-side certificate pinning failures reported through crash analytics
Prevention
- Deploy cert-manager in Kubernetes clusters with automated Let's Encrypt renewal at 30 days before expiry, not the default 7 days
- Set up certificate expiry monitoring in Prometheus with alerting at 30, 14, and 7 days before expiration using blackbox_exporter
- Maintain a certificate inventory spreadsheet or use tools like Keychecker that scan all endpoints weekly
- Use short-lived certificates (90 days or less) to force automation. If renewal only happens annually, the process rots and nobody remembers how
- Test certificate renewal in a staging environment monthly, including the full path from issuance to deployment
Key Points
- •Certificate expiry is the most preventable P0 incident. Every single one is a process failure, never a technical surprise
- •Microsoft Teams went down for 3 hours in February 2020 because an authentication certificate expired. A company with 75,000 engineers forgot to renew a cert
- •Internal mTLS certificates are more dangerous than public-facing ones because they're less visible and often have longer lifetimes
- •Certificate expiry doesn't cause gradual degradation. It's a cliff: everything works until the exact second of expiry, then nothing works
Common Mistakes
- ✗Relying on calendar reminders instead of automated monitoring for certificate renewals
- ✗Renewing the certificate but forgetting to deploy it to all endpoints (CDN edge nodes, internal services, partner API gateways)
- ✗Using certificate pinning in mobile apps without a rotation mechanism, turning a cert renewal into a forced app update