Incident Patterns — Cascading Failures, Retry Storms & Chaos Engineering

Severity

Showing 0 of 18 patterns

AI Model Failure Patterns — Failure Patterns (P1)

Difficulty: Advanced

Key Points for AI Model Failure Patterns

AI failures are silent. The system returns 200 OK with confidently wrong answers, and nothing in your infrastructure monitoring will catch it.
Training-serving skew is the most common root cause. The model sees different data in production than it saw during training.
Data quality problems cascade into model quality problems with a time delay, often days or weeks, which makes root cause analysis harder.
Canary deployments for models need to include quality metrics like accuracy and prediction distribution, not just latency and error rate.
Feedback loop delays make AI incidents fundamentally harder to spot than traditional software failures. A bad model can run for days before anyone realizes.

Incident Timeline for AI Model Failure Patterns

T+0d: Data pipeline pulls in a corrupted source with missing values and a changed schema
T+1d: Scheduled model retraining finishes on the corrupted data, weights get updated
T+1d: Model registry promotes the retrained model and canary deployment starts
T+2d: Canary passes automated checks because latency and error rate look fine
T+3d: Full rollout to 100% traffic. Quality starts degrading, but nobody notices yet.
T+7d: Customer complaints start rolling in as recommendation quality visibly drops
T+8d: Investigation traces the problem to training data corruption, model rolled back to the previous version

Detection Signals for AI Model Failure Patterns

Model prediction distribution shifting, with outputs clustering differently than the baseline
Feature importance changing unexpectedly, where a previously minor feature suddenly dominates
Online-offline metric gap, where the model looks good on test data but performs poorly in production
Business metric drops that line up with model deployment timing
Uptick in user-reported quality issues or feedback submissions

Prevention Strategies for AI Model Failure Patterns

Put data validation gates (Great Expectations, Deequ) at every pipeline boundary
Run automated model quality checks (accuracy, bias, calibration) against a held-out eval set before every deployment
Use shadow mode deployments to compare new model outputs against the production model before sending real traffic
Monitor business-level metrics (conversion rate, user engagement, support tickets) alongside your infrastructure metrics
Keep the ability to instantly roll back to the previous model version

Common Mistakes with AI Model Failure Patterns

Only watching infrastructure metrics (CPU, memory, latency) for AI systems when the real failures show up in output quality
Not checking training data schema and distribution before kicking off retraining
Deploying a retrained model without comparing its quality numbers against the version currently in production
Having no rollback plan for model deployments, treating model updates like irreversible migrations
Assuming you will catch AI quality problems quickly when feedback loops naturally introduce multi-day delays

Related to AI Model Failure Patterns

Cascading Failure Patterns, Deployment Rollback Patterns

Blameless Post-Mortem Guide — Post-Mortem Practice (P2)

Difficulty: Intermediate

Key Points for Blameless Post-Mortem Guide

Blameless does not mean nobody is accountable. It means you focus on systems and processes rather than pointing fingers at individuals.
The five whys technique only works when you follow causal chains, not blame chains.
Action items need to be concrete, owned by someone specific, and have a deadline. Vague improvements never happen.
Sharing post-mortems widely across engineering builds institutional knowledge and stops the same failures from repeating.
A good post-mortem is measured by whether the same incident could happen again, not by how long the document is.

Incident Timeline for Blameless Post-Mortem Guide

Day 1: Incident resolved, initial timeline captured while everything is still fresh
Day 2-3: Post-mortem document drafted with contributing factors and a detailed timeline
Day 3-5: Review meeting held with everyone who was involved
Day 5-7: Action items assigned with specific owners and deadlines
Day 14: First check-in on action item progress
Day 30: Action items completed and verified

Detection Signals for Blameless Post-Mortem Guide

Post-mortems getting skipped or pushed past the one-week mark
Action items sitting incomplete past 30 days
Same root cause showing up in multiple incidents
People avoiding post-mortem meetings or not participating

Prevention Strategies for Blameless Post-Mortem Guide

Schedule post-mortems within 48 hours while the context is still fresh
Stick to a standard template so post-mortems are consistent across teams
Track action items in a shared system with due dates and owners
Go over open action items in weekly engineering leadership meetings

Common Mistakes with Blameless Post-Mortem Guide

Going through the motions instead of treating the post-mortem as a real chance to learn
Assigning action items with no clear owner or deadline, which guarantees nothing gets done
Stopping at the obvious cause without digging into the systemic factors underneath
Writing a post-mortem so long and detailed that nobody bothers reading it

Capacity Planning Failures — Failure Patterns (P1)

Difficulty: Advanced

Key Points for Capacity Planning Failures

Autoscaling is not instant. EKS pod scheduling + node provisioning + image pull + app startup = 3-7 minutes. That's an eternity during a traffic spike
The database is almost always the bottleneck during scale events. Application pods scale horizontally. Databases don't
Load testing at 1x expected traffic proves nothing. You need to test at 3-5x to find the real breaking point and understand the failure mode
Viral traffic doesn't follow normal patterns. Instead of gradual growth you get a step function. 2,000 to 20,000 RPS in under 5 minutes
Storage is the forgotten capacity dimension. Disk fills up gradually, nobody notices until the database can't write WAL files at 3am

Incident Timeline for Capacity Planning Failures

T+0m: A marketing campaign goes viral. Traffic jumps from 2,000 RPS to 18,000 RPS in 4 minutes. Nobody told engineering about the campaign.
T+2m: Autoscaler detects the load increase. Triggers scale-up. New EKS pods scheduled, but cluster node pool needs 3 additional EC2 instances. Instance provisioning begins.
T+5m: EC2 instances still launching. Container images pulling (2.1GB image takes 90 seconds). Application startup and health check warmup adds another 60 seconds. Current pods at 98% CPU.
T+10m: Database connection pool exhausted. PostgreSQL max_connections hit at 200. New pods can't connect. Some existing pods lose connections due to PgBouncer timeout. 500 errors at 35% of requests.
T+15m: New nodes online, pods running, but database is the bottleneck. DBA increases max_connections to 400 and adds a PgBouncer instance. Connection pressure eases.
T+30m: Full recovery. 12 minutes of degraded service. Post-incident reveals: no load test above 5,000 RPS, no marketing-to-engineering notification process, container image too large.

Detection Signals for Capacity Planning Failures

CPU or memory utilization exceeding 80% across a service's pod fleet
Autoscaler hitting maximum replica limits or failing to provision new instances
Database connection pool exhaustion errors in application logs
Request queue depth increasing, indicating more requests arriving than the system can process

Prevention Strategies for Capacity Planning Failures

Load test to at least 3x expected peak traffic monthly. Use tools like k6, Locust, or Gatling against a production-like environment
Pre-warm autoscaling before known traffic events. Scale up 2 hours before a campaign launch, not in response to the traffic
Set up communication channels between marketing/sales and engineering for any event that might drive traffic spikes
Keep container images small (under 200MB). Use multi-stage builds and distroless base images to reduce pull time during scale-up
Implement database connection pooling (PgBouncer, ProxySQL) as a buffer between application instances and database connections

Common Mistakes with Capacity Planning Failures

Setting autoscaler max replicas too low 'to control costs' and then being unable to scale during an actual traffic spike
Load testing against a staging environment with 1/10th the database size and different network topology, then trusting the results
Forgetting that scaling the application layer moves the bottleneck to the database or cache layer, which doesn't autoscale
Not monitoring disk space on database volumes and EBS attachments until they're 100% full

Related to Capacity Planning Failures

Memory Leak Patterns, Third-Party Dependency Failures

Cascading Failure Patterns — Failure Patterns (P0)

Difficulty: Advanced

Key Points for Cascading Failure Patterns

Cascading failures spread through resource exhaustion, not through errors. A service that queues forever is more dangerous than one that crashes outright.
Circuit breakers are the primary defense. They turn slow failures into fast failures, which the system can actually cope with.
Bulkhead isolation limits the blast radius so one broken dependency cannot eat up all your shared resources.
Load balancer health checks can actually speed up a cascade by funneling traffic onto fewer and fewer healthy instances.
The worst cascading failures cross team boundaries, because no single team has the full picture of what is happening.

Incident Timeline for Cascading Failure Patterns

T+0m: Database connection pool runs dry on the primary database
T+1m: App threads pile up waiting for connections, request queue starts growing
T+2m: Load balancer health checks begin failing as response times blow past the threshold
T+3m: Traffic gets redistributed to the remaining healthy instances, which now get overwhelmed too
T+5m: All instances flagged unhealthy, upstream services start queuing requests
T+8m: Upstream services exhaust their own connection pools and the cascade spreads further
T+15m: Full system outage across every dependent service

Detection Signals for Cascading Failure Patterns

Connection pool utilization creeping past 75%
Thread pool saturation showing up across multiple services at the same time
Error rates growing exponentially across service boundaries
Upstream services getting slower even though their own direct load has not changed

Prevention Strategies for Cascading Failure Patterns

Put circuit breakers at every service boundary
Set aggressive timeouts. Fail fast instead of letting requests queue up forever.
Use bulkhead isolation so one failing dependency cannot starve everything else
Design graceful degradation paths for each critical dependency
Load test with dependency failures, not just with high traffic

Common Mistakes with Cascading Failure Patterns

Setting timeouts too high (or not setting them at all) and letting blocked threads stack up until the pool is gone
Only load testing the happy path and never simulating dependency failures
Using a single shared thread pool for all dependencies, so one slow service blocks everything
Treating the system as either fully up or fully down, with nothing in between for graceful degradation
Retrying without backoff during a cascade, which just piles more load onto a service that is already drowning

Related to Cascading Failure Patterns

Thundering Herd & Retry Storms, Deployment Rollback Patterns

Certificate Expiry Incidents — Failure Patterns (P0)

Difficulty: Intermediate

Key Points for Certificate Expiry Incidents

Certificate expiry is the most preventable P0 incident. Every single one is a process failure, never a technical surprise
Microsoft Teams went down for 3 hours in February 2020 because an authentication certificate expired. A company with 75,000 engineers forgot to renew a cert
Internal mTLS certificates are more dangerous than public-facing ones because they're less visible and often have longer lifetimes
Certificate expiry doesn't cause gradual degradation. It's a cliff: everything works until the exact second of expiry, then nothing works

Incident Timeline for Certificate Expiry Incidents

T+0m: TLS certificate expires at 00:00 UTC. New connections start failing with SSL handshake errors. Existing keep-alive connections continue working temporarily.
T+2m: External monitoring detects SSL errors. Error rate climbs as connection pools recycle. Mobile apps hit harder than web due to stricter certificate pinning.
T+5m: On-call paged. Initial investigation checks application logs, sees connection reset errors. Takes 3 minutes to identify certificate expiry as root cause.
T+10m: Team attempts manual certificate renewal. Let's Encrypt rate limits hit because someone ran renewals in a loop during testing earlier that week.
T+15m: Backup certificate from a different CA (DigiCert) deployed manually. Nginx reloaded. Traffic partially recovering for services behind the load balancer.
T+30m: Full recovery after updating certificates on all edge nodes, internal mTLS certs, and CDN configuration. Post-incident review scheduled.

Detection Signals for Certificate Expiry Incidents

SSL handshake failure rate exceeding 1% in load balancer metrics
x509: certificate has expired errors appearing in application logs
Spike in HTTP 502/503 errors at the reverse proxy layer
Client-side certificate pinning failures reported through crash analytics

Prevention Strategies for Certificate Expiry Incidents

Deploy cert-manager in Kubernetes clusters with automated Let's Encrypt renewal at 30 days before expiry, not the default 7 days
Set up certificate expiry monitoring in Prometheus with alerting at 30, 14, and 7 days before expiration using blackbox_exporter
Maintain a certificate inventory spreadsheet or use tools like Keychecker that scan all endpoints weekly
Use short-lived certificates (90 days or less) to force automation. If renewal only happens annually, the process rots and nobody remembers how
Test certificate renewal in a staging environment monthly, including the full path from issuance to deployment

Common Mistakes with Certificate Expiry Incidents

Relying on calendar reminders instead of automated monitoring for certificate renewals
Renewing the certificate but forgetting to deploy it to all endpoints (CDN edge nodes, internal services, partner API gateways)
Using certificate pinning in mobile apps without a rotation mechanism, turning a cert renewal into a forced app update

Related to Certificate Expiry Incidents

DNS Failure Patterns, Configuration Drift Incidents

Chaos Engineering Principles — Chaos Engineering (P3)

Difficulty: Expert

Key Points for Chaos Engineering Principles

Chaos engineering is not about breaking things at random. It is about forming hypotheses about how the system behaves and testing them scientifically.
The steady-state hypothesis spells out what 'working correctly' looks like in measurable terms, before you inject any failure.
Blast radius control is non-negotiable. Start small and only expand once you have built confidence.
Game days are the team-level version of chaos engineering: scheduled exercises where the team responds to simulated incidents.
The value of chaos engineering compounds over time. Each experiment reveals and fixes weaknesses before they cause real outages.

Incident Timeline for Chaos Engineering Principles

Week 1: Define your steady-state hypothesis and decide on success metrics for the experiment
Week 2: Design the experiment with a small blast radius (single AZ, low traffic)
Week 3: Run the experiment during business hours with the team standing by to abort
Week 3: Analyze results. Did the system behave the way you expected?
Week 4: Document findings and create action items for any gaps that turned up
Week 5-6: Implement fixes and re-run the experiment to confirm they worked

Detection Signals for Chaos Engineering Principles

Systems that have never been tested for how they handle failures
Runbooks that have not been validated against real production conditions
Recovery procedures that only exist as documentation nobody has tried
Teams that cannot clearly describe their system's failure domains

Prevention Strategies for Chaos Engineering Principles

Start chaos experiments in non-production environments before touching prod
Always have a kill switch so you can abort experiments immediately
Run experiments during business hours when the team is around to respond
Begin with failure modes you already know about before exploring new ones
Get organizational buy-in before running chaos experiments in production

Common Mistakes with Chaos Engineering Principles

Running chaos experiments without a kill switch, so if things go sideways, you cannot stop it
Starting with large blast radius experiments instead of building confidence with small, contained tests first
Treating chaos engineering as a one-time exercise rather than an ongoing practice that evolves alongside your system
Not getting leadership buy-in before running experiments, since chaos engineering needs organizational support to actually work

Related to Chaos Engineering Principles

Cascading Failure Patterns, Blameless Post-Mortem Guide

Cloud Provider Outage Response — Incident Response (P0)

Difficulty: Advanced

Key Points for Cloud Provider Outage Response

AWS us-east-1 has had major outages roughly once a year. The 2021 outage lasted 8 hours and took down a significant chunk of the internet including Amazon's own retail site
Multi-AZ is not multi-region. AZ failures are common and well-handled. Regional outages are rarer but affect the entire control plane, including your ability to remediate
The control plane vs data plane distinction matters. During an AWS outage, existing EC2 instances keep running (data plane) but you can't launch new ones (control plane)
Multi-region adds latency, complexity, and cost. For most companies, the honest answer is accepting single-region risk and having a communication plan for outages
Your status page should not be hosted on the same provider as your primary infrastructure. Use a static site on a different CDN

Incident Timeline for Cloud Provider Outage Response

T+0m: AWS us-east-1 API calls start timing out. EC2 RunInstances, ECS task launches, and Lambda cold starts all failing. Existing running instances unaffected initially.
T+2m: Internal monitoring shows green (instances still running) but external synthetic checks fail. Confusion about whether it's an internal issue or AWS. Status page shows no incident.
T+5m: Twitter reports confirm us-east-1 issues. Team checks AWS Personal Health Dashboard. No updates yet. Decision made to start failover runbook to us-west-2.
T+10m: Failover initiated. Route 53 health checks detect the failure and start routing traffic to us-west-2. But the us-west-2 deployment is 2 releases behind because cross-region deploys weren't automated.
T+15m: us-west-2 serving traffic but with bugs fixed in the last 2 releases. Customer experience degraded but functional. AWS status page finally acknowledges the incident.
T+30m: AWS us-east-1 recovers. Traffic gradually shifted back. Post-incident reveals failover took 15 minutes instead of the target 5 minutes due to stale deployments and untested runbooks.

Detection Signals for Cloud Provider Outage Response

Simultaneous failures across multiple unrelated services that all share the same cloud provider region
Cloud provider API errors (throttling, timeouts) on infrastructure management calls
Auto-scaling failures because the provider can't launch new instances during the outage
Social media and community channels (Twitter, HackerNews, Reddit) reporting widespread issues with the same provider

Prevention Strategies for Cloud Provider Outage Response

Deploy active-active across at least 2 regions. Both regions serve traffic at all times, so failover is just traffic rebalancing, not cold-starting a new region
Automate cross-region deployments so DR regions are never more than 1 release behind primary
Run monthly failover drills. Actually route production traffic to the secondary region for at least 1 hour. Untested failover is not failover
Minimize dependencies on provider control planes. Use pre-provisioned capacity instead of relying on autoscaling during an outage (you can't scale if the API is down)
Maintain static site fallbacks for critical customer-facing pages (status page, documentation, basic account access) on a different provider

Common Mistakes with Cloud Provider Outage Response

Hosting your status page on the same infrastructure that's down. When AWS goes down, your AWS-hosted status page goes down too
Having a DR region that hasn't been tested in 6 months and has outdated deployments, configurations, and data
Trying to migrate data to another region during the outage instead of using pre-replicated data. Network calls to the affected region are unreliable during the outage
Assuming provider SLAs mean the outage won't happen. 99.99% uptime still means 52 minutes of downtime per year

Related to Cloud Provider Outage Response

Third-Party Dependency Failures, Network Partition Handling

Configuration Drift Incidents — Failure Patterns (P1)

Difficulty: Intermediate

Key Points for Configuration Drift Incidents

Configuration drift is silent. The system works fine until something interacts with the drifted config, which could be days or months after the manual change
The most dangerous drift happens in security configs: security groups, IAM policies, and network ACLs changed manually and never reviewed
Feature flags are intentional drift. Treat them with the same rigor: track who changed what, when, and why. Use LaunchDarkly or Unleash with audit logs
Environment parity between staging and production is a myth in most organizations. Document the known differences explicitly

Incident Timeline for Configuration Drift Incidents

T+0m: An engineer SSH's into a production server and changes an Nginx config to fix a routing issue. The change works. No PR is created.
T+2m: Three weeks later, Terraform runs as part of a routine deployment. It detects the manual change as drift and reverts the Nginx config to the version in code.
T+5m: The routing fix disappears. 15% of API traffic starts hitting the wrong backend. Error rates climb for a subset of customers.
T+10m: On-call investigates the deployment. Diff shows only application code changes, nothing related to Nginx. The infrastructure change isn't in any PR.
T+15m: Team spends 20 minutes comparing running config against Terraform state before finding the discrepancy. The original engineer is on vacation.
T+30m: Fix applied in Terraform, PR merged, deployed. Incident took 45 minutes to resolve because the root cause wasn't documented anywhere.

Detection Signals for Configuration Drift Incidents

Terraform plan showing unexpected changes on infrastructure that wasn't recently modified
Configuration management tool (Ansible, Puppet) reporting convergence failures or remediation actions
Differences between staging and production that aren't explained by feature flags or environment-specific configs
Deployment failures caused by state conflicts between the expected and actual infrastructure state

Prevention Strategies for Configuration Drift Incidents

Enforce GitOps with ArgoCD or Flux. All changes go through git. Manual kubectl or SSH changes are overwritten automatically within minutes
Run terraform plan in CI on a schedule (daily at minimum) and alert on any detected drift, even if no deployment is happening
Use tools like driftctl to scan for infrastructure drift between cloud provider state and Terraform state files
Disable direct SSH access to production servers. Use Session Manager or similar audited access that logs every command
Implement Open Policy Agent (OPA) or Kyverno policies that reject configurations not matching your standards

Common Mistakes with Configuration Drift Incidents

Making manual hotfixes in production under pressure and then forgetting to backport them into the infrastructure code
Running Terraform apply without reviewing the plan output, accidentally reverting previous manual changes
Using different Terraform module versions across environments, so staging and production infrastructure slowly diverge
Storing configuration in environment variables without a validation layer, allowing typos to propagate silently

Related to Configuration Drift Incidents

Certificate Expiry Incidents, Database Migration Failures

Data Corruption Patterns — Incident Response (P0)

Difficulty: Expert

Key Points for Data Corruption Patterns

Silent data corruption is worse than visible corruption. If users see an error, they retry. If they see wrong data, they make decisions based on it
Corruption that enters an event stream is exponentially harder to fix because every consumer has its own corrupted copy of the data
Point-in-time recovery (PITR) is your most important database feature. Without it, you're choosing between losing corrupted data and keeping corrupted data
Most data corruption comes from application bugs, not hardware failures. Race conditions, missing transactions, and incorrect update logic cause 90% of corruption
The blast radius of data corruption grows with time. Detecting corruption 5 minutes after it starts affects 50 records. Detecting it 3 days later affects 50,000

Incident Timeline for Data Corruption Patterns

T+0m: A race condition in the order service writes a partial update to the database. Two concurrent requests modify the same row, and the last write wins with incomplete data.
T+2m: Downstream services consume the corrupted data via event stream. Inventory counts go negative. Billing calculates wrong amounts. The corruption is spreading.
T+5m: A reconciliation job flags 47 orders with mismatched totals. Alert fires, but the on-call initially dismisses it as a known false positive from last month.
T+10m: Customer complaints arrive: wrong charges, missing items in orders. On-call escalates to a P0. Team begins impact assessment.
T+15m: Source of corruption identified. The race condition is in code deployed 3 days ago. Corrupted records span 72 hours. Rolling back the code stops new corruption but doesn't fix existing bad data.
T+30m: Data repair begins using event sourcing replay from the last known good state. Manual review required for 200+ orders that can't be automatically reconciled.

Detection Signals for Data Corruption Patterns

Reconciliation job failures or mismatches between source-of-truth systems
Constraint violation errors in database logs that indicate data integrity issues
Customer reports of incorrect balances, missing records, or duplicate transactions
Negative values in columns that should only be positive (inventory counts, account balances)

Prevention Strategies for Data Corruption Patterns

Use database-level constraints (UNIQUE, CHECK, FOREIGN KEY) as the last line of defense. Application-level validation is necessary but not sufficient
Implement optimistic locking with version columns for all records that can be concurrently modified
Run reconciliation jobs continuously, comparing data across systems (e.g., order totals vs payment amounts vs inventory changes)
Use event sourcing or write-ahead logs for critical data paths so you can reconstruct state from events
Enable checksums on database pages (PostgreSQL data_checksums, MySQL innodb_checksum_algorithm) to detect storage-level corruption

Common Mistakes with Data Corruption Patterns

Restoring a full database backup to fix corruption, which also reverts all legitimate changes made since the backup
Fixing corrupted records in production manually with UPDATE statements without first backing up the corrupted data for analysis
Assuming database replication protects against corruption. Replication faithfully copies corrupted data to every replica

Related to Data Corruption Patterns

Database Migration Failures, Network Partition Handling

Database Migration Failures — Failure Patterns (P1)

Difficulty: Advanced

Key Points for Database Migration Failures

A migration that takes 2 seconds on a table with 1,000 rows can take 2 hours on a table with 100 million rows. Test with production data volumes
In MySQL, most ALTER TABLE operations take a full table lock in versions before 8.0. Even in 8.0+, adding an index is online but changing a column type is not
PostgreSQL's ADD COLUMN with a DEFAULT is instant since version 11, but adding a NOT NULL constraint still scans the entire table
The expand-contract pattern doubles your migration count but eliminates downtime risk entirely
Failed migrations are worse than slow migrations because partial schema changes leave the database in an inconsistent state

Incident Timeline for Database Migration Failures

T+0m: ALTER TABLE starts on a 500GB table during deployment. Migration tool estimates 2 minutes. Actual lock acquisition begins blocking all writes.
T+2m: Write latency spikes to 30 seconds. Connection pool fills up as queries queue behind the DDL lock. Application threads start timing out.
T+5m: Health checks fail due to database timeouts. Load balancer marks instances unhealthy. Customer-facing 500 errors begin.
T+10m: Team kills the migration query. But the rollback of the partial ALTER takes another 8 minutes. Database remains locked.
T+15m: Migration rollback completes. Connections drain and refill. Application recovers gradually as connection pools reset.
T+30m: Post-incident: team discovers the migration worked in staging because the staging table had 10,000 rows, not 500 million.

Detection Signals for Database Migration Failures

Database connection pool utilization exceeding 80% during deployment windows
Lock wait timeout exceeded errors in application logs
Sudden spike in query duration for the affected table (p99 latency jump)
Replication lag increasing on read replicas during migration execution

Prevention Strategies for Database Migration Failures

Use online schema change tools: gh-ost for MySQL, pg_repack or pgroll for PostgreSQL. These create shadow tables and swap atomically
Test every migration against a production-sized dataset. Spin up a replica from a snapshot, run the migration, measure time and lock behavior
Set statement_timeout and lock_timeout on migration connections. A migration that can't acquire a lock in 5 seconds should abort, not queue
Use the expand-contract pattern: add new columns as nullable first, backfill data, update application code, then add constraints in a separate deployment
Run migrations during low-traffic windows with a kill switch. Have the exact KILL query ready before you start

Common Mistakes with Database Migration Failures

Testing migrations against a staging database with 0.1% of production data volume and assuming the timing will be similar
Running migrations inside the same transaction as application deployments, so a slow migration blocks the entire deployment pipeline
Adding a NOT NULL column without a DEFAULT value, which fails immediately on tables with existing rows
Forgetting that adding an index in PostgreSQL takes an ACCESS EXCLUSIVE lock unless you use CREATE INDEX CONCURRENTLY

Related to Database Migration Failures

Data Corruption Patterns, Capacity Planning Failures

DNS Failure Patterns — Failure Patterns (P0)

Difficulty: Advanced

Key Points for DNS Failure Patterns

DNS failures are uniquely catastrophic because they affect every service simultaneously, unlike most infrastructure failures that are scoped to a single component
NXDOMAIN responses get cached by recursive resolvers, meaning recovery takes far longer than the original outage
The 2016 Dyn DDoS attack took down Twitter, GitHub, Netflix, and Reddit because they relied on a single DNS provider
Internal service mesh DNS (Kubernetes CoreDNS) fails differently than external DNS. A CoreDNS OOM kill can take down an entire cluster
DNS-over-HTTPS and DNS-over-TLS add new failure modes that traditional monitoring misses

Incident Timeline for DNS Failure Patterns

T+0m: DNS provider starts returning SERVFAIL for primary domain. External monitoring catches it before internal alerts fire.
T+2m: Cached DNS records still serving traffic. Services with short TTLs (30s) begin failing first. API gateway health checks start failing.
T+5m: Customer-facing error rates spike to 40%. Support tickets flood in. On-call SRE paged via PagerDuty.
T+10m: Team identifies DNS as root cause. Attempts to switch to secondary DNS provider begin. TTL on NS records is 48 hours, blocking fast failover.
T+15m: Hotfix deployed: hardcoded IP addresses in critical service configs as temporary bypass. Partial traffic recovery for known endpoints.
T+30m: Secondary DNS provider fully propagated for zones with low TTLs. Full recovery takes 4-6 hours due to cached NXDOMAIN responses in ISP resolvers.

Detection Signals for DNS Failure Patterns

NXDOMAIN response rate exceeding baseline by 10x in resolver logs
Spike in connection timeout errors across multiple unrelated services simultaneously
External synthetic monitoring failures from multiple geographic regions
DNS query latency exceeding 500ms from internal resolvers

Prevention Strategies for DNS Failure Patterns

Run at least two DNS providers (e.g., Route 53 + Cloudflare) with automated zone sync using tools like OctoDNS or DNSControl
Set TTLs to 60 seconds for critical records. The performance cost of low TTLs is negligible compared to hours-long outages
Maintain a list of hardcoded IP fallbacks for internal service-to-service communication that bypasses DNS
Monitor DNS resolution from multiple vantage points using ThousandEyes or Catchpoint, not just from your own infrastructure
Run internal DNS (CoreDNS, Unbound) with aggressive caching and serve-stale configured

Common Mistakes with DNS Failure Patterns

Setting DNS TTLs to 24 hours 'for performance' and then being unable to failover during an incident
Testing DNS failover only in staging where TTLs and caching behavior differ from production
Assuming your cloud provider's DNS is redundant enough on its own without a secondary provider
Forgetting that certificate validation depends on DNS, so DNS failures also break TLS handshakes for new connections

Related to DNS Failure Patterns

Third-Party Dependency Failures, Network Partition Handling

LLM Production Incident Patterns — Incident Response (P0)

Difficulty: Expert

Key Points for LLM Production Incident Patterns

LLM provider outages are external dependency failures, similar to AWS going down. You have zero control over when they happen or how long they last.
LLM latency is measured in seconds, not milliseconds. Your circuit breakers, timeouts, and retry policies need to be calibrated for that reality.
Cost runaway is a failure type unique to LLMs. A bug that sends long prompts in a loop can burn thousands of dollars in minutes.
Prompt injection is a security incident. It needs the same response protocol as XSS or SQL injection, not just a bug fix.
Output quality degradation might not trip any infrastructure alert. The API returns 200 OK while the model produces nonsense.

Incident Timeline for LLM Production Incident Patterns

T+0m: LLM provider API starts returning 429 rate limit errors during peak traffic
T+1m: Application retry logic amplifies the load, all LLM-powered features start failing
T+2m: No graceful degradation exists, pages that depend on LLM responses show errors
T+5m: Customer support gets flooded with complaints about broken AI features
T+10m: Engineering manually disables LLM features using feature flags
T+15m: Non-AI fallback paths activated, user experience partially restored
T+30m: LLM provider comes back up, gradual re-enablement of AI features begins

Detection Signals for LLM Production Incident Patterns

LLM API response time exceeding 10 seconds when the normal P95 is 2-5 seconds
Rate limit (429) or server error (500/503) responses from the LLM provider
Cost anomaly, like a sudden spike in per-minute API spending
Content quality dropping, with users reporting nonsensical or irrelevant AI outputs
Prompt injection detected, where outputs leak system prompt content or contain unexpected instructions

Prevention Strategies for LLM Production Incident Patterns

For every LLM-powered feature, decide ahead of time what the user sees when the LLM is unavailable
Set up circuit breakers tuned for LLM API latency patterns (seconds, not milliseconds)
Configure multi-provider failover so you can route to Anthropic when OpenAI is down, and the other way around
Set rate limit budgets well below provider limits so you have headroom for traffic spikes
Cache aggressively. Semantic caching for repeated similar queries can serve 30-60% of requests without making an API call.

Common Mistakes with LLM Production Incident Patterns

Shipping LLM features with no fallback, so when the provider goes down, users see raw error messages
Setting circuit breaker timeouts based on traditional API patterns (100-500ms) when LLM responses routinely take 2-10 seconds
Not building cost circuit breakers, which lets a runaway bug or traffic spike generate an uncapped API bill
Treating LLM provider outages as rare events when they actually happen multiple times a month across major providers
Not logging prompt inputs and model outputs, making it impossible to investigate quality incidents or injection attacks after the fact

Related to LLM Production Incident Patterns

Cascading Failure Patterns, Thundering Herd & Retry Storms

Memory Leak Patterns — Failure Patterns (P1)

Difficulty: Advanced

Key Points for Memory Leak Patterns

Memory leaks rarely cause immediate outages. They're slow killers that show up 12-48 hours after deployment, often during off-hours when nobody is watching
The most common leak in Go services is goroutine leaks from unbuffered channels or missing context cancellation
In Java, the leak is usually in off-heap memory (direct byte buffers, JNI allocations) that doesn't show up in heap dumps
Container memory limits are a safety net, not a solution. An OOM kill is still an outage, just a shorter one
Connection pool leaks happen when error paths don't close connections. The happy path works fine; the leak is in the error handling

Incident Timeline for Memory Leak Patterns

T+0m: Deployment at 09:00. Memory usage baseline is 512MB per pod. Everything looks normal. No alerts fire.
T+2m: Memory grows to 520MB. Within noise. Garbage collection running normally. No performance impact visible.
T+5m: Memory at 600MB after 2 hours in production. Growth rate is 1MB/minute. At this rate, the 2GB container limit will be hit in 24 hours.
T+10m: Next morning: first pod hits OOM limit and gets killed by the kernel. Kubernetes restarts it. Traffic shifts to remaining pods, which are also close to OOM.
T+15m: Cascading OOM kills. All 5 pods restart within a 10-minute window. Service is down for 3 minutes during the restart storm.
T+30m: On-call identifies the leak source: a new HTTP client that creates a transport per request instead of reusing connections. Hotfix deployed.

Detection Signals for Memory Leak Patterns

Monotonically increasing memory usage that doesn't decrease after garbage collection cycles
Increasing GC pause times or GC frequency without corresponding traffic increases
Container OOM kill events in Kubernetes (reason: OOMKilled in pod status)
Growing number of goroutines, threads, or file descriptors over time without traffic correlation

Prevention Strategies for Memory Leak Patterns

Set container memory limits and monitor RSS memory trends over 24-hour windows, not just current usage
Run load tests with sustained traffic for at least 1 hour to catch slow leaks that don't appear in 5-minute test runs
Use pprof (Go), async-profiler (Java), or tracemalloc (Python) in staging environments with production-like traffic patterns
Implement connection pool metrics: track active connections, idle connections, and total created connections over time
Deploy canary instances that run for 48 hours before full rollout to catch slow leaks early

Common Mistakes with Memory Leak Patterns

Setting memory alerts on absolute thresholds instead of growth rate. A service using 1.5GB steadily is fine; a service growing by 50MB/hour is not
Running heap dumps in production under load, causing additional memory pressure and making the OOM happen sooner
Fixing the leak by increasing the memory limit instead of finding the root cause, turning a 24-hour time bomb into a 72-hour time bomb

Related to Memory Leak Patterns

Capacity Planning Failures, Configuration Drift Incidents

Network Partition Handling — Failure Patterns (P0)

Difficulty: Expert

Key Points for Network Partition Handling

Real partitions aren't clean. They're partial: some packets get through, latency spikes, some connections work and others don't. This is harder to handle than a full cut
The CAP theorem says during a partition you choose consistency or availability. Most systems choose availability without realizing the consistency implications
Split-brain is the most dangerous partition outcome. Two leaders accepting writes means data divergence that requires manual reconciliation after recovery
Network partitions between cloud availability zones happen more often than providers admit. AWS had at least 3 significant inter-AZ connectivity events in 2022-2023
Testing partition tolerance requires tools like Toxiproxy or tc (traffic control) that simulate partial failures, not just kill-the-connection tests

Incident Timeline for Network Partition Handling

T+0m: Network link between us-east-1a and us-east-1b degrades. Packet loss hits 30%. Not a full partition, but enough to cause timeouts on inter-AZ communication.
T+2m: Database primary in us-east-1a can't reach replicas in us-east-1b. Replication lag jumps from 50ms to 30 seconds. Reads from replicas return stale data.
T+5m: Consul health checks between AZs start failing. Service discovery removes healthy instances in us-east-1b from the service registry. Traffic concentrates on us-east-1a.
T+10m: us-east-1a instances overloaded from handling all traffic. CPU hits 95%. Autoscaling triggered but new instances take 3 minutes to warm up.
T+15m: Split-brain detected: both AZs have elected their own Redis sentinel master. Writes happening to both masters. Data divergence in progress.
T+30m: Network recovers. Redis split-brain resolution loses 340 writes from the minority partition. Database replicas catch up. Full recovery after 45 minutes total.

Detection Signals for Network Partition Handling

Replication lag exceeding baseline by 10x across database replicas in different availability zones
Service mesh reporting increased inter-AZ latency or connection failures between specific zone pairs
Split-brain indicators: multiple nodes claiming leadership for the same resource simultaneously
Gossip protocol failures in distributed systems (Consul, Cassandra) showing unreachable nodes in specific zones

Prevention Strategies for Network Partition Handling

Design services to handle partial failures, not just full connectivity or full partition. Real network issues are usually packet loss and latency spikes, not clean cuts
Use consensus protocols (Raft, Paxos) for leader election instead of simple heartbeat-based failover that's prone to split-brain
Configure database replication with appropriate consistency levels. For PostgreSQL, use synchronous_commit for critical writes
Deploy Redis Sentinel with a minimum of 3 sentinels across 3 AZs so quorum-based leader election survives a single AZ partition
Implement fencing tokens for distributed locks to prevent stale lock holders from making writes after a partition heals

Common Mistakes with Network Partition Handling

Using timeout-based leader election without fencing, allowing a slow-but-alive old leader to continue accepting writes
Assuming that if you can reach the database primary, there's no partition. The partition might be between the primary and its replicas
Treating replication lag as a performance issue instead of a consistency issue. Lag during a partition means reads return stale data
Not testing what happens when the partition heals. The recovery phase (data reconciliation, leader re-election) often causes a second outage

Related to Network Partition Handling

Data Corruption Patterns, Cloud Provider Outage Response

Third-Party Dependency Failures — Failure Patterns (P0)

Difficulty: Advanced

Key Points for Third-Party Dependency Failures

Your availability is bounded by your least reliable dependency. If you depend on five services with 99.9% uptime each, your theoretical maximum is 99.5%
The 2023 Cloudflare outage took down thousands of websites that had no direct relationship with Cloudflare because they used services that used Cloudflare
Circuit breakers are necessary but not sufficient. You also need a meaningful degraded experience, not just a faster error message
Status pages are marketing documents, not monitoring tools. By the time a provider updates their status page, your customers have already noticed
Multi-provider strategies cost 2-3x in engineering effort but are the only real protection against provider-level outages

Incident Timeline for Third-Party Dependency Failures

T+0m: Stripe API starts returning 503 errors. Payment processing fails for all checkout flows. No prior warning from Stripe's status page.
T+2m: Circuit breaker trips after 10 consecutive failures. Checkout page shows 'Payment temporarily unavailable'. Orders queue in a dead letter queue.
T+5m: On-call paged. Checks internal services first (database, cache, application). Spends 4 minutes before checking Stripe's status page, which now shows degraded performance.
T+10m: Team activates fallback: cached payment tokens for returning customers process via a secondary payment processor (Adyen). New customers still blocked.
T+15m: Stripe recovers partially. Team enables retry processing for queued orders. Some duplicate charge risk identified and flagged for manual review.
T+30m: Full recovery. 847 orders were delayed, 12 customers were double-charged and need refunds. Revenue impact: $23,000 in abandoned carts.

Detection Signals for Third-Party Dependency Failures

Sudden spike in HTTP 5xx responses from a specific external API endpoint
Circuit breaker state changes from closed to open across multiple service instances simultaneously
Increased latency (>2x baseline) for requests that involve third-party API calls
Third-party provider status page showing degraded performance or incident in progress

Prevention Strategies for Third-Party Dependency Failures

Implement circuit breakers (Hystrix pattern, resilience4j, or gobreaker) for every external API call with sensible thresholds
Cache third-party responses where possible. A cached exchange rate from 5 minutes ago is better than no exchange rate at all
Maintain a dependency map that lists every third-party service, its criticality level, and the fallback strategy for each
Evaluate SLAs and build your architecture to survive the SLA floor. If a provider promises 99.95% uptime, plan for 4.38 hours of downtime per year
Run quarterly dependency failure drills. Literally block the third-party API at the firewall level and verify your fallback works

Common Mistakes with Third-Party Dependency Failures

Treating third-party API calls the same as internal service calls, without timeouts, retries, or circuit breakers
Not testing what actually happens when a dependency fails. Teams assume the circuit breaker works but never verify the fallback path
Having a single point of failure on a 'reliable' provider. AWS, Stripe, and Cloudflare all have had multi-hour outages

Related to Third-Party Dependency Failures

Cloud Provider Outage Response, DNS Failure Patterns

Thundering Herd & Retry Storms — Failure Patterns (P0)

Difficulty: Advanced

Key Points for Thundering Herd & Retry Storms

A thundering herd happens when lots of clients request the same resource at the same time. The textbook example is a cache expiring or restarting.
Retry storms feed on themselves: each retry adds load, which causes more timeouts, which triggers more retries.
Exponential backoff without jitter still produces synchronized retry waves. Jitter is what actually spreads the load out.
Retry budgets put a cap on total retry traffic at the system level, so no single client can overwhelm the backend.
The safest default is actually to not retry at all. Only add retries once you have confirmed they help, and always pair them with backoff and jitter.

Incident Timeline for Thundering Herd & Retry Storms

T+0m: Cache server restarts, every cached entry is gone (cold cache)
T+0m: Thousands of requests simultaneously miss cache and slam the database
T+1m: Database CPU pegs at 100%, query latency jumps 50x
T+2m: Application timeouts kick in and trigger automatic retries with no backoff
T+3m: Retry storm multiplies the load by 3-5x, database stops responding
T+5m: Every service that depends on this database starts failing
T+10m: Manual intervention needed. Retries disabled, cache warmed back up gradually.

Detection Signals for Thundering Herd & Retry Storms

Sudden spike in database QPS right after a cache restart or widespread cache miss
Retry rate hitting 3x or more above normal across multiple clients
Backend latency climbing even though organic traffic has not grown
Load balancer showing a growing connection queue depth

Prevention Strategies for Thundering Herd & Retry Storms

Add exponential backoff with jitter to all retry logic
Use cache stampede protection (either lock-based or probabilistic early expiration)
Set retry budgets that limit total retries per time window
Put circuit breakers in place that open before a retry storm can form
Have a cache warming procedure ready for cold-start scenarios

Common Mistakes with Thundering Herd & Retry Storms

Using fixed retry intervals instead of exponential backoff, which creates synchronized bursts that hit the backend all at once
Skipping jitter on backoff timers, so all clients compute the same intervals and retry in lockstep
Retrying operations that are not idempotent, leading to duplicate writes, double charges, or inconsistent state
Setting the same TTL on all cache keys, which causes mass expiration events that trigger thundering herds

Related to Thundering Herd & Retry Storms

Cascading Failure Patterns, Deployment Rollback Patterns

CrackingWalnuts

AI Model Failure Patterns

Capacity Planning Failures

Cascading Failure Patterns

Certificate Expiry Incidents

Configuration Drift Incidents

Database Migration Failures

DNS Failure Patterns

Memory Leak Patterns

Network Partition Handling

Third-Party Dependency Failures

Thundering Herd & Retry Storms

Anatomy of a Production Incident

Cloud Provider Outage Response

Data Corruption Patterns

Deployment Rollback Patterns

LLM Production Incident Patterns

Chaos Engineering Principles

Blameless Post-Mortem Guide

AI Model Failure Patterns

Capacity Planning Failures

Cascading Failure Patterns

Certificate Expiry Incidents

Configuration Drift Incidents

Database Migration Failures

DNS Failure Patterns

Memory Leak Patterns

Network Partition Handling

Third-Party Dependency Failures

Thundering Herd & Retry Storms

Anatomy of a Production Incident

Cloud Provider Outage Response

Data Corruption Patterns

Deployment Rollback Patterns

LLM Production Incident Patterns

Chaos Engineering Principles

Blameless Post-Mortem Guide

AI Model Failure Patterns

Capacity Planning Failures

Cascading Failure Patterns

Certificate Expiry Incidents

Configuration Drift Incidents

Database Migration Failures

DNS Failure Patterns

Memory Leak Patterns

Network Partition Handling

Third-Party Dependency Failures

Thundering Herd & Retry Storms

Anatomy of a Production Incident

Cloud Provider Outage Response

Data Corruption Patterns

Deployment Rollback Patterns

LLM Production Incident Patterns

Chaos Engineering Principles

Blameless Post-Mortem Guide

AI Model Failure Patterns — Failure Patterns (P1)

Key Points for AI Model Failure Patterns

Incident Timeline for AI Model Failure Patterns

Detection Signals for AI Model Failure Patterns

Prevention Strategies for AI Model Failure Patterns

Common Mistakes with AI Model Failure Patterns

Related to AI Model Failure Patterns

Anatomy of a Production Incident — Incident Response (P1)

Key Points for Anatomy of a Production Incident

Incident Timeline for Anatomy of a Production Incident

Detection Signals for Anatomy of a Production Incident

Prevention Strategies for Anatomy of a Production Incident

Common Mistakes with Anatomy of a Production Incident

Related to Anatomy of a Production Incident

Blameless Post-Mortem Guide — Post-Mortem Practice (P2)

Key Points for Blameless Post-Mortem Guide

Incident Timeline for Blameless Post-Mortem Guide

Detection Signals for Blameless Post-Mortem Guide

Prevention Strategies for Blameless Post-Mortem Guide

Common Mistakes with Blameless Post-Mortem Guide

Related to Blameless Post-Mortem Guide

Capacity Planning Failures — Failure Patterns (P1)

Key Points for Capacity Planning Failures

Incident Timeline for Capacity Planning Failures

Detection Signals for Capacity Planning Failures

Prevention Strategies for Capacity Planning Failures

Common Mistakes with Capacity Planning Failures

Related to Capacity Planning Failures

Cascading Failure Patterns — Failure Patterns (P0)

Key Points for Cascading Failure Patterns

Incident Timeline for Cascading Failure Patterns

Detection Signals for Cascading Failure Patterns

Prevention Strategies for Cascading Failure Patterns

Common Mistakes with Cascading Failure Patterns

Related to Cascading Failure Patterns

Certificate Expiry Incidents — Failure Patterns (P0)

Key Points for Certificate Expiry Incidents

Incident Timeline for Certificate Expiry Incidents

Detection Signals for Certificate Expiry Incidents

Prevention Strategies for Certificate Expiry Incidents

Common Mistakes with Certificate Expiry Incidents

Related to Certificate Expiry Incidents

Chaos Engineering Principles — Chaos Engineering (P3)

Key Points for Chaos Engineering Principles

Incident Timeline for Chaos Engineering Principles

Detection Signals for Chaos Engineering Principles

Prevention Strategies for Chaos Engineering Principles

Common Mistakes with Chaos Engineering Principles

Related to Chaos Engineering Principles

Cloud Provider Outage Response — Incident Response (P0)

Key Points for Cloud Provider Outage Response

Incident Timeline for Cloud Provider Outage Response

Detection Signals for Cloud Provider Outage Response

Prevention Strategies for Cloud Provider Outage Response

Common Mistakes with Cloud Provider Outage Response

Related to Cloud Provider Outage Response

Configuration Drift Incidents — Failure Patterns (P1)

Key Points for Configuration Drift Incidents

Incident Timeline for Configuration Drift Incidents

Detection Signals for Configuration Drift Incidents

Prevention Strategies for Configuration Drift Incidents

Common Mistakes with Configuration Drift Incidents