Chaos Engineering Principles
Breaking Things on Purpose
Chaos engineering is the practice of experimenting on a system to build confidence that it can hold up under rough conditions in production. It is not about randomly killing servers and seeing what happens. There is a method to it: form a hypothesis, design an experiment, control the blast radius, observe the results, and learn from what you find.
The Steady-State Hypothesis
Every chaos experiment starts with a hypothesis about the system's steady state. Something like: "Our checkout flow keeps a success rate above 99.9% when one of three database replicas goes down." That is testable, measurable, and specific. If the experiment confirms the hypothesis, you have more confidence in the system. If it disproves it, you have found a weakness before it caused a real outage.
The steady state needs to be defined in terms of business metrics, not infrastructure metrics. "CPU stays below 80%" is not particularly useful. "Users can complete purchases within 2 seconds" is. The point is to verify that the system keeps delivering value to users even when infrastructure breaks underneath it.
Controlling the Blast Radius
The most important operational rule in chaos engineering is keeping the blast radius small. Start with the smallest scope you can: one instance, in one availability zone, handling a thin slice of traffic. Watch closely. Only widen the scope after you have built confidence from smaller runs.
Netflix's Chaos Monkey, the tool that put chaos engineering on the map, started by terminating individual EC2 instances during business hours. It took years of practice before Netflix moved on to Chaos Kong, which simulates an entire AWS region going down. That progression happened over years, not weeks.
Game Days
Game days are scheduled exercises where a team intentionally injects failures and practices the incident response process. Unlike automated chaos experiments, game days are really about the human side. Can the team detect the failure? Do they know which runbook to grab? Can they communicate clearly under pressure?
The best game days feel indistinguishable from real incidents for the people responding. The facilitator injects a realistic failure, and the on-call team responds as though it were the real thing: paging, triaging, communicating, resolving. After the exercise, everyone debriefs on what went well and what needs work.
The Netflix Principles
Netflix laid out the foundational principles: (1) Build a hypothesis around steady-state behavior. (2) Inject realistic failures like server crashes, network partitions, clock skew, certificate expiry. (3) Run experiments in production, because staging environments are too different to give reliable results. (4) Automate experiments so they run continuously. (5) Keep the blast radius tight using traffic shadowing and gradual ramp-up.
Tools of the Trade
Chaos Monkey (Netflix) terminates random instances. Litmus (CNCF) provides Kubernetes-native chaos experiments. Gremlin offers a commercial platform with safety controls and a library of pre-built experiments. Toxiproxy (Shopify) simulates network conditions between services. AWS Fault Injection Simulator gives you managed chaos experiments within the AWS ecosystem. The specific tool matters less than the practice itself. Start with manual experiments and adopt tooling as your practice matures.
Incident Timeline
- Week 1Define your steady-state hypothesis and decide on success metrics for the experiment
- Week 2Design the experiment with a small blast radius (single AZ, low traffic)
- Week 3Run the experiment during business hours with the team standing by to abort
- Week 3Analyze results. Did the system behave the way you expected?
- Week 4Document findings and create action items for any gaps that turned up
- Week 5-6Implement fixes and re-run the experiment to confirm they worked
Detection Signals
- •Systems that have never been tested for how they handle failures
- •Runbooks that have not been validated against real production conditions
- •Recovery procedures that only exist as documentation nobody has tried
- •Teams that cannot clearly describe their system's failure domains
Prevention
- Start chaos experiments in non-production environments before touching prod
- Always have a kill switch so you can abort experiments immediately
- Run experiments during business hours when the team is around to respond
- Begin with failure modes you already know about before exploring new ones
- Get organizational buy-in before running chaos experiments in production
Key Points
- •Chaos engineering is not about breaking things at random. It is about forming hypotheses about how the system behaves and testing them scientifically.
- •The steady-state hypothesis spells out what 'working correctly' looks like in measurable terms, before you inject any failure.
- •Blast radius control is non-negotiable. Start small and only expand once you have built confidence.
- •Game days are the team-level version of chaos engineering: scheduled exercises where the team responds to simulated incidents.
- •The value of chaos engineering compounds over time. Each experiment reveals and fixes weaknesses before they cause real outages.
Common Mistakes
- ✗Running chaos experiments without a kill switch, so if things go sideways, you cannot stop it
- ✗Starting with large blast radius experiments instead of building confidence with small, contained tests first
- ✗Treating chaos engineering as a one-time exercise rather than an ongoing practice that evolves alongside your system
- ✗Not getting leadership buy-in before running experiments, since chaos engineering needs organizational support to actually work