Chaos Engineering Principles

Breaking Things on Purpose

Chaos engineering is the practice of experimenting on a system to build confidence that it can hold up under rough conditions in production. It is not about randomly killing servers and seeing what happens. There is a method to it: form a hypothesis, design an experiment, control the blast radius, observe the results, and learn from what you find.

The Steady-State Hypothesis

Every chaos experiment starts with a hypothesis about the system's steady state. Something like: "Our checkout flow keeps a success rate above 99.9% when one of three database replicas goes down." That is testable, measurable, and specific. If the experiment confirms the hypothesis, you have more confidence in the system. If it disproves it, you have found a weakness before it caused a real outage.

The steady state needs to be defined in terms of business metrics, not infrastructure metrics. "CPU stays below 80%" is not particularly useful. "Users can complete purchases within 2 seconds" is. The point is to verify that the system keeps delivering value to users even when infrastructure breaks underneath it.

Controlling the Blast Radius

The most important operational rule in chaos engineering is keeping the blast radius small. Start with the smallest scope you can: one instance, in one availability zone, handling a thin slice of traffic. Watch closely. Only widen the scope after you have built confidence from smaller runs.

Netflix's Chaos Monkey, the tool that put chaos engineering on the map, started by terminating individual EC2 instances during business hours. It took years of practice before Netflix moved on to Chaos Kong, which simulates an entire AWS region going down. That progression happened over years, not weeks.

Game Days

Game days are scheduled exercises where a team intentionally injects failures and practices the incident response process. Unlike automated chaos experiments, game days are really about the human side. Can the team detect the failure? Do they know which runbook to grab? Can they communicate clearly under pressure?

The best game days feel indistinguishable from real incidents for the people responding. The facilitator injects a realistic failure, and the on-call team responds as though it were the real thing: paging, triaging, communicating, resolving. After the exercise, everyone debriefs on what went well and what needs work.

The Netflix Principles

Netflix laid out the foundational principles: (1) Build a hypothesis around steady-state behavior. (2) Inject realistic failures like server crashes, network partitions, clock skew, certificate expiry. (3) Run experiments in production, because staging environments are too different to give reliable results. (4) Automate experiments so they run continuously. (5) Keep the blast radius tight using traffic shadowing and gradual ramp-up.

Tools of the Trade

Chaos Monkey (Netflix) terminates random instances. Litmus (CNCF) provides Kubernetes-native chaos experiments. Gremlin offers a commercial platform with safety controls and a library of pre-built experiments. Toxiproxy (Shopify) simulates network conditions between services. AWS Fault Injection Simulator gives you managed chaos experiments within the AWS ecosystem. The specific tool matters less than the practice itself. Start with manual experiments and adopt tooling as your practice matures.

Breaking Things on Purpose

The Steady-State Hypothesis

Controlling the Blast Radius

Game Days

The Netflix Principles

Tools of the Trade

Breaking Things on Purpose

The Steady-State Hypothesis

Controlling the Blast Radius

Game Days

The Netflix Principles

Tools of the Trade

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Chaos Engineering Principles

Breaking Things on Purpose

The Steady-State Hypothesis

Controlling the Blast Radius

Game Days

The Netflix Principles

Tools of the Trade

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics