Anatomy of a Production Incident
What Actually Happens When Things Break
Production incidents are about people and communication just as much as they are about technology. Your phone buzzes at 2 AM, the alert is real, and your heart rate goes up. What you do in the next 30 minutes is the difference between a short blip and a multi-hour outage that ends up on Hacker News.
People Make or Break the Response
The incident commander exists because somebody needs to keep the big picture in focus while everyone else is buried in logs and dashboards. Without that separation, you end up with five engineers trying five different things at once, sometimes making the problem worse. The communicator handles stakeholder updates so leadership is not flooding the on-call channel asking "any update?" every two minutes.
The Hardest Call You Will Make
Rolling back versus pushing forward is the toughest decision in any incident. Engineers are wired to want to find and fix the root cause. It feels more complete, more satisfying. But when you are under pressure with incomplete information, rollback wins almost every time. You can dig into root cause once traffic is healthy again. Trying to fix forward introduces new risk at exactly the wrong moment.
You Never Have the Full Picture
During a real incident, the information you want is never all there. Logs are lagging, dashboards show stale data, and the dependency graph is more tangled than anyone remembers. This is exactly why runbooks exist. They capture decisions that were made calmly, not in the middle of a live outage. Teams that run game days regularly build the muscle memory that turns a real incident from terrifying into manageable.
After Things Settle Down
The incident is not over when the error rate goes back to normal. The all-clear kicks off the post-mortem process, and that is where the real organizational learning happens. Every incident is a chance to make the system stronger, but only if the team actually follows through on the action items that come out of it.
Incident Timeline
- T+0mMonitoring alert fires, error rate crosses the 5% threshold
- T+2mOn-call engineer picks up the alert and starts initial triage
- T+5mIncident is declared, a comms channel goes up, stakeholders get notified
- T+10mTeam forms a hypothesis on root cause, suspects the recent deployment
- T+15mRollback kicks off to the last known good version
- T+20mError rate drops back to baseline, monitoring confirms recovery
- T+25mAll-clear goes out, incident moves into the post-mortem phase
Detection Signals
- •Error rate jumps above baseline (more than 2x normal)
- •P99 latency blows past the SLO threshold
- •Sudden drop in successful transaction volume
- •Customer support tickets start piling up fast
Prevention
- Use canary deployments so you catch problems before they hit all your traffic
- Set up automated rollback triggers tied to error rate thresholds
- Keep runbooks updated for the most common incident types
- Run game days regularly so the team practices incident response before it matters
Key Points
- •Those first five minutes decide whether an incident spirals or gets boxed in. Fast acknowledgment matters more than anything.
- •Assigning clear roles (incident commander, communicator, investigator) keeps people from stepping on each other.
- •Every incident has a communication timeline that is just as important as the technical timeline.
- •Rolling back is almost always faster than trying to fix things live. Default to rollback.
- •Write things down during the incident, not after. You will forget details faster than you think.
Common Mistakes
- ✗Too many people jumping in at once without coordination, which creates noise and conflicting actions
- ✗Skipping the formal declaration and trying to quietly fix it, which delays awareness for everyone else
- ✗Going for a forward fix under time pressure instead of just rolling back
- ✗Not sending regular status updates, which makes stakeholders assume the worst