Incident Response Leadership
The Incident Commander Role
When an incident fires, the natural instinct is for the most senior engineer to start debugging. This is wrong. The most senior person should be coordinating, not coding. PagerDuty's incident response framework separates the Incident Commander (IC) from the technical responders for exactly this reason.
The IC's job: open the incident channel, set severity, assign roles, manage the timeline, coordinate external communication, and make escalation decisions. They don't need to understand the technical details at a deep level. They need to keep the response organized and make sure the right people are working on the right things.
Rotate IC duty across your senior engineers and managers. Train them explicitly. Run tabletop exercises quarterly where you simulate an incident and practice the response flow. Muscle memory matters when you're running on adrenaline at 2am.
Communication During Outages
Three audiences need updates during an incident: customers, executives, and the engineering team.
Customers get status page updates via Statuspage, Instatus, or your equivalent. Update within 15 minutes of detection. Use plain language. "We are investigating elevated error rates affecting checkout" is better than "We are experiencing issues with our distributed transaction processing subsystem." Update every 30 minutes until resolution.
Executives want to know: what's the customer impact, what's the estimated time to resolution, and is anyone working on it. Set up a separate Slack channel or thread for executive updates. Push updates to them on a 30-minute cadence. Don't wait for them to ask.
The engineering team gets real-time updates in the incident channel. Keep a running timeline document. Every significant action, hypothesis, and finding gets timestamped. This becomes the foundation of your post-incident review.
Post-Incident Reviews
Good post-incident reviews (PIRs) produce systemic improvements. Bad ones produce blame and anxiety. The difference is facilitation.
Start with the timeline. Walk through what happened, when, and what actions were taken. No judgment during the timeline review. Then ask: what went well, what could be improved, and what were we lucky about? The "lucky" question surfaces near-misses that reveal fragility.
Action items must be specific, assigned, and tracked. "Improve monitoring" is not an action item. "Add alerting for payment processing latency p99 exceeding 500ms, owned by Sarah, due by March 15" is an action item. Track completion rates. If your PIR action item completion rate is below 70%, the process is not working.
Building an Early-Reporting Culture
The single most impactful thing you can do for incident response is make it safe to report problems. Engineers who fear blame will hide issues until they become catastrophes. Engineers who feel safe will flag anomalies when they're still small.
Publicly thank people who report incidents. Celebrate catches. Share examples in team meetings of times when early detection prevented a major outage. Etsy's "three-armed sweater" award for blameless post-mortems set a cultural standard that many companies have adopted in their own way.
Never punish someone for an honest mistake that caused an incident. Punish concealment. Make this distinction explicit and repeat it often.
Key Points
- •The Incident Commander role exists to coordinate communication, not to be the person who fixes the problem
- •Customer-facing communication during an outage matters as much as the technical fix
- •Post-incident reviews should focus on systemic improvements, not individual blame
- •Teams that report incidents early have fewer severe outages because small problems get caught before they cascade
- •Executive briefings during incidents should be on a fixed cadence, every 30 minutes until resolution
Common Mistakes
- ✗Letting the most senior engineer become the de facto incident commander while also debugging
- ✗Waiting until you fully understand the problem before communicating externally. Update early, update often
- ✗Treating post-incident reviews as a formality with no follow-through on action items
- ✗Punishing people who report incidents, which guarantees they'll hide the next one