Incident Response Leadership

The Incident Commander Role

When an incident fires, the natural instinct is for the most senior engineer to start debugging. This is wrong. The most senior person should be coordinating, not coding. PagerDuty's incident response framework separates the Incident Commander (IC) from the technical responders for exactly this reason.

The IC's job: open the incident channel, set severity, assign roles, manage the timeline, coordinate external communication, and make escalation decisions. They don't need to understand the technical details at a deep level. They need to keep the response organized and make sure the right people are working on the right things.

Rotate IC duty across your senior engineers and managers. Train them explicitly. Run tabletop exercises quarterly where you simulate an incident and practice the response flow. Muscle memory matters when you're running on adrenaline at 2am.

Communication During Outages

Three audiences need updates during an incident: customers, executives, and the engineering team.

Customers get status page updates via Statuspage, Instatus, or your equivalent. Update within 15 minutes of detection. Use plain language. "We are investigating elevated error rates affecting checkout" is better than "We are experiencing issues with our distributed transaction processing subsystem." Update every 30 minutes until resolution.

Executives want to know: what's the customer impact, what's the estimated time to resolution, and is anyone working on it. Set up a separate Slack channel or thread for executive updates. Push updates to them on a 30-minute cadence. Don't wait for them to ask.

The engineering team gets real-time updates in the incident channel. Keep a running timeline document. Every significant action, hypothesis, and finding gets timestamped. This becomes the foundation of your post-incident review.

Post-Incident Reviews

Good post-incident reviews (PIRs) produce systemic improvements. Bad ones produce blame and anxiety. The difference is facilitation.

Start with the timeline. Walk through what happened, when, and what actions were taken. No judgment during the timeline review. Then ask: what went well, what could be improved, and what were we lucky about? The "lucky" question surfaces near-misses that reveal fragility.

Action items must be specific, assigned, and tracked. "Improve monitoring" is not an action item. "Add alerting for payment processing latency p99 exceeding 500ms, owned by Sarah, due by March 15" is an action item. Track completion rates. If your PIR action item completion rate is below 70%, the process is not working.

Building an Early-Reporting Culture

The single most impactful thing you can do for incident response is make it safe to report problems. Engineers who fear blame will hide issues until they become catastrophes. Engineers who feel safe will flag anomalies when they're still small.

Publicly thank people who report incidents. Celebrate catches. Share examples in team meetings of times when early detection prevented a major outage. Etsy's "three-armed sweater" award for blameless post-mortems set a cultural standard that many companies have adopted in their own way.

Never punish someone for an honest mistake that caused an incident. Punish concealment. Make this distinction explicit and repeat it often.

The Incident Commander Role

Communication During Outages

Three audiences need updates during an incident: customers, executives, and the engineering team.

Post-Incident Reviews

Good post-incident reviews (PIRs) produce systemic improvements. Bad ones produce blame and anxiety. The difference is facilitation.

Building an Early-Reporting Culture

Never punish someone for an honest mistake that caused an incident. Punish concealment. Make this distinction explicit and repeat it often.

The Incident Commander Role

Communication During Outages

Post-Incident Reviews

Building an Early-Reporting Culture

Key Points

Common Mistakes

Related Topics

Incident Response Leadership

The Incident Commander Role

Communication During Outages

Post-Incident Reviews

Building an Early-Reporting Culture

Key Points

Common Mistakes

Related Topics