Incident Management Process

Why Process Matters

Without a defined incident management process, every production issue turns into an ad-hoc scramble. Engineers DM each other on Slack. Managers hover nervously. Customer support gets conflicting information. And the resolution takes three times longer than it should because half the effort goes into figuring out who's doing what instead of actually fixing the problem.

A solid incident management process doesn't add red tape. It removes chaos. It makes sure the right people are involved, communication flows in a predictable way, and the organization learns from each incident instead of making the same mistakes again.

Severity Levels

Clear severity definitions are the foundation. Every organization should tailor these to their context, but here's a common starting point:

SEV1, Critical: Complete service outage or data loss hitting all users. Revenue impact is immediate and significant. Response time: drop everything, all hands on deck. Example: payment processing is down, database corruption detected.

SEV2, Major: Significant degradation affecting a large chunk of users. Core functionality is impaired but workarounds might exist. Response time: within 15 minutes during business hours, within 30 minutes after hours. Example: search returning stale results, API latency at 10x normal.

SEV3, Minor: Limited impact on a subset of users or non-critical functionality. Response time: within 1 business hour. Example: a dashboard widget isn't loading, email notifications delayed by 30 minutes.

SEV4, Low: Cosmetic issues or small bugs with no real user impact. Response time: next business day. Example: a typo in an error message, a tooltip that renders oddly in one browser.

The Incident Commander Role

The Incident Commander (IC) is the single point of coordination during an incident. Their job is NOT to debug the problem. It's to make sure the response stays organized and effective. The IC:

Declares the incident and assigns a severity level
Pulls together the response team based on which systems are affected
Sets the communication cadence, stakeholder updates every 30 minutes for SEV1, every hour for SEV2
Tracks action items to make sure nothing slips through
Decides when to escalate the severity up or down
Closes the incident and schedules the post-incident review

At growth-stage companies, the IC role should rotate among senior engineers and engineering managers. Anyone who might fill the IC role should get training and run through practice drills at least once a quarter.

Post-Incident Reviews

The post-incident review (PIR), sometimes called a postmortem or retrospective, is where the real organizational learning happens. A good PIR is blameless, thorough, and focused on action.

Template:

Timeline , a minute-by-minute reconstruction of what happened, when it was detected, and how it got resolved
Impact , users affected, revenue impact, SLA breach assessment
Contributing factors , what conditions allowed this incident to happen? (Never just "human error")
What went well , what parts of the response worked? Which detection or mitigation mechanisms fired correctly?
What could improve , specific, actionable changes to prevent recurrence
Action items , each one has an owner, a due date, and gets tracked to completion

PIRs should be shared broadly, ideally with the entire engineering organization. Being transparent about failures builds trust and helps other teams learn from incidents they weren't part of.

Why Process Matters

Severity Levels

Clear severity definitions are the foundation. Every organization should tailor these to their context, but here's a common starting point:

SEV4, Low: Cosmetic issues or small bugs with no real user impact. Response time: next business day. Example: a typo in an error message, a tooltip that renders oddly in one browser.

The Incident Commander Role

The Incident Commander (IC) is the single point of coordination during an incident. Their job is NOT to debug the problem. It's to make sure the response stays organized and effective. The IC:

Declares the incident and assigns a severity level

Pulls together the response team based on which systems are affected

Sets the communication cadence, stakeholder updates every 30 minutes for SEV1, every hour for SEV2

Tracks action items to make sure nothing slips through

Decides when to escalate the severity up or down

Closes the incident and schedules the post-incident review

Post-Incident Reviews

The post-incident review (PIR), sometimes called a postmortem or retrospective, is where the real organizational learning happens. A good PIR is blameless, thorough, and focused on action.

Template:

Timeline , a minute-by-minute reconstruction of what happened, when it was detected, and how it got resolved

Impact , users affected, revenue impact, SLA breach assessment

Contributing factors , what conditions allowed this incident to happen? (Never just "human error")

What went well , what parts of the response worked? Which detection or mitigation mechanisms fired correctly?

What could improve , specific, actionable changes to prevent recurrence

Action items , each one has an owner, a due date, and gets tracked to completion

PIRs should be shared broadly, ideally with the entire engineering organization. Being transparent about failures builds trust and helps other teams learn from incidents they weren't part of.

Why Process Matters

Severity Levels

The Incident Commander Role