Incident Management Process
Why Process Matters
Without a defined incident management process, every production issue turns into an ad-hoc scramble. Engineers DM each other on Slack. Managers hover nervously. Customer support gets conflicting information. And the resolution takes three times longer than it should because half the effort goes into figuring out who's doing what instead of actually fixing the problem.
A solid incident management process doesn't add red tape. It removes chaos. It makes sure the right people are involved, communication flows in a predictable way, and the organization learns from each incident instead of making the same mistakes again.
Severity Levels
Clear severity definitions are the foundation. Every organization should tailor these to their context, but here's a common starting point:
SEV1, Critical: Complete service outage or data loss hitting all users. Revenue impact is immediate and significant. Response time: drop everything, all hands on deck. Example: payment processing is down, database corruption detected.
SEV2, Major: Significant degradation affecting a large chunk of users. Core functionality is impaired but workarounds might exist. Response time: within 15 minutes during business hours, within 30 minutes after hours. Example: search returning stale results, API latency at 10x normal.
SEV3, Minor: Limited impact on a subset of users or non-critical functionality. Response time: within 1 business hour. Example: a dashboard widget isn't loading, email notifications delayed by 30 minutes.
SEV4, Low: Cosmetic issues or small bugs with no real user impact. Response time: next business day. Example: a typo in an error message, a tooltip that renders oddly in one browser.
The Incident Commander Role
The Incident Commander (IC) is the single point of coordination during an incident. Their job is NOT to debug the problem. It's to make sure the response stays organized and effective. The IC:
- Declares the incident and assigns a severity level
- Pulls together the response team based on which systems are affected
- Sets the communication cadence, stakeholder updates every 30 minutes for SEV1, every hour for SEV2
- Tracks action items to make sure nothing slips through
- Decides when to escalate the severity up or down
- Closes the incident and schedules the post-incident review
At growth-stage companies, the IC role should rotate among senior engineers and engineering managers. Anyone who might fill the IC role should get training and run through practice drills at least once a quarter.
Post-Incident Reviews
The post-incident review (PIR), sometimes called a postmortem or retrospective, is where the real organizational learning happens. A good PIR is blameless, thorough, and focused on action.
Template:
- Timeline , a minute-by-minute reconstruction of what happened, when it was detected, and how it got resolved
- Impact , users affected, revenue impact, SLA breach assessment
- Contributing factors , what conditions allowed this incident to happen? (Never just "human error")
- What went well , what parts of the response worked? Which detection or mitigation mechanisms fired correctly?
- What could improve , specific, actionable changes to prevent recurrence
- Action items , each one has an owner, a due date, and gets tracked to completion
PIRs should be shared broadly, ideally with the entire engineering organization. Being transparent about failures builds trust and helps other teams learn from incidents they weren't part of.
Key Points
- •Severity levels (SEV1-SEV4) give everyone a shared vocabulary and set expectations for response urgency. Without clear definitions, every incident either feels like an emergency or gets shrugged off
- •The Incident Commander role keeps coordination separate from debugging. One person drives the process while engineers focus on the technical investigation
- •Communication templates (status pages, stakeholder updates, customer notifications) stop the ad-hoc messaging that makes high-stress situations even more confusing
- •Post-incident reviews should focus on systemic fixes, not blame. The point is to make the system more resilient, not to find someone to pin it on
- •Running regular incident drills and game days builds the kind of muscle memory that turns real incidents into practiced responses instead of panicked scrambles
Common Mistakes
- ✗Skipping post-incident reviews for 'minor' incidents. SEV3 and SEV4 issues often reveal the kind of systemic weaknesses that eventually cause a SEV1
- ✗Having the on-call engineer also serve as incident commander. Debugging and coordinating compete for the same mental bandwidth, and both suffer
- ✗Writing post-incident reviews that chalk things up to 'human error.' Humans will always make mistakes. The system should be built to handle that
- ✗Never practicing incident response. Teams that only run the process during real emergencies end up slow, confused, and error-prone when it counts