Incident Metrics: MTTR & MTTD
Breaking Down the MTT* Family
Everyone says "MTTR" but that one acronym hides three distinct phases, each with different improvement levers.
Mean Time to Detect (MTTD) is the gap between when a problem starts and when someone (or something) notices. This is purely a function of your monitoring and alerting quality. If your MTTD is 30 minutes, you're running blind for half an hour every time something breaks. Elite teams get MTTD under 5 minutes with well-tuned alerts on SLI metrics.
Mean Time to Respond (MTTR-respond) is the gap between alert firing and an engineer starting to investigate. This is about on-call processes, escalation paths, and alert routing. If pagers go to the wrong team or alerts are so noisy that people ignore them, response time suffers.
Mean Time to Resolve (MTTR-resolve) is the clock from investigation start to service recovery. This depends on runbooks, debugging tools, rollback capabilities, and institutional knowledge. Teams with good runbooks and one-click rollback can resolve in minutes. Teams debugging from scratch every time measure in hours.
What to Measure and How
Instrument your incident management tool (PagerDuty, Opsgenie, or whatever you use) to capture timestamps at each phase transition. Most tools already track alert time, acknowledgment time, and resolution time. The missing piece is usually detection delay: the gap between when monitoring data shows an anomaly and when an alert actually fires.
Track incidents by severity and calculate MTT* metrics per severity tier. Your Sev-1 MTTR matters for SLA compliance and executive reporting. But Sev-3 MTTR matters for team health and toil reduction. A team that has two Sev-1s per quarter but fifteen Sev-3s per week has a different problem than a team with frequent Sev-1s.
Customer-minutes affected is the single best impact metric. Multiply the number of affected users by the duration of impact. An incident that affects 100 users for 60 minutes (6,000 customer-minutes) is worse than one that affects 10,000 users for 30 seconds (5,000 customer-minutes), but both are quantified in the same unit.
Improving Detection
MTTD is where the highest-leverage improvements live. Start by auditing your alerts. What percentage of real incidents were caught by automated monitoring vs reported by customers or support? If customers find your problems before your monitoring does, that's the gap to close first.
Build alerts on SLI metrics (error rate, latency percentiles, throughput) rather than infrastructure metrics (CPU, memory). A user does not care that your CPU is at 80%. They care that their API calls are timing out. SLI-based alerting catches problems that matter and fires faster because it's measuring the actual user experience.
Tracking Trends
Plot your MTT* metrics on a rolling 90-day basis. Month-to-month variance is normal; look for quarterly trends. After an incident retrospective produces action items, you should see the impact in your metrics within one or two quarters. If you're doing retros but your metrics aren't improving, your action items either aren't getting prioritized or aren't addressing root causes.
Key Points
- •MTTD (Mean Time to Detect) is the most actionable incident metric because it's directly tied to monitoring quality
- •MTTR breaks down into detection, response, and resolution, and each phase needs separate measurement
- •Customer-minutes affected is a better impact measure than raw incident count
- •Incident frequency by severity follows a power law: track the distribution, not just the total
- •Improvement trends matter more than absolute numbers; compare quarter over quarter
Common Mistakes
- ✗Lumping detection, response, and resolution into one MTTR number, which hides where the bottleneck is
- ✗Measuring only Sev-1 incidents and missing the pattern of recurring Sev-3s that add up to massive toil
- ✗Setting MTTR targets without investing in the tooling and runbooks needed to actually hit them