Incident Metrics: MTTR & MTTD

Breaking Down the MTT* Family

Everyone says "MTTR" but that one acronym hides three distinct phases, each with different improvement levers.

Mean Time to Detect (MTTD) is the gap between when a problem starts and when someone (or something) notices. This is purely a function of your monitoring and alerting quality. If your MTTD is 30 minutes, you're running blind for half an hour every time something breaks. Elite teams get MTTD under 5 minutes with well-tuned alerts on SLI metrics.

Mean Time to Respond (MTTR-respond) is the gap between alert firing and an engineer starting to investigate. This is about on-call processes, escalation paths, and alert routing. If pagers go to the wrong team or alerts are so noisy that people ignore them, response time suffers.

Mean Time to Resolve (MTTR-resolve) is the clock from investigation start to service recovery. This depends on runbooks, debugging tools, rollback capabilities, and institutional knowledge. Teams with good runbooks and one-click rollback can resolve in minutes. Teams debugging from scratch every time measure in hours.

What to Measure and How

Instrument your incident management tool (PagerDuty, Opsgenie, or whatever you use) to capture timestamps at each phase transition. Most tools already track alert time, acknowledgment time, and resolution time. The missing piece is usually detection delay: the gap between when monitoring data shows an anomaly and when an alert actually fires.

Track incidents by severity and calculate MTT* metrics per severity tier. Your Sev-1 MTTR matters for SLA compliance and executive reporting. But Sev-3 MTTR matters for team health and toil reduction. A team that has two Sev-1s per quarter but fifteen Sev-3s per week has a different problem than a team with frequent Sev-1s.

Customer-minutes affected is the single best impact metric. Multiply the number of affected users by the duration of impact. An incident that affects 100 users for 60 minutes (6,000 customer-minutes) is worse than one that affects 10,000 users for 30 seconds (5,000 customer-minutes), but both are quantified in the same unit.

Improving Detection

MTTD is where the highest-leverage improvements live. Start by auditing your alerts. What percentage of real incidents were caught by automated monitoring vs reported by customers or support? If customers find your problems before your monitoring does, that's the gap to close first.

Build alerts on SLI metrics (error rate, latency percentiles, throughput) rather than infrastructure metrics (CPU, memory). A user does not care that your CPU is at 80%. They care that their API calls are timing out. SLI-based alerting catches problems that matter and fires faster because it's measuring the actual user experience.

Tracking Trends

Plot your MTT* metrics on a rolling 90-day basis. Month-to-month variance is normal; look for quarterly trends. After an incident retrospective produces action items, you should see the impact in your metrics within one or two quarters. If you're doing retros but your metrics aren't improving, your action items either aren't getting prioritized or aren't addressing root causes.

Breaking Down the MTT* Family

Everyone says "MTTR" but that one acronym hides three distinct phases, each with different improvement levers.

What to Measure and How

Improving Detection

Tracking Trends

Breaking Down the MTT* Family

What to Measure and How

Improving Detection

Tracking Trends

Key Points

Common Mistakes

Related Topics

Incident Metrics: MTTR & MTTD

Breaking Down the MTT* Family

What to Measure and How

Improving Detection

Tracking Trends

Key Points

Common Mistakes

Related Topics