AI Cost & Unit Economics — Cost & FinOps
Difficulty: Advanced. Audience: VP of Engineering.
Key Points for AI Cost & Unit Economics
- AI unit economics differ fundamentally from traditional SaaS. Inference costs scale linearly with usage, not logarithmically like most infrastructure.
- Track cost per inference, cost per AI-enabled feature, and cost per user. These three metrics give you the full picture from infrastructure to business.
- Model selection is an economic decision as much as a technical one. A fine-tuned smaller model at 1/15th the cost often outperforms a frontier model for specific tasks.
- Token optimization is the AI equivalent of database query optimization. Reducing prompt length, caching common queries, and batching requests can cut costs 60-80%.
- Build dashboards that connect AI spend directly to business outcomes. 'We spent $45K on inference this month and it resolved 12,000 support tickets' is a defensible number.
Common Mistakes with AI Cost & Unit Economics
- Not tracking AI costs at the feature level. A single monthly AWS bill tells you nothing about which AI features are worth keeping.
- Ignoring the variable cost structure when setting pricing. Traditional SaaS has near-zero marginal cost per user, but AI features have real per-request costs.
- Optimizing only for accuracy without considering cost. A 2% accuracy improvement that triples your inference bill is rarely worth it.
- Failing to forecast how AI costs will grow as your user base grows. Linear cost scaling can destroy margins at scale if you don't plan for it.
Related to AI Cost & Unit Economics
Cloud Cost Optimization, FinOps Practices
AI System Quality & Reliability Metrics — Reliability Metrics
Difficulty: Advanced. Audience: Engineering Manager.
Key Points for AI System Quality & Reliability Metrics
- Traditional reliability metrics like uptime, latency, and error rate are necessary but insufficient for AI systems. Your service can be 100% available while producing wrong answers.
- Define SLOs for AI quality: accuracy thresholds, hallucination rates, and confidence calibration. These deserve the same rigor as your infrastructure SLOs.
- Data drift monitoring is the leading indicator of quality degradation. By the time accuracy drops, the underlying data distribution has already shifted.
- Human evaluation sampling is essential and should happen weekly. Automated metrics catch known failure modes, but humans catch the ones you haven't thought of yet.
- AI system reliability is the product of three factors: infrastructure reliability, data quality, and model quality. A weakness in any one of them brings down the whole system.
Common Mistakes with AI System Quality & Reliability Metrics
- Only monitoring infrastructure metrics while the AI layer quietly produces wrong answers. A 200 OK response that contains a hallucinated answer is worse than a 500 error.
- Using only offline evaluation metrics without monitoring production performance. A model that scores 95% on your test set can score 80% on real traffic.
- Not establishing quality baselines before deploying a new model. Without a baseline, you cannot tell whether a new version is better or worse.
- Setting accuracy targets without understanding the business impact of different error types. A false positive in fraud detection has a very different cost than a false negative.
Related to AI System Quality & Reliability Metrics
SLO, SLA & SLI Budgeting, DORA Metrics Deep Dive
API Performance Metrics — Reliability Metrics
Difficulty: Advanced. Audience: Staff Engineer.
Key Points for API Performance Metrics
- P99 latency matters more than averages because averages hide tail latency that affects real users
- Apdex scores translate raw latency into a 0-1 satisfaction index that non-engineers can understand
- Performance budgets should be allocated across the call chain, not just set at the edge
- SLIs for APIs typically combine latency (P95 < threshold) and error rate (< threshold) into a composite
- Real User Monitoring captures what synthetic checks miss: geographic variance, device differences, network conditions
Common Mistakes with API Performance Metrics
- Reporting average latency instead of percentiles, which hides the experience of your worst-affected users
- Setting latency SLOs without measuring the actual user-facing call chain end to end
- Monitoring only from a single region and missing latency problems that affect users in other geographies
- Ignoring throughput changes when analyzing latency, since latency often degrades under load
Related to API Performance Metrics
SLO/SLA/SLI & Error Budgeting, DORA Metrics Deep Dive
Capacity Planning Metrics — Cost & FinOps
Difficulty: Advanced. Audience: SRE.
Key Points for Capacity Planning Metrics
- Resource utilization baselines (CPU, memory, storage) measured at P95 over 30 days give you the true usage picture
- Headroom calculation: provision for P95 usage + 30% buffer to handle traffic spikes without degradation
- Cost per transaction normalizes infrastructure spend against actual business value delivered
- Capacity cliffs happen when a single resource hits its limit; identify the binding constraint before it breaks
- Rightsizing recommendations based on utilization data typically save 20-40% on compute spend
Common Mistakes with Capacity Planning Metrics
- Planning capacity based on average utilization instead of peak utilization, which causes outages during traffic spikes
- Forecasting growth linearly when your traffic pattern is actually seasonal or event-driven
- Over-provisioning everything by 3x 'just in case' without tracking actual utilization to validate the buffer
- Treating capacity planning as a quarterly exercise instead of continuous monitoring with automated alerts
Related to Capacity Planning Metrics
Cloud Cost Optimization, FinOps Practices
Cloud Cost Optimization — Cost & FinOps
Difficulty: Intermediate. Audience: VP of Engineering.
Key Points for Cloud Cost Optimization
- Compute typically accounts for 60-70% of cloud spend. Right-sizing instances is the highest-leverage optimization
- Reserved Instances and Savings Plans can reduce compute costs by 30-60% with 1-3 year commitments
- Spot/preemptible instances offer 60-90% discounts for fault-tolerant workloads like batch processing and CI/CD
- Storage lifecycle policies automatically move infrequently accessed data to cheaper tiers, saving 40-80%
- Cost allocation tags are foundational. You cannot optimize what you cannot attribute to a team or service
Common Mistakes with Cloud Cost Optimization
- Over-provisioning resources out of fear. Most production instances run at 10-20% CPU utilization
- Buying Reserved Instances before right-sizing, locking in waste for 1-3 years
- Ignoring data transfer costs, which can silently become 10-15% of your total bill
- Treating cost optimization as a one-time project instead of an ongoing practice
Related to Cloud Cost Optimization
FinOps Practices, SLO, SLA & SLI Budgeting
Code Review Metrics — Delivery Metrics
Difficulty: Intermediate. Audience: Engineering Manager.
Key Points for Code Review Metrics
- Time-to-first-review is the single highest-leverage metric for unblocking developer flow
- Google's research shows PRs under 400 lines get reviewed faster and have fewer defects
- Reviewer load balancing prevents bottlenecks and reduces burnout on senior engineers
- Review cycle time (open to merge) is a strong predictor of overall lead time for changes
- Automated checks should handle style and formatting so human reviewers focus on logic and design
Common Mistakes with Code Review Metrics
- Mandating review speed targets without addressing root causes like PR size or reviewer capacity
- Counting approvals without measuring review depth, which rewards rubber-stamping
- Ignoring reviewer load distribution, letting the same two people review everything
- Treating review time as idle time rather than recognizing it as skilled engineering work
Related to Code Review Metrics
DORA Metrics Deep Dive, Engineering Productivity Measurement
Developer Satisfaction & DevEx — Productivity Measurement
Difficulty: Intermediate. Audience: Engineering Manager.
Key Points for Developer Satisfaction & DevEx
- DX Core 4 measures speed, effectiveness, quality, and impact as perceived by developers themselves
- Quarterly developer experience surveys catch friction points that system metrics miss entirely
- Build times, environment setup, and documentation quality are the top three friction sources in most organizations
- Internal platform NPS below +20 signals serious tooling problems that will eventually affect retention
- DevEx scores correlate strongly with retention: teams with low satisfaction have 2-3x higher attrition
Common Mistakes with Developer Satisfaction & DevEx
- Running surveys without acting on results, which destroys trust and tanks future response rates
- Measuring developer productivity through output metrics (lines of code, commits) instead of developer-reported friction
- Assuming all developers have the same experience when tenure, team, and tech stack create wildly different realities
Related to Developer Satisfaction & DevEx
SPACE Framework, Engineering Productivity Measurement
DORA Metrics Deep Dive — Delivery Metrics
Difficulty: Intermediate. Audience: Engineering Manager.
Key Points for DORA Metrics Deep Dive
- Four key metrics: deployment frequency, lead time for changes, change failure rate, time to restore
- Elite performers deploy on demand with <1 hour lead time and <5% change failure rate
- DORA metrics measure team capability, not individual performance
- Improving deployment frequency usually improves all four metrics simultaneously
- Measure trends over time, not absolute values. Context matters more than benchmarks
Common Mistakes with DORA Metrics Deep Dive
- Using DORA metrics to compare unrelated teams with different contexts and codebases
- Optimizing for deployment frequency without investing in automated testing
- Measuring at the organization level instead of the team level where it's actionable
- Treating DORA as a goal rather than a diagnostic tool
Related to DORA Metrics Deep Dive
SPACE Framework, Engineering Productivity Measurement
Engineering Productivity Measurement — Productivity Measurement
Difficulty: Expert. Audience: CTO.
Key Points for Engineering Productivity Measurement
- Developer productivity is multidimensional. No single metric captures it, and trying to creates perverse incentives
- Combine system metrics (CI/CD data, code review stats) with developer surveys (satisfaction, friction points) for a complete picture
- Proxy measures like PR cycle time and build reliability correlate with productivity but do not define it
- McKinsey's 2023 developer productivity framework was widely criticized for over-indexing on activity metrics
- The best productivity investment is usually removing friction (faster builds, fewer meetings, better tooling) rather than measuring output
Common Mistakes with Engineering Productivity Measurement
- Measuring lines of code, commit counts, or story points as productivity indicators. All are trivially gameable
- Building elaborate dashboards before understanding what questions you are trying to answer
- Comparing productivity across teams without accounting for codebase age, technical debt, and domain complexity
- Treating developer experience improvements as overhead rather than productivity multipliers
Related to Engineering Productivity Measurement
DORA Metrics Deep Dive, SPACE Framework
FinOps Practices — Cost & FinOps
Difficulty: Advanced. Audience: VP of Engineering.
Key Points for FinOps Practices
- FinOps is a cultural practice, not a tool. It makes cost a first-class engineering concern alongside performance and reliability
- Chargeback/showback models attribute cloud spend to the teams consuming it, creating accountability
- Unit economics (cost per transaction, cost per user) are more actionable than raw spend numbers
- FinOps maturity progresses through Inform, Optimize, and Operate phases. Crawl before you run
- Cross-functional FinOps teams include engineering, finance, and product to balance cost against business value
Common Mistakes with FinOps Practices
- Making FinOps purely a finance initiative. Without engineering ownership, cost optimization recommendations get ignored
- Implementing chargeback without giving teams the tooling or autonomy to actually reduce their costs
- Focusing only on rate optimization (reservations, discounts) while ignoring usage optimization (right-sizing, waste elimination)
- Setting cost reduction targets without connecting them to business metrics. Saving money by degrading user experience is not optimization
Related to FinOps Practices
Cloud Cost Optimization, Engineering Productivity Measurement
Incident Metrics: MTTR & MTTD — Reliability Metrics
Difficulty: Intermediate. Audience: SRE.
Key Points for Incident Metrics: MTTR & MTTD
- MTTD (Mean Time to Detect) is the most actionable incident metric because it's directly tied to monitoring quality
- MTTR breaks down into detection, response, and resolution, and each phase needs separate measurement
- Customer-minutes affected is a better impact measure than raw incident count
- Incident frequency by severity follows a power law: track the distribution, not just the total
- Improvement trends matter more than absolute numbers; compare quarter over quarter
Common Mistakes with Incident Metrics: MTTR & MTTD
- Lumping detection, response, and resolution into one MTTR number, which hides where the bottleneck is
- Measuring only Sev-1 incidents and missing the pattern of recurring Sev-3s that add up to massive toil
- Setting MTTR targets without investing in the tooling and runbooks needed to actually hit them
Related to Incident Metrics: MTTR & MTTD
SLO/SLA/SLI & Error Budgeting, DORA Metrics Deep Dive
On-Call Health Metrics — Reliability Metrics
Difficulty: Intermediate. Audience: SRE.
Key Points for On-Call Health Metrics
- Pages per shift is the primary load metric: more than 2 pages per on-call shift signals unsustainable alert volume
- Google SRE recommends a maximum of 50% operational work; the rest should be engineering projects
- Sleep disruption (pages between 10pm and 7am) is the strongest predictor of on-call burnout
- On-call load distribution should be tracked per-person to ensure fairness across the team
- Escalation frequency indicates gaps in runbooks, tooling, or on-call engineer confidence
Common Mistakes with On-Call Health Metrics
- Measuring only incident count while ignoring alert noise, false positives, and duplicate pages
- Distributing on-call by calendar rotation without accounting for page volume differences across shifts
- Treating toil reduction as optional cleanup instead of a reliability investment
- Not tracking on-call health until someone burns out and quits
Related to On-Call Health Metrics
SLO/SLA/SLI & Error Budgeting, SPACE Framework
Release Quality Metrics — Delivery Metrics
Difficulty: Intermediate. Audience: Engineering Manager.
Key Points for Release Quality Metrics
- Defect escape rate (bugs found in production vs total bugs) is the clearest signal of pre-release quality
- Rollback frequency above 5% of deploys indicates systemic gaps in testing or review processes
- Test coverage is a lagging indicator; defect density per release is a leading indicator of quality trends
- Canary deployment success rate measures how often canaries promote without rollback or intervention
- Release confidence scoring combines automated signals into a go/no-go number before each deploy
Common Mistakes with Release Quality Metrics
- Treating zero rollbacks as the goal, which discourages fast rollback when problems do occur
- Using test coverage percentage as a quality proxy without measuring what those tests actually validate
- Skipping release retrospectives when things go well, missing the chance to understand why they went well
- Measuring only production bugs and ignoring the cost of hotfixes and emergency patches
Related to Release Quality Metrics
DORA Metrics Deep Dive, Engineering Productivity Measurement
Security Metrics Dashboard — Reliability Metrics
Difficulty: Advanced. Audience: Director.
Key Points for Security Metrics Dashboard
- Mean time to remediate by severity class is the single most revealing metric for your AppSec program. If your critical MTTR is 5 days and trending upward, no amount of training or policy will fix the underlying capacity and tooling problems driving the delay
- Vulnerability counts without severity breakdown are actively misleading. A dashboard showing '847 open vulnerabilities' tells you nothing. A dashboard showing '3 critical, 12 high, 832 low' tells you exactly where to focus
- AppSec pipeline coverage is your biggest blind spot indicator. If 40% of your repositories ship without automated security scanning, those are the repositories where your next incident lives. Prioritize internet-facing services and repositories handling PII first
- Container image scanning should be a deployment gate, not a weekly report that someone glances at. Netflix and Shopify both enforce hard blocks on images that fail scanning. If it fails, it doesn't deploy. Period
- Present security metrics to leadership as trends paired with investments, not snapshots. 'Critical MTTR improved from 5 days to 2 days after we deployed Snyk and added an AppSec rotation' builds the case for continued investment in a way that raw numbers never will
Common Mistakes with Security Metrics Dashboard
- Setting remediation SLAs without giving teams the tooling to meet them. A 24-hour critical SLA is useless if engineers have to manually triage Dependabot alerts across 200 repos. Automate the triage, auto-create tickets, and route to the right team, or the SLA is just a source of frustration
- Measuring security training completion as a proxy for security posture. Completion rates tell you who clicked through a 30-minute video. They tell you nothing about whether your codebase is more secure. Replace this with metrics that measure actual outcomes: fewer vulnerabilities introduced per PR, faster remediation times
- Treating all vulnerability scanners as equal. SAST tools generate high false-positive rates that train engineers to ignore findings. Pair SAST with DAST and SCA, tune your rulesets aggressively, and track the false-positive rate as a metric itself. A scanner that cries wolf is worse than no scanner
- Reporting patch compliance without distinguishing between internet-facing systems and internal tools. An unpatched internal wiki is a different risk profile than an unpatched API gateway. Weight your metrics by exposure
Related to Security Metrics Dashboard
SLO/SLA/SLI & Error Budgeting, Cloud Cost Optimization
SLO, SLA & SLI Budgeting — Reliability Metrics
Difficulty: Advanced. Audience: Platform Team.
Key Points for SLO, SLA & SLI Budgeting
- SLIs are the measurements, SLOs are the targets, SLAs are the contracts. Do not confuse them
- Error budgets quantify how much unreliability you can tolerate before pausing feature work
- A 99.9% SLO allows 43.2 minutes of downtime per month. Know your budget in real time
- Burn rate alerts detect when you are consuming error budget faster than expected
- SLOs should be set based on user expectations, not on what your system currently achieves
Common Mistakes with SLO, SLA & SLI Budgeting
- Setting SLOs at 99.99% when your users would be perfectly happy with 99.9%, wasting engineering effort
- Defining SLIs that do not reflect actual user experience (measuring server uptime instead of request success rate)
- Having SLOs without error budget policies. The budget is meaningless if nobody acts when it is exhausted
- Treating SLAs and SLOs as the same thing. SLAs have financial penalties, SLOs are internal targets
Related to SLO, SLA & SLI Budgeting
DORA Metrics Deep Dive, Cloud Cost Optimization
SPACE Framework — Productivity Measurement
Difficulty: Advanced. Audience: VP of Engineering.
Key Points for SPACE Framework
- Five dimensions: Satisfaction & well-being, Performance, Activity, Communication & collaboration, Efficiency & flow
- No single metric captures developer productivity. SPACE requires measuring across multiple dimensions
- Satisfaction surveys are a leading indicator; declining satisfaction predicts future attrition and velocity drops
- Activity metrics (PRs, commits) are only valid when combined with outcome metrics to avoid Goodhart's Law
- Developed by Nicole Forsgren, Margaret-Anne Storey, and others at Microsoft Research and GitHub
Common Mistakes with SPACE Framework
- Cherry-picking only the Activity dimension because it is easiest to measure automatically
- Running satisfaction surveys but never acting on the results, creating survey fatigue
- Measuring individual developers instead of teams. SPACE explicitly warns against this
- Treating SPACE as a replacement for DORA instead of a complementary framework
Related to SPACE Framework
DORA Metrics Deep Dive, Engineering Productivity Measurement
Technical Debt Measurement — Productivity Measurement
Difficulty: Advanced. Audience: Staff Engineer.
Key Points for Technical Debt Measurement
- Technical debt compounds like financial debt: the interest rate matters more than the principal
- Martin Fowler's quadrant classifies debt as reckless/prudent and deliberate/inadvertent
- Static analysis tools like SonarQube and CodeClimate give you a baseline, but they miss architectural debt
- Debt paydown velocity should be tracked alongside feature velocity in sprint planning
- A debt budget (10-20% of engineering capacity) keeps debt from growing unchecked
Common Mistakes with Technical Debt Measurement
- Treating all technical debt as equally urgent instead of prioritizing by interest rate
- Relying solely on static analysis metrics and ignoring structural or architectural debt
- Pitching debt paydown as 'engineering wants to refactor' instead of framing it with business impact data
- Waiting for a dedicated 'tech debt sprint' instead of continuously paying down high-interest debt
Related to Technical Debt Measurement
Engineering Productivity Measurement, SPACE Framework
Test Coverage & Effectiveness — Delivery Metrics
Difficulty: Intermediate. Audience: Staff Engineer.
Key Points for Test Coverage & Effectiveness
- Coverage percentage tells you what code is executed during tests, not whether the tests actually catch bugs
- Mutation testing measures real test effectiveness by injecting faults and checking if tests detect them
- Flaky test rate above 2-3% visibly degrades developer trust in the test suite and slows down merges
- Test pyramid ratios (70% unit / 20% integration / 10% e2e) keep execution fast and maintenance manageable
- Test execution time is a developer experience metric: suites over 10 minutes break the feedback loop
Common Mistakes with Test Coverage & Effectiveness
- Chasing a coverage number (like 80%) without considering what the tests actually assert
- Writing tests after the fact to hit a coverage gate, which produces low-value tests that test implementation details
- Ignoring flaky tests until the suite is so unreliable that engineers stop trusting green builds
- Building a test-diamond (heavy integration, light unit) which makes the suite slow and brittle
Related to Test Coverage & Effectiveness
DORA Metrics Deep Dive, Engineering Productivity Measurement