Metrics & Monitoring

Why It Exists

Once a distributed system grows past a handful of services, reasoning about what's happening by looking at logs alone becomes impossible. Numbers are needed. Metrics provide those numbers, and they're the foundation for everything else: alerting, capacity planning, SLOs.

I've watched teams try to debug production issues by SSH-ing into boxes and tailing log files. That works with three servers. It falls apart completely at thirty. And it's impossible at three hundred.

Two frameworks worth internalizing early. The USE method (Utilization, Saturation, Errors) covers what to measure for infrastructure resources like CPU, memory, and disk. The RED method (Rate, Errors, Duration) covers what to measure for services. These aren't theoretical. They provide a concrete checklist instead of building dashboards based on vibes.

How It Works

Metric Types

Prometheus defines four core metric types. The first three get used constantly; the fourth is best avoided.

Counter: A value that only goes up (e.g., http_requests_total). On its own it's useless. What matters is the rate: rate(http_requests_total[5m]).
Gauge: A point-in-time snapshot that can go up or down (e.g., node_memory_available_bytes, active_connections). Simple and intuitive.
Histogram: Buckets observations to compute percentiles (p50, p95, p99). This is what latency SLIs need. Pick bucket boundaries carefully, because changing them later means losing historical comparability.
Summary: Similar idea to histograms, but it calculates quantiles on the client side. The problem? Summaries can't be aggregated across instances. Just use histograms.

Prometheus Architecture

Prometheus uses a pull-based model. It scrapes HTTP /metrics endpoints at whatever interval is configured (usually 15-30s). Service discovery through Kubernetes, Consul, or DNS finds scrape targets automatically, which is one of the things that makes Prometheus work so well in dynamic environments.

Scraped samples land in a local time-series database built for append-heavy writes with solid compression (roughly 1.5 bytes per sample). For anything beyond about 15 days of retention, remote write should push data into a durable backend like Grafana Mimir, Thanos, or VictoriaMetrics. The local TSDB was never meant to be a long-term store.

SLI/SLO/Error Budget Framework

An SLI (Service Level Indicator) is a specific measurement of service behavior. Usually request latency at p99 or success rate. Straightforward.

An SLO (Service Level Objective) puts a target on it: "99.9% of requests complete successfully within 300ms over a 30-day window."

The error budget is the gap between perfection and the SLO. At 99.9%, that's 0.1% of requests as the budget, which works out to roughly 43 minutes of downtime per month. When that budget runs dry, teams stop shipping features and fix reliability. This is the part most teams struggle with politically, but it's the whole point of the framework.

Production Considerations

Federation: Run a global Prometheus that scrapes aggregated metrics from regional instances. Do not federate raw high-cardinality data. I've seen this bring down the global instance.
Cardinality management: Use metric_relabel_configs to enforce label value limits. One metric with a user_id label can generate millions of series and OOM the server in hours.
Recording rules: Pre-compute expensive PromQL queries (like multi-window burn rates) as recording rules. Dashboards and alerts will evaluate much faster, and the TSDB won't get hammered on every page load.
Retention vs. cost: Keep raw metrics at 15s resolution for about 15 days. Downsample to 5m resolution for long-term storage. Nobody needs second-level granularity from three months ago.
Alertmanager grouping: Group related alerts by service and severity. Without this, a single cascading failure generates 200 individual pages and the on-call engineer's phone becomes unusable.

Failure Scenarios

Scenario 1: Cardinality Explosion Causes Prometheus OOM

A developer adds a user_id label to a request duration histogram. Seems harmless. But with 10M active users and 10 histogram buckets, that's 100M+ new time series overnight. Prometheus memory jumps from 8 GB to 120 GB and gets OOM-killed. Now all alerting is dead. SLO burn rate alerts go silent. A latency regression in the payment service goes unnoticed for 90 minutes.

Detection: Monitor prometheus_tsdb_head_series and alert when series count grows more than 20% in 1 hour.

Recovery: Find the offending metric with topk(10, count by (__name__)({__name__=~".+"})), add a metric_relabel_config to drop the label, and restart Prometheus. Then have a conversation with that developer about why label cardinality matters.

Scenario 2: TSDB Corruption After Disk Full

Prometheus fills up its 500 GB volume. The write-ahead log (WAL) corrupts during the disk-full event. After expanding the volume, Prometheus can't replay the WAL and 2 hours of data are lost. Every recording rule and alert using a [2h] window returns no data, which triggers hundreds of false alert resolutions. The team thinks everything is fine when it isn't.

Detection: Alert on node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15 with a 30-minute for clause.

Recovery: Restore from the most recent TSDB snapshot, accept the data gap, and resize the volume to 3x current usage. Set up automated PVC expansion to avoid this happening again.

Scenario 3: Metric Pipeline Lag During Traffic Surge

Black Friday hits with 4x normal traffic. The remote write queue to Grafana Mimir backs up because the ingester can't keep pace. prometheus_remote_storage_samples_pending climbs past 5M. Dashboards show stale data (15 minutes behind), and SLO calculations undercount errors because recent samples haven't arrived yet.

Detection: Alert on prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300.

Recovery: Scale Mimir ingesters horizontally, increase Prometheus max_shards for remote write, and turn on WAL-based queueing to prevent sample loss. Ideally, load-test the monitoring pipeline before the traffic spike, not during it.

Capacity Planning

Storage estimation formula: storage_bytes = num_series * bytes_per_sample * samples_per_day * retention_days. With Prometheus at ~1.5 bytes/sample compressed, 1M active series at a 15s scrape interval produces: 1,000,000 * 1.5 * 5,760 * 15 = ~130 GB for 15 days of retention.

Scale Tier	Active Series	Scrape Interval	Daily Ingest	15-Day Storage	Reference
Startup	50K	30s	2.5 GB	37 GB	Series A SaaS
Mid-scale	500K	15s	50 GB	750 GB	50-service platform
Large-scale	5M	15s	500 GB	7.5 TB	Netflix-scale
Hyper-scale	50M+	15s	5 TB+	75 TB+	Datadog (40T+ data points/day)

Key thresholds: A single Prometheus node tops out around 10M active series with 64 GB RAM. Beyond that, shard with Thanos or Mimir. Query latency gets noticeably worse when PromQL touches more than 1M series in a single query, so use recording rules to pre-aggregate. Keep Grafana dashboard load times under 3s. If a panel queries over 500K series, refactor it to use recorded metrics. As a rough planning number, budget 2 GB RAM per 1M active series.

Architecture Decision Record

Decision: Choosing a Metrics Stack for Production

Criteria (Weight)	Prometheus + Thanos	Grafana Mimir	Datadog	VictoriaMetrics
Cost at scale (25%)	4: Free, infra cost only	4: Free, infra cost only	2: $23/host/mo adds up	4: Free, efficient storage
Operational complexity (20%)	2: Sidecar + store + compactor	3: Fewer components, read/write path	5: Fully managed	4: Single binary option
Multi-tenancy (15%)	3: External labels per tenant	5: Native tenant isolation	4: Org-level separation	3: Label-based isolation
Query performance (15%)	3: Store gateway latency	4: Optimized read path	4: Proprietary optimizations	5: Fastest PromQL engine
Ecosystem integration (15%)	5: De facto standard	4: PromQL + Grafana native	3: Proprietary query language	4: PromQL-compatible
Compliance / data residency (10%)	5: Self-hosted, full control	5: Self-hosted, full control	2: SaaS, limited regions	5: Self-hosted, full control

When to choose what:

Team < 20 engineers, <500 services: Datadog. The operational simplicity is worth the cost. Teams are up and running in days, not weeks.
Team 20-100, cost-conscious: VictoriaMetrics. Single-binary deployment, great performance, low resource footprint. Honestly underrated.
Team 100+, multi-tenant platform: Grafana Mimir. Native multi-tenancy, tight Grafana integration, strong community.
Regulated industry (finance, healthcare): Self-hosted Prometheus + Thanos or Mimir. Data stays on internal infrastructure with full audit control over metric access. No negotiating with a SaaS vendor about where data lives.

Tool	Type	Best For	Scale
Prometheus	Open Source	Kubernetes-native, PromQL, pull-based	Medium-Enterprise
Datadog	Commercial	Unified observability, APM, easy setup	Small-Enterprise
Grafana + Mimir	Open Source	Long-term storage, multi-tenant Prometheus	Large-Enterprise
Victoria Metrics	Open Source	High-performance TSDB, PromQL-compatible	Medium-Enterprise

Why It Exists

How It Works

Metric Types

Prometheus defines four core metric types. The first three get used constantly; the fourth is best avoided.

Counter: A value that only goes up (e.g., http_requests_total). On its own it's useless. What matters is the rate: rate(http_requests_total[5m]).
Gauge: A point-in-time snapshot that can go up or down (e.g., node_memory_available_bytes, active_connections). Simple and intuitive.
Histogram: Buckets observations to compute percentiles (p50, p95, p99). This is what latency SLIs need. Pick bucket boundaries carefully, because changing them later means losing historical comparability.
Summary: Similar idea to histograms, but it calculates quantiles on the client side. The problem? Summaries can't be aggregated across instances. Just use histograms.

Prometheus Architecture

SLI/SLO/Error Budget Framework

An SLI (Service Level Indicator) is a specific measurement of service behavior. Usually request latency at p99 or success rate. Straightforward.

An SLO (Service Level Objective) puts a target on it: "99.9% of requests complete successfully within 300ms over a 30-day window."

Production Considerations

Federation: Run a global Prometheus that scrapes aggregated metrics from regional instances. Do not federate raw high-cardinality data. I've seen this bring down the global instance.
Cardinality management: Use metric_relabel_configs to enforce label value limits. One metric with a user_id label can generate millions of series and OOM the server in hours.
Recording rules: Pre-compute expensive PromQL queries (like multi-window burn rates) as recording rules. Dashboards and alerts will evaluate much faster, and the TSDB won't get hammered on every page load.
Retention vs. cost: Keep raw metrics at 15s resolution for about 15 days. Downsample to 5m resolution for long-term storage. Nobody needs second-level granularity from three months ago.
Alertmanager grouping: Group related alerts by service and severity. Without this, a single cascading failure generates 200 individual pages and the on-call engineer's phone becomes unusable.

Failure Scenarios

Scenario 1: Cardinality Explosion Causes Prometheus OOM

Detection: Monitor prometheus_tsdb_head_series and alert when series count grows more than 20% in 1 hour.

Scenario 2: TSDB Corruption After Disk Full

Detection: Alert on node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15 with a 30-minute for clause.

Recovery: Restore from the most recent TSDB snapshot, accept the data gap, and resize the volume to 3x current usage. Set up automated PVC expansion to avoid this happening again.

Scenario 3: Metric Pipeline Lag During Traffic Surge

Detection: Alert on prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300.

Capacity Planning

Scale Tier	Active Series	Scrape Interval	Daily Ingest	15-Day Storage	Reference
Startup	50K	30s	2.5 GB	37 GB	Series A SaaS
Mid-scale	500K	15s	50 GB	750 GB	50-service platform
Large-scale	5M	15s	500 GB	7.5 TB	Netflix-scale
Hyper-scale	50M+	15s	5 TB+	75 TB+	Datadog (40T+ data points/day)

Architecture Decision Record

Decision: Choosing a Metrics Stack for Production

Criteria (Weight)	Prometheus + Thanos	Grafana Mimir	Datadog	VictoriaMetrics
Cost at scale (25%)	4: Free, infra cost only	4: Free, infra cost only	2: $23/host/mo adds up	4: Free, efficient storage
Operational complexity (20%)	2: Sidecar + store + compactor	3: Fewer components, read/write path	5: Fully managed	4: Single binary option
Multi-tenancy (15%)	3: External labels per tenant	5: Native tenant isolation	4: Org-level separation	3: Label-based isolation
Query performance (15%)	3: Store gateway latency	4: Optimized read path	4: Proprietary optimizations	5: Fastest PromQL engine
Ecosystem integration (15%)	5: De facto standard	4: PromQL + Grafana native	3: Proprietary query language	4: PromQL-compatible
Compliance / data residency (10%)	5: Self-hosted, full control	5: Self-hosted, full control	2: SaaS, limited regions	5: Self-hosted, full control

When to choose what:

Team < 20 engineers, <500 services: Datadog. The operational simplicity is worth the cost. Teams are up and running in days, not weeks.
Team 20-100, cost-conscious: VictoriaMetrics. Single-binary deployment, great performance, low resource footprint. Honestly underrated.
Team 100+, multi-tenant platform: Grafana Mimir. Native multi-tenancy, tight Grafana integration, strong community.
Regulated industry (finance, healthcare): Self-hosted Prometheus + Thanos or Mimir. Data stays on internal infrastructure with full audit control over metric access. No negotiating with a SaaS vendor about where data lives.

Architecture Diagram

Why It Exists

How It Works

Metric Types

Prometheus Architecture

SLI/SLO/Error Budget Framework

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Metrics & Monitoring

Architecture Diagram

Why It Exists

How It Works

Metric Types

Prometheus Architecture

SLI/SLO/Error Budget Framework

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics