Metrics & Monitoring
Architecture Diagram
Why It Exists
Once a distributed system grows past a handful of services, reasoning about what's happening by looking at logs alone becomes impossible. Numbers are needed. Metrics provide those numbers, and they're the foundation for everything else: alerting, capacity planning, SLOs.
I've watched teams try to debug production issues by SSH-ing into boxes and tailing log files. That works with three servers. It falls apart completely at thirty. And it's impossible at three hundred.
Two frameworks worth internalizing early. The USE method (Utilization, Saturation, Errors) covers what to measure for infrastructure resources like CPU, memory, and disk. The RED method (Rate, Errors, Duration) covers what to measure for services. These aren't theoretical. They provide a concrete checklist instead of building dashboards based on vibes.
How It Works
Metric Types
Prometheus defines four core metric types. The first three get used constantly; the fourth is best avoided.
- Counter: A value that only goes up (e.g.,
http_requests_total). On its own it's useless. What matters is the rate:rate(http_requests_total[5m]). - Gauge: A point-in-time snapshot that can go up or down (e.g.,
node_memory_available_bytes,active_connections). Simple and intuitive. - Histogram: Buckets observations to compute percentiles (p50, p95, p99). This is what latency SLIs need. Pick bucket boundaries carefully, because changing them later means losing historical comparability.
- Summary: Similar idea to histograms, but it calculates quantiles on the client side. The problem? Summaries can't be aggregated across instances. Just use histograms.
Prometheus Architecture
Prometheus uses a pull-based model. It scrapes HTTP /metrics endpoints at whatever interval is configured (usually 15-30s). Service discovery through Kubernetes, Consul, or DNS finds scrape targets automatically, which is one of the things that makes Prometheus work so well in dynamic environments.
Scraped samples land in a local time-series database built for append-heavy writes with solid compression (roughly 1.5 bytes per sample). For anything beyond about 15 days of retention, remote write should push data into a durable backend like Grafana Mimir, Thanos, or VictoriaMetrics. The local TSDB was never meant to be a long-term store.
SLI/SLO/Error Budget Framework
An SLI (Service Level Indicator) is a specific measurement of service behavior. Usually request latency at p99 or success rate. Straightforward.
An SLO (Service Level Objective) puts a target on it: "99.9% of requests complete successfully within 300ms over a 30-day window."
The error budget is the gap between perfection and the SLO. At 99.9%, that's 0.1% of requests as the budget, which works out to roughly 43 minutes of downtime per month. When that budget runs dry, teams stop shipping features and fix reliability. This is the part most teams struggle with politically, but it's the whole point of the framework.
Production Considerations
- Federation: Run a global Prometheus that scrapes aggregated metrics from regional instances. Do not federate raw high-cardinality data. I've seen this bring down the global instance.
- Cardinality management: Use
metric_relabel_configsto enforce label value limits. One metric with auser_idlabel can generate millions of series and OOM the server in hours. - Recording rules: Pre-compute expensive PromQL queries (like multi-window burn rates) as recording rules. Dashboards and alerts will evaluate much faster, and the TSDB won't get hammered on every page load.
- Retention vs. cost: Keep raw metrics at 15s resolution for about 15 days. Downsample to 5m resolution for long-term storage. Nobody needs second-level granularity from three months ago.
- Alertmanager grouping: Group related alerts by
serviceandseverity. Without this, a single cascading failure generates 200 individual pages and the on-call engineer's phone becomes unusable.
Failure Scenarios
Scenario 1: Cardinality Explosion Causes Prometheus OOM
A developer adds a user_id label to a request duration histogram. Seems harmless. But with 10M active users and 10 histogram buckets, that's 100M+ new time series overnight. Prometheus memory jumps from 8 GB to 120 GB and gets OOM-killed. Now all alerting is dead. SLO burn rate alerts go silent. A latency regression in the payment service goes unnoticed for 90 minutes.
Detection: Monitor prometheus_tsdb_head_series and alert when series count grows more than 20% in 1 hour.
Recovery: Find the offending metric with topk(10, count by (__name__)({__name__=~".+"})), add a metric_relabel_config to drop the label, and restart Prometheus. Then have a conversation with that developer about why label cardinality matters.
Scenario 2: TSDB Corruption After Disk Full
Prometheus fills up its 500 GB volume. The write-ahead log (WAL) corrupts during the disk-full event. After expanding the volume, Prometheus can't replay the WAL and 2 hours of data are lost. Every recording rule and alert using a [2h] window returns no data, which triggers hundreds of false alert resolutions. The team thinks everything is fine when it isn't.
Detection: Alert on node_filesystem_avail_bytes / node_filesystem_size_bytes < 0.15 with a 30-minute for clause.
Recovery: Restore from the most recent TSDB snapshot, accept the data gap, and resize the volume to 3x current usage. Set up automated PVC expansion to avoid this happening again.
Scenario 3: Metric Pipeline Lag During Traffic Surge
Black Friday hits with 4x normal traffic. The remote write queue to Grafana Mimir backs up because the ingester can't keep pace. prometheus_remote_storage_samples_pending climbs past 5M. Dashboards show stale data (15 minutes behind), and SLO calculations undercount errors because recent samples haven't arrived yet.
Detection: Alert on prometheus_remote_storage_highest_timestamp_in_seconds - prometheus_remote_storage_queue_highest_sent_timestamp_seconds > 300.
Recovery: Scale Mimir ingesters horizontally, increase Prometheus max_shards for remote write, and turn on WAL-based queueing to prevent sample loss. Ideally, load-test the monitoring pipeline before the traffic spike, not during it.
Capacity Planning
Storage estimation formula: storage_bytes = num_series * bytes_per_sample * samples_per_day * retention_days. With Prometheus at ~1.5 bytes/sample compressed, 1M active series at a 15s scrape interval produces: 1,000,000 * 1.5 * 5,760 * 15 = ~130 GB for 15 days of retention.
| Scale Tier | Active Series | Scrape Interval | Daily Ingest | 15-Day Storage | Reference |
|---|---|---|---|---|---|
| Startup | 50K | 30s | 2.5 GB | 37 GB | Series A SaaS |
| Mid-scale | 500K | 15s | 50 GB | 750 GB | 50-service platform |
| Large-scale | 5M | 15s | 500 GB | 7.5 TB | Netflix-scale |
| Hyper-scale | 50M+ | 15s | 5 TB+ | 75 TB+ | Datadog (40T+ data points/day) |
Key thresholds: A single Prometheus node tops out around 10M active series with 64 GB RAM. Beyond that, shard with Thanos or Mimir. Query latency gets noticeably worse when PromQL touches more than 1M series in a single query, so use recording rules to pre-aggregate. Keep Grafana dashboard load times under 3s. If a panel queries over 500K series, refactor it to use recorded metrics. As a rough planning number, budget 2 GB RAM per 1M active series.
Architecture Decision Record
Decision: Choosing a Metrics Stack for Production
| Criteria (Weight) | Prometheus + Thanos | Grafana Mimir | Datadog | VictoriaMetrics |
|---|---|---|---|---|
| Cost at scale (25%) | 4: Free, infra cost only | 4: Free, infra cost only | 2: $23/host/mo adds up | 4: Free, efficient storage |
| Operational complexity (20%) | 2: Sidecar + store + compactor | 3: Fewer components, read/write path | 5: Fully managed | 4: Single binary option |
| Multi-tenancy (15%) | 3: External labels per tenant | 5: Native tenant isolation | 4: Org-level separation | 3: Label-based isolation |
| Query performance (15%) | 3: Store gateway latency | 4: Optimized read path | 4: Proprietary optimizations | 5: Fastest PromQL engine |
| Ecosystem integration (15%) | 5: De facto standard | 4: PromQL + Grafana native | 3: Proprietary query language | 4: PromQL-compatible |
| Compliance / data residency (10%) | 5: Self-hosted, full control | 5: Self-hosted, full control | 2: SaaS, limited regions | 5: Self-hosted, full control |
When to choose what:
- Team < 20 engineers, <500 services: Datadog. The operational simplicity is worth the cost. Teams are up and running in days, not weeks.
- Team 20-100, cost-conscious: VictoriaMetrics. Single-binary deployment, great performance, low resource footprint. Honestly underrated.
- Team 100+, multi-tenant platform: Grafana Mimir. Native multi-tenancy, tight Grafana integration, strong community.
- Regulated industry (finance, healthcare): Self-hosted Prometheus + Thanos or Mimir. Data stays on internal infrastructure with full audit control over metric access. No negotiating with a SaaS vendor about where data lives.
Key Points
- •Of the three observability pillars (metrics, logs, traces), metrics are where alerting starts
- •USE method (Utilization, Saturation, Errors) for infrastructure; RED method (Rate, Errors, Duration) for services
- •Prometheus pulls from /metrics endpoints; push-based systems (Datadog, StatsD) receive data from agents
- •Cardinality explosion will kill the metric system. Never use unbounded label values like user IDs or request IDs.
- •SLIs, SLOs, and error budgets turn raw numbers into reliability commitments the business can actually understand
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Prometheus | Open Source | Kubernetes-native, PromQL, pull-based | Medium-Enterprise |
| Datadog | Commercial | Unified observability, APM, easy setup | Small-Enterprise |
| Grafana + Mimir | Open Source | Long-term storage, multi-tenant Prometheus | Large-Enterprise |
| Victoria Metrics | Open Source | High-performance TSDB, PromQL-compatible | Medium-Enterprise |
Common Mistakes
- Building 50 dashboards before writing a single alert. Dashboards help investigate. Alerts detect problems.
- Alerting on symptoms like high CPU instead of user impact like elevated error rate or high latency
- Skipping SLO definition before building monitoring. Without defining 'healthy,' there's nothing to measure.
- High-cardinality labels causing Prometheus OOM. Every unique label combo creates a new time series.
- Forgetting to monitor the monitoring system. Prometheus needs health checks and alerts on itself.