Prometheus
The pull-based metrics system that actually works for containers and Kubernetes
Use Cases
Architecture
Why It Exists
Anyone who has run Nagios or Zabbix knows those tools were built for a world where servers had fixed hostnames and IP addresses that stuck around. That world is gone. Containers spin up and die in seconds, services auto-scale based on load, and yesterday's IP means nothing today.
Prometheus was built at SoundCloud in 2012 specifically for this reality. The team drew heavy inspiration from Google's internal Borgmon system. It graduated from the CNCF in 2018, only the second project to do so after Kubernetes, and at this point it is the default monitoring choice for cloud-native infrastructure. Its data model, PromQL, and exposition format have shaped every monitoring tool that came after it.
How It Works
Data Model: Every metric is a name plus a set of key-value labels. http_requests_total{method="GET", handler="/api/users", status="200"} is a time series, which is just a sequence of (timestamp, value) pairs. Labels enable slicing data across multiple dimensions: by method, by handler, by status, or any combination. This is far more flexible than the old hierarchical naming style (like servers.web01.http.get.200.count), and once teams adopt it, the old style feels unworkable.
Four Metric Types: Counter is a monotonically increasing value, good for total requests or total errors. Gauge goes up and down freely, useful for things like temperature or concurrent connections. Histogram buckets observations into configurable ranges and tracks sum and count, which is the right choice for request duration. Summary is similar to histogram but calculates quantiles on the client side. In practice, histograms are almost always the better choice because they are more flexible at query time.
Scraping: Prometheus pulls metrics by hitting each target's /metrics endpoint via HTTP GET at a configured interval (default is 15 seconds). Targets expose metrics in Prometheus exposition format or OpenMetrics format. Service discovery finds targets automatically from Kubernetes, Consul, DNS, EC2, Azure, or plain file-based config.
Architecture Deep Dive
TSDB (Time Series Database): Prometheus ships with a custom TSDB built for write-heavy, append-mostly workloads. Incoming data lands in an in-memory "head block" backed by a write-ahead log (WAL) for crash safety. Every 2 hours, the head block gets compacted into an immutable on-disk block containing compressed chunks, an inverted index mapping label sets to series, and metadata. Older blocks get merged into larger ones over time. The compression is genuinely impressive, typically 1-2 bytes per sample.
Query Engine: PromQL evaluation is lazy and iterator-based. The engine builds an execution tree from the query AST where each node is an iterator. Range vector selectors pull data from the TSDB, functions like rate() and increase() transform it, and aggregation operators like sum() and avg() combine series across label dimensions. This design handles queries touching millions of samples without choking.
Alertmanager: Alert rules are defined as PromQL expressions that get evaluated on a regular interval. When an expression returns a non-empty result, the alert fires. Alertmanager then takes over and handles routing (critical alerts go to PagerDuty, warnings go to Slack), grouping (batch related alerts into one notification), inhibition (suppress lower-severity alerts when a higher-severity one is active), and silencing (manually mute alerts during maintenance windows). In HA setups with multiple Prometheus instances, Alertmanager deduplicates so nobody gets paged twice for the same problem.
Scaling with Thanos: A single Prometheus instance handles 5-10 million active time series, which is a lot. But to go beyond that, or visibility across multiple clusters is needed, Thanos is the answer. Thanos Sidecar uploads Prometheus blocks to object storage like S3 or GCS, providing practically unlimited retention at low cost. Thanos Query provides a single PromQL endpoint that fans out to all Prometheus instances and deduplicates results. Thanos Compactor downsamples old data (5-minute averages after 1 week, 1-hour averages after 1 month) so queries over long time ranges stay fast.
GitLab runs their entire SaaS platform on Prometheus, scraping over 25 million time series across thousands of services. That is real-world proof it works at serious scale, typically with Thanos layered on top for long-term storage and cross-cluster querying.
Instrumentation Best Practices
Use the official client libraries (Go, Java, Python, Ruby, .NET) to instrument application code. At minimum, expose RED metrics for every service: http_requests_total (counter) for Rate, http_request_duration_seconds (histogram) for Duration, and http_requests_errors_total (counter) for Errors. Then add USE metrics for resources: Utilization, Saturation, and Errors for CPU, memory, disk, and network. Keep label names consistent across services. If one service calls it http_method and another calls it method, inconsistent naming makes cross-service queries painful.
Pros
- • Pull model makes service discovery and health detection straightforward
- • PromQL is hands-down the best metrics query language available
- • Kubernetes integration with automatic service discovery works out of the box
- • Huge exporter ecosystem with 500+ integrations
- • CNCF graduated project, widely adopted, battle-tested
Cons
- • Single-node by default. You need Thanos or Cortex to scale horizontally
- • Local storage is not durable. A disk failure wipes your metrics
- • High cardinality labels will blow up memory and kill query performance
- • Pull model means Prometheus needs network access to every target
- • No built-in dashboards. You will need Grafana
When to use
- • Cloud-native environments, especially anything running on Kubernetes
- • You want a proven, standards-based monitoring stack
- • Your team practices SRE and needs SLI/SLO tracking
- • You need multi-dimensional metrics with label-based querying
When NOT to use
- • Log aggregation or distributed tracing (reach for Loki or Jaeger instead)
- • Billing or accounting metrics where 100% accuracy matters (Prometheus can drop data)
- • Environments where Prometheus cannot reach targets (use a push-based alternative)
- • Very long retention (years) without adding Thanos or Cortex for remote storage
Key Points
- •The pull model means Prometheus scrapes targets at configured intervals. If a target is down, Prometheus knows immediately because the scrape fails. Push-based systems cannot tell the difference between 'no data' and 'target is healthy but idle.' That distinction matters more than people realize.
- •PromQL supports rate(), histogram_quantile(), and aggregation across label dimensions. For example, rate(http_requests_total{status='500'}[5m]) / rate(http_requests_total[5m]) produces the error rate over 5 minutes. This is the bread and butter of SLI calculation.
- •The local TSDB uses a compressed, append-only block format that achieves 1-2 bytes per sample. A Prometheus instance watching 1 million time series at a 15-second scrape interval uses roughly 6GB of RAM and 30GB of disk per day.
- •Thanos provides a global query view across clusters, unlimited retention through object storage (S3/GCS), and downsampling. It runs a sidecar on each Prometheus that uploads blocks, plus a query component that deduplicates and merges results across instances.
- •Recording rules pre-compute expensive expressions and store results as new time series. This matters for dashboards. A dashboard querying rate() over 30 days of raw data can take minutes. A recording rule that reduces to 5-minute aggregates makes it instant.
Common Mistakes
- ✗Unbounded label cardinality. Adding user_id, request_id, or IP as labels creates millions of time series and causes OOM. Labels need bounded, low-cardinality values. Method, status_code, and endpoint are fine. User_id is not.
- ✗Skipping recording rules for dashboard queries. Every Grafana panel re-evaluates its PromQL on every refresh. Complex queries over large time ranges will bring Prometheus to its knees. Pre-compute with recording rules.
- ✗Scraping too often. 15-second intervals are standard for a reason. Running 1-second scrapes on 10,000 targets generates 600K samples per second, and that will overwhelm the TSDB. Match scrape interval to the granularity actually needed.
- ✗Inconsistent metric naming. Metrics should follow the pattern namespace_subsystem_name_unit (e.g., http_server_request_duration_seconds). Sloppy naming makes queries and alerts a nightmare to maintain.
- ✗Running Prometheus without Alertmanager. Prometheus can evaluate alert rules, but it cannot route, deduplicate, silence, or group notifications. Every production deployment needs Alertmanager. No exceptions.