System Design: Observability Platform at Scale - Metrics, Traces, and Logs
Goal: Build a unified observability platform that ingests 500 million metrics per second, collects distributed traces across 500K+ services, and centralizes logs from every container in the fleet. Support all three pillars of observability: metrics (counters, gauges, histograms, summaries), traces (per-request call trees across service boundaries), and logs (structured application output with full-text search). Provide real-time dashboards, alerting with SLO (Service Level Objective)-based burn rates, multi-tenant isolation, and long-term retention with automatic downsampling.
Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.
- Sections 1-8: Metrics pipeline in full depth (the largest and most complex signal)
- Section 9: Distributed tracing at scale
- Section 10: Centralized logging
- Sections 11-15: Cross-cutting concerns (bottlenecks, failures, meta-monitoring, deployment, security)
All three signals share the same OTel Collector fleet and Kafka cluster but diverge at the storage layer.
- New to observability? Start with Sections 1-7 for the core architecture and sizing. Skim Section 8.3 (cardinality) for the scariest problem in metrics. Skip to Section 12 for real failure stories.
- Building something similar? Sections 7-8 have the sizing math and deep dives you need.
- Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 11 (bottlenecks) and Section 12 (failures) are common follow-up questions.
TL;DR: Three pipelines sharing one OTel + Kafka collection layer. Metrics: Flink stream aggregation into VictoriaMetrics (Gorilla compression, 1.37 bytes/sample). Traces: tail-based sampling (99.5% volume reduction) into Grafana Tempo on S3. Logs: VictoriaLogs with bloom filter indexing (3x Loki throughput). Grafana correlates all three via exemplars and trace_id. The hardest problems: cardinality control for metrics, sampling strategy for traces, volume control for logs.
1. Problem Statement
A few clarifications before getting into the architecture:
Scale clarification: This design targets extreme hyperscale. Most organizations ingest tens of thousands to low single-digit millions of metrics/sec. Only the largest enterprises (thousands of microservices, hundreds of thousands of hosts) reach 10M-50M. The major SaaS platforms operate well beyond that: Datadog processes 100+ trillion events per day across all telemetry types, with billions of data points per second. Uber's M3 aggregates 500M metrics/sec (pre-aggregation) across 6.6 billion active time series, persisting 20M metrics/sec after rollup. Grafana Mimir benchmarked 1 billion active series at ~50M samples/sec on a single cluster. The 500M metrics/sec figure in this design represents a stress-test upper bound, roughly matching Uber's pre-aggregation rate, to pressure-test every layer of the architecture. Section 7.3.1 shows how the same architecture scales down linearly to 10-50M metrics/sec with fewer components.
Assumptions:
- This is a multi-tenant platform (like Datadog, Grafana Cloud, or New Relic). Multiple teams and organizations push metrics into the same infrastructure.
- "Ingest" means accepted into durable storage. Query availability follows within 2 seconds of ingestion.
- Each metric is a (name, label set, timestamp, float64 value) tuple. Labels are key-value pairs like
{service="checkout", env="prod", region="us-east-1"}. - Services expose metrics via Prometheus exposition format or push via OTLP (OpenTelemetry Protocol).
Operating model:
- Responsibility split: The platform team operates collection infrastructure (OTel Collectors, Kafka, storage, dashboards). Tenant teams instrument their applications using OpenTelemetry SDKs, Prometheus client libraries, or auto-instrumentation agents (similar to Datadog/New Relic agents but vendor-neutral). Tenants either expose a
/metricsendpoint (pull) or push OTLP to the platform's ingestion endpoint (push). Tenants do not run their own Prometheus servers or storage. The platform handles everything after the application boundary. - Data transport: For tenants on the same Kubernetes cluster, the platform deploys OTel Collectors as a DaemonSet (one per node) that scrapes tenant pods locally over mTLS with no cross-network hop. For tenants on separate infrastructure, the platform exposes an HTTPS ingestion endpoint (OTLP or Prometheus remote_write) authenticated via per-tenant API key. Both paths converge into the same Kafka-backed pipeline.
Scope: Three Pillars of Observability
A production monitoring platform handles three signals that look nothing alike:
- Metrics answer "how much?" Small numbers (CPU at 45%, 200 req/sec), arriving constantly.
- Traces answer "what happened to this one request?" A call tree across service boundaries with timing data.
- Logs answer "what did the system say?" Structured text messages with full-text search.
Each requires a different storage engine. Metrics are aggregated (average CPU over 5 minutes), traces are looked up by ID (show me request abc123), and logs are searched by keywords (find all "timeout" errors):
| Signal | Data Shape | Size per Event | Storage Engine | Query Pattern |
|---|---|---|---|---|
| Metrics | (name, labels, timestamp, float64) | 1.5-2 bytes (Gorilla compression, Section 8.2) | TSDB (VictoriaMetrics) | Aggregate: rate(x[5m]) |
| Traces | (trace_id, span_id, parent_id, service, duration, attributes) | 500-2000 bytes/span | Columnar/object store (Grafana Tempo) | By trace ID, filter by service+latency |
| Logs | (timestamp, severity, message, structured_fields) | 200-1000 bytes | Log store (VictoriaLogs) | Full-text search, filter by severity |
Trying to shove all three signals into a single TSDB is a design mistake that surfaces at scale.
What NOT to do:
A single Prometheus instance typically handles 3-10 million active series depending on available RAM and scrape interval. At 500M metrics/sec with potentially billions of unique series, that setup will OOM (Out of Memory crash) before lunch. Running 50,000 standalone Prometheus instances isn't a solution either. That's 50,000 things to manage, 50,000 things to fail, and no global query capability.
The other common mistake: storing raw metrics forever. At 500M samples/sec, that's 43.2 trillion samples/day. Even at 1.37 bytes per sample (Gorilla compression), that's 54 TB/day of hot storage. Without downsampling, the storage bill alone kills the project. Downsampling applies to metrics only. Traces and logs retain full resolution or are discarded entirely (Section 9.6 explains why).
2. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Ingest metrics via pull (Prometheus scrape) and push (OTLP remote write) | P0 |
| FR-02 | Support four metric types: counter, gauge, histogram, summary | P0 |
| FR-03 | Real-time dashboard queries with sub-second latency over 15-day window | P0 |
| FR-04 | Continuous alert rule evaluation with configurable thresholds | P0 |
| FR-05 | SLO (Service Level Objective)-based alerting with multi-window burn rate detection | P0 |
| FR-06 | Automatic metric downsampling: raw (15d) → 5-min aggregates (90d) → 1-hour aggregates (forever). Traces and logs are not downsampled. | P0 |
| FR-07 | Multi-tenant data isolation with per-tenant cardinality limits | P0 |
| FR-08 | Recording rules for pre-computing expensive queries | P1 |
| FR-09 | Service discovery (Kubernetes, Consul) for automatic scrape target registration | P1 |
| FR-10 | Label-based access control (RBAC) on dashboards and queries | P1 |
| FR-11 | Alertmanager integration: deduplication, grouping, silencing, routing | P1 |
| FR-12 | Cross-region query federation for global dashboards | P1 |
| FR-13 | Metric relabeling and filtering at ingestion time | P2 |
| FR-14 | Cost attribution per tenant based on active series count and query volume | P2 |
| FR-15 | Distributed trace collection via OTLP with tail-based sampling (keep errors, p99, 0.1% baseline) | P0 |
| FR-16 | Trace-to-metric correlation via exemplars (click from metric panel to trace waterfall) | P1 |
| FR-17 | Centralized log ingestion with full-text search across all fields (50M lines/sec) | P0 |
| FR-18 | Log-to-trace correlation via trace_id field (three-way: metric to trace to log) | P1 |
3. Non-Functional Requirements
| Requirement | Target |
|---|---|
| Ingestion throughput | 500M data points/sec sustained |
| Query latency (15-day window) | p50 < 50ms, p99 < 200ms |
| Query latency (90-day window, downsampled) | p50 < 200ms, p99 < 500ms |
| Ingestion-to-dashboard visibility | < 2 seconds |
| Alert rule evaluation interval | 15 seconds (configurable) |
| Alert firing latency (ingestion to notification) | < 30 seconds |
| Availability | 99.99% (52 min downtime/year) |
| Data durability | 99.999% (no data loss on single node failure) |
| Active time series capacity | 10 billion+ |
| Retention | Raw: 15 days, 5-min: 90 days, 1-hour: indefinite |
| Tenant isolation | No cross-tenant data leakage, noisy neighbor protection |
4. High-Level Approach & Technology Selection
4.1 Why This Architecture
A metrics monitoring system at 500M/sec is write-heavy and read-light, but when reads happen, they need to be fast. The write-to-read ratio is roughly 1000:1. Millions of data points flow in every second. A handful of dashboards and alert rules query that data.
-
Append-only writes. Metrics are immutable. "At 3:14pm, CPU was 45%" never changes. The storage engine only appends, never seeks and updates, which is far faster.
-
Parallel everything. 500M writes per second on one machine is impossible. Spread the work across thousands of machines, each handling a slice of the data.
-
No read-before-write. At 500M/sec, checking whether a row exists before writing would be the bottleneck. Skip the lookup entirely.
-
Shard by metric identity. Assign each metric to a specific storage node based on its fingerprint (a hash of the metric name + labels). All samples for one metric always go to the same node. This keeps related data together, which is critical for compression and fast queries.
-
Compress aggressively. Time-series data has unique properties (timestamps arrive at regular intervals, values often repeat or change slowly) that enable 12x compression with the right algorithm.
-
Touch minimal data on reads. Inverted indexes map label selectors to series IDs. Recording rules pre-compute expensive aggregations so dashboards never scan raw data.
-
Tier retention automatically. Hot data (raw, 15 days) on fast storage. Warm data (5-min aggregates, 90 days) on object storage. Cold data (1-hour aggregates) archived indefinitely.
4.2 Core Components
The pipeline has five components:
- Collection layer - OpenTelemetry Collectors scrape metrics from services (pull) or accept pushed metrics (OTLP). Service discovery automates target registration.
- Ingestion buffer - Kafka absorbs write bursts, decouples collection from storage, and provides replay capability for recovery.
- Stream aggregation - Flink pre-computes recording rules, rollups, and SLO burn rates on the ingest path. Raw metrics become queryable aggregates at this stage.
- Time-series storage - VictoriaMetrics cluster stores hot data with Gorilla compression. S3 + Parquet stores downsampled data for long-term retention.
- Query and alerting - MetricsQL engine serves dashboard queries. Distributed rule evaluators continuously check alert conditions.
4.3 Technology Selection
| Component | Technology | Why This Choice |
|---|---|---|
| Collection | OpenTelemetry Collector | CNCF standard. Pull (Prometheus scrape) and push (OTLP) in one agent. 700+ integrations. Vendor-neutral. |
| Ingestion buffer | Apache Kafka | Replay on storage failure (24h retention), fan-out to Flink + vminsert, burst absorption at 500M/sec. RF=3, acks=all. Partitioned by metric fingerprint for locality. |
| Stream aggregation | Apache Flink | Native streaming (not micro-batch like Spark). Managed state with RocksDB backend. Exactly-once via distributed snapshots. Handles 50M+ keys without GC pressure. |
| TSDB (hot, 0-15d) | VictoriaMetrics Cluster | 2.2M data points/sec per node baseline (5M+ on production-grade hardware, see Section 7.3). PromQL-compatible. Gorilla compression built-in. Shared-nothing architecture. |
| Long-term storage (warm/cold) | S3 + Apache Parquet | Columnar format enables partition pruning. ZSTD compression. Massively lower storage footprint vs hot TSDB. Inspired by Thanos and InfluxDB 3.0 FDAP stack. |
| Inverted index | VictoriaMetrics mergeset | Maps label combinations to series IDs. Handles 100M+ series without OOM. Disk-backed, not in-memory like Prometheus. |
| Query engine | MetricsQL | PromQL superset. Keeps metric names after aggregations. Better high-cardinality handling. WITH templates for complex queries. Drop-in replacement for existing PromQL users. |
| Alert evaluation | vmalert + Alertmanager | vmalert evaluates rules against vmselect, sharded across pods for scale (similar pattern to Grafana Mimir ruler). Alertmanager handles dedup, grouping, routing to PagerDuty/Slack/Opsgenie. |
| Service discovery | Kubernetes API + Consul | Auto-discover scrape targets. No manual configuration when services scale. Label-based relabeling for metric filtering. |
| Visualization | Grafana | Industry standard. Multi-datasource. Dashboard load target <3s. Variable templating for multi-service views. |
| Compression | Gorilla + ZSTD | Not sequential. Different compressors for different tiers. Gorilla compresses raw samples in vmstorage (hot, 0-15 days): 12x ratio, 1.37 bytes/sample. When data ages out, Flink produces aggregated doubles (min/max/avg/sum/count) that are no longer sequential time-series samples, so Gorilla doesn't apply. ZSTD compresses these Parquet blocks on S3 (warm/cold, 15+ days): ~3x ratio, tunable levels (1-22). |
| Multi-tenancy | Tenant ID header + per-tenant series limits | Inject tenant_id on remote write. Reject writes exceeding cardinality budget. Shuffle-sharding for isolation. |
| Trace backend | Grafana Tempo | S3-native Parquet storage. No index to maintain (bloom filters for trace ID lookup). TraceQL query language. Fraction of Jaeger+Elasticsearch infrastructure at scale. |
| Log backend | VictoriaLogs | Bloom filter per-token index on all fields. LogsQL for full-text search. vlinsert/vlselect/vlstorage cluster scales horizontally. Section 10.2 benchmarks against Loki and Elasticsearch. |
| Context propagation | W3C Trace Context | Industry standard traceparent header. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers. |
Why Kafka in the metrics path: Most open-source metrics systems (Prometheus, Thanos, Mimir) skip Kafka entirely, writing directly from collectors to storage via remote_write. That works well below ~50M metrics/sec. At 500M metrics/sec, Kafka provides three capabilities a direct pipeline cannot:
- Replay after storage failures. If vmstorage crashes, Kafka retains 24 hours of raw metrics for re-ingestion without data loss.
- Fan-out to multiple consumers. Both Flink (for recording rules) and vminsert (for raw storage) consume the same topic independently.
- Burst absorption during scrape storms. When thousands of targets align their scrape intervals, Kafka absorbs the microsecond-level burst while vminsert consumes at a steady rate.
At smaller scales (<50M metrics/sec), a direct OTel Collector to vminsert pipeline via remote_write is simpler, lower latency, and recommended.
4.4 Why VictoriaMetrics Over Alternatives
| Dimension | VictoriaMetrics | InfluxDB 3.0 | TimescaleDB | M3DB (Uber) |
|---|---|---|---|---|
| Ingestion rate (single node) | 2.2M pts/sec | 330K pts/sec | 480K pts/sec | ~500K pts/sec |
| RAM usage (4M series benchmark) | 6 GB | 20.5 GB | 2.5 GB | ~12 GB |
| Disk usage (4M series) | 3 GB | 18.4 GB | 52 GB | ~8 GB |
| Query language | MetricsQL (PromQL superset) | SQL (via DataFusion) | SQL | M3QL + PromQL |
| Compression | Gorilla (1.37 bytes/sample) | Arrow + Parquet + ZSTD | PostgreSQL TOAST | M3TSZ (Gorilla variant) |
| Architecture | Shared-nothing (vminsert/vmselect/vmstorage) | Monolithic rewrite (Rust) | PostgreSQL extension | Quorum-based (3-way replication) |
| Operational complexity | Low (stateless query/insert, stateful storage only) | Medium (newer, less battle-tested) | Medium (PostgreSQL tuning required) | High (etcd dependency, complex topology) |
Why VictoriaMetrics wins here:
VictoriaMetrics ingests 6.6x faster than InfluxDB at one-third the RAM. The shared-nothing architecture means vminsert nodes are stateless and horizontally scalable. Each vmstorage node owns its shard independently. No quorum overhead, no coordination on writes. For a 500M/sec write-heavy workload, that absence of coordination is critical.
The three components: VictoriaMetrics cluster splits into three specialized roles:
-
vminsert (write router): Receives metrics and uses consistent hashing on the metric fingerprint to route each sample to the correct vmstorage node. Stateless (no local data), scales horizontally.
-
vmstorage (storage engine): Each node owns a shard of time series data on local NVMe. Handles Gorilla compression, maintains the inverted index, and runs compaction. The only stateful component.
-
vmselect (query engine): Receives MetricsQL queries, fans them out to all vmstorage nodes in parallel, merges results, and applies functions like
rate(). Also stateless.
Each dimension (write throughput, storage capacity, query concurrency) scales independently by adding nodes to the relevant component.
The PromQL compatibility matters too. The entire alerting ecosystem (Alertmanager, Grafana, recording rules) speaks PromQL. Choosing a SQL-native TSDB like InfluxDB 3.0 means rebuilding that tooling from scratch. MetricsQL extends PromQL without breaking it.
M3DB is battle-proven at Uber (6.6B series), but the operational complexity is much higher. etcd cluster management, complex topology, and 3-way quorum writes add latency and failure modes. VictoriaMetrics achieves comparable scale with a simpler operational model. Use Prometheus as the format and query language standard. Replace the storage layer with VictoriaMetrics.
4.5 Why Flink for Stream Aggregation
| Dimension | Apache Flink | Spark Structured Streaming | Kafka Streams |
|---|---|---|---|
| Processing model | True streaming (event-by-event) | Micro-batch (100ms+ intervals) | True streaming |
| State management | RocksDB backend (spills to disk) | Heap-based (GC pressure) | RocksDB or in-memory |
| Exactly-once | Distributed snapshots (Chandy-Lamport) | Offset tracking | Kafka transactions |
| Scalability | Independent of Kafka (separate cluster) | Independent cluster | Tied to Kafka partitions |
| Windowing | Event-time, processing-time, session | Event-time, processing-time | Event-time, session |
| State size at 500M/sec | 50M+ keys, multi-GB, no GC pauses | Multi-second GC pauses at scale | Limited by Kafka partition count |
For this system: 500M metrics/sec flowing through recording rules, rollup aggregations, and SLO burn rate calculations. At peak, that's 50 million active aggregation keys in state. Flink's RocksDB state backend handles this without heap pressure. Spark's JVM heap would cause multi-second GC pauses that violate the 2-second ingestion-to-visibility target.
Why not the Prometheus/Mimir ruler? The ruler evaluates recording rules by querying the TSDB every 15-30 seconds. At moderate scale that works. At 500M metrics/sec, each evaluation cycle competes with dashboard and alert queries for vmselect capacity. Flink processes metrics inline during ingestion with zero query overhead. The trade-off: Flink adds operational complexity. Below ~50M metrics/sec, the ruler is simpler and sufficient.
Scope: metrics only. Flink processes metrics only. Traces and logs bypass Flink entirely (Sections 10 and 11). They don't need pre-aggregation or downsampling.
5. High-Level Architecture
5.1 Bird's-Eye View
Component glossary (numbered steps match the diagram):
(1) Collection (two models):
- Pull (same K8s cluster): The platform's OTel Collector DaemonSet runs one collector per K8s node (~3,000 collectors total). It scrapes local pods'
/metricsendpoints over mTLS. No cross-network hop. Service Discovery (K8s API, Consul) automatically finds new scrape targets as services scale. - Push (external infrastructure): Services on separate clusters, on-prem, or serverless can't be scraped. They push metrics, traces, and logs via OTLP to the OTel Collector Gateway over HTTPS, authenticated by per-tenant API key.
(2) Kafka (ingestion buffer):
- metrics-ingestion: 10,000 partitions, 24h retention, partitioned by metric fingerprint. Enables replay if storage crashes.
- traces-raw: partitioned by trace_id so all spans from one trace land together.
- logs-raw: partitioned by service_name for locality.
(3)→(4) Processing:
- Flink computes recording rules, 5-minute rollups (20x data reduction), and SLO burn rate calculations on the metrics stream.
- Tail-Based Sampler keeps errors, p99 latency, and a small random baseline (Section 9.2 covers the full sampling strategy).
- Log Parser extracts structured fields (service name, log level, trace_id) and routes by severity.
(5) Storage:
- vminsert → vmstorage → S3 Parquet: Stateless write router hashes metrics to storage nodes. vmstorage compresses with Gorilla (1.37 bytes/sample) on NVMe (15-day hot tier). Aged data exports to S3 as ZSTD-compressed Parquet (warm 90 days, cold 1 year).
- Tempo Ingesters → S3 Parquet: Batches sampled spans into Parquet blocks on S3. Bloom filters enable fast trace ID lookup.
- vlinsert → vlstorage: Distributes logs across storage nodes. Bloom filter index on every field. Local NVMe storage.
(6) Query:
- vmselect: Fans out MetricsQL queries to all vmstorage nodes, merges results.
- Tempo Query: TraceQL queries against S3 Parquet blocks.
- vlselect: LogsQL full-text search across vlstorage nodes.
- Grafana: Unified dashboard querying all three backends. Exemplars link metrics to traces. trace_id links traces to logs.
The architecture splits into a control plane (tenant configs, alert/recording rules, cardinality budgets, RBAC, managed via API) and a data plane (OTel Collectors, Kafka, Flink, VictoriaMetrics, Tempo, VictoriaLogs, the telemetry pipeline). They scale independently and deploy separately.
All three signals share the same OTel Collectors and Kafka cluster, then split at the storage layer: each signal routes to a storage engine built for its access pattern. Section 6.3 details the full Kafka topic layout.
Pull vs Push collection. The OTel Collector normalizes both models into the same protobuf format.
How each signal reaches the platform:
| Signal | Same K8s cluster | External infrastructure | Protocol |
|---|---|---|---|
| Metrics | Pull: Collector scrapes /metrics (local, mTLS) | Push: App sends OTLP to collector gateway (HTTPS) | Prometheus text (pull) or OTLP protobuf (push) |
| Traces | Push: OTel SDK sends spans to local collector (gRPC) | Push: OTel SDK sends spans to collector gateway (gRPC/HTTPS) | OTLP protobuf over gRPC (default) or HTTP |
| Logs | File tail: Collector reads /var/log/pods/*/ (no network) | Push: App sends OTLP logs to collector gateway (gRPC/HTTPS) | Local file I/O (same cluster) or OTLP protobuf (external) |
After Kafka, each signal diverges to its specialized storage pipeline (see numbered steps above). Grafana correlates all three via exemplars and trace_id fields.
6. Data Model
6.1 Metric Wire Format
Every metric sample follows the Prometheus data model:
metric_name{label1="value1", label2="value2"} float64_value unix_timestamp_ms
Example:
http_requests_total{service="checkout", method="POST", status="200", region="us-east-1"} 142857 1709827200000
Why fingerprints? At 500M metrics/sec, the system needs deterministic routing: each metric must always land on the same Kafka partition and storage node. A fingerprint hash of the metric name + labels provides this.
The series identity (fingerprint) is computed as:
fingerprint = FNV-1a(sorted(metric_name + labels)) // FNV-1a: a fast, non-cryptographic hash function
↑ Labels sorted alphabetically by key name before hashing.
Different producers may emit labels in different order
(e.g., {service, method} vs {method, service}).
Sorting guarantees the same series always produces the same fingerprint.
Example:
metric: http_requests_total
labels: {method="POST", region="us-east-1", service="checkout", status="200"}
fingerprint: hash("http_requests_total\x00method\x00POST\x00region\x00us-east-1\x00service\x00checkout\x00status\x00200")
→ 0x7a3f2b1c (used for Kafka partitioning and storage sharding)
Why the fingerprint matters: This single hash drives data locality at every layer.
- Kafka uses it as the partition key. All samples for one metric land on the same partition, so Flink can process that metric's recording rules without fetching data from other partitions.
- vminsert uses it to route all data for one metric to the same storage node. Keeping data together on one node is critical for compression (the compressor needs consecutive values to find patterns) and for queries (look up one node, not all 200).
- S3 Parquet files are keyed by fingerprint range. When querying cold data, the system skips entire files whose fingerprint range doesn't match, instead of scanning everything.
6.2 Metric Types
| Type | Description | Storage Behavior |
|---|---|---|
| Counter | Monotonically increasing value (resets on process restart) | Store raw value. rate() computed at query time. |
| Gauge | Arbitrary value that goes up and down | Store raw value. |
| Histogram | Observations bucketed into configurable ranges. Each bucket counts how many observations fell at or below that boundary. | Expands to multiple series: _bucket{le="X"} (one per bucket, where le = "less than or equal to", the upper boundary), plus _sum and _count. A histogram with 10 buckets = 12 time series per unique label combination. |
| Summary | Pre-computed quantiles on the client side | Expands to {quantile="X"}, _sum, _count. |
Why histograms explode into multiple series:
A counter metric (http_requests_total) is one number: "142,857 requests." One series. A histogram metric (http_request_duration_seconds) is a distribution: "How many requests took 0-10ms? How many took 10-50ms? 50-100ms?" Each bucket is a separate series. With 10 buckets plus a _sum and _count, one histogram becomes 12 series. Now multiply by every unique label combination and this is where cardinality explodes. A single histogram metric with 10 buckets, exposed by 1,000 pods, with 5 label dimensions (service, method, status, region, zone) each having moderate cardinality:
Series count = buckets × pods × label_combinations
= 12 × 1,000 × (5 services × 4 methods × 5 statuses × 3 regions × 2 zones)
= 12 × 1,000 × 600
= 7,200,000 series from ONE histogram metric
6.3 Kafka Topic Layout
Topic: metrics-ingestion
Producer: OTel Collectors (scrape + OTLP push)
Consumer: Flink (recording rules, rollups) + vminsert (raw metric writes)
Partitions: 10,000
Replication Factor: 3
Retention: 24 hours (replay buffer for recovery)
Partitioning key: metric fingerprint (FNV-1a hash)
Value format: Protobuf (TimeSeries message)
Topic: metrics-aggregated
Producer: Flink (processed recording rules and rollups)
Consumer: vminsert (aggregated metric writes)
Partitions: 5,000
Replication Factor: 3
Retention: 12 hours
Topic: metrics-alerts
Producer: Flink (alert events when burn rate thresholds breach)
Consumer: Alertmanager (routes and deduplicates alerts)
Partitions: 100
Replication Factor: 3
Retention: 7 days
6.4 Downsampled Parquet Schema (Warm/Cold Tier)
When raw metrics age past 15 days, they don't stay in vmstorage. Flink writes 5-minute rollups to S3 as Parquet files during ingestion (Section 8.6). A nightly batch job further aggregates those into 1-hour rollups. vmselect reads these files when a query spans beyond the hot tier, using partition pruning on tenant, resolution, and time range to skip irrelevant files. Each row stores five aggregate values (min, max, avg, sum, count) because different query functions need different aggregates: max() reads value_max, rate() uses value_sum / value_count, and so on.
-- Parquet file layout on S3
-- Path: s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet
-- Schema:
CREATE TABLE downsampled_metrics (
fingerprint UINT64, -- Metric identity
metric_name STRING,
labels MAP<STRING, STRING>,
timestamp_ms INT64, -- Aligned to resolution boundary
value_min DOUBLE,
value_max DOUBLE,
value_avg DOUBLE,
value_sum DOUBLE,
value_count INT64, -- Number of raw samples in this window
resolution STRING -- '5m' or '1h'
)
PARTITIONED BY (tenant_id, resolution, year, month);
7. Back-of-the-Envelope Estimation
7.1 Ingestion Volume
Target: 500M metrics/sec (peak)
Source split (assumed):
Infrastructure metrics (CPU, memory, disk, network): 30% → 150M/sec
Application metrics (request rate, latency, errors): 40% → 200M/sec
Business metrics (orders, revenue, conversions): 10% → 50M/sec
Custom/user-defined metrics: 20% → 100M/sec
Daily volume:
500M/sec × 86,400 sec/day ← 86,400 = 24 × 60 × 60
= 43.2 trillion samples/day ← ~43T samples
At 40% average utilization (not peak all day):
200M/sec × 86,400 = 17.28T samples/day ← actual sustained rate
7.2 Storage Sizing
Per-sample size:
Raw (uncompressed): 16 bytes (8B timestamp + 8B float64 value)
Gorilla compressed: 1.37 bytes/sample ← 16 / 1.37 = 12x compression
With labels + index overhead: ~2 bytes/sample ← use this for all estimates below
─── Hot tier (raw, 15 days) ───
500M/sec × 2 bytes = 1 GB/sec ← raw ingestion rate, easy anchor
1 GB/sec × 86,400 = 86.4 TB/day ← one day of hot storage
86.4 TB × 15 days = 1,296 TB (~1.3 PB) ← 15 days at peak
At 40% average utilization:
1.3 PB × 0.4 = ~518 TB ← actual provisioned capacity
─── Warm tier (5-min aggregates in Parquet on S3, 90 days) ───
500M/sec ÷ 20 = 25M aggregated samples/sec ← 20 raw samples per 5-min window become 1
25M × 5 values = 125M values/sec ← each aggregate stores min, max, avg, sum, count
125M × 8 bytes = 1 GB/sec ← raw doubles, no Gorilla (not sequential anymore)
1 GB/sec ÷ 3 (ZSTD) = 333 MB/sec ← ZSTD compresses ~3x on float64 arrays
333 MB/sec × 86,400 = 28.8 TB/day ← one day of warm storage
28.8 TB × 90 days = 2,592 TB (~2.6 PB) ← 90 days of warm
─── Cold tier (1-hour aggregates in Parquet on S3, 1 year) ───
25M/sec ÷ 12 = ~2.08M aggregated samples/sec ← twelve 5-min windows per hour
2.08M × 5 × 8 bytes = 83 MB/sec ← same 5-value pattern as warm
83 MB/sec ÷ 3 (ZSTD) = ~28 MB/sec ← after compression
28 MB/sec × 86,400 = 2.4 TB/day ← one day of cold storage
2.4 TB × 365 days = 870 TB ← one year of cold
─── Total ───
Hot: 518 TB (vmstorage NVMe, 15 days at 40% avg)
Warm: 2,592 TB (S3 Parquet, ~2.6 PB, 90 days)
Cold: 870 TB (S3 Parquet, 1 year)
Total: ~4.0 PB
7.3 Component Sizing
OTel Collectors:
Each collector: ~170K metrics/sec on c6g.xlarge (4 vCPU, 8GB)
500M / 170K = ~3,000 collectors ← Go-based, CPU-efficient for protobuf
Deployed as DaemonSet (1 per node) + dedicated fleet for high-volume targets
Kafka:
Per-broker sustained: ~250K msg/sec (RF=3, acks=all) ← LinkedIn runs ~20K/sec in prod;
Confluent shows 500K ideal, RF=3 is lower
500M / 250K = 2,000 brokers
Partitions: 10,000 ← enough for parallelism without overhead
Flink:
Each TaskManager: ~1M events/sec (4 slots × 250K each) ← Amazon Managed Flink: ~28K/sec per KPU;
dedicated r6g.2xlarge has much more headroom
500M / 1M = 500 TaskManagers ← 2,000 parallel tasks (4 slots each)
VictoriaMetrics (vmstorage):
Benchmark: 100M on 46 nodes = 2.17M/node
On i3.4xlarge (16 vCPU, 122GB, 3.8TB NVMe): ~5M/sec ← ~2.3x benchmark with 4x hardware
500M / 5M = 100 vmstorage nodes
× 2 (RF) = 200 vmstorage nodes
200 × 3.8 TB = 760 TB ← covers 518 TB hot + WAL/compaction headroom
vminsert (stateless):
500M / 12.5M per node = 40 vminsert nodes
vmselect (stateless):
Start with 20 nodes. Scale based on dashboard concurrency.
7.3.1 Right-Sizing for Smaller Scale
The numbers above target 500M metrics/sec. The architecture scales down linearly:
| Scale | OTel Collectors | Kafka Brokers | Flink TaskManagers | vmstorage | vminsert |
|---|---|---|---|---|---|
| 10M/sec | 60 | 40 | 10 | 4 | 2 |
| 50M/sec | 300 | 200 | 50 | 20 | 8 |
| 100M/sec | 600 | 400 | 100 | 40 | 16 |
| 500M/sec | 3,000 | 2,000 | 500 | 200 | 40 |
At 10-50M metrics/sec, Kafka and Flink can be replaced entirely with a direct OTel Collector to vminsert pipeline using Prometheus remote_write. Recording rules move to the VictoriaMetrics vmalert component (query-based evaluation, similar to the Mimir ruler). This removes Kafka and Flink entirely. The jump to Kafka + Flink is justified when direct remote_write cannot absorb burst traffic or when stream-based recording rules outperform query-based evaluation at the target scale.
7.4 Query Layer Sizing
Dashboard query assumptions:
50 concurrent Grafana users
20 panels per dashboard
30-second refresh interval
Average query touches 50K series (with recording rules)
Query throughput:
50 × 20 / 30 = 33 queries/sec ← steady-state
Peak (all refresh at once): ~200 queries/sec
vmselect fan-out:
33 queries/sec × 200 nodes = 6,600 vmstorage req/sec ← every query hits every node
Peak: 200 × 200 = 40,000 req/sec
vmselect sizing:
Each node: ~100 concurrent queries
20 nodes = 2,000 concurrent queries ← 10x headroom over 200 peak
Memory: 32 GB per node for result merging + caching
Result cache:
Key: (query_hash, time_range, step)
10 GB per node (LRU eviction)
Hit rate: 60-80% ← dashboard auto-refresh is repetitive
Effective query reduction: 3-5x
7.5 Trace and Log Volume Estimation
Traces and logs add incremental infrastructure on the same shared collection and buffer layers.
Tracing:
500K services × 50 RPS = 25M requests/sec
25M × 8 spans/trace = 200M spans/sec
200M × 1 KB/span = 200 GB/sec ← raw trace volume
200 GB/sec × 86,400 = 17.3 PB/day ← not viable to store 100%
× 0.005 (0.5% sampling) = 1 GB/sec = 86 TB/day ← after tail-based sampling
Tiered storage (180 days): ~15.5 PB total (Section 9.7)
Tempo ingesters: 20, Compactors: 5, +4 Kafka brokers
Logging:
500K services × 100 lines/sec = 50M lines/sec
50M × 500 bytes = 25 GB/sec ← raw log volume
25 GB/sec × 86,400 = 2.16 PB/day
÷ 8 (VictoriaLogs compression) = ~270 TB/day stored ← columnar + LZ4/ZSTD
Tiered storage: ~9.5 PB across active tiers
vlinsert: 17, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers
The same OTel Collectors, Kafka cluster, and Grafana deployment serve all three signals. Only the storage layer diverges.
7.6 Network Bandwidth
Network traffic adds up fast.
Ingestion path (metrics only):
OTel Collectors → Kafka: 1 GB/sec (compressed protobuf)
Kafka replication (RF=3): × 3 = 3 GB/sec cross-broker
Kafka → vminsert: 1 GB/sec
Kafka → Flink: 1 GB/sec
vminsert → vmstorage: 1 GB/sec (distributed across 200 nodes)
Total metrics bandwidth: ~7 GB/sec aggregate
Trace + log paths:
Traces: 1 GB/sec (post-sampling) + 3 GB/sec Kafka replication
Logs: 25 GB/sec raw + 75 GB/sec Kafka replication
Total trace + log bandwidth: ~104 GB/sec aggregate
Query path (bursty):
vmselect → vmstorage fan-out: variable, 1-5 GB/sec during peak dashboard load
vmselect → S3 (warm queries): 50-200 MB/sec (mostly partition-pruned)
Cross-AZ traffic: Kafka brokers, vmstorage, and vlstorage should be spread across 3 availability zones for durability. Cross-AZ data transfer costs ~$0.01/GB on AWS. At 7+ GB/sec for metrics alone, that's ~$6K/day in cross-AZ transfer. Minimize this by co-locating Kafka partition leaders with their primary consumers (vminsert, Flink) in the same AZ where possible, and using AZ-aware rack assignment in Kafka's broker.rack configuration.
Log pipeline dominates bandwidth. At 25 GB/sec raw log volume, the logging pipeline accounts for over 70% of total network traffic. Compressing logs at the OTel Collector (gzip or ZSTD before sending to Kafka) and using Kafka's end-to-end compression (producer compresses, broker stores compressed, consumer decompresses) reduces wire traffic by 3-5x.
8. Deep Dives: Metrics Pipeline
8.1 Ingestion Pipeline
The full ingestion flow:
Key design decisions in the ingestion path:
Fingerprint-based partitioning. All samples for a given time series land on the same Kafka partition (see Section 6.1 for why this matters). Flink gets partition-local state, vmstorage gets sequential writes optimal for Gorilla compression.
Two Kafka topics. Raw metrics go to metrics-ingestion, Flink outputs go to metrics-aggregated. That way, vmstorage only consumes pre-processed data. If Flink needs to be restarted or redeployed, raw metrics buffer safely in Kafka (24-hour retention) while Flink catches up.
Why not write directly to VictoriaMetrics? A direct OTel → vminsert pipeline is simpler, lower latency, and works well up to ~50M metrics/sec. Beyond that scale, three problems emerge: (1) burst absorption: a deploy that doubles metric volume for 5 minutes overwhelms vminsert with no buffer; (2) replay: if vmstorage goes down and recovers, there's no way to replay the lost window without Kafka retention; (3) fan-out: recording rules, downsampling, and metrics-to-logs correlation all need the same raw stream, and tapping vmstorage's write path is fragile. Kafka solves all three. Cost: operational complexity and ~2 seconds of additional end-to-end latency. Below 50M/sec, skip Kafka and write directly.
Backpressure handling. If vmstorage falls behind, Kafka consumer lag increases. OTel Collectors detect backpressure via Kafka producer latency and can temporarily batch more aggressively (larger batches, less frequent sends). The 24-hour Kafka retention gives enough buffer for most recovery scenarios.
Tenant validation at the edge. OTel Collectors inject the tenant_id header and validate it against a local cache of known tenants. Invalid tenants get rejected before hitting Kafka, which prevents unknown tenants from consuming pipeline resources.
Scrape endpoint security. The /metrics endpoint on tenant pods is not open to arbitrary callers. OTel Collectors and application endpoints authenticate each other via mTLS, with certificates issued by cert-manager in Kubernetes. A collector without a valid certificate cannot scrape, and a pod without a valid certificate is rejected as a scrape target. For push-model ingestion, the platform's OTLP endpoint requires a per-tenant API key in the Authorization header, validated at the OTel Collector gateway before data reaches Kafka.
8.2 Gorilla Compression
VictoriaMetrics uses Gorilla compression from Facebook's 2015 paper, the foundation of time-series compression in every modern TSDB (VictoriaMetrics, M3DB, Prometheus TSDB, InfluxDB all use variants). The algorithm exploits the regularity of time-series data: timestamps arrive at fixed intervals (delta-of-delta compresses to near zero), and consecutive values are often identical or structurally similar (XOR encoding). Net result: 1.37 bytes per sample vs 16 bytes raw, a 12x compression ratio. Without Gorilla, 500M samples/sec at 16 bytes would require ~691 TB/day; with it, ~59 TB/day. In production, chunk headers and label overhead push the effective rate to 1.5-2 bytes per sample (the capacity estimates in Section 7.2 use 2 bytes/sample).
For the full bit-level walkthrough (IEEE 754 XOR encoding, end-to-end 8-sample compression, late scrape impact, and chunk structure), see the Gorilla Compression deep dive.
8.3 High Cardinality: The Silent Killer
Cardinality explosion is the single biggest operational risk in a metrics monitoring system, bigger than disk failures, network partitions, or Kafka lag. At 500M metrics/sec, the vast majority of data points come from infrastructure exporters and framework auto-instrumentation, not custom application code. But it's the custom labels on application metrics that cause cardinality explosions.
What actually happens:
An engineer adds a user_id label to a request latency histogram. Seems reasonable. Want to track per-user latency for debugging.
http_request_duration_seconds_bucket{
service="checkout",
method="POST",
status="200",
user_id="u_abc123", ← This label
le="0.1"
}
The math:
Before user_id label:
1 metric × 12 series (10 buckets + sum + count) × 5 services × 4 methods × 5 statuses = 1,200 series
Memory: ~8 KB per series × 1,200 = 9.6 MB (fine)
After user_id label (10M users):
1 metric × 12 series (10 buckets + sum + count) × 10M users = 120,000,000 series
Memory: ~8 KB per series × 120M = 960 GB
Timeline:
00:00 Deploy goes out with user_id label
00:15 First scrape. 120M new series created.
00:16 vmstorage memory jumps from 10 MB to 960 GB
00:17 OOM. vmstorage killed.
00:17 All dashboards go blank. All alerts stop evaluating.
00:18 The monitoring system is dead. Nobody gets paged about it being dead.
Detection:
Monitor the monitoring system's own cardinality metrics:
# Alert: cardinality growing too fast
rate(vm_new_timeseries_created_total[1h]) > 100000
# Alert: approaching tenant series limit
vm_active_timeseries{tenant_id="acme"} / vm_series_limit{tenant_id="acme"} > 0.8
# Dashboard: top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))
Prevention (layered):
-
Ingestion-time cardinality limits. vminsert checks active series count per tenant. Reject new series if the tenant exceeds its budget. Return HTTP 429 with
X-Series-Limitheader. -
Label allowlists. The OTel Collector's
metric_relabel_configscan drop labels before they reach Kafka. Block known high-cardinality labels (user_id, request_id, trace_id, pod_id) from metric labels.
# OTel Collector relabel config
metric_relabel_configs:
- source_labels: [user_id]
action: labeldrop
- source_labels: [request_id]
action: labeldrop
-
Cardinality analysis at deploy time. A CI/CD check scans metric instrumentation for unbounded label values. If a label's value set is unbounded (user IDs, UUIDs), the pipeline blocks the deploy.
-
The "Metrics without Limits" pattern (Datadog-inspired). Decouple ingestion from indexing. Ingest all raw data into Kafka (cheap, append-only). But only index the label combinations that are actually queried. If
user_idis never used in a dashboard or alert rule, it's stored but not indexed. The inverted index stays small. Storage cost increases, but memory and query performance are protected.
8.3.1 Schema Evolution: Label Changes
Metric schemas change. A service renames status to http_status, adds a new region label, or drops an unused version label. Each of these creates a new set of time series because a different label set produces a different fingerprint, which means a different series.
What breaks:
- Dashboards referencing the old label name return empty results
- Alert rules using the old label stop evaluating correctly
- The old series goes stale (5-minute staleness marker), but historical data remains under the old name
- Recording rules that aggregate by the old label silently stop including the renamed service
Migration strategy:
-
Dual-emit during transition. The application emits metrics with both old and new label names for one retention period (15 days). Dashboards and alert rules are updated to use the new name. After the transition window, the old label is dropped.
-
Relabeling at the collector. If the application can't dual-emit, the OTel Collector's
metric_relabel_configscan copy the old label to the new name during the transition:
metric_relabel_configs:
- source_labels: [status]
target_label: http_status
- Dashboard migration tooling. Grafana's provisioning API can batch-update dashboard JSON to replace label references. Script this as part of the label rename runbook.
There is no way to retroactively rename labels in historical data stored in vmstorage or S3 Parquet files. Plan label names carefully. Breaking changes should be coordinated across the team with a defined cutover window.
8.4 Inverted Index at Scale
The inverted index is how the query engine turns a label selector like {service="checkout", env="prod"} into a list of matching time series IDs.
How it works:
VictoriaMetrics mergeset structure:
VictoriaMetrics uses mergeset for its inverted index, an LSM-tree variant optimized for sorted key lookups. New label-to-TSID mappings buffer in memory, flush to immutable sorted parts on disk, and background merges keep the part count small. Queries look up each label selector across parts, then intersect posting lists to find matching series.
High-cardinality labels degrade index performance:
The inverted index for service=checkout might have 1,000 TSIDs in its posting list. Fast to intersect. But user_id=u_abc123 has exactly 1 TSID, and there are 10 million such entries. The index now stores 10 million separate posting lists. Each label lookup scans a massive key range. Intersection becomes trivial (single entry), but the index scan to find it is expensive.
Worse: the mergeset's key space grows linearly with cardinality. More keys means more disk I/O during compaction, more memory for bloom filters, and longer startup times when rebuilding the index.
Bloom filters for segment skipping:
VictoriaMetrics uses bloom filters on each index part to quickly determine if a label value might exist in that part. If the bloom filter says "definitely not here," the part is skipped entirely. For low-cardinality labels (service, env, region), bloom filters skip 95%+ of index parts on first check. For high-cardinality labels, the bloom filter's false positive rate increases and the benefit diminishes.
When a query has multiple label selectors, the engine intersects posting lists using size-adaptive algorithms (galloping search for skewed lists, linear merge for similar-sized lists).
8.5 Query Engine and Execution
MetricsQL query execution flow:
Query splitting for large time ranges:
When a dashboard requests a 30-day range, the query spans both hot (vmstorage, raw data for last 15 days) and warm (S3, 5-min aggregates for days 16-30). vmselect splits the query:
- Days 1-15: Fan out to vmstorage nodes. Raw data, full resolution.
- Days 16-30: Read from S3 Parquet files. 5-min aggregated data. Partition pruning by tenant_id + time range skips irrelevant files.
- Merge: vmselect stitches the two result sets together, aligning timestamps at the step boundary.
Dashboard experience across tiers:
Querying beyond the hot tier is slower and lower resolution.
| Time Range | Storage Tier | Resolution | Typical Latency | What You Lose |
|---|---|---|---|---|
| Last 15 days | Hot (vmstorage) | 15-sec raw | p99 < 200ms | Nothing, full fidelity |
| 15-90 days | Warm (S3 Parquet) | 5-min aggregates | p99 < 500ms | Short spikes (<5 min) disappear. Dashboard lines are smoother. A 3-minute error burst that was visible in hot tier becomes a single averaged data point in warm. |
| 90+ days | Cold (S3 Parquet) | 1-hour aggregates | 500ms-2s | Only useful for trend analysis ("did error rate increase this quarter?"), not incident investigation. |
Over 95% of dashboard queries are last-15-minutes or last-1-hour (real-time operations). Long-range queries (30-day, 90-day) are infrequent capacity-planning views where 5-min resolution and 200-500ms extra latency are fine. Recording rules help here too. Pre-computed aggregates like service:http_requests:rate5m exist at hot tier resolution regardless of time range, so dashboards built on recording rules never hit S3 for common panels.
Result caching:
vmselect caches query results in its own internal LRU cache, keyed by (query_hash, time_range, step). For dashboard auto-refresh (every 30s), the cache hit rate is high because only the latest data point changes. The cache uses a time-bounded eviction policy: cached results for time ranges that end in the past are valid indefinitely. Results for ranges that include "now" expire after the scrape interval (15s).
Tempo uses a separate approach: an external Memcached cluster caches trace query results (Section 11.9 covers this in detail). This keeps repeated trace ID lookups from hitting S3 on every request, which matters because Tempo's query path is entirely S3-backed with no local query cache.
Query priority scheduling. Not all queries are equal. Alert rule evaluation from vmalert is the highest priority. A delayed alert is a missed incident. Dashboard auto-refresh is medium priority. Ad-hoc exploration queries from engineers are lowest priority. vmselect enforces this via per-tenant query concurrency slots: alert rules get dedicated slots that are never preempted, dashboard queries share a pool, and ad-hoc queries use whatever capacity remains.
Recording rules to keep dashboards fast:
Without recording rules, a dashboard panel querying across 500 services scans millions of raw series on every refresh. Recording rules pre-compute these aggregations and store the result as a few hundred series instead of millions. Section 8.8 covers recording rules in detail, including when to use them and their cost.
8.6 Downsampling and Retention
Three storage tiers with automatic lifecycle management:
Stream aggregation (Flink) vs batch rollup:
Two approaches to downsampling, each with tradeoffs:
| Approach | How | Pros | Cons |
|---|---|---|---|
| Stream aggregation | Flink aggregates on the ingest path. Every 5 minutes, emit min/max/avg/sum/count for each series. Write aggregates to warm tier. | No read penalty. Aggregates available immediately. No backfill needed. | Uses Flink state (memory + checkpoint). Must handle late-arriving data. |
| Batch rollup | Background job reads raw data from vmstorage, computes aggregates, writes to S3. | Simple. No ingest-path overhead. Can recompute if logic changes. | Reads data twice (once to store, once to rollup). Creates I/O contention on vmstorage. Rollup delay (hours). |
This design uses stream aggregation for hot-to-warm (Flink computes 5-min rollups on ingest) and batch rollup for warm-to-cold (nightly job reads 5-min Parquet files, re-aggregates to 1-hour, writes new Parquet files).
Each tier reduces volume dramatically: hot-to-warm cuts 20x (raw to 5-min aggregates), warm-to-cold cuts another 12x (5-min to 1-hour). Total storage across all tiers: ~4.0 PB. See Section 7.2 for the full derivation.
Handling late-arriving data:
Flink's 5-min aggregation window uses event-time processing (the sample's own timestamp, not arrival time) with a 2-minute watermark. The watermark signals "no events older than this will arrive," letting Flink close windows. Data arriving up to 2 minutes late lands in the correct window. Data arriving later than 2 minutes is either:
- Dropped (if it falls within the hot tier window and raw data is still available)
- Counted toward the next window (if it falls in the warm tier)
For metrics, late arrivals are rare. Scrape intervals are clock-driven. Delays beyond 2 minutes indicate network failures.
8.7 Alert Evaluation at Scale
Thousands of alert rules across billions of time series, evaluated every 15 seconds. A single evaluator can't keep up, so the rules are sharded.
Rule format: vmalert reads rules from YAML files on disk, the same format Prometheus uses. Each file contains one or more rule groups. Each group has its own evaluation interval. Two types: alerting rules fire notifications when a condition holds (e.g., error rate > 5% for 2 minutes), and recording rules pre-compute expensive queries and save the result as a new metric so dashboards stay fast.
Alerting rule example:
# /etc/vmalert/rules/shard-0/checkout-alerts.yml
groups:
- name: checkout-alerts
interval: 15s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
tenant_id: acme
annotations:
description: "Error rate above 5% for checkout service"
runbook: "https://wiki.internal/runbooks/checkout-errors"
- alert: SLOBurnRateFast
expr: slo:error_budget:burn_rate_5m > 14.4
for: 1m
labels:
severity: critical
annotations:
description: "Fast burn: error budget will exhaust within 1 hour at current rate"
Recording rule example:
# /etc/vmalert/rules/shard-0/checkout-recording.yml
groups:
- name: checkout-recording
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
- record: slo:error_budget:burn_rate_5m
expr: |
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) / (1 - 0.999)
Rule management at scale: The platform's /api/v1/rules endpoint provides CRUD operations for rules. Behind the scenes, it stores rule metadata in a database, generates these YAML files, and writes them to the correct vmalert shard's directory. A hot reload via vmalert's /-/reload endpoint picks up changes without restart.
Rule sharding strategy:
Each alert rule is assigned to exactly one vmalert shard. The rule management API uses consistent hashing on the rule name to decide which shard directory (/etc/vmalert/rules/shard-{N}/). When vmalert pods scale up or down, the API reassigns rules and triggers reloads on affected instances.
Each evaluator shard, every 15 seconds:
- Loads its assigned rules from the rule store
- Executes each rule's MetricsQL expression against vmselect
- Compares results against thresholds
- For rules that have been firing for longer than their
forduration, sends an alert to Alertmanager - Updates local state (firing duration, last evaluation time)
SLO-based burn rate alerting:
Traditional threshold alerts ("error rate > 5%") are noisy. SLO-based alerting asks a better question: "At the current error rate, will the error budget run out before the SLO period ends?"
How burn rates work: A 99.9% SLO gives you ~43 minutes of error budget over a 30-day window.
A burn rate measures how fast that budget gets consumed:
- Burn rate 1x: Budget consumed exactly over 30 days. No action needed.
- Burn rate 14.4x: Budget consumed 14.4x faster. Runs out in ~2 days. Page immediately.
- Burn rate 6x: Runs out in ~5 days. Urgent, but hours remain, not minutes.
The formula: burn_rate = current_error_rate / allowed_error_rate. With a 0.1% error allowance and a current rate of 1.44%, the burn rate is 14.4x:
Multi-window burn rates:
Fast burn (last 5 min): error_rate / (1 - 0.999) > 14.4 → page immediately
Medium burn (last 30 min): error_rate / (1 - 0.999) > 6 → page (slower burn)
Slow burn (last 6 hours): error_rate / (1 - 0.999) > 1 → ticket (not a page)
Alertmanager deduplication and grouping:
When 500 pods fire the same "HighErrorRate" alert, that's 500 PagerDuty notifications without grouping. Alertmanager collapses them:
route:
group_by: ['alertname', 'service', 'env']
group_wait: 30s # Wait 30s to batch alerts in the same group
group_interval: 5m # Wait 5m between re-notifications
repeat_interval: 4h # Re-notify every 4h if still firing
receiver: 'pagerduty-critical'
routes:
- match:
severity: warning
receiver: 'slack-warnings'
- match:
severity: info
receiver: 'slack-info'
500 pod-level alerts get grouped into 1 notification: "HighErrorRate firing for service=checkout, env=prod (500 instances)."
Alert storm protection:
If a widespread failure causes 100,000 alerts to fire in 1 minute (datacenter-level event), even grouped alerts can overwhelm on-call engineers. Protections:
- Inhibition rules. If a datacenter-level alert is firing, suppress all service-level alerts in that datacenter.
- Rate limiting. Alertmanager limits notification sends to 100/minute per receiver. Excess alerts are queued.
- Escalation. If more than 50 unique alert groups fire within 5 minutes, trigger a "mass incident" page to incident command, not individual service owners.
8.8 Recording Rules and Pre-Computation
Recording rules are pre-computed queries that run on the Flink aggregation layer and write results back as new time series. Two reasons to use them:
- Performance. A dashboard querying
sum(rate(http_requests_total[5m])) by (service)across 10M raw series takes seconds. The same query against 500 pre-computed series takes milliseconds. - Consistency. Alert rules and dashboard panels that need the same aggregation use the same recording rule. No divergence from slightly different query syntax.
Where recording rules execute:
Flink receives raw metrics from Kafka. For each recording rule, it maintains a windowed aggregation in state. Every evaluation interval (30s default), it emits the result as a new time series with the recording rule's name.
When to use recording rules vs raw queries:
| Scenario | Use Recording Rule | Use Raw Query |
|---|---|---|
| Dashboard panel queried by multiple users | Yes | No |
| Alert rule evaluated every 15s | Yes | No |
| One-off debugging query | No | Yes |
| Query touches >100K series | Yes | No |
| Query touches <1K series | No | Yes |
| SLO burn rate calculation | Yes (always) | No |
Cost of recording rules: Each rule produces new time series (roughly 2 MB/day for a 500-series rule, negligible). But rules with high-cardinality by clauses (e.g., by (service, endpoint, region, zone)) can produce millions of new series. Estimate rule cardinality before deployment.
Backfill when recording rules change:
Recording rules computed by Flink are forward-only. When you modify a recording rule expression (changing the aggregation logic, adding a new by dimension), the new output starts from the moment the updated Flink job deploys. Historical data under the old expression stays as-is.
If historical backfill is needed (e.g., a new SLO burn rate calculation that should show 30 days of history on day one), run a one-off batch job that reads raw metrics from the S3 warm tier, applies the new recording rule expression, and writes the backfilled series to vmstorage or S3. The Flink job template can be reused with a bounded source (S3 Parquet reader instead of Kafka consumer) for this.
In practice, most recording rule changes don't need backfill. Dashboards showing the new metric simply have no data before the change date, and that's fine. Reserve backfill for SLO calculations where a full window of historical data is required for accurate burn rate alerting from the start.
8.9 Multi-Tenancy and Isolation
At 500M metrics/sec from hundreds of organizations, one tenant's cardinality explosion must never take down another tenant's dashboards.
Tenant identification:
Every request (remote write, query, rule evaluation) carries a X-Scope-OrgID header. This tenant_id propagates through the entire pipeline:
OTel Collector → Kafka (message header) → Flink (state key prefix)
→ vminsert (routing) → vmstorage (directory isolation) → vmselect (query filter)
Per-tenant limits:
| Limit | Default | Purpose |
|---|---|---|
| Max active series | 50M | Prevent cardinality explosion |
| Max ingestion rate | 5M samples/sec | Prevent single tenant from saturating Kafka |
| Max query series | 500K | Prevent dashboard query from scanning too many series |
| Max query duration | 60s | Kill runaway queries |
| Max alert rules | 10,000 | Prevent rule evaluator overload |
| Max label name length | 128 chars | Index size control |
| Max label value length | 1,024 chars | Index size control |
| Max labels per series | 30 | Cardinality control |
Limits are enforced at the earliest possible point:
- Ingestion rate: OTel Collector checks before sending to Kafka
- Series limit: vminsert checks before routing to vmstorage. When a tenant exceeds their series limit, the remote write endpoint returns
429 Too Many RequestswithX-Series-LimitandX-Series-Currentheaders so clients can track cardinality usage against their budget. - Query limits: vmselect checks before fanning out to vmstorage
Shuffle-sharding for noisy neighbor protection: Shuffle-sharding assigns each tenant to a random subset of nodes (not all nodes), ensuring one tenant's failure blast radius is limited to their subset, not the entire cluster.
Each tenant is assigned a subset of vmstorage nodes (4 out of 12 in this example). If Tenant A causes a node to OOM, only nodes 1, 4, 7, 10 are affected. Tenants B and C are on different nodes and see no impact.
The shard assignment uses consistent hashing with the tenant_id as input. When nodes are added or removed, only a fraction of tenant assignments change.
Cost attribution:
Per-tenant billing based on:
- Active series count (sampled hourly, billed at max)
- Ingestion rate (samples/sec, billed at p95 over the billing period)
- Query volume (number of series touched per query, summed monthly)
- Storage consumed (bytes across all tiers)
8.10 Service Discovery and Scrape
The collection layer must automatically discover new services as they deploy and stop scraping services that disappear. Manual target configuration doesn't work at 500K+ services.
Scope: pull model only. Service discovery applies to same-cluster tenants (see Section 5.1). External tenants push directly and don't need discovery.
Scrape interval tuning: 15-second scrape interval is the default. It aligns with alert evaluation intervals (also 15s), so each evaluation sees fresh data. Going faster rarely improves alerting because the for duration on most alerts is 1-5 minutes anyway. Use 30-60s only for low-priority infrastructure metrics.
Target churn handling:
Kubernetes pods churn constantly, and each new pod creates new series (the pod label changes). When a target disappears, the OTel Collector sends a stale marker (NaN with a staleness bit). VictoriaMetrics stops considering the series "active" after 5 minutes, so stale series don't count against cardinality limits or appear in queries.
Relabeling for metric filtering:
Not every metric from every service is worth storing. The OTel Collector's relabeling pipeline can drop, rename, or filter metrics before they reach Kafka:
metric_relabel_configs:
# Drop Go runtime metrics (noisy, rarely useful)
- source_labels: [__name__]
regex: 'go_(gc|memstats|threads)_.*'
action: drop
# Drop high-cardinality labels
- regex: 'request_id|trace_id|user_id'
action: labeldrop
# Rename for consistency
- source_labels: [__name__]
regex: 'node_cpu_seconds_total'
target_label: __name__
replacement: 'cpu_seconds_total'
Fleet management at scale:
3,000 OTel Collectors need coordinated configuration updates: scrape targets, relabel rules, sampling policies, and receiver settings. Two approaches:
-
GitOps pipeline. Collector configuration lives in a Git repository. Changes go through PR review. A CI/CD pipeline renders per-node configs (using templates for node-specific settings like
node_nameand local pod discovery) and deploys via Kubernetes rolling update of the DaemonSet. Safe, auditable, but slow (minutes for a full rollout). -
OpAMP (Open Agent Management Protocol). The CNCF standard for remote agent management. A central OpAMP server pushes configuration updates to collectors in real time. Collectors report health, effective configuration, and capabilities back to the server. Useful for emergency changes (e.g., adding a
labeldroprule during a cardinality explosion). Faster than GitOps (seconds, not minutes), but requires running and securing the OpAMP server infrastructure.
Most teams use GitOps for planned changes and keep OpAMP as a fast-path override for incidents.
9. Distributed Tracing at Scale
Traces are expensive. From the Section 7 capacity estimate: 200M spans/sec at ~1 KB each produces 17.3 PB/day. At ~$23/TB/month on S3, that's roughly $400K/month in storage alone. Not viable.
Tail-based sampling (Section 9.2) keeps errors, slow requests, and a 0.1% random baseline, dropping the rest to 86 TB/day. First, the trace span format.
9.1 Trace Span Format
Each trace span captures a single operation within a distributed request. A complete trace is a directed acyclic graph of spans connected by parent-child relationships.
{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"parentSpanId": "e457b5a2e4d86bd1",
"operationName": "POST /checkout",
"serviceName": "checkout-service",
"startTime": 1709827200000,
"duration": 145,
"attributes": {
"http.method": "POST",
"http.status_code": 200,
"db.system": "postgresql"
},
"events": [{"name": "retry", "timestamp": 1709827200045}]
}
500-2000 bytes per span. Variable structure. Rich attributes. A single request generates one metric increment but 5-20 spans across service boundaries. This 100-1000x size difference per event is exactly why traces need a different storage engine than metrics.
9.2 Tech Stack Selection
| Layer | Technology | Why This Choice |
|---|---|---|
| Instrumentation | OpenTelemetry SDK | Same SDKs already deployed for metrics. Produce trace spans with zero new dependencies. CNCF standard. |
| Collection | OTel Collector (same fleet) | Add otlp trace receiver to existing DaemonSet configuration. Same fleet, no new infrastructure for collection. |
| Sampling | OTel Collector tail-based sampling | Buffer spans 60s, keep errors + p99 + 0.1% baseline. Cuts 99.5%. Head-based misses rare errors. |
| Buffer | Kafka (topic: traces-raw) | Same Kafka cluster, separate topic. Isolates trace volume from metric ingestion. Fingerprint-partitioned by trace_id for span co-location. |
| Trace Backend | Grafana Tempo | S3-native storage (no index to maintain). TraceQL for queries. Stateless query layer scales horizontally. Bloom filters for trace ID lookup. |
| Long-term Storage | S3 (Parquet blocks) | Same S3 infrastructure used for metrics warm/cold tiers. Tempo writes trace blocks as Parquet files. S3 lifecycle policies manage tier transitions. |
| Correlation | Exemplars | Link metric data points to the specific trace that caused a spike. Click from a Grafana metric panel directly to the trace waterfall. |
| Context Propagation | W3C Trace Context (traceparent) | Industry standard. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers. |
| Visualization | Grafana (already deployed) | Native Tempo datasource. Trace waterfall view. Auto-generated service dependency maps from trace data. |
Why tail-based sampling over head-based: Head-based sampling decides at the start of a request (e.g., "keep 1% of all traces") before knowing whether the request will fail or be slow. This inevitably drops error and high-latency traces that are the most valuable for debugging. Tail-based sampling buffers all spans for a configurable window (60 seconds in this design), waits for the complete trace, then makes a keep/drop decision based on outcome: all errors kept, all p99 latency kept, 0.1% random baseline. The cost is memory. Each OTel Collector must buffer ~60 seconds of spans in RAM (see Section 11.8 for sizing).
Why Tempo over Jaeger:
| Dimension | Grafana Tempo | Jaeger |
|---|---|---|
| Storage backend | S3/GCS (object storage) | Elasticsearch or Cassandra |
| Index maintenance | No index. Bloom filters + object keys. | Elasticsearch index management, ILM policies |
| Query language | TraceQL (filter by attributes, duration, status) | Tag-based search (limited) |
| Operational complexity | Stateless query + S3. Minimal ops. | ES/Cassandra cluster management |
| Grafana integration | Native (same vendor stack) | Requires datasource plugin |
Tempo stores traces as Parquet blocks on S3 with bloom filters for trace ID lookups. No inverted index to maintain, no Elasticsearch cluster to babysit. At this scale (hundreds of terabytes per day after sampling), that operational simplicity matters more than Jaeger's slightly more mature UI.
9.3 Data Flow
9.4 Query Patterns: Metrics vs Traces
Metrics and traces answer different questions. Knowing which to reach for saves time during incidents.
| Question | Metrics (MetricsQL) | Traces (TraceQL) |
|---|---|---|
| "Is checkout slow?" | histogram_quantile(0.99, rate(http_duration_bucket{service="checkout"}[5m])) | {resource.service.name="checkout" && duration > 2s} |
| "Which service causes 500s?" | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) | {status = error} then inspect span waterfall for root cause |
| "What happened to user X?" | Cannot answer. Metrics are pre-aggregated. | {span.user_id="user_123"} shows the full request call tree |
| "Is the DB the bottleneck?" | rate(db_duration_sum[5m]) / rate(db_duration_count[5m]) | {span.db.system="postgresql" && duration > 500ms} shows exact slow queries |
| "Correlate a metric spike to a trace" | Click exemplar link on Grafana panel | Opens the exact trace that caused the spike |
Exemplars bridge the two: click from a dashboard panel showing a latency spike directly to the trace waterfall that caused it.
9.5 Exemplar Storage and Mechanics
Exemplars link metrics to traces: when the OTel SDK records a histogram observation, it can attach the current trace_id, giving the metric data point a direct pointer to the trace that produced it.
How exemplars flow through the pipeline:
- The OTel SDK records a histogram observation and attaches
{trace_id: "abc123", span_id: "def456"}as exemplar metadata. - The OTel Collector forwards the exemplar alongside the metric sample to Kafka.
- vminsert writes the exemplar to vmstorage. VictoriaMetrics stores exemplars in a separate data structure alongside the time series, not inline with samples.
- When Grafana renders a histogram panel, it queries vmselect for both the metric data and associated exemplars.
- Exemplars appear as dots on the graph. Clicking a dot opens the linked trace in Tempo via the trace_id.
Storage overhead: VictoriaMetrics limits exemplars to 128 per series by default (configurable via -search.maxExemplars). Each exemplar is ~100 bytes (trace_id + span_id + timestamp + labels). At 10 billion active series, even if 1% carry exemplars, that's 100M exemplars x 100 bytes = ~10 GB. Negligible compared to the metric data itself.
Sampling alignment matters. If the trace was dropped by tail-based sampling but the exemplar was kept, clicking the exemplar link leads to a "trace not found" error in Tempo. The sampling decision does not retroactively remove exemplars from metric storage since they're already written. At a 0.5% trace keep rate, roughly 99.5% of exemplar links will be dead ends. In practice, this is acceptable because the exemplars that matter most (errors, p99 latency) correspond to traces that the tail-based sampler keeps by policy. The random 0.1% baseline exemplars are mostly dead links, but they still serve as timestamps showing when specific traces occurred.
9.6 Tiered Storage for Traces
Same hot/warm/cold philosophy as the metrics pipeline (Section 8.6). Traces have different access patterns at different ages.
| Tier | Retention | Storage | Resolution | Query Latency |
|---|---|---|---|---|
| Hot | 0-3 days | Tempo ingesters (local SSD) + S3 Standard | Full spans, all attributes | < 500ms by trace ID |
| Warm | 3-30 days | S3 Standard (compacted Parquet blocks) | Full spans, all attributes | 1-3s (S3 reads + bloom filter lookup) |
| Cold | 30-180 days | S3 Glacier Instant Retrieval | Full spans, all attributes | 5-10s (Glacier retrieval) |
| Archive | 180d+ | S3 Glacier Deep Archive | Full spans (compliance/audit) | Minutes (restore required) |
Unlike metrics, traces are never downsampled. A trace is either useful in full detail or not at all. Cost savings come from tiering across S3 storage classes, not reducing resolution.
Tempo's compactor merges small blocks into larger ones and applies age-based tiering. compactor.compaction.block_retention controls block expiration; S3 lifecycle rules handle Standard-to-Glacier transitions automatically.
9.7 Capacity Estimate
From Section 7.5: raw trace volume is 17.3 PB/day. After tail-based sampling (0.5% keep rate), ingested volume is 86 TB/day.
Tiered storage (180-day retention):
Hot (3 days): 86 TB x 3 = 258 TB on S3 Standard
Warm (27 days): 86 TB x 27 = 2.32 PB on S3 Standard
Cold (150 days): 86 TB x 150 = 12.9 PB on Glacier Instant
Total trace storage: ~15.5 PB
Tempo ingester fleet:
1 GB/sec ingest rate. Each ingester handles ~50 MB/sec.
20 ingesters (r6g.2xlarge)
Tempo compactor:
5 compactor nodes (c6g.xlarge)
Kafka additional load:
1 GB/sec trace data on existing cluster
~4 additional brokers (250K msg/sec each)
OTel Collector overhead:
Tail-based sampling requires 4 GB RAM per 100K spans/sec in the decision buffer
200M raw spans/sec across 3,000 collectors = ~67K spans/sec per collector
~2.7 GB additional RAM per collector (fits within c6g.xlarge 8GB with headroom)
17.3 PB/day raw, reduced to 86 TB/day after tail-based sampling. The sampling layer is not optional at this scale. Without it, the storage volume is 200x higher. With it, the entire tracing infrastructure is a fraction of the metrics pipeline.
For deeper coverage of sampling strategies and context propagation patterns, see the Distributed Tracing reference. Failure scenarios for the trace pipeline are covered in Section 12.
10. Centralized Logging
The trace shows which service failed. Logs show why: "connection refused to stripe-api.com:443, certificate expired." The logging pipeline reuses the same OTel Collector fleet and Kafka cluster, diverging only at the storage layer.
10.1 Log Entry Format
Structured logs follow a consistent JSON schema with fields auto-indexed by VictoriaLogs.
{
"timestamp": "2026-03-07T14:30:00.123Z",
"level": "ERROR",
"service": "payment-service",
"message": "connection refused to stripe-api.com:443, certificate expired",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "a1b2c3d4e5f60718",
"host": "payment-pod-7b8f9",
"region": "us-east-1",
"error.type": "ConnectionRefused",
"retry_count": 3
}
200-1000 bytes per log line. The trace_id field enables log-to-trace correlation: click from a log line in Grafana to open the full trace waterfall in Tempo. VictoriaLogs auto-indexes every field using bloom filters (2 bytes per unique token), so both trace_id lookups and full-text search across the message field are index-assisted.
10.2 Tech Stack Selection
Three serious options exist for log storage at this scale: Elasticsearch, Grafana Loki, and VictoriaLogs. Each takes a very different approach to indexing.
| Dimension | VictoriaLogs | Grafana Loki | Elasticsearch |
|---|---|---|---|
| Indexing strategy | Bloom filters on all tokens. Index-assisted search on every field automatically. | Labels only (service, env, level). Log content is NOT indexed. Brute-force grep on chunks. | Full-text inverted index on all fields. |
| Ingestion throughput | 3x higher than Loki on same hardware. 66 MB/s vs 20 MB/s on 4 vCPU / 8 GiB (benchmark). | Bottlenecks on high-volume ingestion. Requires more nodes. | High throughput but index building is CPU-intensive. |
| Query speed (content search) | Fast. Bloom filter token lookup skips non-matching blocks. Needle-in-haystack: 900ms. | Slow. Brute-force scan after label filter. Needle-in-haystack: 12s. | Fast. Inverted index on all content. |
| Resource efficiency | 72% less CPU, 87% less RAM than Loki. Up to 30x less RAM than Elasticsearch. | Memory-hungry ingesters. 6-7 GiB RAM on same workload. | Highest resource consumption. Large JVM heaps. |
| Storage efficiency | 40% less disk than Loki. Columnar LSM-style storage with compression. | S3-native chunk storage. Label index is small. | Largest footprint. Inverted index adds 2-5x overhead. |
| Best for | High-volume operational logs with content search at scale | Moderate-volume logs with label-first query patterns | Full-text analytics, log-based dashboards, compliance search |
At 25 GB/sec: Elasticsearch requires the most infrastructure due to full-text inverted index overhead on 2.16 PB/day raw volume. Between Loki and VictoriaLogs, the deciding factor is what happens when queries go beyond label filtering. At 500K services, operators frequently search log content across broad service groups ("find all connection timeout errors in the payment domain"). Loki handles this by brute-force scanning chunks after label filtering. VictoriaLogs uses bloom filter token indexes to skip non-matching blocks entirely (see query speed row above).
The tradeoff: Loki stores data natively on S3, making tiered storage simpler. VictoriaLogs uses local disk, similar to how VictoriaMetrics vmstorage operates. For cold/archive tiers, VictoriaLogs relies on vmbackup snapshots to S3 rather than native lifecycle policies. At this scale, the ingestion and memory advantages from the table above outweigh the simpler S3 tiering from Loki.
Chosen stack:
| Layer | Technology | Why |
|---|---|---|
| Collection | OTel Collector (filelog receiver) | Same DaemonSet fleet. Tail container stdout/stderr via Kubernetes log paths. Parse structured JSON logs. |
| Buffer | Kafka (topic: logs-raw) | Same cluster, separate topic. Partitioned by service_name for consumer locality. |
| Storage & Indexing | VictoriaLogs (cluster mode) | Bloom filter per-token index on all fields. Auto-indexes everything. Same ingestion and memory advantages as comparison above. |
| Long-term | Local NVMe on vlstorage + vmbackup to S3 | Hot/warm on local disk. Cold/archive snapshots to S3. Retention managed per log level. |
| Query | LogsQL | Filter and search: _time:5m AND service:checkout AND error AND _stream:{level="error"}. SQL-like syntax with full-text search. |
| Correlation | trace_id field in structured logs | Click from log line to open trace in Tempo. Click from trace span to see logs for that service during that time window. |
| Visualization | Grafana (already deployed) | VictoriaLogs datasource plugin for Grafana. Log panels alongside metric and trace panels in the same dashboard. |
10.3 Data Flow
Container stdout --> Kubernetes log file --> OTel Collector (filelog receiver)
--> Parse JSON --> Extract fields (service, level, trace_id)
--> Kafka topic: logs-raw
--> vlinsert --> Distribute to vlstorage nodes
--> vlstorage --> Tokenize, build bloom filters, compress, write to local NVMe
--> Query: Grafana --> vlselect --> Fan out to vlstorage nodes --> Return results
This is the same fan-out pattern used by vmselect in the metrics pipeline.
10.4 Tiered Storage for Logs
Logs follow the same tiered storage strategy, with one important difference: unlike traces, partial log retention is viable. DEBUG and INFO logs older than 30 days rarely get queried. ERROR and WARN logs are the ones that matter for post-incident reviews and compliance audits. Dropping low-severity logs after the warm tier cuts cold/archive storage by 90%.
Dynamic log level adjustment. In production, most services run at INFO level. During incidents, operators can temporarily increase specific services to DEBUG via the OTel Collector's configuration reload (or OpAMP for fleet-wide changes). This keeps baseline ingestion volume manageable while preserving the ability to get full detail when it matters. Some teams go further: automatically elevate log level when error rate exceeds a threshold, and drop back to INFO once the incident resolves.
VictoriaLogs uses local disk storage (not S3-native like Tempo). The tiering model mirrors how VictoriaMetrics handles metrics storage: hot and warm data lives on local NVMe across vlstorage nodes, while cold and archive data moves to S3 via vmbackup snapshots.
| Tier | Retention | Storage | What Gets Stored | Query Latency |
|---|---|---|---|---|
| Hot | 0-3 days | vlstorage local NVMe | All log levels, full content | < 500ms (bloom filter + local disk) |
| Warm | 3-30 days | vlstorage local disk | All log levels, full content | 500ms-2s (local disk reads) |
| Cold | 30-90 days | S3 Standard (vmbackup snapshots) | ERROR/WARN only (drop DEBUG/INFO) | 5-15s (restore from snapshot) |
| Archive | 90-365 days | S3 Glacier Deep Archive | ERROR/WARN only (compliance/audit) | Minutes (restore required) |
VictoriaLogs manages retention natively via the -retentionPeriod flag on vlstorage. Logs older than the configured period are deleted automatically at partition granularity (daily UTC partitions). For per-level retention (keeping ERROR longer than DEBUG), separate vlstorage instances or stream-based routing at vlinsert level can partition logs by severity, each with different retention periods.
Compared to Loki's S3-native tiering (automatic lifecycle via S3 policies), VictoriaLogs requires vmbackup to move cold data to S3. What you get: sub-second query latency on local disk vs multi-second on S3, plus 40% storage reduction.
10.5 Log-to-Trace Correlation
Structured logs include a trace_id field. VictoriaLogs auto-indexes this field like every other field, so querying by trace_id is a direct token lookup, not a brute-force scan. In Grafana, clicking a log line and selecting "View Trace" opens the full trace waterfall in Tempo. The reverse direction works too: from a Tempo trace span, clicking "View Logs" runs a VictoriaLogs query for trace_id:"abc123" AND service:checkout scoped to the span's time window.
With all three signals connected, the investigation goes from "p99 latency spike" to the exact error message ("connection timeout to the read replica") in three clicks. Without that correlation, the same investigation takes 30 minutes of manual work across different tools.
10.6 Capacity Estimate
From Section 7.5: 50M log lines/sec at 500 bytes = 25 GB/sec = 2.16 PB/day raw.
VictoriaLogs compression + bloom filter overhead:
~8x compression on structured JSON logs (columnar storage + LZ4/ZSTD)
VictoriaLogs uses 40% less storage than Loki on the same data (benchmark)
Compressed: ~260 TB/day stored on local disk
Log level distribution (typical): 70% DEBUG, 20% INFO, 8% WARN, 2% ERROR
Tiered storage (365-day retention for errors, 30-day for all):
Hot (3 days): 260 TB x 3 = 780 TB on local NVMe
Warm (27 days): 260 TB x 27 = 7.0 PB on local SSD
Cold (60 days): 260 TB x 0.10 x 60 = 1.56 PB on S3 Standard
(only ERROR+WARN = 10% of volume)
Archive (275d): 260 TB x 0.10 x 275 = 7.15 PB on Glacier Deep
Total log storage: ~9.5 PB across active tiers (+ 7.15 PB Glacier Deep Archive)
vlinsert fleet:
25 GB/sec ingest rate (see Section 10.2 comparison table for benchmark numbers).
Each vlinsert handles ~1.5 GB/sec (estimated from 66 MB/s on 4 vCPU, scaled to larger instances).
17 vlinsert nodes (c6g.4xlarge, 16 vCPU)
vlstorage fleet:
Storage-optimized instances for local NVMe.
Hot tier: 780 TB across i3en.6xlarge nodes (2x 7.5 TB NVMe each = 15 TB usable)
~52 nodes for hot tier
Warm tier: 7.0 PB across d3en.8xlarge nodes (8x 10 TB HDD each = 80 TB usable)
~88 nodes for warm tier
vlselect fleet:
Fan-out query layer. Stateless. Scale based on query concurrency.
10 nodes (c6g.2xlarge)
Kafka additional load:
25 GB/sec log data on existing cluster
~20 additional brokers
Failure scenarios for the logging pipeline are covered in Section 12.
11. Identify Bottlenecks
11.1 Cardinality Ceiling
The inverted index is the first component to break under cardinality pressure. At 10 billion active series, the mergeset index consumes significant disk I/O and memory for bloom filters. Beyond 10B series per vmstorage node, lookup latency degrades. Symptoms: dashboard queries that used to return in 50ms start taking 500ms+. The metric vm_index_search_duration_seconds creeps up. Alert rule evaluations miss their 15-second window. The fix: more vmstorage nodes (horizontal sharding) and stricter cardinality limits per tenant.
11.2 Query Fan-Out
Every MetricsQL query fans out to all vmstorage nodes. With 200 nodes (RF=2), that's 200 parallel requests per query. If a dashboard has 20 panels, each refreshing every 30 seconds, that's 20 × 200 = 4,000 vmstorage requests per dashboard refresh. With 50 concurrent dashboard users, that's 200K requests/30s to vmstorage. Symptoms: Grafana dashboards show "query timeout" errors. vm_concurrent_select_requests on vmselect stays at max. Users complain dashboards are blank or loading forever.
To handle this: recording rules reduce the number of series each panel queries, result caching in vmselect cuts redundant reads, and query coalescing merges identical concurrent queries into one.
11.3 Kafka Consumer Lag
If Flink or vminsert falls behind, Kafka consumer lag grows. With 500M/sec ingestion and 24-hour retention, Kafka stores up to 43.2T samples. If a consumer falls 1 hour behind, it needs to process 1.8T extra samples to catch up while also handling current load. Symptoms: kafka_consumergroup_lag climbs steadily. Metrics appear on dashboards with increasing delay. The 2-second ingestion-to-visibility target starts slipping to 10+ seconds.
Monitor consumer lag (kafka_consumergroup_lag) and alert if lag exceeds 5 minutes. Auto-scale Flink TaskManagers based on lag. The 24-hour Kafka retention gives headroom for recovery.
11.4 Flink Checkpoint Size
Flink's recording rules maintain state (current window aggregations) in RocksDB. At 50M active aggregation keys, checkpoints can be multi-GB. Checkpoint duration affects recovery time (longer checkpoint = longer restart + replay). Symptoms: Flink's checkpoint duration metric (flink_jobmanager_job_lastCheckpointDuration) grows from seconds to minutes. If checkpoints take longer than the checkpoint interval, they start overlapping and Flink falls behind.
Mitigation: incremental checkpointing (only write changed state), tuning RocksDB block cache and compaction, and using aligned checkpoint barriers to reduce checkpoint duration.
11.5 vmstorage Compaction Storms
VictoriaMetrics periodically merges small parts into larger ones (similar to LSM-tree compaction). During compaction, disk I/O spikes. If compaction falls behind ingestion, parts accumulate, and read latency increases. Symptoms: vm_parts_count on vmstorage keeps climbing instead of staying flat after merges. Disk I/O utilization stays pegged at 100%. Query latency gradually worsens as more parts need scanning.
Provision vmstorage nodes with 30% I/O headroom for compaction and use NVMe SSDs (not EBS gp3). Monitor vm_merge_need_free_disk_space and scale storage before compaction bottlenecks.
11.6 S3 Query Latency
Queries spanning both hot and cold tiers bottleneck on the S3 leg. vmstorage responds in single-digit milliseconds. S3 GET latency is 50-200ms per request. A query touching 100 Parquet files on S3 waits for the slowest one. Symptoms: any dashboard query with a time range beyond 15 days loads noticeably slower than a same-duration query within 15 days. vmselect_s3_request_duration_seconds shows a long tail.
At 2.6 PB in the warm tier, the file count is substantial. With 256 MB Parquet files, that's ~10 million files. Without partition pruning, a single 90-day query could scan thousands of files. The Parquet path structure (s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet) enables three levels of pruning: tenant_id eliminates other organizations' data, time range skips irrelevant days, and fingerprint range narrows to the specific series. A well-pruned query touches 10-50 files, not thousands.
To reduce this: pre-fetch Parquet row group indexes into vmselect memory on startup, cache frequently accessed Parquet blocks in a local SSD cache on vmselect nodes, and partition Parquet files by time range so pruning eliminates most files before any S3 request is made.
11.7 Inverted Index Rebuild Time
When a vmstorage node restarts, it replays the WAL (Write-Ahead Log, a sequential on-disk record of all recent writes) and rebuilds the in-memory portion of the mergeset index. With billions of series, this can take 10-30 minutes. During this window, the node accepts writes (buffered to WAL) but cannot serve queries efficiently. Symptoms: after a vmstorage restart, vmselect logs show timeouts for queries involving that node. Partial results appear on dashboards until the index rebuild completes.
Periodic index snapshots reduce WAL replay size. Faster NVMe storage speeds up replay. Keep the hot tier series count manageable per node (target: <500M active series per vmstorage node). If rebuild time is critical, enable replication factor 2 so the replica serves queries while the primary rebuilds.
11.8 Tail-Based Sampling Memory
The OTel Collector's tail-based sampling processor buffers all spans for 60 seconds before making keep/drop decisions. At 67K spans/sec per collector (200M raw spans/sec across 3,000 collectors), each collector holds ~4M spans in memory, consuming ~2.7 GB RAM. If a burst of high-cardinality traces arrives (e.g., a load test generating 10x normal traffic), this buffer can exceed the collector's memory limit and trigger OOM. Symptoms: collector pods get OOMKilled by Kubernetes. otelcol_processor_tail_sampling_count_traces_dropped spikes. Trace coverage drops and exemplar links break.
Set max_traces on the tail-based sampling processor to cap memory usage. If the limit is reached, new traces pass through unsampled rather than crashing the collector. Monitor otelcol_processor_tail_sampling_count_traces_dropped and scale the collector fleet if drops exceed 1%. For high-volume services, use a dedicated collector pool rather than routing everything through the DaemonSet fleet.
11.9 Tempo S3 Query Latency
Tempo stores trace blocks as Parquet files on S3 with bloom filters for trace ID lookup. When a bloom filter returns a false positive, Tempo must fetch and scan the full Parquet block from S3 (50-200ms per GET). For queries that search by attributes rather than trace ID (e.g., {resource.service.name="checkout" && duration > 2s}), Tempo may scan dozens of blocks. Symptoms: TraceQL queries by trace ID return in <1s, but attribute-based searches take 10-30s. tempo_query_frontend_queries_total{status="timeout"} increases.
Keep the Tempo compactor healthy. Well-compacted blocks have smaller, more effective bloom filters. Configure the compactor's compaction_window and max_block_bytes to balance compaction frequency against S3 write volume. A memcached query result cache avoids repeated S3 reads for popular queries.
11.10 VictoriaLogs Bloom Filter Cache
VictoriaLogs uses bloom filters on every log field for index-assisted search. High-cardinality queries (e.g., searching for a specific trace_id across all services) hit bloom filters on every vlstorage node. If the bloom filter working set exceeds the OS page cache, queries trigger disk reads instead of cache hits, increasing latency from sub-second to multi-second. Symptoms: log search queries that usually return in 200ms start taking 5-10s. The vlstorage node's disk read IOPS spikes while memory usage stays flat (the bloom filters no longer fit in cache).
Monitor bloom filter cache hit rate via vlstorage metrics and provision vlstorage nodes with enough RAM for the OS page cache to hold the active bloom filter set (see Section 10.1 for sizing). If cache thrashing occurs, add more vlstorage nodes to reduce the per-node token count.
11.11 Scaling Triggers
The Section 7 sizing gives static node counts for 500M/sec. In production, you need to know when to add capacity before hitting a wall.
| Component | Scaling Metric | Trigger | Action |
|---|---|---|---|
| vmstorage | vm_active_timeseries per node | > 400M series (80% of 500M target) | Add vmstorage nodes, rebalance hash ring |
| vmstorage | Disk utilization | > 70% on NVMe | Add nodes or increase retention export frequency |
| vminsert | CPU utilization | > 70% sustained | Add vminsert pods (stateless, instant) |
| vmselect | vm_concurrent_select_requests | > 80% of max | Add vmselect pods |
| Kafka | Bytes in per broker | > 200 MB/sec per broker | Add brokers, rebalance partitions |
| Flink | kafka_consumergroup_lag | Lag > 5 minutes sustained | Add TaskManagers |
| Flink | Checkpoint duration | > 50% of checkpoint interval | Tune RocksDB, add parallelism |
| vlstorage | Disk utilization | > 70% | Add vlstorage nodes |
| vlstorage | Bloom filter cache miss rate | > 20% | Add RAM or add nodes to reduce per-node token count |
| OTel Collectors | otelcol_exporter_send_failed_metric_points | Failure rate > 1% | Scale DaemonSet fleet or increase per-collector resources |
Use Kubernetes HPA for stateless components (vminsert, vmselect, vlinsert, vlselect) with custom metrics from the meta-monitoring stack. Stateful components (vmstorage, vlstorage, Kafka) require manual scaling with planned rebalancing to avoid data movement storms.
12. Failure Scenarios
What happens when components fail or scaling limits are exceeded, and how the platform recovers.
12.1 VictoriaMetrics Node Loss
Scenario: A vmstorage node loses its disk (NVMe failure) and all data on it.
Impact: The series sharded to that node become unavailable for queries. Ingestion continues (vminsert detects the failed node and redistributes to remaining nodes). Queries return partial results for affected series.
Recovery:
- vminsert's health check detects the failed node within 10 seconds
- vminsert's consistent hashing ring updates. New writes for affected series go to the next node in the ring.
- The failed node's data is lost (unless replication was enabled). With replication factor 2, the replica node has a copy. vmselect queries both the primary and replica, so queries still return complete data.
- A replacement node joins the cluster. vmstorage rebalancing migrates data from overloaded nodes.
Data gap: Without replication, there's a gap in the affected series equal to the time between the last S3 export and the failure. With daily exports, that's up to 24 hours of lost raw data (downsampled data on S3 is unaffected).
Why replication factor 2 is enough: Unlike databases, metrics data is append-only and immutable. There are no updates, no transactions, no read-before-write. Losing recent data is tolerable (the system will re-scrape and backfill). Losing old data is mitigated by S3 exports.
12.2 Kafka Broker Failure
Scenario: A Kafka broker in the metrics-ingestion cluster crashes.
Impact: Partitions led by that broker become temporarily unavailable. With RF=3, the controller elects a new leader from in-sync replicas. Producers retry. Consumers rebalance.
Recovery:
- Kafka controller detects broker failure (ZooKeeper session timeout or KRaft heartbeat)
- Leader election for affected partitions (~1-5 seconds)
- Producers retry buffered messages to the new leader
- Consumers detect new partition assignment and resume from last committed offset
- No data loss (RF=3 with acks=all means at least 2 replicas have every message)
During failover: 1-5 seconds of elevated producer latency. OTel Collectors buffer in memory during this window. At 500M/sec, that's 500M-2.5B buffered samples. OTel Collectors need sufficient memory for this burst (configured via sending_queue settings).
12.3 Cardinality Explosion (OOM)
Scenario: A tenant adds user_id as a histogram label (see the full cardinality explosion walkthrough in Section 8.3). 10M users × 12 histogram series = 120M new series.
Impact: vmstorage memory spikes. If per-tenant limits aren't enforced, OOM kills the process. All tenants on that node lose access.
Recovery with per-tenant limits (correct configuration):
- vminsert detects that the tenant has exceeded its 50M series limit
- New series creation is rejected (HTTP 429)
- Existing series continue receiving data
- Alert fires:
vm_active_timeseries > 0.8 * vm_series_limit - The tenant investigates and removes the problematic label from their application code (they added
user_idto their SDK instrumentation, so they must remove it, or the platform's OTel Collector relabel rules can drop it as a stop-gap) - Over the next staleness period (5 minutes), the 120M stale series are marked inactive
- Memory gradually returns to normal as stale series are garbage collected
Recovery without per-tenant limits (disaster):
- vmstorage OOM-killed
- vmselect can't reach that node. Queries return partial results.
- Alert evaluators can't evaluate rules for series on that node. Alerts go silent.
- Operator manually identifies the problematic tenant (by checking
vm_new_timeseries_created_totalper tenant) - Operator adds an emergency cardinality limit for that tenant
- vmstorage restarts, replays WAL, rebuilds in-memory index
- Recovery time: 10-30 minutes depending on WAL size
Per-tenant cardinality limits are P0 for exactly this reason.
12.4 Split-Brain in VictoriaMetrics Cluster
Scenario: Network partition isolates vmstorage nodes 1-3 from nodes 4-6.
Impact: vminsert can only reach half the nodes. Writes for series sharded to unreachable nodes fail. vmselect can only query reachable nodes. Results are partial.
Recovery:
VictoriaMetrics is AP (availability over consistency). During a partition, reads and writes continue but may return temporarily inconsistent data:
- vminsert routes writes for unreachable nodes to the next available node in the hash ring
- This creates temporary duplicate series on different nodes
- vmselect queries whatever nodes it can reach, returning partial data
- When the partition heals, deduplication on vmselect merges overlapping data from both sides
- The temporarily duplicated data is reconciled during the next compaction cycle
No manual intervention required. Data completeness may be reduced during the partition, but no data is permanently lost.
12.5 Alert Storm
Scenario: A core dependency (database, DNS, load balancer) fails. 5,000 services lose connectivity. 100,000 alert rules fire simultaneously.
Impact: Without protections, 100,000 simultaneous notifications bury on-call engineers in alert fatigue, making root cause identification impossible.
Recovery:
-
Grouping: Alertmanager groups alerts by
alertname + service + env. 100,000 pod-level alerts become ~500 service-level notifications. -
Inhibition: A top-level "dependency-down" alert inhibits all downstream service alerts:
inhibit_rules:
- source_match:
alertname: CoreDependencyDown
target_match_re:
alertname: '.*'
equal: ['env', 'region']
-
Mass incident trigger: If >50 unique alert groups fire within 5 minutes, the system triggers a "mass incident" page to incident command rather than individual team on-calls.
-
Rate limiting: Alertmanager caps at 100 notifications/minute per receiver. Excess alerts are queued and sent as a digest.
12.6 Flink Job Failure
Scenario: A Flink TaskManager crashes mid-checkpoint while processing recording rules. State is partially written to the checkpoint store.
Impact: Recording rules stop producing pre-computed metrics. Dashboards using recording rule output show stale data. SLO burn rate calculations freeze. Raw metric ingestion into vmstorage continues unaffected (direct path from Kafka to vminsert).
Recovery:
- Flink JobManager detects TaskManager heartbeat loss
- Reschedules failed tasks on available TaskManagers
- Tasks restore from the last successful checkpoint (not the partial one)
- Consumer offsets reset to the checkpoint's Kafka position
- Flink replays events from that offset, rebuilding state
- Recording rule output resumes once replay catches up to real-time
Gap: During replay (typically 30 seconds to 2 minutes), recording rule output is stale. Dashboards using raw queries are unaffected. Alert rules using recording rule output may miss a single evaluation cycle.
12.7 Kafka Partition Rebalance
Adding a Flink TaskManager triggers a consumer group rebalance (Kafka redistributes partition ownership across the group, briefly pausing consumption), pausing all consumers for 5-30 seconds. Consumers resume from last committed offset and process the backlog at full throughput. Use Kafka's cooperative rebalancing protocol (only affected partitions pause, not the entire consumer group) to reduce the pause to near-zero for unaffected partitions.
12.8 S3 Export Failure
If the hot-to-warm export job fails repeatedly, vmstorage disk fills and new writes fail. Alert on time() - last_s3_export_timestamp > 86400. The export runs every 6 hours with a 24-hour alert threshold, so there are 4 chances to succeed before anyone gets paged. Each run is idempotent (Parquet files are keyed by time range + fingerprint range). If disk pressure is immediate, temporarily extend hot tier retention and add vmstorage nodes.
12.9 Tenant Isolation Breach Attempt
Scenario: A tenant crafts a MetricsQL query attempting to access another tenant's data: http_requests_total{tenant_id="other-tenant"}.
Impact: None, if the isolation layer is implemented correctly.
Prevention:
- vmselect strips any user-provided
tenant_idlabel from the query - vmselect injects the authenticated tenant's ID from the
X-Scope-OrgIDheader - The query becomes
http_requests_total{tenant_id="authenticated-tenant"}regardless of what the user submitted - vmstorage data is physically separated by tenant directory (
/data/{tenant_id}/)
Detection: Audit log captures queries where the user-provided tenant_id differs from the authenticated one. This flags potential reconnaissance attempts. Three such attempts within an hour trigger a security alert.
12.10 Clock Skew Between Collectors
NTP misconfiguration causes clock drift: future timestamps confuse dashboards and break rate() calculations, past timestamps outside VictoriaMetrics' 30-minute window are silently dropped. Monitor vm_out_of_order_samples_total and vm_too_far_in_future_samples_total. Prevention: validate timestamps at the collector (current time +/- 5 minutes) and overwrite out-of-range timestamps.
12.11 Cascading Failure from Query Overload
Scenario: A tenant opens a dashboard with 20 panels, each running an unoptimized query that touches 2 million series. The dashboard auto-refreshes every 10 seconds.
Impact: 20 queries × 2M series × 200 vmstorage nodes = 4 billion series lookups every 10 seconds from a single user. This saturates vmstorage CPU. All other tenants' queries slow down. Alert evaluation latency increases. SLO burn rate detection is delayed.
Recovery:
- Per-tenant query concurrency limit (default: 20 concurrent queries) throttles the offending tenant
- Per-query series limit (default: 500K series) rejects individual queries that are too broad
- vmselect returns HTTP 422 with
X-Query-Series-Exceededheader - Other tenants' queries continue at normal latency
Prevention: Query analysis at dashboard save time. If a panel's query is estimated to touch >500K series, warn the user and suggest adding recording rules. The per-tenant concurrency and series limits from above also apply proactively. Monitor vm_concurrent_queries{tenant_id="X"} and page if any tenant consistently hits their limit.
12.12 Trace Sampling Pipeline Failure
Scenario: The tail-based sampling processor in the OTel Collector crashes or hits its max_traces limit. Spans either get dropped entirely or pass through unsampled at 100% keep rate.
Impact (spans dropped): Gaps in trace data. Trace-to-metric correlation via exemplars breaks for affected traces. Service dependency maps show missing edges.
Impact (100% pass-through): Tempo ingesters receive 200x normal volume (200 GB/sec instead of 1 GB/sec). Kafka topic traces-raw saturates. Tempo ingesters OOM or fall behind. Storage volume spikes 200x if sustained.
Recovery:
- Monitor
otelcol_processor_tail_sampling_count_traces_sampledandotelcol_processor_tail_sampling_count_traces_dropped - Alert if the sampled/raw ratio deviates from the expected 0.5% by more than 10x
- If pass-through: the OTel Collector's memory limiter processor acts as a circuit breaker, dropping spans before OOM
- Restart affected collectors. Tail-based sampling state is in-memory only, so restart resets the decision buffer
- Kafka retention (24 hours) buffers the excess. Once sampling resumes, Tempo catches up
12.13 Trace Context Propagation Break
Async boundaries (Kafka messages, SQS queues, background jobs) break trace context if traceparent isn't injected into message headers. Downstream spans become orphaned root spans, and service dependency maps miss edges. Monitor the root-span ratio per service. A sudden increase indicates broken propagation. The fix is always code-level: inject traceparent at the producer, extract at the consumer using OTel SDK's TextMapPropagator.
12.14 VictoriaLogs Disk Full
Scenario: A noisy service starts emitting 10x normal log volume (e.g., a misconfigured debug logging flag in production). The vlstorage nodes receiving logs from that service fill their local NVMe storage.
Impact: Once disk is full, vlstorage stops accepting new writes. vlinsert detects the failure and either buffers or drops logs. Queries to the affected vlstorage node return partial results. Other vlstorage nodes are unaffected (logs are distributed by service name, not broadcast).
Recovery:
- Alert fires:
vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes > 0.85(warning at 85%) - Identify the noisy service: query
sum(rate(vlinsert_ingested_bytes_total[5m])) by (service)and sort descending - Apply per-service rate limiting at vlinsert to throttle the offending service
- Fix the logging configuration in the application (disable debug logging, add sampling)
- VictoriaLogs retention policy (
-retentionPeriod) automatically deletes old partitions, freeing disk space within one partition cycle (daily granularity)
Prevention: Monitor per-service ingestion rates. Set per-service ingestion quotas at the vlinsert layer. Use the OTel Collector's filter processor to drop DEBUG logs from high-volume services before they reach Kafka.
12.15 Network Partition Scenarios
OTel-to-Kafka and Flink-to-VictoriaMetrics partitions share the same recovery model: upstream components buffer locally (OTel Collectors have a 500 MB retry queue; Flink backpressures to Kafka), and Kafka's 24-hour retention prevents data loss once connectivity restores. Monitor otelcol_exporter_send_failed_metric_points and kafka_consumergroup_lag for detection. Deploy Kafka brokers across multiple AZs for prevention.
13. Observability (Meta-Monitoring)
The meta-problem: if the metrics platform itself goes down, how does anyone find out? Dashboards are blank. Alerts aren't evaluating. PagerDuty is silent.
Solution: A separate, minimal monitoring stack.
A single Prometheus instance on separate infrastructure (different Kubernetes cluster, different cloud region, different failure domain) monitors four things:
- Is vmstorage responding? Scrape
/metrics. If scrape fails 3 times, page. - Is ingestion flowing? Check
vm_rows_inserted_totalrate. If it drops to 0, page. - Is the alert evaluator running? Check
rule_evaluations_totalrate. If it drops to 0, page. - Is Alertmanager healthy? Check
alertmanager_alerts_received_total. If it flatlines, page.
Total footprint: one t3.medium instance.
13.1 Critical Metrics
What to watch, by component:
# === INGESTION ===
rate(vm_rows_inserted_total[5m]) # Should match 500M/sec ± 20%
kafka_consumergroup_lag{group="flink-recording-rules"} # Flink consumer lag
kafka_consumergroup_lag{group="vminsert-consumer"} # vminsert consumer lag
otelcol_scraper_scraped_metric_points / otelcol_scraper_errored_metric_points # Scrape success rate
rate(vm_rows_rejected_total{reason="series_limit"}[5m]) # Tenant limit rejections
# === STORAGE ===
vm_active_timeseries # Global active series
rate(vm_new_timeseries_created_total[5m]) # Series churn rate
vm_data_size_bytes{type="indexdb"} # Index size
vm_data_size_bytes{type="storage"} # Data size
vm_merge_need_free_disk_space # Compaction backlog
# === QUERY ===
histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) # p99 query latency
vm_concurrent_queries # Active queries
rate(vm_slow_queries_total[5m]) # Slow queries (>10s)
vm_cache_hits_total / (vm_cache_hits_total + vm_cache_misses_total) # Cache hit rate
# === ALERTING ===
histogram_quantile(0.99, rate(rule_evaluation_duration_seconds_bucket[5m])) # Rule eval duration
rate(rule_evaluation_failures_total[5m]) # Missed evaluations
rate(alertmanager_notifications_failed_total[5m]) # Notification failures
# === FLINK ===
flink_job_lastCheckpointDuration # Should be < checkpoint interval
rate(flink_taskmanager_job_task_numRecordsInPerSecond[5m]) # Should match Kafka rate
flink_taskmanager_job_task_backPressuredTimeMsPerSecond # High = can't keep up
flink_taskmanager_job_task_operator_state_size # RocksDB state size
# === TRACE PIPELINE ===
tempo_ingester_flush_duration_seconds # Ingester lag
rate(otelcol_processor_tail_sampling_count_traces_dropped[5m]) # Span drop rate
rate(otelcol_processor_tail_sampling_count_traces_sampled[5m]) /
rate(otelcol_processor_tail_sampling_count_traces_received[5m]) # Sampling ratio (~0.5%)
tempo_compactor_outstanding_blocks # Compactor health
# === LOG PIPELINE ===
rate(vlinsert_ingested_rows_total[5m]) # Log ingestion rate
vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes # Disk usage
vlstorage_bloom_filter_cache_hits /
(vlstorage_bloom_filter_cache_hits + vlstorage_bloom_filter_cache_misses) # Bloom filter cache
sum(rate(vlinsert_ingested_bytes_total[5m])) by (service) # Detect noisy services
13.2 Alert Rules for Meta-Monitoring
| Alert | Expression | Severity | For |
|---|---|---|---|
| IngestionDrop | rate(vm_rows_inserted_total[5m]) < 400000000 | critical | 2m |
| KafkaLagHigh | kafka_consumergroup_lag > 50000000 | warning | 5m |
| KafkaLagCritical | kafka_consumergroup_lag > 500000000 | critical | 2m |
| CardinalitySpike | rate(vm_new_timeseries_created_total[1h]) > 100000 | warning | 0m |
| VmstorageDown | up{job="vmstorage"} == 0 | critical | 30s |
| QueryLatencyHigh | histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) > 1 | warning | 5m |
| RuleEvalFailing | rate(rule_evaluation_failures_total[5m]) > 0 | critical | 2m |
| FlinkBackpressure | flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500 | warning | 5m |
| CompactionBehind | vm_merge_need_free_disk_space > 0 | warning | 10m |
| S3ExportFailed | time() - last_s3_export_timestamp > 86400 | critical | 0m |
| TempoIngesterLag | tempo_ingester_flush_duration_seconds > 30 | warning | 5m |
| TraceSamplingAnomaly | rate(otelcol_processor_tail_sampling_count_traces_sampled[5m]) / rate(otelcol_processor_tail_sampling_count_traces_received[5m]) > 0.05 | warning | 2m |
| VictoriaLogsStorageFull | vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes > 0.85 | warning | 5m |
| VictoriaLogsStorageCritical | vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes > 0.95 | critical | 1m |
| LogIngestionDrop | rate(vlinsert_ingested_rows_total[5m]) < 40000000 | critical | 5m |
| NoisyServiceLog | sum(rate(vlinsert_ingested_bytes_total[5m])) by (service) > 1073741824 | warning | 5m |
14. Deployment Strategy
The platform deploys on Kubernetes across multiple regions. Each region runs a complete, independent pipeline. Kubernetes provides DaemonSet-based collector deployment, service discovery for scrape targets, and horizontal pod autoscaling for stateless components (vminsert, vmselect, vlinsert, vlselect).
14.1 Multi-Region Topology
Active-active ingestion: Each region runs its own complete pipeline (OTel → Kafka → Flink → VictoriaMetrics). Metrics stay region-local, no cross-region ingestion traffic.
Cross-region query federation: A global vmselect gateway fans out queries to VictoriaMetrics clusters in all regions. For global dashboards ("total request rate across all regions"), the gateway merges results from both regions. For region-specific dashboards, the gateway routes to a single region.
S3 replication: Each region's S3 bucket has cross-region replication enabled. If us-east-1 goes down entirely, warm/cold data is still accessible from us-west-2.
14.2 Rolling Upgrades
VictoriaMetrics: Upgrade one component at a time. vminsert first (stateless, instant restart). vmselect next (stateless). vmstorage last (stateful, needs graceful drain).
vmstorage graceful drain:
- Mark node as "draining" in the hash ring
- vminsert stops sending new writes to the draining node
- Wait for in-flight writes to complete (30-second timeout)
- Stop the process, upgrade, restart
- Re-join the hash ring
Flink: Blue-green deployment (run the new version alongside the old, switch traffic once the new version is healthy, then stop the old) for Flink jobs.
- Deploy new Flink job version alongside the existing one (different consumer group)
- New job starts consuming from Kafka and building state
- Once the new job has caught up (consumer lag = 0), switch traffic
- Stop the old job
This avoids state migration, which is the riskiest part of Flink upgrades.
14.3 Disaster Recovery
RPO/RTO targets:
| Component | RPO (data loss tolerance) | RTO (recovery time) | Mechanism |
|---|---|---|---|
| Metrics (hot) | 0 (with RF=2) | < 5 minutes | Replica serves queries while failed node recovers. vminsert reroutes writes. |
| Metrics (warm/cold) | 0 | < 30 minutes | S3 cross-region replication. Global vmselect reads from either region. |
| Kafka | 0 (RF=3, acks=all) | < 1 minute | Leader election on surviving replicas. |
| Flink | Last checkpoint (seconds to minutes) | < 5 minutes | Restart from checkpoint, replay from Kafka offsets. |
| Traces (Tempo) | ~60s of in-flight spans | < 10 minutes | Tempo query frontend is stateless. Ingesters lose their buffer (~60s of spans). |
| Logs (VictoriaLogs) | Since last vmbackup (hours) | 15-30 minutes | Restore from S3 snapshot. Hot data since last backup is lost unless Kafka retention covers it. |
Full region failure:
If us-east-1 goes down completely, the recovery path depends on the signal:
- Metrics: The global vmselect gateway detects us-east-1 VictoriaMetrics as unreachable and routes all queries to us-west-2. Dashboards show data only from us-west-2 services until recovery. Warm/cold data from us-east-1 is accessible via S3 cross-region replication.
- Traces: Tempo's query frontend in us-west-2 can read us-east-1 trace blocks from the replicated S3 bucket. No trace data is lost (beyond the ~60s in-flight buffer).
- Logs: VictoriaLogs data on us-east-1 local NVMe is unavailable until the region recovers. Logs from us-east-1 services are lost for the duration. If log durability across regions is critical, stream a copy of the
logs-rawKafka topic to us-west-2 via MirrorMaker.
Ingestion during regional failure: Services in the failed region can't emit telemetry. Services in the surviving region continue unaffected. If services in us-east-1 fail over to us-west-2 (application-level DR), their telemetry automatically flows into the us-west-2 pipeline with no configuration change, since OTel Collectors run as a DaemonSet on every node.
15. Security
15.1 Security Fundamentals
- Transport: mTLS between all components (OTel Collectors, Kafka, Flink, VictoriaMetrics). TLS 1.3 for external-facing APIs. Certificates managed by cert-manager.
- Tenant isolation: Separate data directories per tenant on vmstorage (
/data/{tenant_id}/). vmselect injectstenant_idfilter into every query. Alert rules, recording rules, and dashboards are tenant-scoped with RBAC at the API gateway. - Encryption at rest: AES-256 disk encryption for vmstorage (EBS/NVMe). SSE-S3 or SSE-KMS for S3. Broker-side encryption for Kafka.
- Audit logging: All administrative actions (tenant CRUD, alert rule changes, cardinality limit overrides, emergency tenant disables) logged to a separate append-only store. 1-year minimum retention for compliance.
15.2 Query Injection Prevention
MetricsQL expressions are user-provided strings. Without sanitization, a crafted query could attempt to:
- Access another tenant's data:
http_requests_total{tenant_id="other-tenant"} - Execute resource-intensive queries:
count({__name__=~".+"})(scan all series)
Protections:
- vmselect forcibly prepends
{tenant_id="<authenticated_tenant>"}to every query. The user-provided tenant_id label is stripped. - Query complexity limits: maximum number of series touched (500K default), maximum query duration (60s), maximum number of subqueries (10).
- Regex complexity limits: MetricsQL regex matchers are checked for backtracking patterns before execution.
Explore the Technologies
The technologies and infrastructure patterns referenced throughout this design:
Core Technologies
| Technology | Role in This System | Learn More |
|---|---|---|
| Kafka | Ingestion buffer, 500M metrics/sec, fingerprint-partitioned topics | Kafka |
| VictoriaMetrics | Hot TSDB, Gorilla compression, shared-nothing cluster (vminsert/vmselect/vmstorage) | VictoriaMetrics |
| Flink | Stream aggregation, recording rules, SLO burn rate calculation | Flink |
| Prometheus | Scrape format standard, MetricsQL foundation, meta-monitoring stack | Prometheus |
| Grafana | Dashboard visualization, multi-datasource queries, variable templating | Grafana |
| InfluxDB | Alternative TSDB evaluated in Section 4.4 (FDAP stack, Arrow + Parquet) | InfluxDB |
| RocksDB | Flink state backend, VictoriaMetrics mergeset index foundation | RocksDB |
| gRPC | vminsert-to-vmstorage communication, protobuf metric serialization | gRPC |
| Envoy | Service mesh sidecar, mTLS for scrape traffic, load balancing | Envoy |
| Memcached | Query result cache for Tempo trace queries and vmselect overflow caching | Memcached |
| Gorilla Compression | Delta-of-delta + XOR encoding, 12x compression ratio, 1.37 bytes/sample | Gorilla Compression |
| ZSTD Compression | General-purpose compression for cold-tier Parquet blocks on S3, Kafka messages, RocksDB SST files. ~3x ratio, 1500 MB/s decompression | ZSTD Compression |
| Grafana Tempo | Distributed trace storage, S3-backed Parquet blocks, TraceQL queries, exemplar correlation | Grafana Tempo |
| VictoriaLogs | Log storage, bloom filter per-token index, LogsQL queries, vlinsert/vlselect/vlstorage cluster | VictoriaLogs |
| VictoriaTraces | Alternative trace backend (evaluated alongside Tempo), VictoriaLogs engine, vtinsert/vtselect/vtstorage | VictoriaTraces |
Infrastructure Patterns
| Pattern | Role in This System | Learn More |
|---|---|---|
| Metrics & Monitoring | Core domain. Gorilla compression, TSDB internals, cardinality management | Metrics & Monitoring |
| Message Queues & Event Streaming | Kafka topic design, fingerprint partitioning, backpressure handling | Event Streaming |
| Alerting & On-Call | Distributed rule evaluation, Alertmanager, SLO burn rates, alert storms | Alerting & On-Call |
| Kubernetes Architecture | Service discovery, DaemonSet collectors, pod-level scraping | Kubernetes Architecture |
| Rate Limiting & Throttling | Per-tenant ingestion limits, query concurrency limits, cardinality budgets | Rate Limiting |
| Circuit Breaker & Resilience | vmstorage failure handling, Kafka retry policies, graceful degradation | Circuit Breaker & Resilience |
| Auto-Scaling Patterns | vminsert/vmselect horizontal scaling, Flink TaskManager auto-scaling | Auto-Scaling |
| Distributed Tracing | Trace collection via OTel SDK, tail-based sampling, Tempo on S3, trace-to-metric exemplars | Distributed Tracing |
| Distributed Logging | Log collection via OTel filelog receiver, VictoriaLogs bloom filter indexing, per-level retention, trace-to-log correlation | Distributed Logging |
| Database Sharding | VictoriaMetrics consistent hashing, tenant shuffle-sharding | Database Sharding |
| Service Discovery & Registration | Kubernetes SD, Consul for automatic scrape target registration | Service Discovery |
| Deployment Strategies | Blue-green for Flink jobs, rolling upgrades for VictoriaMetrics | Deployment Strategies |
| Caching Strategies | vmselect query cache, recording rule pre-computation, Parquet block cache | Caching Strategies |
| Object Storage & Data Lake | S3 + Parquet warm/cold tiers, ZSTD compression, partition pruning | Object Storage |
| Secrets Management | mTLS certificates, tenant API keys, S3 encryption keys | Secrets Management |
Further Reading
- Facebook Gorilla Paper: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015)
- Datadog Monocle: Evolving Real-Time Timeseries Storage in Rust
- Uber M3: Open Source Large-Scale Metrics Platform
- High-Cardinality TSDB Benchmarks: VictoriaMetrics vs TimescaleDB vs InfluxDB
- Datadog Metrics without Limits
- Google SRE Book: Alerting on SLOs
- Grafana Mimir Architecture
- Honeycomb: Why Observability Requires a Distributed Column Store
A unified observability platform at this scale is three pipelines sharing a single collection layer. Kafka buffers all three signal firehoses. VictoriaMetrics stores metrics with 12x Gorilla compression. Tempo stores traces as Parquet blocks on S3 after tail-based sampling cuts 99.5% of raw volume. VictoriaLogs stores logs with bloom filter indexing at 3x the throughput of Loki. Grafana ties all three together with exemplars and trace_id correlation: three clicks from "something is slow" to "here is the root cause." The hardest problem remains cardinality for metrics, sampling strategy for traces, and volume control for logs. Per-tenant limits, meta-monitoring, and a separate watchdog stack keep the whole thing running.