Observability Platform at Scale - Metrics, Traces, Logs, and Profiles
Goal: Build a unified observability platform that ingests 500 million metrics per second, collects distributed traces across 500K+ services, centralizes logs from every container in the fleet, and continuously profiles application performance. Support all four pillars of observability: metrics (counters, gauges, histograms, summaries), traces (per-request call trees across service boundaries), logs (structured application output with full-text search), and profiles (CPU, memory, and lock contention snapshots correlated to traces). Provide real-time dashboards, alerting with SLO (Service Level Objective)-based burn rates and ML-assisted anomaly detection, multi-tenant isolation, and long-term retention with automatic downsampling.
Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.
- Sections 1-8: Metrics pipeline in full depth (the largest and most complex signal)
- Section 9: Distributed tracing at scale
- Section 10: Centralized logging
- Section 11: Continuous profiling (the fourth pillar)
- Sections 12-16: Cross-cutting concerns (bottlenecks, failures, meta-monitoring, deployment, security)
All four signals share the same OTel Collector fleet and Kafka cluster but diverge at the storage layer.
- New to observability? Start with Sections 1-7 for the core architecture and sizing. Skim Section 8.3 (cardinality) for the scariest problem in metrics. Skip to Section 13 for real failure stories.
- Building something similar? Sections 7-8 have the sizing math and deep dives you need. Section 11 covers the profiling pipeline most teams skip.
- Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 12 (bottlenecks) and Section 13 (failures) are common follow-up questions.
TL;DR: A unified observability platform handling 500M metrics/sec, 200M spans/sec, 50M log lines/sec, and 50K profiles/sec across 500K+ services. Gorilla compression achieves 1.3–2 bytes/sample (12x reduction). Tail-based trace sampling reduces volume by 98-99.5% while keeping 100% of errors. Value-based routing drops 15-30% of low-value data before storage. ML anomaly detection supplements SLO burn rates for alerting. The hardest problems: cardinality explosions in metrics, sampling strategy for traces, volume spikes in logs, and profile-to-trace correlation for profiling.
Architecture in One Minute
At a high level the platform consists of five layers:
-
Two-tier collection layer. eBPF baseline telemetry via Grafana Beyla (zero code changes) plus OpenTelemetry SDK instrumentation for deep visibility.
-
Observability pipeline. OTel Collector processors perform metadata enrichment, value-based routing, and deduplication before data reaches Kafka.
-
Streaming ingestion. Apache Kafka buffers telemetry and supports replay. Apache Flink performs recording rules, rollups, and SLO burn-rate calculations.
-
Signal-specific storage engines. Metrics → VictoriaMetrics, Traces → Grafana Tempo, Logs → VictoriaLogs, Profiles → Grafana Pyroscope. Each engine is optimized for its signal's access pattern.
-
Unified query and correlation. Grafana correlates signals using exemplars, trace_id, and span_id. Optional ClickHouse analytics gives you cross-signal exploration.
1. Problem Statement
A few clarifications before getting into the architecture:
Scale clarification: This design targets extreme hyperscale: a stress-test upper bound, not a typical deployment. Most organizations ingest thousands to low millions of metrics/sec; only the largest enterprises (thousands of microservices, hundreds of thousands of hosts) reach 10M-50M. At the frontier: Uber's M3 aggregates 500M metrics/sec pre-aggregation across 6.6B active series, Datadog processes 100+ trillion events/day, and Grafana Mimir benchmarked 1B active series at ~50M samples/sec. The 500M figure here matches Uber's pre-aggregation rate to pressure-test every layer. Section 7.3.1 shows how the architecture scales down linearly to 10-50M with fewer components.
Assumptions:
- This is a multi-tenant platform (like Datadog, Grafana Cloud, or New Relic). Multiple teams and organizations push metrics into the same infrastructure.
- A metric is "ingested" once it's in durable storage and queryable, which takes under 2 seconds.
- Each metric is a (name, label set, timestamp, float64 value) tuple. Labels are key-value pairs like
{service="checkout", env="prod", region="us-east-1"}. - Services expose metrics via Prometheus exposition format or push via OTLP (OpenTelemetry Protocol).
Operating model:
-
Responsibility split: The platform team operates collection infrastructure (OTel Collectors, eBPF agents, Kafka, storage, dashboards). Tenant teams instrument their applications using OpenTelemetry SDKs, Prometheus client libraries, or rely on eBPF-based auto-instrumentation for baseline telemetry. Tenants either expose a
/metricsendpoint (pull) or push OTLP to the platform's ingestion endpoint (push). Tenants do not run their own Prometheus servers or storage. The platform handles everything after the application boundary. -
Two-tier instrumentation: The platform deploys Grafana Beyla (eBPF-based) as a DaemonSet alongside the OTel Collector DaemonSet. Beyla auto-instruments HTTP/gRPC/SQL at the kernel level with zero application code changes, giving you RED metrics (Rate, Errors, Duration) and basic trace spans for every service.
The split matters because eBPF and OTel SDK see different things. eBPF watches network boundaries: it knows "POST /checkout took 500ms" and "a SQL call to port 5432 took 200ms," but it has no idea what happened inside the service. It can't tell you which function was slow, what the actual SQL query text was, or why the request failed. OTel SDK instrumentation lives inside the application code, so it captures the internal breakdown: which method took how long, the exact query text, custom business metrics like
orders_placedorpayment_amount, span attributes likeuser_idandcart_size, and CPU/memory profiles showing which line of code is the bottleneck. Profiles come exclusively from SDK instrumentation, not from eBPF.Why not just use the SDK for everything? Because it requires code changes. Importing libraries, adding instrumentation, redeploying. Across 500+ services in five languages, that takes months. eBPF covers every service in an afternoon: deploy one DaemonSet and you have RED metrics and basic trace spans for the entire cluster. When a team eventually adds SDK instrumentation to a service, the OTel Collector deduplicates the overlap (keeping the richer SDK version) and eBPF quietly steps back for those endpoints.
-
Data transport: For tenants on the same Kubernetes cluster, the platform deploys OTel Collectors as a DaemonSet (one per node) that scrapes tenant pods locally over mTLS with no cross-network hop. For tenants on separate infrastructure, the platform exposes an HTTPS ingestion endpoint (OTLP or Prometheus remote_write) authenticated via per-tenant API key. Both paths converge into the same Kafka-backed pipeline.
Scope: Four Pillars of Observability
A production monitoring platform handles four signals that look nothing alike:
- Metrics answer "how much?" Small numbers (CPU at 45%, 200 req/sec), arriving constantly.
- Traces answer "what happened to this one request?" A call tree across service boundaries with timing data.
- Logs answer "what did the system say?" Structured text messages with full-text search.
- Profiles answer "why is this code slow?" CPU flame graphs, memory allocations, lock contention snapshots showing exactly which function is the bottleneck.
Each needs a different storage engine because the query patterns are completely different:
| Signal | Data Shape | Size per Event | Storage Engine | Query Pattern |
|---|---|---|---|---|
| Metrics | (name, labels, timestamp, float64) | 1.5-2 bytes (Gorilla compression, Section 8.2) | TSDB (VictoriaMetrics) | Aggregate: rate(x[5m]) |
| Traces | (trace_id, span_id, parent_id, service, duration, attributes) | 500-2000 bytes/span | Columnar/object store (Grafana Tempo) | By trace ID, filter by service+latency |
| Logs | (timestamp, severity, message, structured_fields) | 200-1000 bytes | Log store (VictoriaLogs) | Full-text search, filter by severity |
| Profiles | (profile_id, span_id, service, type, stacktrace, values) | 50-200 KB/snapshot | Profile store (Grafana Pyroscope) | By span ID, by service+time range |
What NOT to do:
A single Prometheus instance typically handles 3-10 million active series depending on available RAM and scrape interval. At 500M metrics/sec with potentially billions of unique series, that setup will OOM (Out of Memory crash) before lunch. Running 50,000 standalone Prometheus instances isn't a solution either. That's 50,000 things to manage, 50,000 things to fail, and no global query capability.
The other common mistake: storing raw metrics forever. At 500M samples/sec, that's 43.2 trillion samples/day. Even at 1.37 bytes per sample (Gorilla compression), that's 54 TB/day of hot storage. Without downsampling, the storage bill alone kills the project. Downsampling applies to metrics only. Traces, logs, and profiles retain full resolution or are discarded entirely (Sections 9-11 explain why).
2. Functional Requirements
| ID | Requirement | Priority |
|---|---|---|
| FR-01 | Ingest metrics via pull (Prometheus scrape) and push (OTLP remote write) | P0 |
| FR-02 | Support four metric types: counter, gauge, histogram, summary | P0 |
| FR-03 | Real-time dashboard queries with sub-second latency over 15-day window | P0 |
| FR-04 | Continuous alert rule evaluation with configurable thresholds | P0 |
| FR-05 | SLO-based alerting with multi-window burn rate detection | P0 |
| FR-06 | Automatic metric downsampling: raw (15d) → 5-min aggregates (90d) → 1-hour aggregates (forever) | P0 |
| FR-07 | Multi-tenant data isolation with per-tenant cardinality limits | P0 |
| FR-08 | Distributed trace collection via OTLP with tail-based sampling | P0 |
| FR-09 | Centralized log ingestion with full-text search (50M lines/sec) | P0 |
| FR-10 | eBPF-based baseline collection (Grafana Beyla) for automatic RED metrics and trace spans | P0 |
| FR-11 | Value-based data routing: full fidelity for errors/SLO-critical, sample/drop health checks and debug noise | P0 |
| FR-12 | ML-assisted anomaly detection with seasonal baselines, adaptive thresholds, and cross-service correlation | P0 |
P1 requirements (recording rules, RBAC, exemplar correlation, cross-region federation, continuous profiling) and P2 requirements (metric relabeling, cost attribution) are addressed in their respective deep-dive sections.
3. Non-Functional Requirements
| Requirement | Target |
|---|---|
| Ingestion throughput | 500M data points/sec sustained |
| Query latency (15-day window) | p50 < 50ms, p99 < 200ms |
| Query latency (90-day window, downsampled) | p50 < 200ms, p99 < 500ms |
| Ingestion-to-dashboard visibility | < 2 seconds |
| Alert rule evaluation interval | 15 seconds (configurable) |
| Alert firing latency (ingestion to notification) | < 30 seconds |
| ML anomaly detection latency | < 60 seconds from ingestion to anomaly signal |
| Profile ingestion throughput | 50K profiles/sec sustained |
| Availability | 99.99% (52 min downtime/year) |
| Data durability | 99.999% (no data loss on single node failure) |
| Active time series capacity | 10 billion+ |
| Retention | Raw: 15 days, 5-min: 90 days, 1-hour: indefinite |
| Tenant isolation | No cross-tenant data leakage, noisy neighbor protection |
Design Principles
Several principles shape the architecture:
1. Specialized engines beat general-purpose systems at scale. Metrics, traces, logs, and profiles have different query patterns and storage requirements. Using signal-specific storage engines reduces cost and improves performance compared to forcing all signals through a single system like ClickHouse or Elasticsearch.
2. Observability must work on day one. Grafana Beyla (eBPF) gives every service RED metrics and trace spans the moment it's deployed, before anyone writes instrumentation code. Without this baseline, what actually happens is: teams deprioritize SDK work, services run dark for months, and when an incident hits an uninstrumented service you're debugging with kubectl logs and guesswork.
3. Cost control must happen before storage. Value-based routing drops low-value signals (health checks, readiness probes, debug noise) before they reach Kafka. This reduces storage costs by 15-30% without losing data that matters during incidents.
4. The write path must tolerate burst traffic. Kafka buffers between collection and storage so each layer scales independently. Details in Section 8.1.
5. Queries should touch minimal data. Recording rules pre-compute expensive aggregations. Downsampling reduces query scan ranges by 20x for older data. The result: sub-second dashboard loads even over 90-day windows.
4. High-Level Approach & Technology Selection
4.1 Why This Architecture
Observability workloads are extremely write-heavy but require fast queries when incidents occur. The write-to-read ratio is roughly 1000:1. Millions of data points flow in every second. A handful of dashboards and alert rules query that data.
-
Append-only writes. Metrics are immutable. "At 3:14pm, CPU was 45%" never changes. The storage engine only appends, never seeks and updates, which is far faster.
-
Parallel everything. 500M writes per second on one machine is impossible. Spread the work across thousands of machines, each handling a slice of the data.
-
No read-before-write. At 500M/sec, checking whether a row exists before writing would be the bottleneck. Skip the lookup entirely.
-
Shard by metric identity. Assign each metric to a specific storage node based on its fingerprint (a hash of the metric name + labels). All samples for one metric always go to the same node. This keeps related data together, which is critical for compression and fast queries.
-
Compress aggressively. Time-series data has unique properties (timestamps arrive at regular intervals, values often repeat or change slowly) that enable 12x compression with the right algorithm.
-
Touch minimal data on reads. Inverted indexes map label selectors to series IDs, so queries skip irrelevant data.
-
Tier retention automatically. Hot data (raw, 15 days) on fast storage. Warm data (5-min aggregates, 90 days) on object storage. Cold data (1-hour aggregates) archived indefinitely.
-
Route by value, not just by time. Not all data is equally valuable. Errors and SLO-critical signals get full fidelity. Health checks, readiness probes, and debug noise get sampled or dropped before reaching storage. This reduces stored volume by 15-30% without losing anything that matters during an incident.
4.2 Technology Selection
The pipeline flows through seven stages -- collection, processing, buffering, aggregation, storage, query, and analytics. Each stage uses a purpose-built technology:
| Component | Technology | Why This Choice |
|---|---|---|
| Collection (baseline) | Grafana Beyla (eBPF) | Zero-config HTTP/gRPC/SQL instrumentation at kernel level. ~200 MB RAM per node. Produces RED metrics and basic trace spans without any SDK changes. |
| Collection (deep) | OpenTelemetry SDK + OTel Collector | CNCF standard. Pull (Prometheus scrape) and push (OTLP) in one agent. 700+ integrations. Vendor-neutral. Custom business metrics, detailed span attributes, profile collection. |
| Pipeline processing | OTel Collector processor chain | Value-based routing, metadata enrichment, deduplication, and cost-aware sampling as processor stages in the existing OTel Collector fleet. No separate product needed. |
| Ingestion buffer | Apache Kafka | Replay on storage failure (24h retention), fan-out to Flink + vminsert, burst absorption at 500M/sec. RF=3, acks=all. Partitioned by metric fingerprint for locality. |
| Stream aggregation | Apache Flink | Native streaming (not micro-batch like Spark). Managed state with RocksDB backend. Exactly-once via distributed snapshots. Handles 50M+ keys without GC pressure. |
| TSDB (hot, 0-15d) | VictoriaMetrics Cluster | 2.2M data points/sec per node baseline (5M+ on production-grade hardware, see Section 7.3). PromQL-compatible. Gorilla compression built-in. Shared-nothing architecture. |
| Long-term storage (warm/cold) | S3 + Apache Parquet | Columnar format supports partition pruning. ZSTD compression. Massively lower storage footprint vs hot TSDB. Inspired by Thanos and InfluxDB 3.0 FDAP stack. |
| Inverted index | VictoriaMetrics mergeset | Maps label combinations to series IDs. Handles 100M+ series without OOM. Disk-backed, not in-memory like Prometheus. |
| Query engine | MetricsQL | PromQL superset. Keeps metric names after aggregations. Better high-cardinality handling. WITH templates for complex queries. Drop-in replacement for existing PromQL users. |
| Alert evaluation | vmalert + Alertmanager | vmalert evaluates rules against vmselect, sharded across pods for scale (similar pattern to Grafana Mimir ruler). Alertmanager handles dedup, grouping, routing to PagerDuty/Slack/Opsgenie. |
| ML anomaly detection | Custom Kafka consumer + RocksDB | Seasonal baseline per metric (STL decomposition), anomaly scoring, cross-signal correlation. Lightweight: 5-10 nodes. Anomalies go to Slack, SLO burn rates remain the paging mechanism. |
| Service discovery | Kubernetes API + Consul | Auto-discover scrape targets. No manual configuration when services scale. Label-based relabeling for metric filtering. |
| Visualization | Grafana | Industry standard. Multi-datasource. Dashboard load target <3s. Variable templating for multi-service views. |
| Compression | Gorilla + ZSTD | Not sequential. Different compressors for different tiers. Gorilla compresses raw samples in vmstorage (hot, 0-15 days): 12x ratio, 1.37 bytes/sample. When data ages out, Flink produces aggregated doubles (min/max/avg/sum/count) that are no longer sequential time-series samples, so Gorilla doesn't apply. ZSTD compresses these Parquet blocks on S3 (warm/cold, 15+ days): ~3x ratio, tunable levels (1-22). |
| Multi-tenancy | Tenant ID header + per-tenant series limits | Inject tenant_id on remote write. Reject writes exceeding cardinality budget. Shuffle-sharding for isolation. |
| Trace backend | Grafana Tempo | S3-native Parquet storage. No index to maintain (bloom filters for trace ID lookup). TraceQL query language. Fraction of Jaeger+Elasticsearch infrastructure at scale. |
| Log backend | VictoriaLogs | Bloom filter per-token index on all fields. LogsQL for full-text search. vlinsert/vlselect/vlstorage cluster scales horizontally. Section 10.2 benchmarks against Loki and Elasticsearch. |
| Profile backend | Grafana Pyroscope | S3-native storage. pprof format ingestion. Span-to-profile correlation via OTel SDK. Same Grafana ecosystem as Tempo. Flame graph visualization built into Grafana. |
| Analytics (optional) | ClickHouse | Complementary wide-event store for cross-signal exploration. Not the primary store (see Section 4.6 for why). 10-20 nodes. |
| Context propagation | W3C Trace Context | Industry standard traceparent header. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers. |
When to add Kafka: Above ~50M metrics/sec, Kafka becomes necessary for replay, fan-out, and burst buffering. Below that, write directly via remote_write. Full cost-benefit analysis in Section 8.1.
4.3 Why VictoriaMetrics Over Alternatives
| Dimension | VictoriaMetrics | InfluxDB 3.0 | TimescaleDB | M3DB (Uber) | ClickHouse |
|---|---|---|---|---|---|
| Ingestion rate (single node) | 2.2M pts/sec | 330K pts/sec | 480K pts/sec | ~500K pts/sec | ~500K-1M pts/sec |
| RAM usage (4M series benchmark) | 6 GB | 20.5 GB | 2.5 GB | ~12 GB | ~1-2 TB per 1B series |
| Disk usage (4M series) | 3 GB | 18.4 GB | 52 GB | ~8 GB | ~4-6 bytes/sample |
| Query language | MetricsQL (PromQL superset) | SQL (via DataFusion) | SQL | M3QL + PromQL | SQL |
| Compression | Gorilla (1.37 bytes/sample) | Arrow + Parquet + ZSTD | PostgreSQL TOAST | M3TSZ (Gorilla variant) | LZ4/ZSTD on Float64 columns |
| Architecture | Shared-nothing (vminsert/vmselect/vmstorage) | Monolithic rewrite (Rust) | PostgreSQL extension | Quorum-based (3-way replication) | MergeTree, columnar |
| Operational complexity | Low (stateless query/insert, stateful storage only) | Medium (newer, less battle-tested) | Medium (PostgreSQL tuning required) | High (etcd dependency, complex topology) | Medium (general-purpose, not TSDB-optimized) |
Why VictoriaMetrics wins here:
VictoriaMetrics ingests 4–7x faster than InfluxDB at one-third the RAM (the exact multiplier depends on hardware and label cardinality; the 2.2M baseline is from public benchmarks on modest instances, while production-grade nodes with NVMe and tuned OS settings reach 5M+). The shared-nothing architecture means vminsert nodes are stateless and horizontally scalable. Each vmstorage node owns its shard independently. No quorum overhead, no coordination on writes. For a 500M/sec write-heavy workload, that absence of coordination is critical.
The three components: VictoriaMetrics cluster splits into three specialized roles:
-
vminsert (write router): Receives metrics and uses consistent hashing on the metric fingerprint to route each sample to the correct vmstorage node. Stateless (no local data), scales horizontally.
-
vmstorage (storage engine): Each node owns a shard of time series data on local NVMe. Handles Gorilla compression, maintains the inverted index, and runs compaction. The only stateful component.
-
vmselect (query engine): Receives MetricsQL queries, fans them out to all vmstorage nodes in parallel, merges results, and applies functions like
rate(). Also stateless.
Each dimension (write throughput, storage capacity, query concurrency) scales independently by adding nodes to the relevant component.
The PromQL compatibility matters too. The entire alerting ecosystem (Alertmanager, Grafana, recording rules) speaks PromQL. Choosing a SQL-native TSDB like InfluxDB 3.0 means rebuilding that tooling from scratch. MetricsQL extends PromQL without breaking it.
M3DB is battle-proven at Uber (6.6B series), but the operational complexity is much higher. etcd cluster management, complex topology, and 3-way quorum writes add latency and failure modes. VictoriaMetrics achieves comparable scale with a simpler operational model. Use Prometheus as the format and query language standard. Replace the storage layer with VictoriaMetrics.
Licensing note: VictoriaMetrics single-node is Apache 2.0, fully open source. The cluster version is also open source for core functionality, but enterprise features like downsampling, multi-tenant rate limiting, deduplication across HA pairs, and advanced RBAC require a commercial license. For most teams the open-source cluster version covers what you need. If licensing flexibility is a hard constraint, Prometheus paired with Thanos or Cortex for long-term storage is the safer path. You lose some ingestion performance, but the licensing picture is cleaner.
4.4 Why Flink for Stream Aggregation
| Dimension | Apache Flink | Spark Structured Streaming | Kafka Streams |
|---|---|---|---|
| Processing model | True streaming (event-by-event) | Micro-batch (100ms+ intervals) | True streaming |
| State management | RocksDB backend (spills to disk) | Heap-based (GC pressure) | RocksDB or in-memory |
| Exactly-once | Distributed snapshots (Chandy-Lamport) | Offset tracking | Kafka transactions |
| Scalability | Independent of Kafka (separate cluster) | Independent cluster | Tied to Kafka partitions |
| Windowing | Event-time, processing-time, session | Event-time, processing-time | Event-time, session |
| State size at 500M/sec | 50M+ keys, multi-GB, no GC pauses | Multi-second GC pauses at scale | Limited by Kafka partition count |
For this system: 500M metrics/sec flowing through recording rules, rollup aggregations, and SLO burn rate calculations. At peak, that's 50 million active aggregation keys in state. Flink's RocksDB state backend handles this without heap pressure. Spark's JVM heap would cause multi-second GC pauses that violate the 2-second ingestion-to-visibility target.
Why not the Prometheus/Mimir ruler? The ruler evaluates recording rules by querying the TSDB every 15-30 seconds. At moderate scale that works. At 500M metrics/sec, each evaluation cycle competes with dashboard and alert queries for vmselect capacity. Flink processes metrics inline during ingestion with zero query overhead. The trade-off: Flink adds operational complexity. Below ~50M metrics/sec, the ruler is simpler and sufficient.
Scope: metrics only. Flink processes metrics only. Traces, logs, and profiles bypass Flink entirely (Sections 9-11). They don't need pre-aggregation or downsampling.
How Grafana queries raw vs aggregated data: Flink writes aggregated metrics back into VictoriaMetrics as separate metric names. For example, Flink computes a 1-minute rolling average and writes it as http_response_time_avg_1m{service="checkout"}. The raw metric http_response_time_seconds still flows into VictoriaMetrics separately. Both live in the same VictoriaMetrics instance, different metric names. Grafana doesn't route automatically. The dashboard author picks the metric name in each panel's PromQL query. A 6-hour ops panel uses: avg(http_response_time_avg_1m{service="checkout"}). An incident investigation panel queries raw data: histogram_quantile(0.99, rate(http_response_time_seconds_bucket{service="checkout"}[5m])). Same data source, different query.
4.5 Observability Pipeline: Value-Based Data Routing
Traditional observability pipelines treat all data equally: every metric, span, and log line gets the same treatment from collection to storage. At 500M metrics/sec, this is wasteful. Health check endpoints, Kubernetes readiness probes, and debug-level application noise account for 15-30% of ingestion volume but are almost never queried during incidents.
Value-based routing makes cost-aware decisions at the OTel Collector processor chain level, before data reaches Kafka:
# OTel Collector pipeline configuration (simplified)
processors:
# Step 1: Enrich with ownership metadata from platform registry
metadata_enrichment:
endpoint: "http://metadata-service:8080/v1/lookup"
cache_ttl: 5m
enrich_fields:
- team_owner
- cost_center
- slo_tier # gold, silver, bronze
# Step 2: Value-based routing decisions
routing:
rules:
# Full fidelity: errors, SLO-critical signals
- condition: 'attributes["http.status_code"] >= 500'
action: pass_through
- condition: 'resource["slo_tier"] == "gold"'
action: pass_through
# Drop: health checks, readiness probes
- condition: 'attributes["http.route"] == "/healthz"'
action: drop
- condition: 'attributes["http.route"] == "/readyz"'
action: drop
# Sample: debug-level signals, low-priority services
- condition: 'resource["slo_tier"] == "bronze"'
action: probabilistic_sample
sampling_percentage: 10
# Default: pass through
- condition: 'true'
action: pass_through
# Step 3: Deduplicate overlapping scrape targets
deduplication:
# When both Beyla (eBPF) and OTel SDK produce the same metric,
# prefer the SDK version (richer labels) and drop the Beyla duplicate
strategy: prefer_sdk_over_ebpf
match_on: [metric_name, service_name]Why this lives in the OTel Collector, not a separate product: The processor chain already handles batching, filtering, and transformation. Adding routing decisions here avoids an extra network hop and keeps the pipeline simple. Products like Cribl, Mezmo Pipeline, and Edge Delta solve the same problem for organizations that prefer a managed pipeline. At this scale, the OTel Collector's native processors are sufficient and avoid vendor lock-in.
Impact on storage costs: The 15-30% volume reduction from value-based routing (Design Principle #3) translates to 75-150M fewer samples/sec reaching storage. Net effect: slightly lower costs despite adding profiling as a fourth signal.
Metadata enrichment service: A lightweight gRPC service backed by an in-memory cache (refreshed from the platform's service registry every 5 minutes). It maps (service_name, namespace) to (team_owner, cost_center, slo_tier). The OTel Collector caches lookups locally (5-minute TTL) so the metadata service handles only ~600 lookups/sec across the fleet, not per-sample lookups. If the metadata service is unreachable and the local cache is cold (e.g., a freshly started collector), the pipeline defaults to slo_tier=silver and routes data through at standard fidelity. No data is dropped due to enrichment failures.
4.6 Why Not a Unified ClickHouse Store?
A reasonable question: why not store everything in ClickHouse? Several production observability platforms do exactly this. SigNoz stores metrics, traces, and logs in ClickHouse. Highlight.io uses ClickHouse for session replay, traces, and logs. Uptrace uses ClickHouse for all signals. The approach works and has real advantages.
The case for ClickHouse as a unified store:
ClickHouse excels at high-cardinality analytical queries. Its columnar storage with LZ4/ZSTD compression achieves excellent compression ratios. MergeTree engine handles append-heavy workloads efficiently. SQL is the query language, which most engineers already know. A single storage engine means simpler operations: one cluster to manage, one set of failure modes to understand, one backup strategy.
SigNoz demonstrates this at moderate scale: metrics, traces, and logs in ClickHouse, with a custom query layer on top that speaks PromQL for metrics and a trace query language for spans. It works well for organizations processing up to ~10-50M metrics/sec.
The case against at 500M metrics/sec:
At hyperscale, specialized engines win on efficiency (see Section 4.3 comparison table for the full benchmark). The crossover point is roughly 50M metrics/sec. Below that, ClickHouse as a unified store is simpler and the efficiency gap doesn't matter. Above that, the 3-4x compression advantage and 2-5x ingestion throughput advantage of a specialized TSDB compounds into hundreds of nodes and petabytes of storage difference.
For traces and logs, the gap is narrower. Tempo on S3 is mostly an operational simplicity choice (no cluster to manage for trace storage). VictoriaLogs' bloom filter approach is competitive with ClickHouse for log search but more memory-efficient at this scale.
The compromise: ClickHouse as a complementary analytics layer.
Instead of replacing the specialized stores, add ClickHouse as an optional wide-event exploration layer:
Metrics → VictoriaMetrics (primary, full fidelity)
Traces → Tempo (primary, full fidelity)
Logs → VictoriaLogs (primary, full fidelity)
Profiles → Pyroscope (primary, full fidelity)
↓
All signals → ClickHouse (secondary, sampled wide events for cross-signal exploration)
The Flink pipeline exports a sampled stream of wide events to ClickHouse: denormalized rows that combine metric context, trace attributes, log snippets, and profile metadata into a single queryable record. This opens up queries that no single pillar store can answer alone:
- "Show me all requests where latency > 2s AND the CPU profile shows GC pressure AND the downstream DB logs show lock contention." This requires joining traces, profiles, and logs.
- "Which team's services have the worst error-budget burn this week, broken down by endpoint?" This requires joining metrics with service ownership metadata.
ClickHouse handles these exploratory, ad-hoc queries well. The primary pillar stores handle the high-throughput, latency-sensitive operational queries (dashboards, alerts, trace lookups).
Sizing: 10-20 ClickHouse nodes for the analytics layer. Receives ~1-5% of total signal volume (sampled). Retention: 30-90 days. This is a "nice to have" capability, not a critical path component. The platform works without it.
5. High-Level Architecture
5.1 Bird's-Eye View
Component glossary (numbered steps match the diagram):
(1) Collection (two-tier model):
- eBPF baseline (Grafana Beyla): Deployed as a DaemonSet (~3,000 nodes). Beyla attaches eBPF probes to the kernel's network and syscall layers. It automatically produces RED metrics (
http_server_request_duration_seconds,http_server_request_body_size_bytes) and basic trace spans for every HTTP/gRPC/SQL call without any application code changes. ~200 MB RAM per node. Beyla forwards its output to the local OTel Collector for enrichment and routing. - Pull (same K8s cluster): The platform's OTel Collector DaemonSet runs one collector per K8s node (~3,000 collectors total). It scrapes local pods'
/metricsendpoints over mTLS. No cross-network hop. Service Discovery (K8s API, Consul) automatically finds new scrape targets as services scale. It also receives Beyla's output and deduplicates where SDK instrumentation overlaps. - Push (external infrastructure): Services on separate clusters, on-prem, or serverless can't be scraped. They push metrics, traces, logs, and profiles via OTLP to the OTel Collector Gateway over HTTPS, authenticated by per-tenant API key.
(1b) Pipeline processing:
- The OTel Collector processor chain applies value-based routing (Section 4.5) before forwarding to Kafka: metadata enrichment from the platform registry, health check/readiness probe filtering, SLO-tier-based sampling, and Beyla/SDK deduplication. This reduces stored volume by 15-30%.
(2) Kafka (ingestion buffer):
- metrics-ingestion: 10,000 partitions, 24h retention, partitioned by metric fingerprint. Enables replay if storage crashes.
- traces-raw: partitioned by trace_id so all spans from one trace land together.
- logs-raw: partitioned by service_name for locality.
- profiles-raw: partitioned by service_name. Lower volume than other signals (~1 GB/sec).
(3)→(4) Processing:
- Flink computes recording rules, 5-minute rollups (20x data reduction), and SLO burn rate calculations on the metrics stream.
- Tail-Based Sampler keeps errors, p99 latency, and a small random baseline (Section 9.2 covers the full sampling strategy).
- Log Parser extracts structured fields (service name, log level, trace_id) and routes by severity.
- ML Anomaly Detector consumes the metrics stream from Kafka, maintains seasonal baselines per metric in RocksDB, and emits anomaly signals to Alertmanager when deviations exceed thresholds (Section 8.7.1).
(5) Storage:
- vminsert → vmstorage → S3 Parquet: Stateless write router hashes metrics to storage nodes. vmstorage compresses with Gorilla (1.37 bytes/sample) on NVMe (15-day hot tier). Aged data exports to S3 as ZSTD-compressed Parquet (warm 90 days, cold 1 year).
- Tempo Ingesters → S3 Parquet: Batches sampled spans into Parquet blocks on S3. Bloom filters enable fast trace ID lookup.
- vlinsert → vlstorage: Distributes logs across storage nodes. Bloom filter index on every field. Local NVMe storage.
- Pyroscope Ingesters → S3: Batches profile data into S3 blocks. Indexed by service + time range + profile type. Span-to-profile correlation via span_id.
(6) Query:
- vmselect: Fans out MetricsQL queries to all vmstorage nodes, merges results.
- Tempo Query: TraceQL queries against S3 Parquet blocks.
- vlselect: LogsQL full-text search across vlstorage nodes.
- Pyroscope Query: Profile queries by service, time range, and span_id. Flame graph rendering.
- Grafana: Unified dashboard querying all four backends. Exemplars link metrics to traces. trace_id links traces to logs. span_id links traces to profiles.
- ClickHouse Analytics (optional): Cross-signal wide-event queries for exploratory analysis.
Control plane (tenant configs, rules, budgets, RBAC) and data plane (everything that touches telemetry) scale and deploy independently.
All four signals share the same OTel Collectors and Kafka cluster, then split at the storage layer: each signal routes to a storage engine built for its access pattern. Section 6.3 details the full Kafka topic layout.
Pull vs Push collection. The OTel Collector normalizes both models into the same protobuf format.
How each signal reaches the platform:
| Signal | Same K8s cluster | External infrastructure | Protocol |
|---|---|---|---|
| Metrics | Pull: Collector scrapes /metrics (local, mTLS). eBPF: Beyla auto-generates RED metrics. | Push: App sends OTLP to collector gateway (HTTPS) | Prometheus text (pull), OTLP protobuf (push), or eBPF (auto) |
| Traces | Push: OTel SDK sends spans to local collector (gRPC). eBPF: Beyla generates basic spans. | Push: OTel SDK sends spans to collector gateway (gRPC/HTTPS) | OTLP protobuf over gRPC (default) or HTTP |
| Logs | File tail: Collector reads /var/log/pods/*/ (no network) | Push: App sends OTLP logs to collector gateway (gRPC/HTTPS) | Local file I/O (same cluster) or OTLP protobuf (external) |
| Profiles | Push: OTel SDK sends profiles to local collector (gRPC) | Push: OTel SDK sends profiles to collector gateway (gRPC/HTTPS) | OTLP profiles protobuf (experimental) or Pyroscope SDK push |
After Kafka, each signal diverges to its specialized storage pipeline (see numbered steps above). Grafana correlates all four via exemplars, trace_id, and span_id fields.
6. Data Model
6.1 Metric Wire Format
Every metric sample follows the Prometheus data model:
metric_name{label1="value1", label2="value2"} float64_value unix_timestamp_ms
Example:
http_requests_total{service="checkout", method="POST", status="200", region="us-east-1"} 142857 1709827200000
Why fingerprints? At 500M metrics/sec, the system needs deterministic routing: each metric must always land on the same Kafka partition and storage node. A fingerprint hash of the metric name + labels provides this.
The series identity (fingerprint) is computed as:
fingerprint = FNV-1a(sorted(metric_name + labels)) // FNV-1a: a fast, non-cryptographic hash function
↑ Labels sorted alphabetically by key name before hashing.
Different producers may emit labels in different order
(e.g., {service, method} vs {method, service}).
Sorting guarantees the same series always produces the same fingerprint.
Example:
metric: http_requests_total
labels: {method="POST", region="us-east-1", service="checkout", status="200"}
fingerprint: hash("http_requests_total\x00method\x00POST\x00region\x00us-east-1\x00service\x00checkout\x00status\x00200")
→ 0x7a3f2b1c (used for Kafka partitioning and storage sharding)
Why the fingerprint matters: This single hash drives data locality at every layer.
- Kafka uses it as the partition key. All samples for one metric land on the same partition, so Flink can process that metric's recording rules without fetching data from other partitions.
- vminsert uses it to route all data for one metric to the same storage node. Keeping data together on one node is critical for compression (the compressor needs consecutive values to find patterns) and for queries (look up one node, not all 200).
- S3 Parquet files are keyed by fingerprint range. When querying cold data, the system skips entire files whose fingerprint range doesn't match, instead of scanning everything.
6.2 Metric Types
| Type | Description | Storage Behavior |
|---|---|---|
| Counter | Monotonically increasing value (resets on process restart) | Store raw value. rate() computed at query time. |
| Gauge | Arbitrary value that goes up and down | Store raw value. |
| Histogram | Observations bucketed into configurable ranges. Each bucket counts how many observations fell at or below that boundary. | Expands to multiple series: _bucket{le="X"} (one per bucket, where le = "less than or equal to", the upper boundary), plus _sum and _count. A histogram with 10 buckets = 12 time series per unique label combination. |
| Summary | Pre-computed quantiles on the client side | Expands to {quantile="X"}, _sum, _count. |
Why histograms explode into multiple series:
A counter metric (http_requests_total) is one number: "142,857 requests." One series. A histogram metric (http_request_duration_seconds) is a distribution: "How many requests took 0-10ms? How many took 10-50ms? 50-100ms?" Each bucket is a separate series. With 10 buckets plus a _sum and _count, one histogram becomes 12 series. Now multiply by every unique label combination and this is where cardinality explodes. A single histogram metric with 10 buckets, exposed by 1,000 pods, with 5 label dimensions (service, method, status, region, zone) each having moderate cardinality:
Series count = buckets × pods × label_combinations
= 12 × 1,000 × (5 services × 4 methods × 5 statuses × 3 regions × 2 zones)
= 12 × 1,000 × 600
= 7,200,000 series from ONE histogram metric
The modern fix: native/exponential histograms. OpenTelemetry exponential histograms (and Prometheus native histograms, which share the same underlying format) solve this problem by encoding the entire distribution into a single time series instead of expanding into 12+ separate _bucket series. The histogram uses exponentially-spaced bucket boundaries determined by a scale parameter, so there are no fixed le labels to fan out. A single series stores the full distribution, and the receiver merges buckets on ingestion. At the storage layer, VictoriaMetrics v1.100+ supports native histogram ingestion, and Grafana can render them natively. For the 7.2M-series example above, exponential histograms reduce the series count to 1,000 pods × 600 label combinations = 600,000 series -- a 12x reduction. The trade-off: older PromQL functions like histogram_quantile() work differently on native histograms, and client libraries need explicit opt-in. But for any new instrumentation at this scale, exponential histograms should be the default.
6.3 Kafka Topic Layout
Topic: metrics-ingestion
Producer: OTel Collectors (scrape + OTLP push + Beyla eBPF)
Consumer: Flink (recording rules, rollups) + vminsert (raw metric writes) + ML Anomaly Detector
Partitions: 10,000
Replication Factor: 3
Retention: 24 hours (replay buffer for recovery)
Partitioning key: metric fingerprint (FNV-1a hash)
Value format: Protobuf (TimeSeries message)
Topic: metrics-aggregated
Producer: Flink (processed recording rules and rollups)
Consumer: vminsert (aggregated metric writes)
Partitions: 5,000
Replication Factor: 3
Retention: 12 hours
Topic: metrics-alerts
Producer: Flink (alert events when burn rate thresholds breach) + ML Anomaly Detector
Consumer: Alertmanager (routes and deduplicates alerts)
Partitions: 100
Replication Factor: 3
Retention: 7 days
Topic: traces-raw
Producer: OTel Collectors (tail-based sampler output)
Consumer: Tempo Ingesters
Partitions: 2,000
Replication Factor: 3
Retention: 24 hours
Partitioning key: trace_id (all spans from one trace co-locate)
Value format: Protobuf (OTLP ExportTraceServiceRequest)
Topic: logs-raw
Producer: OTel Collectors (filelog receiver + pipeline processors)
Consumer: vlinsert (VictoriaLogs write nodes)
Partitions: 1,000
Replication Factor: 3
Retention: 24 hours
Partitioning key: service_name (locality for per-service indexing)
Value format: Protobuf (OTLP ExportLogsServiceRequest)
Topic: profiles-raw
Producer: OTel Collectors (profile signal from SDKs)
Consumer: Pyroscope Ingesters
Partitions: 500
Replication Factor: 3
Retention: 12 hours
Partitioning key: service_name
Value format: Protobuf (pprof-encoded profiles)
6.4 Downsampled Parquet Schema (Warm/Cold Tier)
When raw metrics age past 15 days, they don't stay in vmstorage. Flink writes 5-minute rollups to S3 as Parquet files during ingestion (Section 8.6). A nightly batch job further aggregates those into 1-hour rollups. vmselect reads these files when a query spans beyond the hot tier, using partition pruning on tenant, resolution, and time range to skip irrelevant files. Each row stores five aggregate values (min, max, avg, sum, count) because different query functions need different aggregates: max() reads value_max, rate() uses value_sum / value_count, and so on.
-- Parquet file layout on S3
-- Path: s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet
-- Schema:
CREATE TABLE downsampled_metrics (
fingerprint UINT64, -- Metric identity
metric_name STRING,
labels MAP<STRING, STRING>,
timestamp_ms INT64, -- Aligned to resolution boundary
value_min DOUBLE,
value_max DOUBLE,
value_avg DOUBLE,
value_sum DOUBLE,
value_count INT64, -- Number of raw samples in this window
resolution STRING -- '5m' or '1h'
)
PARTITIONED BY (tenant_id, resolution, year, month);6.5 Wide-Event Export Schema (ClickHouse Analytics Layer)
When the optional ClickHouse analytics layer is enabled, Flink exports a sampled stream of wide events: denormalized records that combine context from multiple signals into a single queryable row:
-- ClickHouse table for cross-signal exploration (key columns shown)
CREATE TABLE wide_events (
event_time DateTime64(3),
trace_id String,
span_id String,
service_name LowCardinality(String),
slo_tier LowCardinality(String),
error_rate_5m Float64, -- metric context
span_duration_ms Float64, -- trace context
log_message String, -- log snippet
has_profile UInt8 -- profile context
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (service_name, event_time)
TTL event_time + INTERVAL 90 DAY;This table receives ~1-5% of total events (sampled at the Flink export stage). It's not the source of truth for any signal. The pillar-native stores are. It exists for exploratory queries that span multiple signals.
7. Back-of-the-Envelope Estimation
7.1 Ingestion Volume
Target: 500M metrics/sec (peak)
Source split (assumed):
Infrastructure metrics (CPU, memory, disk, network): 30% → 150M/sec
Application metrics (request rate, latency, errors): 40% → 200M/sec
Business metrics (orders, revenue, conversions): 10% → 50M/sec
Custom/user-defined metrics: 20% → 100M/sec
Daily volume:
500M/sec × 86,400 sec/day ← 86,400 = 24 × 60 × 60
= 43.2 trillion samples/day ← ~43T samples
At 40% average utilization (not peak all day):
200M/sec × 86,400 = 17.28T samples/day ← actual sustained rate
Pipeline reduction (value-based routing):
~20% of raw volume dropped/sampled before Kafka ← health checks, probes, debug
Effective ingestion: ~400M/sec peak, ~160M/sec sustained
7.2 Storage Sizing
Per-sample size:
Raw (uncompressed): 16 bytes (8B timestamp + 8B float64 value)
Gorilla compressed: 1.37 bytes/sample ← 16 / 1.37 = 12x compression
With labels + index overhead: ~2 bytes/sample ← use this for all estimates below
─── Hot tier (raw, 15 days) ───
400M/sec × 2 bytes = 800 MB/sec ← post-pipeline ingestion rate
800 MB/sec × 86,400 = 69 TB/day ← one day of hot storage
69 TB × 15 days = 1,035 TB (~1.0 PB) ← 15 days at peak
At 40% average utilization:
1.0 PB × 0.4 = ~414 TB ← actual provisioned capacity
─── Warm tier (5-min aggregates in Parquet on S3, 90 days) ───
400M/sec ÷ 20 = 20M aggregated samples/sec ← 20 raw samples per 5-min window become 1
20M × 5 values = 100M values/sec ← each aggregate stores min, max, avg, sum, count
100M × 8 bytes = 800 MB/sec ← raw doubles, no Gorilla (not sequential anymore)
800 MB/sec ÷ 3 (ZSTD) = 267 MB/sec ← ZSTD compresses ~3x on float64 arrays
267 MB/sec × 86,400 = 23 TB/day ← one day of warm storage
23 TB × 90 days = 2,070 TB (~2.1 PB) ← 90 days of warm
─── Cold tier (1-hour aggregates in Parquet on S3, 1 year) ───
20M/sec ÷ 12 = ~1.67M aggregated samples/sec ← twelve 5-min windows per hour
1.67M × 5 × 8 bytes = 67 MB/sec ← same 5-value pattern as warm
67 MB/sec ÷ 3 (ZSTD) = ~22 MB/sec ← after compression
22 MB/sec × 86,400 = 1.9 TB/day ← one day of cold storage
1.9 TB × 365 days = 694 TB ← one year of cold
─── Total ───
Hot: 414 TB (vmstorage NVMe, 15 days at 40% avg)
Warm: 2,070 TB (S3 Parquet, ~2.1 PB, 90 days)
Cold: 694 TB (S3 Parquet, 1 year)
Total: ~3.2 PB
Note: The pipeline's value-based routing reduces total storage by ~20% compared to storing everything. The savings more than offset the new profiling signal's storage requirements (Section 7.5).
7.3 Component Sizing
Beyla eBPF Agents:
Deployed as DaemonSet: 1 per node, ~3,000 agents
~200 MB RAM per agent (eBPF maps + metric aggregation)
Minimal CPU overhead: eBPF programs run in kernel space
OTel Collectors:
Each collector: ~100-200K metrics/sec on c6g.xlarge (4 vCPU, 8GB)
Range depends on pipeline complexity: bare receiver+exporter ≈ 200K,
full chain (enrichment, routing, dedup, tail sampling) ≈ 100-150K
500M / 170K = ~3,000 collectors ← Go-based, CPU-efficient for protobuf
Deployed as DaemonSet (1 per node) + dedicated fleet for high-volume targets
Additional ~200 MB RAM for pipeline processors (routing, enrichment, dedup)
Kafka:
OTel Collectors batch ~2,000 samples per Kafka message (batch processor)
400M samples/sec ÷ 2,000 = 200K messages/sec ← message rate, not sample rate
Byte throughput: 400M × ~50 bytes/sample (compressed batch) = ~20 GB/sec
With RF=3: ~60 GB/sec cross-cluster replication
Per-broker sustained: ~300 MB/sec (RF=3, acks=all) ← byte throughput is the real bottleneck
60,000 MB / 300 MB = 200 brokers ← sized on bytes, not message count
Deployed as multiple regional clusters (e.g., 100 per region in a 2-region setup)
Partitions: 10,000 ← enough for parallelism without overhead
Flink:
Each TaskManager: ~1M events/sec (4 slots × 250K each) ← Amazon Managed Flink: ~28K/sec per KPU;
dedicated r6g.2xlarge has much more headroom
400M / 1M = 400 TaskManagers ← 1,600 parallel tasks (4 slots each)
VictoriaMetrics (vmstorage):
Benchmark: 100M on 46 nodes = 2.17M/node
On i3.4xlarge (16 vCPU, 122GB, 3.8TB NVMe): ~5M/sec ← ~2.3x benchmark with 4x hardware
400M / 5M = 80 vmstorage nodes
× 2 (RF) = 160 vmstorage nodes
160 × 3.8 TB = 608 TB ← covers 414 TB hot + WAL/compaction headroom
vminsert (stateless):
400M / 12.5M per node = 32 vminsert nodes
vmselect (stateless):
Start with 20 nodes. Scale based on dashboard concurrency.
ML Anomaly Detector:
Kafka consumer group, 5-10 nodes (c6g.2xlarge)
RocksDB state: ~50 GB per node (seasonal baselines for monitored metrics)
Processes a subset of metrics (top 100K series per tenant by query frequency)
7.3.1 Right-Sizing for Smaller Scale
The numbers above target 500M metrics/sec. The architecture scales down linearly:
| Scale | Beyla Agents | OTel Collectors | Kafka Brokers | Flink TaskManagers | vmstorage | vminsert |
|---|---|---|---|---|---|---|
| 10M/sec | 60 | 60 | 10 | 10 | 4 | 2 |
| 50M/sec | 300 | 300 | 25 | 50 | 20 | 8 |
| 100M/sec | 600 | 600 | 50 | 100 | 40 | 16 |
| 500M/sec | 3,000 | 3,000 | 200 | 400 | 160 | 32 |
At 10-50M metrics/sec, Kafka and Flink can be replaced entirely with a direct OTel Collector to vminsert pipeline using Prometheus remote_write. Recording rules move to the VictoriaMetrics vmalert component (query-based evaluation, similar to the Mimir ruler). This removes Kafka and Flink entirely. The jump to Kafka + Flink is justified when direct remote_write cannot absorb burst traffic or when stream-based recording rules outperform query-based evaluation at the target scale. The ML anomaly detector and ClickHouse analytics layer are also optional at smaller scales.
7.4 Query Layer Sizing
Dashboard query assumptions:
50 concurrent Grafana users
20 panels per dashboard
30-second refresh interval
Average query touches 50K series (with recording rules)
Query throughput:
50 × 20 / 30 = 33 queries/sec ← steady-state
Peak (all refresh at once): ~200 queries/sec
vmselect fan-out:
33 queries/sec × 160 nodes = 5,280 vmstorage req/sec ← every query hits every node
Peak: 200 × 160 = 32,000 req/sec
vmselect sizing:
Each node: ~100 concurrent queries
20 nodes = 2,000 concurrent queries ← 10x headroom over 200 peak
Memory: 32 GB per node for result merging + caching
Result cache:
Key: (query_hash, time_range, step)
10 GB per node (LRU eviction)
Hit rate: 60-80% ← dashboard auto-refresh is repetitive
Effective query reduction: 3-5x
7.5 Trace, Log, and Profile Volume Estimation
Traces, logs, and profiles add incremental infrastructure on the same shared collection and buffer layers.
Tracing:
500K services × 50 RPS = 25M requests/sec
25M × 8 spans/trace = 200M spans/sec
200M × 1 KB/span = 200 GB/sec ← raw trace volume
200 GB/sec × 86,400 = 17.3 PB/day ← not viable to store 100%
× 0.005 (0.5% sampling) = 1 GB/sec = 86 TB/day ← after tail-based sampling
Tiered storage (180 days): ~15.5 PB total (Section 9.7)
Tempo ingesters: 20, Compactors: 5, +4 Kafka brokers
Logging:
500K services × 100 lines/sec = 50M lines/sec
50M × 500 bytes = 25 GB/sec ← raw log volume
25 GB/sec × 86,400 = 2.16 PB/day
÷ 8 (VictoriaLogs compression) = ~270 TB/day stored ← columnar + LZ4/ZSTD
Tiered storage: ~9.5 PB across active tiers
vlinsert: 17, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers
Profiling:
500K services × 10-sec interval = 50K profiles/sec
50K × 100 KB avg (compressed pprof) = 5 GB/sec ← raw profile volume
5 GB/sec ÷ 5 (Pyroscope compression) = 1 GB/sec stored ← after dedup + compression
1 GB/sec × 86,400 = 86 TB/day stored
Tiered storage (90 days): ~7.7 PB total
Pyroscope ingesters: 10 (r6g.2xlarge), +2 Kafka brokers
The same OTel Collectors, Kafka cluster, and Grafana deployment serve all four signals. Only the storage layer diverges.
7.6 Network Bandwidth
Metrics ingestion path:
OTel Collectors → Kafka: 800 MB/sec (compressed protobuf, post-pipeline)
Kafka replication (RF=3): × 3 = 2.4 GB/sec cross-broker
Kafka → consumers (vminsert, Flink, ML): ~1.7 GB/sec
vminsert → vmstorage: 800 MB/sec (distributed across 160 nodes)
Total metrics bandwidth: ~5.7 GB/sec aggregate
Other signals (traces + logs + profiles): ~108 GB/sec aggregate
(dominated by log pipeline: 25 GB/sec raw + 75 GB/sec Kafka replication)
Query path (bursty):
vmselect → vmstorage fan-out: 1-5 GB/sec during peak dashboard load
Log pipeline dominates bandwidth. At 25 GB/sec raw volume, logs account for over 70% of total network traffic. Compressing logs at the OTel Collector (gzip or ZSTD) and using Kafka's end-to-end compression reduces wire traffic by 3-5x. Co-locate Kafka partition leaders with primary consumers in the same AZ using broker.rack configuration to minimize cross-AZ transfer costs.
What Breaks First at Scale
Before diving into deep dives, it helps to know where this architecture is most likely to fail:
Cardinality explosions. A single user_id label can create 120M series and OOM the storage layer. Per-tenant cardinality limits and ingestion-time label validation prevent this (Section 8.3).
Log volume spikes. Verbose logging during incidents can multiply ingestion volume by 10x. Value-based routing and per-tenant rate limits prevent a single noisy service from overwhelming the log pipeline (Section 10).
Trace sampling mistakes. Sampling too aggressively hides critical latency problems. Tail-based sampling ensures 100% capture of errors and slow requests while reducing baseline volume by 98-99.5% (Section 9). The actual keep rate depends on your error rate and latency distribution -- environments with higher error rates retain more traces.
Noisy tenants. One team emitting high-cardinality metrics or excessive debug logs can degrade query performance for everyone. Shuffle-sharded storage isolation and per-tenant query concurrency limits contain the blast radius (Section 8.9).
Section 12 covers the full bottleneck analysis with scaling triggers and mitigation strategies for each component.
8. Deep Dives: Metrics Pipeline
8.1 Ingestion Pipeline
The full ingestion flow:
Key design decisions in the ingestion path:
Two-tier collection. Beyla eBPF agents auto-generate RED metrics and basic trace spans for every HTTP/gRPC/SQL call at the kernel level (the rationale for this split is covered in the operating model above). The OTel Collector receives Beyla's output alongside SDK telemetry and deduplicates where they overlap, keeping SDK versions for richer labels.
How eBPF binds to traces. Beyla attaches uprobes to language-specific HTTP handler entry points: Go's net/http.(*conn).serve, Java's javax.servlet.http.HttpServlet.service, Python's WSGI/ASGI handler functions, Node.js's HTTP module. When a request enters the function, the uprobe fires in kernel space and records the timestamp, connection metadata, and reads the incoming W3C traceparent header directly from the request buffer. If a traceparent exists, Beyla creates a child span with the correct parent_span_id and trace_id, linking the eBPF-generated span into the existing distributed trace. If no traceparent exists, Beyla creates a new root span. For languages without specific uprobe targets, Beyla falls back to kprobes on tcp_sendmsg/tcp_recvmsg, which still capture request/response pairs but with less protocol detail (no parsed HTTP method or path, just TCP-level timing).
What eBPF can and cannot see. Beyla sees every network call (HTTP method, path, status, duration) but not internal code execution — that requires SDK spans or continuous profiling. The operating model section above covers the full breakdown.
Deduplication logic. When both Beyla and OTel SDK instrument the same service, the OTel Collector sees two spans for the same HTTP request: one from eBPF (network-level, fewer attributes) and one from the SDK (application-level, richer labels like user_id, cart_size). The dedup processor matches them by {service_name, http.method, http.target, time_window} and keeps the SDK version (richer attributes). For metrics, the same matching drops the eBPF-generated http_server_request_duration_seconds when an SDK-generated equivalent exists for the same service+endpoint.
Pipeline processing before Kafka. The OTel Collector processor chain applies value-based routing (Section 4.5) before any data reaches Kafka. This is the key difference from a 2022-era architecture that treated all data equally. By dropping health checks, sampling debug noise, and enriching with ownership metadata at the edge, the pipeline reduces Kafka and storage load by 15-30%.
Kernel networking bottleneck. eBPF handles collection well — but the collected data still has to leave each box through the standard Linux kernel networking stack. And that stack was not designed for millions of tiny packets. Every telemetry packet takes the same expensive trip: send() syscall → context switch into the kernel → socket buffer allocation → full TCP/IP processing → NIC driver → wire. That is a lot of ceremony for a single metric sample.
Here is the thing most people miss: a typical metric sample, trace span, or log line is under 1KB after serialization. Small packets are the worst case for kernel networking. The per-packet overhead — syscalls, buffer copies, lock contention — dominates the actual transmission. The CPU spends more time managing each packet than sending it. At 500M metrics/sec across 3,000 nodes, each box is generating thousands of these tiny packets every second. A standard Linux box tops out at roughly 1M small packets/sec before the kernel becomes the bottleneck. Not the NIC. Not the network. The kernel itself.
Network acceleration: XDP + DPDK. Two techniques solve this, each at a different layer of the pipeline:
| Layer | Technology | Where | What It Does | Impact |
|---|---|---|---|---|
| Ship out | XDP (eXpress Data Path) | Every agent node (3,000 boxes) | Hooks into the NIC driver and forwards telemetry packets before they enter the full kernel networking stack. It runs as an eBPF program — same toolchain, same deployment model you already have. No dedicated CPU cores needed. | 5-10M packets/sec per core vs ~1M with normal send() syscalls |
| Receive at scale | DPDK (Data Plane Development Kit) | OTel Collector gateways (few boxes) | Complete kernel bypass. The NIC delivers packets directly into userspace memory via hugepages. Dedicated cores poll the NIC in a tight loop — no interrupts, no syscalls, no copies. | 15-20M+ packets/sec per core. Handles the aggregated fan-in traffic from all 3,000 agent nodes. |
Why two different tools? XDP is lightweight enough to deploy on every production server — no dedicated cores, no app changes, just an eBPF program attached to the NIC driver. You would not want DPDK there because it takes over entire CPU cores and requires rewriting your networking layer. But on the collector gateways, where thousands of agent streams converge into a handful of boxes and the kernel networking stack genuinely hits the wall, DPDK is worth the complexity. The full data plane becomes: eBPF (collect) → XDP (fast egress) → DPDK (high-throughput ingestion) → Kafka.
Fingerprint-based partitioning. All samples for a given time series land on the same Kafka partition (see Section 6.1 for why this matters). Flink gets partition-local state, vmstorage gets sequential writes optimal for Gorilla compression.
Two Kafka topics. Raw metrics go to metrics-ingestion, Flink outputs go to metrics-aggregated. That way, vmstorage only consumes pre-processed data. If Flink needs to be restarted or redeployed, raw metrics buffer safely in Kafka (24-hour retention) while Flink catches up.
Why not write directly to VictoriaMetrics? A direct OTel → vminsert pipeline is simpler, lower latency, and works well up to ~50M metrics/sec. Beyond that scale, three problems emerge: (1) burst absorption: a deploy that doubles metric volume for 5 minutes overwhelms vminsert with no buffer; (2) replay: if vmstorage goes down and recovers, there's no way to replay the lost window without Kafka retention; (3) fan-out: recording rules, downsampling, ML anomaly detection, and metrics-to-logs correlation all need the same raw stream, and tapping vmstorage's write path is fragile. Kafka solves all three. Cost: operational complexity and ~2 seconds of additional end-to-end latency. Below 50M/sec, skip Kafka and write directly.
Backpressure handling. If vmstorage falls behind, Kafka consumer lag increases. OTel Collectors detect backpressure via Kafka producer latency and can temporarily batch more aggressively (larger batches, less frequent sends). The 24-hour Kafka retention gives enough buffer for most recovery scenarios.
Tenant validation at the edge. OTel Collectors inject the tenant_id header and validate it against a local cache of known tenants. Invalid tenants get rejected before hitting Kafka, which prevents unknown tenants from consuming pipeline resources.
Scrape endpoint security. The /metrics endpoint on tenant pods is not open to arbitrary callers. OTel Collectors and application endpoints authenticate each other via mTLS, with certificates issued by cert-manager in Kubernetes. A collector without a valid certificate cannot scrape, and a pod without a valid certificate is rejected as a scrape target. For push-model ingestion, the platform's OTLP endpoint requires a per-tenant API key in the Authorization header, validated at the OTel Collector gateway before data reaches Kafka.
8.2 Gorilla Compression
VictoriaMetrics uses Gorilla compression from Facebook's 2015 paper, the foundation of time-series compression in every modern TSDB (VictoriaMetrics, M3DB, Prometheus TSDB, InfluxDB all use variants). The algorithm exploits the regularity of time-series data: timestamps arrive at fixed intervals (delta-of-delta compresses to near zero), and consecutive values are often identical or structurally similar (XOR encoding). Net result: 1.3–2 bytes per sample in production vs 16 bytes raw, an 8–12x compression ratio. The original Facebook paper achieved 1.37 bytes/sample on highly regular in-memory workloads; real production systems with chunk headers, label overhead, and variable scrape intervals typically land at 1.5–2 bytes/sample. The capacity estimates in Section 7.2 use 2 bytes/sample to be conservative.
For the full bit-level walkthrough (IEEE 754 XOR encoding, end-to-end 8-sample compression, late scrape impact, and chunk structure), see the Gorilla Compression deep dive.
8.3 High Cardinality: The Silent Killer
Cardinality explosion is the single biggest operational risk in a metrics monitoring system, bigger than disk failures, network partitions, or Kafka lag. At 500M metrics/sec, the vast majority of data points come from infrastructure exporters and framework auto-instrumentation, not custom application code. But it's the custom labels on application metrics that cause cardinality explosions.
What actually happens:
An engineer adds a user_id label to a request latency histogram. Seems reasonable. Want to track per-user latency for debugging.
http_request_duration_seconds_bucket{
service="checkout",
method="POST",
status="200",
user_id="u_abc123", ← This label
le="0.1"
}
The math:
Before user_id label:
1 metric × 12 series (10 buckets + sum + count) × 5 services × 4 methods × 5 statuses = 1,200 series
Memory: ~8 KB per series × 1,200 = 9.6 MB (fine)
After user_id label (10M users):
1 metric × 12 series (10 buckets + sum + count) × 10M users = 120,000,000 series
Memory: ~8 KB per series × 120M = 960 GB
Timeline:
00:00 Deploy goes out with user_id label
00:15 First scrape. 120M new series created.
00:16 vmstorage memory jumps from 10 MB to 960 GB
00:17 OOM. vmstorage killed.
00:17 All dashboards go blank. All alerts stop evaluating.
00:18 The monitoring system is dead. Nobody gets paged about it being dead.
Detection:
Monitor the monitoring system's own cardinality metrics:
# Alert: cardinality growing too fast
rate(vm_new_timeseries_created_total[1h]) > 100000
# Alert: approaching tenant series limit
vm_active_timeseries{tenant_id="acme"} / vm_series_limit{tenant_id="acme"} > 0.8
# Dashboard: top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))Prevention (layered):
-
Ingestion-time cardinality limits. vminsert checks active series count per tenant. Reject new series if the tenant exceeds its budget. Return HTTP 429 with
X-Series-Limitheader. -
Label allowlists. The OTel Collector's
metric_relabel_configscan drop labels before they reach Kafka. Block known high-cardinality labels (user_id, request_id, trace_id, pod_id) from metric labels.
# OTel Collector relabel config
metric_relabel_configs:
- source_labels: [user_id]
action: labeldrop
- source_labels: [request_id]
action: labeldrop-
Cardinality analysis at deploy time. A CI/CD check scans metric instrumentation for unbounded label values. If a label's value set is unbounded (user IDs, UUIDs), the pipeline blocks the deploy.
-
The "Metrics without Limits" pattern (Datadog-inspired). Decouple ingestion from indexing. Ingest all raw data into Kafka (cheap, append-only). But only index the label combinations that are actually queried. If
user_idis never used in a dashboard or alert rule, it's stored but not indexed. The inverted index stays small. Storage cost increases, but memory and query performance are protected.
8.3.1 Schema Evolution: Label Changes
Metric schemas change. A service renames status to http_status, adds a new region label, or drops an unused version label. Each of these creates a new set of time series because a different label set produces a different fingerprint, which means a different series.
What breaks:
- Dashboards referencing the old label name return empty results
- Alert rules using the old label stop evaluating correctly
- The old series goes stale (5-minute staleness marker), but historical data remains under the old name
- Recording rules that aggregate by the old label silently stop including the renamed service
Migration strategy:
-
Dual-emit during transition. The application emits metrics with both old and new label names for one retention period (15 days). Dashboards and alert rules are updated to use the new name. After the transition window, the old label is dropped.
-
Relabeling at the collector. If the application can't dual-emit, the OTel Collector's
metric_relabel_configscan copy the old label to the new name during the transition:
metric_relabel_configs:
- source_labels: [status]
target_label: http_status- Dashboard migration tooling. Grafana's provisioning API can batch-update dashboard JSON to replace label references. Script this as part of the label rename runbook.
There is no way to retroactively rename labels in historical data stored in vmstorage or S3 Parquet files. Plan label names carefully. Breaking changes should be coordinated across the team with a defined cutover window.
8.4 Inverted Index at Scale
The inverted index is how the query engine turns a label selector like {service="checkout", env="prod"} into a list of matching time series IDs.
How it works:
VictoriaMetrics mergeset structure:
VictoriaMetrics uses mergeset for its inverted index, an LSM-tree variant optimized for sorted key lookups. New label-to-TSID mappings buffer in memory, flush to immutable sorted parts on disk, and background merges keep the part count small. Queries look up each label selector across parts, then intersect posting lists to find matching series.
High-cardinality labels degrade index performance:
The inverted index for service=checkout might have 1,000 TSIDs in its posting list. Fast to intersect. But user_id=u_abc123 has exactly 1 TSID, and there are 10 million such entries. The index now stores 10 million separate posting lists. Each label lookup scans a massive key range. Intersection becomes trivial (single entry), but the index scan to find it is expensive.
The mergeset's key space also grows linearly with cardinality. More keys means more disk I/O during compaction, more memory for bloom filters, and longer startup times when rebuilding the index.
Bloom filters for segment skipping:
VictoriaMetrics uses bloom filters on each index part to quickly determine if a label value might exist in that part. If the bloom filter says "definitely not here," the part is skipped entirely. For low-cardinality labels (service, env, region), bloom filters skip 95%+ of index parts on first check. For high-cardinality labels, the bloom filter's false positive rate increases and the benefit diminishes.
When a query has multiple label selectors, the engine intersects posting lists using size-adaptive algorithms (galloping search for skewed lists, linear merge for similar-sized lists).
8.5 Query Engine and Execution
MetricsQL query execution flow:
Query splitting for large time ranges:
When a dashboard requests a 30-day range, the query spans both hot (vmstorage, raw data for last 15 days) and warm (S3, 5-min aggregates for days 16-30). vmselect splits the query:
- Days 1-15: Fan out to vmstorage nodes. Raw data, full resolution.
- Days 16-30: Read from S3 Parquet files. 5-min aggregated data. Partition pruning by tenant_id + time range skips irrelevant files.
- Merge: vmselect stitches the two result sets together, aligning timestamps at the step boundary.
Dashboard experience across tiers:
Querying beyond the hot tier is slower and lower resolution.
| Time Range | Storage Tier | Resolution | Typical Latency | What You Lose |
|---|---|---|---|---|
| Last 15 days | Hot (vmstorage) | 15-sec raw | p99 < 200ms | Nothing, full fidelity |
| 15-90 days | Warm (S3 Parquet) | 5-min aggregates | p99 < 500ms | Short spikes (<5 min) disappear. Dashboard lines are smoother. A 3-minute error burst that was visible in hot tier becomes a single averaged data point in warm. |
| 90+ days | Cold (S3 Parquet) | 1-hour aggregates | 500ms-2s | Only useful for trend analysis ("did error rate increase this quarter?"), not incident investigation. |
Over 95% of dashboard queries are last-15-minutes or last-1-hour (real-time operations). Long-range queries (30-day, 90-day) are infrequent capacity-planning views where 5-min resolution and 200-500ms extra latency are fine. Recording rules help here too. Pre-computed aggregates like service:http_requests:rate5m exist at hot tier resolution regardless of time range, so dashboards built on recording rules never hit S3 for common panels.
Result caching:
vmselect caches query results in its own internal LRU cache, keyed by (query_hash, time_range, step). For dashboard auto-refresh (every 30s), the cache hit rate is high because only the latest data point changes. The cache uses a time-bounded eviction policy: cached results for time ranges that end in the past are valid indefinitely. Results for ranges that include "now" expire after the scrape interval (15s).
Tempo uses a separate approach: an external Memcached cluster caches trace query results. This keeps repeated trace ID lookups from hitting S3 on every request, which matters because Tempo's query path is entirely S3-backed with no local query cache.
Query priority scheduling. Not all queries are equal. Alert rule evaluation from vmalert is the highest priority. A delayed alert is a missed incident. Dashboard auto-refresh is medium priority. Ad-hoc exploration queries from engineers are lowest priority. vmselect enforces this via per-tenant query concurrency slots: alert rules get dedicated slots that are never preempted, dashboard queries share a pool, and ad-hoc queries use whatever capacity remains.
Recording rules to keep dashboards fast:
Without recording rules, a dashboard panel querying across 500 services scans millions of raw series on every refresh. Recording rules pre-compute these aggregations and store the result as a few hundred series instead of millions. Section 8.8 covers recording rules in detail, including when to use them and their cost.
8.6 Downsampling and Retention
Three storage tiers with automatic lifecycle management:
Stream aggregation (Flink) vs batch rollup:
Two approaches to downsampling, each with tradeoffs:
| Approach | How | Pros | Cons |
|---|---|---|---|
| Stream aggregation | Flink aggregates on the ingest path. Every 5 minutes, emit min/max/avg/sum/count for each series. Write aggregates to warm tier. | No read penalty. Aggregates available immediately. No backfill needed. | Uses Flink state (memory + checkpoint). Must handle late-arriving data. |
| Batch rollup | Background job reads raw data from vmstorage, computes aggregates, writes to S3. | Simple. No ingest-path overhead. Can recompute if logic changes. | Reads data twice (once to store, once to rollup). Creates I/O contention on vmstorage. Rollup delay (hours). |
Hot-to-warm uses stream aggregation (Flink computes 5-min rollups on the ingest path). Warm-to-cold uses a nightly batch job that reads 5-min Parquet files, re-aggregates to 1-hour granularity, and writes new Parquet files.
Each tier reduces volume dramatically: hot-to-warm cuts 20x (raw to 5-min aggregates), warm-to-cold cuts another 12x (5-min to 1-hour). Total storage across all tiers: ~3.2 PB. See Section 7.2 for the full derivation.
Value-based routing integration: The pipeline layer's routing decisions (Section 4.5) interact with downsampling. Signals marked as bronze SLO tier that survive the initial sampling can be downsampled more aggressively: 1-minute aggregates in the warm tier instead of 5-minute, reducing warm-tier storage further. Gold-tier signals always get full 5-minute resolution in warm and 1-hour in cold.
Handling late-arriving data:
Flink's 5-min aggregation window uses event-time processing (the sample's own timestamp, not arrival time) with a 2-minute watermark. The watermark signals "no events older than this will arrive," letting Flink close windows. Data arriving up to 2 minutes late lands in the correct window. Data arriving later than 2 minutes is either:
- Dropped (if it falls within the hot tier window and raw data is still available)
- Counted toward the next window (if it falls in the warm tier)
For metrics, late arrivals are rare. Scrape intervals are clock-driven. Delays beyond 2 minutes indicate network failures.
8.7 Alert Evaluation at Scale
Thousands of alert rules across billions of time series, evaluated every 15 seconds. A single evaluator can't keep up, so the rules are sharded.
Rule format: vmalert reads rules from YAML files on disk, the same format Prometheus uses. Each file contains one or more rule groups. Each group has its own evaluation interval. Two types: alerting rules fire notifications when a condition holds (e.g., error rate > 5% for 2 minutes), and recording rules pre-compute expensive queries and save the result as a new metric so dashboards stay fast.
Alerting rule example:
# /etc/vmalert/rules/shard-0/checkout-alerts.yml
groups:
- name: checkout-alerts
interval: 15s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
tenant_id: acme
annotations:
description: "Error rate above 5% for checkout service"
runbook: "https://wiki.internal/runbooks/checkout-errors"
- alert: SLOBurnRateFast
expr: slo:error_budget:burn_rate_5m > 14.4
for: 1m
labels:
severity: critical
annotations:
description: "Fast burn: error budget will exhaust within 1 hour at current rate"Recording rule example:
# /etc/vmalert/rules/shard-0/checkout-recording.yml
groups:
- name: checkout-recording
interval: 30s
rules:
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service)
- record: slo:error_budget:burn_rate_5m
expr: |
1 - (
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
) / (1 - 0.999)Rule management at scale: The platform's /api/v1/rules endpoint provides CRUD operations for rules. Behind the scenes, it stores rule metadata in a database, generates these YAML files, and writes them to the correct vmalert shard's directory. A hot reload via vmalert's /-/reload endpoint picks up changes without restart.
Rule sharding strategy:
Each alert rule is assigned to exactly one vmalert shard. The rule management API uses consistent hashing on the rule name to decide which shard directory (/etc/vmalert/rules/shard-{N}/). When vmalert pods scale up or down, the API reassigns rules and triggers reloads on affected instances.
Each evaluator shard, every 15 seconds:
- Loads its assigned rules from the rule store
- Executes each rule's MetricsQL expression against vmselect
- Compares results against thresholds
- For rules that have been firing for longer than their
forduration, sends an alert to Alertmanager - Updates local state (firing duration, last evaluation time)
SLO-based burn rate alerting:
Traditional threshold alerts ("error rate > 5%") are noisy. SLO-based alerting asks a better question: "At the current error rate, will the error budget run out before the SLO period ends?"
How burn rates work: A 99.9% SLO gives you ~43 minutes of error budget over a 30-day window.
A burn rate measures how fast that budget gets consumed:
- Burn rate 1x: Budget consumed exactly over 30 days. No action needed.
- Burn rate 14.4x: Budget consumed 14.4x faster. Runs out in ~2 days. Page immediately.
- Burn rate 6x: Runs out in ~5 days. Urgent, but hours remain, not minutes.
The formula: burn_rate = current_error_rate / allowed_error_rate. With a 0.1% error allowance and a current rate of 1.44%, the burn rate is 14.4x:
Multi-window burn rates:
Fast burn (last 5 min): error_rate / (1 - 0.999) > 14.4 → page immediately
Medium burn (last 30 min): error_rate / (1 - 0.999) > 6 → page (slower burn)
Slow burn (last 6 hours): error_rate / (1 - 0.999) > 1 → ticket (not a page)
Alertmanager deduplication and grouping:
When 500 pods fire the same "HighErrorRate" alert, that's 500 PagerDuty notifications without grouping. Alertmanager collapses them:
route:
group_by: ['alertname', 'service', 'env']
group_wait: 30s # Wait 30s to batch alerts in the same group
group_interval: 5m # Wait 5m between re-notifications
repeat_interval: 4h # Re-notify every 4h if still firing
receiver: 'pagerduty-critical'
routes:
- match:
severity: warning
receiver: 'slack-warnings'
- match:
severity: info
receiver: 'slack-info'500 pod-level alerts get grouped into 1 notification: "HighErrorRate firing for service=checkout, env=prod (500 instances)."
Alert storm protection:
If a widespread failure causes 100,000 alerts to fire in 1 minute (datacenter-level event), even grouped alerts can overwhelm on-call engineers. Protections:
- Inhibition rules. If a datacenter-level alert is firing, suppress all service-level alerts in that datacenter.
- Rate limiting. Alertmanager limits notification sends to 100/minute per receiver. Excess alerts are queued.
- Escalation. If more than 50 unique alert groups fire within 5 minutes, trigger a "mass incident" page to incident command, not individual service owners.
8.7.1 ML-Assisted Anomaly Detection
SLO burn rates are the primary paging mechanism and remain so. ML anomaly detection is a supplementary signal that catches patterns burn rates miss: gradual degradations, seasonal deviations, and correlated anomalies across services that individually stay within thresholds.
Architecture:
The ML anomaly detector is a Kafka consumer group (5-10 nodes) that processes the metrics-ingestion topic. It doesn't process every metric. It watches the top ~100K series per tenant by query frequency (the metrics that actually appear in dashboards and alert rules). This keeps compute requirements modest.
Metric selection: how the monitored set is built. vmselect logs which metric names are queried by dashboards and alert rules. A recording rule aggregates query counts per metric name per tenant over a rolling 7-day window. Every 6 hours, the anomaly detector recalculates its monitored set by taking the top 100K series by query frequency. Metrics that drop out of all dashboards and alert rules for 7 days are removed from the monitored set. A new metric added to a dashboard enters the monitored set at the next 6-hour recalculation, not immediately. The key insight: the system only watches metrics that humans actually look at. If nobody queries node_entropy_available_bits, the anomaly detector ignores it. This is the single most effective mechanism for keeping false positives low -- most metrics are infrastructure noise that nobody monitors.
How it works:
-
Seasonal baseline (STL decomposition). For each monitored metric, the service maintains a seasonal baseline using STL (Seasonal and Trend decomposition using Loess). This decomposes the metric into trend, seasonal, and residual components. The seasonal component captures daily/weekly patterns (e.g., traffic drops at 3am, spikes at 9am). Baselines are stored in RocksDB on each node's local SSD.
Cold start: New metrics with less than 7 days of history have anomaly detection suppressed entirely -- there's no value in alerting on a metric with no baseline. Metrics with 7-28 days of history use whatever data exists, but the anomaly threshold widens to 5σ (vs the steady-state 3σ) to prevent noisy alerting during ramp-up. After 28 days, the metric enters steady-state detection with the full seasonal model. New tenants onboarding start with all metrics in cold-start mode, so the detector is effectively silent for the first week. This is the same 5σ widening used during baseline rebuilds after major traffic pattern changes (Section 13.8).
-
Anomaly scoring. Each new data point is compared against the expected value from the seasonal model. The score is based on how many standard deviations the residual is from zero. A threshold of 3σ (three standard deviations) triggers an anomaly signal. The threshold adapts based on the metric's historical volatility. Metrics that are naturally noisy (cache hit rates) get wider bands than stable metrics (CPU on idle hosts).
Why STL over other approaches? STL was chosen for three reasons: it's computationally cheap (O(n) per series per update, which matters at 100K series), it requires no training data beyond the metric's own history, and it's interpretable -- you can render the trend and seasonal components on a Grafana dashboard and an engineer can see why the system thinks something is anomalous. Prophet fits a full Bayesian model per series, which is prohibitively expensive at this scale. ARIMA doesn't model seasonality natively (SARIMA does, but requires manually specifying seasonal order for each metric). Isolation forests work well for multivariate anomaly detection but require a feature vector, and this system monitors each series independently. The cross-service correlator (step 3) provides a lightweight version of multivariate detection without the model complexity.
-
Cross-service correlation. When multiple services show simultaneous anomalies, the correlator groups them by time window and shared attributes (region, dependency). Instead of 50 individual "anomaly detected" alerts, it produces one "correlated anomaly: 50 services in us-east-1 showing elevated latency, likely caused by shared dependency X."
Grounded expectations:
- ML anomaly alerts go to Slack only, not PagerDuty. They are informational signals for engineers already investigating or proactively monitoring.
- SLO burn rate alerts remain the only paging mechanism. ML never pages.
- False positive rate target: <5% of anomaly signals. Achieved by only monitoring high-query-frequency metrics and using adaptive thresholds.
- Detection latency: <60 seconds from metric ingestion to anomaly signal in Slack.
Feedback and threshold tuning:
Engineers dismiss false positives in Slack via a thumbs-down emoji reaction, which triggers a webhook back to the anomaly detector. Dismissals are tracked per metric. If a metric accumulates more than 3 dismissals in 7 days, the system auto-widens its threshold by 0.5σ (e.g., 3σ becomes 3.5σ). There is no real-time model retraining -- STL is a statistical decomposition, not a trained model in the gradient-descent sense. The "learning" is purely threshold adjustment based on operator feedback. An API endpoint (GET /api/v1/anomaly/tuning/{metric}) exposes the current effective threshold, dismissal count, and baseline age for any monitored metric, which helps on-call engineers understand why a particular metric is or isn't firing.
What ML catches that burn rates miss:
| Scenario | SLO Burn Rate | ML Anomaly Detection |
|---|---|---|
| Error rate suddenly jumps from 0.01% to 0.5% (still under 0.1% SLO) | No alert (within budget) | Detects: 50x deviation from baseline |
| Latency slowly increases 10% per day over 2 weeks | No alert until budget exhaustion | Detects: trend component diverging |
| 10 services simultaneously show 20% traffic drop at an unusual hour | No alert (availability is fine) | Detects: correlated anomaly, likely upstream issue |
| Cache hit rate drops from 95% to 85% | No alert (no SLO on cache) | Detects: significant deviation, Slack notification |
What the Slack alert looks like:
🔶 Anomaly detected: http_request_duration_seconds_p99
Service: checkout (us-east-1)
Current: 450ms
Expected: 120ms (seasonal baseline)
Deviation: 4.2σ (threshold: 3.0σ)
Trend: increasing over last 35 minutes
➜ View in Grafana: https://grafana.internal/d/svc-latency?service=checkout&from=now-1h
For correlated anomalies, the correlator groups individual signals into a single message:
🔴 Correlated anomaly: 12 services in us-east-1 showing elevated p99 latency
Top affected:
checkout 4.2σ (120ms → 450ms)
payment 3.8σ (85ms → 310ms)
inventory 3.5σ (60ms → 195ms)
Suspected shared dependency: redis-cluster-east
Window: 14:32 - 14:47 UTC
➜ View in Grafana: https://grafana.internal/d/correlated?region=us-east-1&from=now-1h
Sizing:
ML Anomaly Detector:
5-10 nodes (c6g.2xlarge: 8 vCPU, 16 GB RAM)
Kafka consumer: processes ~100K series per tenant
RocksDB state: ~50 GB per node (28-day seasonal baselines)
Total cluster: ~500 GB state, ~100K anomaly evaluations/sec
8.8 Recording Rules and Pre-Computation
Recording rules are pre-computed queries that run on the Flink aggregation layer and write results back as new time series. Two reasons to use them:
- Performance. A dashboard querying
sum(rate(http_requests_total[5m])) by (service)across 10M raw series takes seconds. The same query against 500 pre-computed series takes milliseconds. - Consistency. Alert rules and dashboard panels that need the same aggregation use the same recording rule. No divergence from slightly different query syntax.
Where recording rules execute:
Flink receives raw metrics from Kafka. For each recording rule, it maintains a windowed aggregation in state. Every evaluation interval (30s default), it emits the result as a new time series with the recording rule's name.
When to use recording rules vs raw queries:
| Scenario | Use Recording Rule | Use Raw Query |
|---|---|---|
| Dashboard panel queried by multiple users | Yes | No |
| Alert rule evaluated every 15s | Yes | No |
| One-off debugging query | No | Yes |
| Query touches >100K series | Yes | No |
| Query touches <1K series | No | Yes |
| SLO burn rate calculation | Yes (always) | No |
Cost of recording rules: Each rule produces new time series (roughly 2 MB/day for a 500-series rule, negligible). But rules with high-cardinality by clauses (e.g., by (service, endpoint, region, zone)) can produce millions of new series. Estimate rule cardinality before deployment.
Backfill when recording rules change:
Recording rules computed by Flink are forward-only. When you modify a recording rule expression (changing the aggregation logic, adding a new by dimension), the new output starts from the moment the updated Flink job deploys. Historical data under the old expression stays as-is.
If historical backfill is needed (e.g., a new SLO burn rate calculation that should show 30 days of history on day one), run a one-off batch job that reads raw metrics from the S3 warm tier, applies the new recording rule expression, and writes the backfilled series to vmstorage or S3. The Flink job template can be reused with a bounded source (S3 Parquet reader instead of Kafka consumer) for this.
In practice, most recording rule changes don't need backfill. Dashboards showing the new metric simply have no data before the change date, and that's fine. Reserve backfill for SLO calculations where a full window of historical data is required for accurate burn rate alerting from the start.
8.9 Multi-Tenancy and Isolation
At 500M metrics/sec from hundreds of organizations, one tenant's cardinality explosion must never take down another tenant's dashboards.
Tenant identification:
Every request (remote write, query, rule evaluation) carries a X-Scope-OrgID header. This tenant_id propagates through the entire pipeline:
OTel Collector → Pipeline Processors (enrichment) → Kafka (message header)
→ Flink (state key prefix) → vminsert (routing) → vmstorage (directory isolation)
→ vmselect (query filter)
Per-tenant limits:
| Limit | Default | Purpose |
|---|---|---|
| Max active series | 50M | Prevent cardinality explosion |
| Max ingestion rate | 5M samples/sec | Prevent single tenant from saturating Kafka |
| Max query series | 500K | Prevent dashboard query from scanning too many series |
| Max query duration | 60s | Kill runaway queries |
| Max alert rules | 10,000 | Prevent rule evaluator overload |
| Max label name length | 128 chars | Index size control |
| Max label value length | 1,024 chars | Index size control |
| Max labels per series | 30 | Cardinality control |
Limits are enforced at the earliest possible point:
- Ingestion rate: OTel Collector checks before sending to Kafka
- Series limit: vminsert checks before routing to vmstorage. When a tenant exceeds their series limit, the remote write endpoint returns
429 Too Many RequestswithX-Series-LimitandX-Series-Currentheaders so clients can track cardinality usage against their budget. - Query limits: vmselect checks before fanning out to vmstorage
Shuffle-sharding for noisy neighbor protection: Shuffle-sharding assigns each tenant to a random subset of nodes (not all nodes), ensuring one tenant's failure blast radius is limited to their subset, not the entire cluster.
Each tenant is assigned a subset of vmstorage nodes (4 out of 12 in this example). If Tenant A causes a node to OOM, only nodes 1, 4, 7, 10 are affected. Tenants B and C are on different nodes and see no impact.
The shard assignment uses consistent hashing with the tenant_id as input. When nodes are added or removed, only a fraction of tenant assignments change.
Cost attribution:
Per-tenant billing based on:
- Active series count (sampled hourly, billed at max)
- Ingestion rate (samples/sec, billed at p95 over the billing period)
- Query volume (number of series touched per query, summed monthly)
- Storage consumed (bytes across all tiers)
8.10 Service Discovery and Scrape
The collection layer must automatically discover new services as they deploy and stop scraping services that disappear. Manual target configuration doesn't work at 500K+ services.
Scope: pull model only. Service discovery applies to same-cluster tenants (see Section 5.1). External tenants push directly and don't need discovery.
eBPF auto-discovery. Beyla attaches to new pods automatically based on Kubernetes labels. When Service Discovery detects a new pod, Beyla instruments it at the kernel level within seconds. This means every new service has baseline RED metrics and trace spans before the first scrape interval completes, even if the team hasn't added any OTel SDK instrumentation yet. The OTel Collector receives Beyla's output and deduplicates it against any SDK-generated metrics for the same service.
Scrape interval tuning: 15-second scrape interval is the default. It aligns with alert evaluation intervals (also 15s), so each evaluation sees fresh data. Going faster rarely improves alerting because the for duration on most alerts is 1-5 minutes anyway. Use 30-60s only for low-priority infrastructure metrics.
Target churn handling:
Kubernetes pods churn constantly, and each new pod creates new series (the pod label changes). When a target disappears, the OTel Collector sends a stale marker (NaN with a staleness bit). VictoriaMetrics stops considering the series "active" after 5 minutes, so stale series don't count against cardinality limits or appear in queries.
Relabeling for metric filtering:
Not every metric from every service is worth storing. The OTel Collector's relabeling pipeline can drop, rename, or filter metrics before they reach Kafka:
metric_relabel_configs:
# Drop Go runtime metrics (noisy, rarely useful)
- source_labels: [__name__]
regex: 'go_(gc|memstats|threads)_.*'
action: drop
# Drop high-cardinality labels
- regex: 'request_id|trace_id|user_id'
action: labeldrop
# Rename for consistency
- source_labels: [__name__]
regex: 'node_cpu_seconds_total'
target_label: __name__
replacement: 'cpu_seconds_total'Fleet management at scale:
3,000 OTel Collectors need coordinated configuration updates: scrape targets, relabel rules, sampling policies, and receiver settings. Two approaches:
-
GitOps pipeline. Collector configuration lives in a Git repository. Changes go through PR review. A CI/CD pipeline renders per-node configs (using templates for node-specific settings like
node_nameand local pod discovery) and deploys via Kubernetes rolling update of the DaemonSet. Safe, auditable, but slow (minutes for a full rollout). -
OpAMP (Open Agent Management Protocol). The CNCF standard for remote agent management. A central OpAMP server pushes configuration updates to collectors in real time. Collectors report health, effective configuration, and capabilities back to the server. Useful for emergency changes (e.g., adding a
labeldroprule during a cardinality explosion). Faster than GitOps (seconds, not minutes), but requires running and securing the OpAMP server infrastructure.
Most teams use GitOps for planned changes and keep OpAMP as a fast-path override for incidents.
9. Distributed Tracing at Scale
Traces are expensive. From the Section 7 capacity estimate: 200M spans/sec at ~1 KB each produces 17.3 PB/day. At ~$23/TB/month on S3, that's roughly $400K/month in storage alone. Not viable.
Tail-based sampling (Section 9.2) keeps errors, slow requests, and a 0.1% random baseline, dropping the rest to 86 TB/day. First, the trace span format.
9.1 Trace Span Format
Each trace span captures a single operation within a distributed request. A complete trace is a directed acyclic graph of spans connected by parent-child relationships.
{
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"parentSpanId": "e457b5a2e4d86bd1",
"operationName": "POST /checkout",
"serviceName": "checkout-service",
"startTime": 1709827200000,
"duration": 145,
"attributes": {
"http.method": "POST",
"http.status_code": 200,
"db.system": "postgresql"
},
"events": [{"name": "retry", "timestamp": 1709827200045}],
"links": [{"traceId": "...", "spanId": "...", "attributes": {"link.type": "profile"}}]
}500-2000 bytes per span. Variable structure. Rich attributes. A single request generates one metric increment but 5-20 spans across service boundaries. This 100-1000x size difference per event is exactly why traces need a different storage engine than metrics.
9.2 Tech Stack Selection
| Layer | Technology | Why This Choice |
|---|---|---|
| Instrumentation | OpenTelemetry SDK + Beyla eBPF | SDK for deep instrumentation. Beyla provides baseline trace spans for all HTTP/gRPC/SQL without code changes. |
| Collection | OTel Collector (same fleet) | Add otlp trace receiver to existing DaemonSet configuration. Same fleet, no new infrastructure for collection. |
| Sampling | OTel Collector tail-based sampling | Buffer spans 60s, keep errors + p99 + 0.1% baseline. Cuts 98-99.5% (depends on error rate). Head-based misses rare errors. |
| Buffer | Kafka (topic: traces-raw) | Same Kafka cluster, separate topic. Isolates trace volume from metric ingestion. Fingerprint-partitioned by trace_id for span co-location. |
| Trace Backend | Grafana Tempo | S3-native storage (no index to maintain). TraceQL for queries. Stateless query layer scales horizontally. Bloom filters for trace ID lookup. |
| Long-term Storage | S3 (Parquet blocks) | Same S3 infrastructure used for metrics warm/cold tiers. Tempo writes trace blocks as Parquet files. S3 lifecycle policies manage tier transitions. |
| Correlation | Exemplars + span-to-profile links | Link metric data points to the specific trace that caused a spike. Link trace spans to CPU/memory profiles showing why the span was slow. |
| Context Propagation | W3C Trace Context (traceparent) | Industry standard. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers. |
| Visualization | Grafana (already deployed) | Native Tempo datasource. Trace waterfall view. Auto-generated service dependency maps from trace data. Integrated flame graph panel for span-linked profiles. |
Why tail-based sampling over head-based: Head-based sampling decides at the start of a request (e.g., "keep 1% of all traces") before knowing whether the request will fail or be slow. This inevitably drops error and high-latency traces that are the most valuable for debugging. Tail-based sampling buffers all spans for a configurable window (60 seconds in this design), waits for the complete trace, then makes a keep/drop decision based on outcome: all errors kept, all p99 latency kept, 0.1% random baseline. The cost is memory and an extra routing hop.
Trace-affinity routing for tail-based sampling: Tail-based sampling requires ALL spans from a single trace to arrive at the same sampler instance, since the keep/drop decision needs the complete trace. A trace crossing 8 services generates spans on 8 different nodes, hitting 8 different DaemonSet collectors. No single collector has the full picture. The solution is a two-tier collector architecture for traces:
- DaemonSet collectors (3,000 nodes) receive spans from local pods and forward them using the OTel Collector's load-balancing exporter, which routes by
trace_idhash. - Sampling tier (a dedicated pool of ~50 OTel Collector instances) receives all spans for a given trace on the same instance, buffers for 60 seconds, and makes the keep/drop decision.
This adds one network hop but ensures correct sampling decisions. The sampling tier instances each handle ~4M spans in the 60-second buffer, consuming ~2.7 GB RAM each (see Section 12.8 for sizing).
Why Tempo over Jaeger: Tempo stores traces as Parquet blocks on S3 with bloom filters for trace ID lookups. No inverted index to maintain, no Elasticsearch cluster to babysit. At this scale (hundreds of terabytes per day after sampling), that operational simplicity matters more than Jaeger's slightly more mature UI.
9.3 Data Flow
9.4 Exemplar Storage and Trace-to-Profile Correlation
Exemplars link metrics to traces: when the OTel SDK records a histogram observation, it can attach the current trace_id. Clicking an exemplar dot on a Grafana metric panel opens the linked trace in Tempo.
Sampling alignment matters. At a 0.5-2% trace keep rate, 98-99.5% of exemplar links will be dead ends. In practice, this is acceptable because the exemplars that matter most (errors, p99 latency) correspond to traces that the tail-based sampler keeps by policy.
Trace-to-profile correlation is the newer, higher-value link. When a trace span shows a slow database call (500ms), clicking "View Profile" shows the CPU flame graph for that exact span, showing whether the slowness was regex backtracking, GC pressure, lock contention, or actual I/O wait. This correlation is covered in depth in Section 11.4.
9.5 Tiered Storage for Traces
Same hot/warm/cold philosophy as the metrics pipeline (Section 8.6).
| Tier | Retention | Storage | Resolution | Query Latency |
|---|---|---|---|---|
| Hot | 0-3 days | Tempo ingesters (local SSD) + S3 Standard | Full spans, all attributes | < 500ms by trace ID |
| Warm | 3-30 days | S3 Standard (compacted Parquet blocks) | Full spans, all attributes | 1-3s (S3 reads + bloom filter lookup) |
| Cold | 30-180 days | S3 Glacier Instant Retrieval | Full spans, all attributes | 5-10s (Glacier retrieval) |
| Archive | 180d+ | S3 Glacier Deep Archive | Full spans (compliance/audit) | Minutes (restore required) |
Unlike metrics, traces are never downsampled. A trace is either useful in full detail or not at all. Cost savings come from tiering across S3 storage classes, not reducing resolution.
9.6 Capacity Estimate
From Section 7.5: raw trace volume is 17.3 PB/day. After tail-based sampling (0.5% keep rate), ingested volume is 86 TB/day.
Tiered storage (180-day retention):
Hot (3 days): 86 TB x 3 = 258 TB on S3 Standard
Warm (27 days): 86 TB x 27 = 2.32 PB on S3 Standard
Cold (150 days): 86 TB x 150 = 12.9 PB on Glacier Instant
Total trace storage: ~15.5 PB
Tempo ingester fleet:
1 GB/sec ingest rate. Each ingester handles ~50 MB/sec.
20 ingesters (r6g.2xlarge)
Tempo compactor:
10-15 compactor nodes (c6g.xlarge), auto-scaled by pending block count
Monitor: tempo_compactor_outstanding_blocks. Scale up if backlog > 500 blocks.
At 1 GB/sec ingestion, 5 compactors fall behind within hours during peak.
OTel Collector overhead:
Tail-based sampling requires 4 GB RAM per 100K spans/sec in the decision buffer
200M raw spans/sec across 3,000 collectors = ~67K spans/sec per collector
~2.7 GB additional RAM per collector (fits within c6g.xlarge 8GB with headroom)
For deeper coverage of sampling strategies and context propagation patterns, see the Distributed Tracing reference. Failure scenarios for the trace pipeline are covered in Section 13.
10. Centralized Logging
The trace shows which service failed. Logs show why: "connection refused to stripe-api.com:443, certificate expired." The logging pipeline reuses the same OTel Collector fleet and Kafka cluster, diverging only at the storage layer.
10.1 Log Entry Format
Structured logs follow a consistent JSON schema with fields auto-indexed by VictoriaLogs.
{
"timestamp": "2026-03-07T14:30:00.123Z",
"level": "ERROR",
"service": "payment-service",
"message": "connection refused to stripe-api.com:443, certificate expired",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "a1b2c3d4e5f60718",
"host": "payment-pod-7b8f9",
"region": "us-east-1",
"error.type": "ConnectionRefused",
"retry_count": 3
}200-1000 bytes per log line. The trace_id field links logs back to traces. VictoriaLogs auto-indexes every field using bloom filters (2 bytes per unique token), so both trace_id lookups and full-text search across the message field are index-assisted.
10.2 Tech Stack Selection
Three serious options exist for log storage at this scale: Elasticsearch, Grafana Loki, and VictoriaLogs.
| Dimension | VictoriaLogs | Grafana Loki | Elasticsearch |
|---|---|---|---|
| Indexing strategy | Bloom filters on all tokens. Index-assisted search on every field automatically. | Labels only (service, env, level). Log content is NOT indexed. Brute-force grep on chunks. | Full-text inverted index on all fields. |
| Ingestion throughput | 3x higher than Loki on same hardware. 66 MB/s vs 20 MB/s on 4 vCPU / 8 GiB (benchmark). | Bottlenecks on high-volume ingestion. Requires more nodes. | High throughput but index building is CPU-intensive. |
| Query speed (content search) | Fast. Bloom filter token lookup skips non-matching blocks. | Slow. Brute-force scan after label filter. | Fast. Inverted index on all content. |
| Resource efficiency | 72% less CPU, 87% less RAM than Loki. Up to 30x less RAM than Elasticsearch. | Memory-hungry ingesters. | Highest resource consumption. Large JVM heaps. |
At 25 GB/sec: The deciding factor is what happens when queries go beyond label filtering. At 500K services, operators frequently search log content across broad service groups. VictoriaLogs uses bloom filter token indexes to skip non-matching blocks entirely.
Chosen stack:
| Layer | Technology | Why |
|---|---|---|
| Collection | OTel Collector (filelog receiver) | Same DaemonSet fleet. Tail container stdout/stderr via Kubernetes log paths. |
| Buffer | Kafka (topic: logs-raw) | Same cluster, separate topic. Partitioned by service_name. |
| Storage & Indexing | VictoriaLogs (cluster mode) | Bloom filter per-token index on all fields. Auto-indexes everything. |
| Long-term | Local NVMe on vlstorage + vmbackup to S3 | Hot/warm on local disk. Cold/archive snapshots to S3. |
| Query | LogsQL | Filter and search: _time:5m AND service:checkout AND error AND _stream:{level="error"}. |
| Correlation | trace_id field in structured logs | Click from log line to open trace in Tempo. Click from trace span to see logs. |
10.3 Data Flow
Container stdout --> Kubernetes log file --> OTel Collector (filelog receiver)
--> Parse JSON --> Extract fields (service, level, trace_id)
--> Pipeline processors (value-based routing: drop DEBUG from bronze-tier services)
--> Kafka topic: logs-raw
--> vlinsert --> Distribute to vlstorage nodes
--> vlstorage --> Tokenize, build bloom filters, compress, write to local NVMe
--> Query: Grafana --> vlselect --> Fan out to vlstorage nodes --> Return results
10.4 Tiered Storage for Logs
Logs follow the same tiered storage strategy. Unlike traces, partial log retention is viable. DEBUG and INFO logs older than 30 days rarely get queried. Dropping low-severity logs after the warm tier cuts cold/archive storage by 90%.
Dynamic log level adjustment. In production, most services run at INFO level. During incidents, operators can temporarily increase specific services to DEBUG via OpAMP. The pipeline's value-based routing (Section 4.5) integrates here: bronze-tier services have DEBUG logs dropped at the OTel Collector level, while gold-tier services retain all log levels.
| Tier | Retention | Storage | What Gets Stored | Query Latency |
|---|---|---|---|---|
| Hot | 0-3 days | vlstorage local NVMe | All log levels, full content | < 500ms |
| Warm | 3-30 days | vlstorage local disk | All log levels, full content | 500ms-2s |
| Cold | 30-90 days | S3 Standard (vmbackup snapshots) | ERROR/WARN only | 5-15s |
| Archive | 90-365 days | S3 Glacier Deep Archive | ERROR/WARN only (compliance/audit) | Minutes |
10.5 Capacity Estimate
From Section 7.5: 50M log lines/sec at 500 bytes = 25 GB/sec = 2.16 PB/day raw.
VictoriaLogs compression: ~8x on structured JSON logs
Compressed: ~260 TB/day stored on local disk
Tiered storage (365-day retention for errors, 30-day for all):
Hot (3 days): 260 TB x 3 = 780 TB on local NVMe
Warm (27 days): 260 TB x 27 = 7.0 PB on local SSD
Cold (60 days): 260 TB x 0.10 x 60 = 1.56 PB on S3 Standard
Archive (275d): 260 TB x 0.10 x 275 = 7.15 PB on Glacier Deep
Total log storage: ~9.5 PB across active tiers (+ 7.15 PB archive)
vlinsert: 17 nodes, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers
Failure scenarios for the logging pipeline are covered in Section 13.
11. Continuous Profiling: The Fourth Pillar
Metrics tell you what is slow. Traces tell you where in the call chain it's slow. Profiles tell you why the code is slow: which function, which line, which allocation pattern. Without profiling, the investigation stops at "the checkout service has a slow span" and the engineer guesses. With profiling, they see the CPU flame graph showing 40% of time spent in regex compilation inside a hot loop.
11.1 Profile Format and Types
Profiles use the pprof format (Google's protocol buffer-based profile format), the de facto standard adopted by Go, Java (via async-profiler), Python (via py-spy), and Rust (via pprof-rs). Each profile is a snapshot of stack traces with associated values.
| Profile Type | What It Captures | When to Use | Typical Size |
|---|---|---|---|
| CPU | Time spent in each function | "This span is slow, but I don't know which function" | 50-200 KB |
| Memory (alloc) | Bytes allocated by each call stack | "GC pauses are killing latency" | 30-100 KB |
| Memory (inuse) | Current heap usage by call stack | "Memory is growing, what's leaking?" | 30-100 KB |
| Goroutine | Number of goroutines by creation stack (Go) | "Goroutine count is climbing" | 10-50 KB |
| Mutex | Time spent waiting on locks by call stack | "Contention is high but I don't know which lock" | 10-50 KB |
| Block | Time spent blocked on channel/mutex operations (Go) | "Something is blocking, what?" | 10-50 KB |
Profile structure (simplified pprof):
Profile {
sample_type: [("cpu", "nanoseconds")]
sample: [
{
location_id: [4, 3, 2, 1], // Stack trace (leaf to root)
value: [150000000], // 150ms of CPU time
label: {
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7" // Links to specific trace span
}
},
...
]
location: [
{id: 1, function: "main.handleCheckout", file: "checkout.go", line: 42},
{id: 2, function: "validator.ValidateCart", file: "validate.go", line: 87},
{id: 3, function: "regexp.Compile", file: "regexp.go", line: 172},
{id: 4, function: "regexp.compile", file: "compile.go", line: 94}
]
}
11.2 Tech Stack Selection
| Dimension | Grafana Pyroscope | Parca | Datadog Continuous Profiler |
|---|---|---|---|
| Storage backend | S3-native (object storage) | S3 or local disk | Proprietary (SaaS) |
| Profile format | pprof (native) | pprof (native) | Proprietary + pprof conversion |
| Query language | Label-based + time range | Label-based + time range | Proprietary UI |
| Grafana integration | Native (same vendor stack) | Community datasource plugin | Requires Datadog UI |
| Trace correlation | Span-to-profile via span_id in pprof labels | Experimental | Built-in (proprietary) |
| Maturity | Production-proven at Grafana Labs, acquired from Pyroscope project | Early production (CNCF sandbox) | Mature (SaaS) |
| OTel support | OTel profile signal (experimental) + Pyroscope SDK | OTel profile signal | Datadog agent only |
Why Pyroscope: Same vendor ecosystem as Tempo and Grafana. S3-native storage matches the trace pipeline's model. Span-to-profile correlation works via shared span_id labels in the pprof data, enabling the "click from slow trace span to CPU flame graph" workflow in Grafana. The OTel profile signal support (experimental as of early 2026) means profiles flow through the same OTel Collector fleet as other signals.
11.3 Data Flow
Key points:
- Beyla does NOT produce profiles. eBPF-based instrumentation generates RED metrics and basic trace spans, but CPU/memory/lock profiles require SDK-level instrumentation. Profiling is an opt-in capability that teams enable when they need deeper performance insight.
- Profile collection is per-service, not per-request. The SDK takes a CPU profile snapshot every 10 seconds (configurable). The snapshot covers all activity on the process during that window, not individual requests. Span-to-profile correlation works by matching the span's time window to the profile snapshot that covers it.
- Profiles flow through the same OTel Collector fleet and pipeline processors. Value-based routing applies: profiles from gold-tier services are always kept; profiles from bronze-tier services may be sampled during cost pressure.
11.4 Profile-to-Trace Correlation: The Key Value
This is where profiling transforms from "nice to have" to "changes how you debug." The four-click investigation flow:
How the correlation works technically: The OTel SDK labels each profile sample with the current span_id in pprof metadata. In Grafana, clicking "Profiles" on a trace span queries Pyroscope for profiles matching that span_id and time range, rendering the flame graph for that exact execution window.
What the flame graph shows that traces can't:
| Trace Shows | Profile Shows (the "why") |
|---|---|
| checkout-service span: 500ms | regexp.Compile called 10,000 times inside validateCart(), compiling the same regex per request instead of caching |
| payment-service span: 300ms | 60% time in runtime.mallocgc, excessive allocations causing GC pressure |
| db-proxy span: 200ms | 45% time in sync.Mutex.Lock, lock contention on connection pool |
| auth-service span: 150ms | 80% actual network I/O wait (expected), no code issue, downstream dependency is slow |
Without profiles, the engineer sees "checkout-service is slow" and starts guessing. With profiles, they see "regexp.Compile is called 10,000 times per request. Cache the compiled regex" and ship a fix in 20 minutes.
11.5 Capacity Estimate
Profile volume:
500K services with profiling enabled (not all, opt-in via SDK)
Assume 50% adoption: 250K services
10-second profile interval per service
250K / 10 = 25K profiles/sec (conservative)
At full adoption: 50K profiles/sec (target capacity)
Per-profile size:
CPU profile: 50-200 KB compressed (avg 100 KB)
Memory profile: 30-100 KB compressed
Avg across types: ~100 KB
Ingestion rate:
50K profiles/sec × 100 KB = 5 GB/sec raw
Pyroscope dedup + compression: ~5x reduction
Stored: ~1 GB/sec = 86 TB/day
Tiered storage (90-day retention):
Hot (7 days): 86 TB × 7 = 602 TB on S3 Standard
Warm (83 days): 86 TB × 83 = 7.1 PB on S3 Standard-IA
Total profile storage: ~7.7 PB
Pyroscope cluster:
Ingesters: 10 nodes (r6g.2xlarge, 8 vCPU, 64 GB RAM)
Query frontend: 5 nodes (c6g.2xlarge)
Compactor: 3 nodes (c6g.xlarge)
Kafka brokers: +2 (for profiles-raw topic)
Cost context: Profile storage is ~7.7 PB, compared to ~15.5 PB for traces and ~9.5 PB for logs. The incremental storage cost is meaningful but not dominant. The value (going from "this span is slow" to "this function is the bottleneck") justifies the cost for any organization where developer time spent debugging is expensive.
12. Identify Bottlenecks
12.1 Cardinality Ceiling
The inverted index is the first component to break under cardinality pressure. At 10 billion active series, the mergeset index consumes significant disk I/O and memory for bloom filters. Beyond 10B series per vmstorage node, lookup latency degrades. Symptoms: dashboard queries that used to return in 50ms start taking 500ms+. The metric vm_index_search_duration_seconds creeps up. Alert rule evaluations miss their 15-second window. The fix: more vmstorage nodes (horizontal sharding) and stricter cardinality limits per tenant.
12.2 Query Fan-Out
Every MetricsQL query fans out to all vmstorage nodes. With 160 nodes (RF=2), that's 160 parallel requests per query. If a dashboard has 20 panels, each refreshing every 30 seconds, that's 20 × 160 = 3,200 vmstorage requests per dashboard refresh. With 50 concurrent dashboard users, that's 160K requests/30s to vmstorage. Symptoms: Grafana dashboards show "query timeout" errors. vm_concurrent_select_requests on vmselect stays at max.
To handle this: recording rules reduce the number of series each panel queries, result caching in vmselect cuts redundant reads, and query coalescing merges identical concurrent queries into one.
12.3 Kafka Consumer Lag
If Flink or vminsert falls behind, Kafka consumer lag grows. With 400M/sec ingestion and 24-hour retention, Kafka stores up to ~34.6T samples. If a consumer falls 1 hour behind, it needs to process 1.44T extra samples to catch up while also handling current load. Symptoms: kafka_consumergroup_lag climbs steadily. Metrics appear on dashboards with increasing delay. The 2-second ingestion-to-visibility target starts slipping to 10+ seconds.
Monitor consumer lag (kafka_consumergroup_lag) and alert if lag exceeds 5 minutes. Auto-scale Flink TaskManagers based on lag. The 24-hour Kafka retention gives headroom for recovery.
12.4 Flink Checkpoint Size
Flink's recording rules maintain state (current window aggregations) in RocksDB. At 50M active aggregation keys, checkpoints can be multi-GB. Checkpoint duration affects recovery time (longer checkpoint = longer restart + replay). Symptoms: Flink's checkpoint duration metric (flink_jobmanager_job_lastCheckpointDuration) grows from seconds to minutes. If checkpoints take longer than the checkpoint interval, they start overlapping and Flink falls behind.
Mitigation: incremental checkpointing (only write changed state), tuning RocksDB block cache and compaction, and using aligned checkpoint barriers to reduce checkpoint duration.
12.5 vmstorage Compaction Storms
VictoriaMetrics periodically merges small parts into larger ones (similar to LSM-tree compaction). During compaction, disk I/O spikes. If compaction falls behind ingestion, parts accumulate, and read latency increases. Symptoms: vm_parts_count on vmstorage keeps climbing instead of staying flat after merges. Disk I/O utilization stays pegged at 100%.
Provision vmstorage nodes with 30% I/O headroom for compaction and use NVMe SSDs (not EBS gp3). Monitor vm_merge_need_free_disk_space and scale storage before compaction bottlenecks.
Tempo compactor backlog. Trace ingestion produces far more small blocks than metrics because of high cardinality (every request generates a new trace). At 1 GB/sec, ingesters flush blocks to S3 every 30 seconds. If the compactor fleet cannot merge blocks faster than they arrive, the block count grows, bloom filter lookups slow down, and trace queries degrade. Safeguards: auto-scale compactor instances when pending block count exceeds 500 per tenant (monitor tempo_compactor_outstanding_blocks). If compaction lag exceeds 2 hours, apply ingestion backpressure by rejecting low-priority traces (debug-level spans) rather than letting query performance degrade for all tenants.
12.6 S3 Query Latency
Queries spanning both hot and cold tiers bottleneck on the S3 leg. vmstorage responds in single-digit milliseconds. S3 GET latency is 50-200ms per request. A query touching 100 Parquet files on S3 waits for the slowest one. Symptoms: any dashboard query with a time range beyond 15 days loads noticeably slower.
The Parquet path structure (s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet) supports three levels of pruning: tenant_id eliminates other organizations' data, time range skips irrelevant days, and fingerprint range narrows to the specific series. A well-pruned query touches 10-50 files, not thousands.
12.7 Inverted Index Rebuild Time
When a vmstorage node restarts, it replays the WAL and rebuilds the in-memory portion of the mergeset index. With billions of series, this can take 10-30 minutes. During this window, the node accepts writes but cannot serve queries efficiently. Periodic index snapshots reduce WAL replay size. Keep the hot tier series count manageable per node (target: <500M active series per vmstorage node). Enable replication factor 2 so the replica serves queries while the primary rebuilds.
12.8 Tail-Based Sampling Memory
The OTel Collector's tail-based sampling processor buffers all spans for 60 seconds before making keep/drop decisions. At 67K spans/sec per collector (200M raw spans/sec across 3,000 collectors), each collector holds ~4M spans in memory, consuming ~2.7 GB RAM. If a burst of high-cardinality traces arrives, this buffer can exceed the collector's memory limit.
Set max_traces on the tail-based sampling processor to cap memory usage. If the limit is reached, new traces pass through unsampled rather than crashing the collector. Monitor otelcol_processor_tail_sampling_count_traces_dropped and scale the collector fleet if drops exceed 1%.
12.9 Tempo S3 Query Latency
Tempo stores trace blocks as Parquet files on S3 with bloom filters for trace ID lookup. When a bloom filter returns a false positive, Tempo must fetch the full Parquet block from S3. For attribute-based searches (e.g., {resource.service.name="checkout" && duration > 2s}), Tempo may scan dozens of blocks. Keep the Tempo compactor healthy. A memcached query result cache avoids repeated S3 reads for popular queries.
12.10 VictoriaLogs Bloom Filter Cache
VictoriaLogs uses bloom filters on every log field. High-cardinality queries hit bloom filters on every vlstorage node. If the bloom filter working set exceeds the OS page cache, queries trigger disk reads instead of cache hits, increasing latency from sub-second to multi-second. Monitor bloom filter cache hit rate and provision enough RAM for the OS page cache.
12.11 Pyroscope Ingester Memory
Pyroscope ingesters buffer profile data in memory before flushing to S3. At 50K profiles/sec with an average 100 KB per profile, each ingester (10 nodes) handles 5K profiles/sec = 500 MB/sec in-memory. With a 30-second flush interval, each ingester holds ~15 GB of profile data in RAM. Symptoms: ingester OOM during traffic spikes, profile data gaps in Grafana flame graph panels.
Mitigation: configure ingester.max-block-bytes to limit per-tenant buffer size. Use Kubernetes memory limits with 30% headroom above expected peak. During volume spikes, excess profiles are dropped with a metric (pyroscope_ingester_dropped_profiles_total) for alerting.
12.12 Scaling Triggers
| Component | Scaling Metric | Trigger | Action |
|---|---|---|---|
| vmstorage | vm_active_timeseries per node | > 400M series | Add vmstorage nodes, rebalance hash ring |
| vmstorage | Disk utilization | > 70% on NVMe | Add nodes or increase retention export frequency |
| vminsert | CPU utilization | > 70% sustained | Add vminsert pods (stateless, instant) |
| vmselect | vm_concurrent_select_requests | > 80% of max | Add vmselect pods |
| Kafka | Bytes in per broker | > 200 MB/sec per broker | Add brokers, rebalance partitions |
| Flink | kafka_consumergroup_lag | Lag > 5 minutes sustained | Add TaskManagers |
| Flink | Checkpoint duration | > 50% of checkpoint interval | Tune RocksDB, add parallelism |
| vlstorage | Disk utilization | > 70% | Add vlstorage nodes |
| vlstorage | Bloom filter cache miss rate | > 20% | Add RAM or add nodes |
| Pyroscope | Ingester memory utilization | > 70% | Add ingester nodes |
| ML Anomaly | Processing lag | > 2 minutes behind real-time | Add consumer nodes |
| Beyla | beyla_ebpf_attach_errors | > 0 sustained | Investigate kernel compatibility |
| OTel Collectors | otelcol_exporter_send_failed_metric_points | Failure rate > 1% | Scale fleet or increase per-collector resources |
Use Kubernetes HPA for stateless components (vminsert, vmselect, vlinsert, vlselect) with custom metrics from the meta-monitoring stack. Stateful components (vmstorage, vlstorage, Kafka) require manual scaling with planned rebalancing.
13. Failure Scenarios
What happens when components fail or scaling limits are exceeded, and how the platform recovers.
13.1 VictoriaMetrics Node Loss
Scenario: A vmstorage node loses its disk (NVMe failure) and all data on it.
Impact: The series sharded to that node become unavailable for queries. Ingestion continues (vminsert detects the failed node and redistributes to remaining nodes). Queries return partial results for affected series.
Recovery:
- vminsert's health check detects the failed node within 10 seconds
- vminsert's consistent hashing ring updates. New writes for affected series go to the next node in the ring.
- With replication factor 2, the replica node has a copy. vmselect queries both the primary and replica, so queries still return complete data.
- A replacement node joins the cluster. vmstorage rebalancing migrates data from overloaded nodes.
Why replication factor 2 is enough: Unlike databases, metrics data is append-only and immutable. There are no updates, no transactions, no read-before-write. Losing recent data is tolerable (the system will re-scrape and backfill). Losing old data is mitigated by S3 exports.
13.2 Kafka Broker Failure
Scenario: A Kafka broker in the metrics-ingestion cluster crashes.
Impact: Partitions led by that broker become temporarily unavailable. With RF=3, the controller elects a new leader from in-sync replicas.
Recovery:
- Kafka controller detects broker failure (KRaft heartbeat)
- Leader election for affected partitions (~1-5 seconds)
- Producers retry buffered messages to the new leader
- No data loss (RF=3 with acks=all means at least 2 replicas have every message)
During failover: 1-5 seconds of elevated producer latency. OTel Collectors buffer in memory during this window (configured via sending_queue settings).
13.3 Cardinality Explosion (OOM)
Scenario: A tenant adds user_id as a histogram label (see Section 8.3). 10M users × 12 histogram series = 120M new series.
Recovery with per-tenant limits (correct configuration):
- vminsert detects that the tenant has exceeded its 50M series limit
- New series creation is rejected (HTTP 429)
- Existing series continue receiving data
- Alert fires:
vm_active_timeseries > 0.8 * vm_series_limit - Over the next staleness period (5 minutes), the 120M stale series are marked inactive
Per-tenant cardinality limits are P0 for exactly this reason.
13.4 Split-Brain in VictoriaMetrics Cluster
Scenario: Network partition isolates vmstorage nodes 1-3 from nodes 4-6.
VictoriaMetrics is AP (availability over consistency). During a partition, reads and writes continue but may return temporarily inconsistent data. vminsert routes writes for unreachable nodes to the next available node. When the partition heals, deduplication on vmselect merges overlapping data. No manual intervention required.
13.5 Flink Job Failure
Scenario: A Flink TaskManager crashes mid-checkpoint.
Impact: Recording rules stop producing pre-computed metrics. SLO burn rate calculations freeze. Raw metric ingestion into vmstorage continues unaffected.
Recovery: Flink JobManager reschedules failed tasks. Tasks restore from the last successful checkpoint. Consumer offsets reset to the checkpoint's Kafka position. Flink replays events, rebuilding state. Gap during replay: typically 30 seconds to 2 minutes.
13.6 eBPF Collector Failure
Scenario: A kernel upgrade breaks Beyla eBPF probes on a subset of nodes. Beyla pods enter CrashLoopBackOff.
Impact: Baseline RED metrics and auto-generated trace spans disappear for services on affected nodes. Services with OTel SDK instrumentation are completely unaffected. Their metrics, traces, logs, and profiles continue flowing normally. Only services relying exclusively on eBPF for observability lose visibility.
Recovery:
- Alert fires:
beyla_ebpf_attach_errors > 0on affected nodes - Platform team identifies the kernel version causing the failure
- Short-term: roll back the kernel upgrade on affected nodes, or pin Beyla to a compatible version
- Long-term: update Beyla to a version compatible with the new kernel
- The graceful degradation to SDK-only is automatic. No manual intervention needed for SDK-instrumented services
Key design point: The two-tier instrumentation model (eBPF + SDK) means eBPF failures are always a partial degradation, never a total outage. Teams that have added SDK instrumentation are fully protected.
13.7 Pyroscope Ingester Failure
Scenario: A Pyroscope ingester node OOMs due to a traffic spike from a newly onboarded high-volume service.
Impact: Profile data for services assigned to that ingester is lost for the duration. Trace-to-profile links show "profile not found." Metrics, traces, and logs are completely unaffected.
Recovery:
- Kafka buffers profiles in the
profiles-rawtopic (12-hour retention) - Remaining ingesters pick up partitions from the failed node (consumer group rebalance)
- Once the replacement ingester catches up from Kafka, profile coverage resumes
- Profiles during the gap are permanently lost (not re-collectable)
Prevention: Per-tenant profile ingestion rate limits at the OTel Collector. Monitor pyroscope_ingester_memory_utilization and alert at 70%.
13.8 ML Anomaly Detector Divergence
Scenario: The ML anomaly detector's seasonal baselines become stale after a major traffic pattern change (Black Friday, migration to new region, daylight saving time shift).
Impact: Every metric deviates from the stale baseline. Hundreds of false-positive anomaly alerts flood Slack. Engineers ignore the channel. When a real anomaly occurs simultaneously, it's buried in noise.
Recovery:
- On-call engineer uses the anomaly suppression API:
POST /api/v1/anomaly/suppress?duration=4h&scope=all - After the traffic pattern stabilizes, trigger a baseline reset:
POST /api/v1/anomaly/reset-baseline?service=* - The detector rebuilds baselines over the next 24-48 hours using the new traffic pattern
- During rebuild, anomaly detection operates with wider thresholds (5σ instead of 3σ) to reduce noise
Key design point: ML anomaly alerts go to Slack only, never PagerDuty. This design decision means a false positive storm is annoying but not dangerous. SLO burn rate alerts (the paging mechanism) are unaffected by ML detector issues.
13.9 Trace Sampling Pipeline Failure
Scenario: The tail-based sampling processor crashes or hits its max_traces limit. Spans either get dropped entirely or pass through unsampled at 100% keep rate.
Impact (100% pass-through): Tempo ingesters receive 200x normal volume. Kafka topic saturates. Storage spikes.
Recovery: The OTel Collector's memory limiter processor acts as a circuit breaker, dropping spans before OOM. Kafka buffers excess until sampling resumes and Tempo catches up. Monitor otelcol_processor_tail_sampling_count_traces_sampled for anomalies.
13.10 Network Partition Scenarios
OTel-to-Kafka and Flink-to-VictoriaMetrics partitions share the same recovery model: upstream components buffer locally (OTel Collectors have a 500 MB retry queue; Flink backpressures to Kafka), and Kafka retention bridges the gap once connectivity restores. Deploy Kafka brokers across multiple AZs for prevention.
13.11 VictoriaLogs Disk Full
Scenario: A noisy service emits 10x normal log volume. vlstorage nodes fill their local NVMe.
Recovery: Alert at 85% disk. Identify noisy service via sum(rate(vlinsert_ingested_bytes_total[5m])) by (service). Apply per-service rate limiting at vlinsert. VictoriaLogs retention policy auto-deletes old partitions. The pipeline's value-based routing (Section 4.5) also helps: bronze-tier services can have their log volume capped at the OTel Collector level.
13.12 Clock Skew Between Collectors
NTP misconfiguration causes clock drift: future timestamps confuse dashboards and break rate() calculations, past timestamps outside VictoriaMetrics' 30-minute window are silently dropped. Monitor vm_out_of_order_samples_total. Prevention: validate timestamps at the collector (current time +/- 5 minutes) and overwrite out-of-range timestamps.
14. Observability (Meta-Monitoring)
The meta-problem: if the metrics platform itself goes down, how does anyone find out? Dashboards are blank. Alerts aren't evaluating. PagerDuty is silent.
Solution: A separate, minimal monitoring stack.
A single Prometheus instance on separate infrastructure (different Kubernetes cluster, different cloud region, different failure domain) monitors five things:
- Is vmstorage responding? Scrape
/metrics. If scrape fails 3 times, page. - Is ingestion flowing? Check
vm_rows_inserted_totalrate. If it drops to 0, page. - Is the alert evaluator running? Check
rule_evaluations_totalrate. If it drops to 0, page. - Is Alertmanager healthy? Check
alertmanager_alerts_received_total. If it flatlines, page. - Is the profile pipeline healthy? Check
pyroscope_ingester_appended_profiles_total. If it drops to 0, alert but don't page (profiles are P1, not P0).
Total footprint: one t3.medium instance.
14.1 Critical Metrics
What to watch, by component:
# === INGESTION ===
rate(vm_rows_inserted_total[5m]) # Should match 400M/sec ± 20%
kafka_consumergroup_lag{group="flink-recording-rules"} # Flink consumer lag
kafka_consumergroup_lag{group="vminsert-consumer"} # vminsert consumer lag
otelcol_scraper_scraped_metric_points / otelcol_scraper_errored_metric_points # Scrape success rate
rate(vm_rows_rejected_total{reason="series_limit"}[5m]) # Tenant limit rejections
# === eBPF COLLECTION ===
beyla_ebpf_attach_errors # eBPF probe failures
beyla_http_server_request_duration_seconds_count # Beyla generating metrics
# === PIPELINE ===
otelcol_processor_routing_dropped_total # Health checks/probes dropped
otelcol_processor_routing_sampled_total # Bronze-tier signals sampled
# === STORAGE ===
vm_active_timeseries # Global active series
rate(vm_new_timeseries_created_total[5m]) # Series churn rate
vm_data_size_bytes{type="indexdb"} # Index size
vm_merge_need_free_disk_space # Compaction backlog
# === QUERY ===
histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) # p99 query latency
vm_concurrent_queries # Active queries
rate(vm_slow_queries_total[5m]) # Slow queries (>10s)
# === ALERTING ===
histogram_quantile(0.99, rate(rule_evaluation_duration_seconds_bucket[5m])) # Rule eval duration
rate(rule_evaluation_failures_total[5m]) # Missed evaluations
rate(alertmanager_notifications_failed_total[5m]) # Notification failures
# === ML ANOMALY DETECTION ===
ml_anomaly_detector_lag_seconds # Processing lag behind real-time
ml_anomaly_detector_false_positive_rate # Should be < 5%
ml_anomaly_detector_baseline_age_hours # Stale baseline detection
# === FLINK ===
flink_job_lastCheckpointDuration # Should be < checkpoint interval
rate(flink_taskmanager_job_task_numRecordsInPerSecond[5m]) # Should match Kafka rate
flink_taskmanager_job_task_backPressuredTimeMsPerSecond # High = can't keep up
# === TRACE PIPELINE ===
tempo_ingester_flush_duration_seconds # Ingester lag
rate(otelcol_processor_tail_sampling_count_traces_dropped[5m]) # Span drop rate
# === LOG PIPELINE ===
rate(vlinsert_ingested_rows_total[5m]) # Log ingestion rate
vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes # Disk usage
# === PROFILE PIPELINE ===
rate(pyroscope_ingester_appended_profiles_total[5m]) # Profile ingestion rate
pyroscope_ingester_memory_utilization # Ingester memory pressure
pyroscope_compactor_outstanding_blocks # Compaction health
14.2 Alert Rules for Meta-Monitoring
| Alert | Expression | Severity | For |
|---|---|---|---|
| IngestionDrop | rate(vm_rows_inserted_total[5m]) < 320000000 | critical | 2m |
| KafkaLagHigh | kafka_consumergroup_lag > 50000000 | warning | 5m |
| KafkaLagCritical | kafka_consumergroup_lag > 500000000 | critical | 2m |
| CardinalitySpike | rate(vm_new_timeseries_created_total[1h]) > 100000 | warning | 0m |
| VmstorageDown | up{job="vmstorage"} == 0 | critical | 30s |
| QueryLatencyHigh | histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) > 1 | warning | 5m |
| RuleEvalFailing | rate(rule_evaluation_failures_total[5m]) > 0 | critical | 2m |
| FlinkBackpressure | flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500 | warning | 5m |
| CompactionBehind | vm_merge_need_free_disk_space > 0 | warning | 10m |
| S3ExportFailed | time() - last_s3_export_timestamp > 86400 | critical | 0m |
| BeylaAttachErrors | beyla_ebpf_attach_errors > 0 | warning | 5m |
| MLAnomalyLag | ml_anomaly_detector_lag_seconds > 120 | warning | 5m |
| MLBaselineStale | ml_anomaly_detector_baseline_age_hours > 672 | warning | 0m |
| PyroscopeIngesterHigh | pyroscope_ingester_memory_utilization > 0.7 | warning | 5m |
| ProfileIngestionDrop | rate(pyroscope_ingester_dropped_profiles_total[5m]) > 0 | warning | 2m |
| TraceSamplingAnomaly | rate(otelcol_processor_tail_sampling_count_traces_sampled[5m]) / rate(otelcol_processor_tail_sampling_count_traces_received[5m]) > 0.05 | warning | 2m |
| VictoriaLogsStorageFull | vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes > 0.85 | warning | 5m |
| LogIngestionDrop | rate(vlinsert_ingested_rows_total[5m]) < 40000000 | critical | 5m |
15. Deployment Strategy
The platform deploys on Kubernetes across multiple regions. Each region runs a complete, independent pipeline. Kubernetes provides DaemonSet-based collector deployment, service discovery for scrape targets, and horizontal pod autoscaling for stateless components.
15.1 Multi-Region Topology
Active-active ingestion: Each region runs its own complete pipeline (Beyla + OTel → Kafka → Flink → VictoriaMetrics + Pyroscope). Telemetry stays region-local, no cross-region ingestion traffic.
Cross-region query federation: A global vmselect gateway fans out queries to VictoriaMetrics clusters in all regions. For global dashboards ("total request rate across all regions"), the gateway merges results from both regions.
S3 replication: Each region's S3 bucket has cross-region replication enabled. If us-east-1 goes down entirely, warm/cold data is still accessible from us-west-2.
15.2 Rolling Upgrades
VictoriaMetrics: Upgrade one component at a time. vminsert first (stateless, instant restart). vmselect next (stateless). vmstorage last (stateful, needs graceful drain: mark as draining, stop new writes, wait for in-flight to complete, upgrade, re-join).
Flink: Blue-green deployment. Deploy new Flink job version alongside the existing one (different consumer group). Once the new job has caught up (consumer lag = 0), switch traffic. Stop the old job.
Beyla: Rolling DaemonSet update. eBPF probes detach when the old pod stops and reattach when the new pod starts. Brief gap (~5-10 seconds per node) in eBPF-generated telemetry during the rollout.
15.3 Disaster Recovery
RPO/RTO targets:
| Component | RPO (data loss tolerance) | RTO (recovery time) | Mechanism |
|---|---|---|---|
| Metrics (hot) | 0 (with RF=2) | < 5 minutes | Replica serves queries while failed node recovers. |
| Metrics (warm/cold) | 0 | < 30 minutes | S3 cross-region replication. |
| Kafka | 0 (RF=3, acks=all) | < 1 minute | Leader election on surviving replicas. |
| Flink | Last checkpoint (seconds to minutes) | < 5 minutes | Restart from checkpoint. |
| Traces (Tempo) | ~60s of in-flight spans | < 10 minutes | Stateless query frontend. Ingesters lose buffer. |
| Logs (VictoriaLogs) | Since last vmbackup (hours) | 15-30 minutes | Restore from S3 snapshot. |
| Profiles (Pyroscope) | ~30s of in-flight profiles | < 10 minutes | Kafka buffers (12h retention). Ingesters lose buffer. |
| ML Anomaly Detector | Baseline state in RocksDB (hours to rebuild) | < 15 minutes | Restart, replay from Kafka. Baselines rebuild over 24-48h. |
16. Security
16.1 Security Fundamentals
- Transport: mTLS between all components (OTel Collectors, Beyla, Kafka, Flink, VictoriaMetrics, Pyroscope). TLS 1.3 for external-facing APIs. Certificates managed by cert-manager.
- Tenant isolation: Separate data directories per tenant on vmstorage (
/data/{tenant_id}/). vmselect injectstenant_idfilter into every query. Alert rules, recording rules, and dashboards are tenant-scoped with RBAC at the API gateway. - Encryption at rest: AES-256 disk encryption for vmstorage (EBS/NVMe). SSE-S3 or SSE-KMS for S3. Broker-side encryption for Kafka.
- eBPF security: Beyla requires
CAP_SYS_ADMINorCAP_BPF+CAP_PERFMONcapabilities. These are restricted to the Beyla DaemonSet pods only. eBPF programs are verified by the kernel's BPF verifier before loading, preventing arbitrary kernel code execution. - Profile data sensitivity: CPU and memory profiles can reveal internal function names, code paths, and potentially sensitive string values in stack frames. Profile data inherits the same tenant isolation and encryption-at-rest as other signals. RBAC restricts profile access to service owners and on-call engineers.
- Audit logging: All administrative actions (tenant CRUD, alert rule changes, cardinality limit overrides, anomaly suppression) logged to a separate append-only store. 1-year minimum retention for compliance.
16.2 Query Injection Prevention
MetricsQL expressions are user-provided strings. Without sanitization, a crafted query could attempt to access another tenant's data or execute resource-intensive queries.
Protections:
- vmselect forcibly prepends
{tenant_id="<authenticated_tenant>"}to every query. The user-provided tenant_id label is stripped. - Query complexity limits: maximum number of series touched (500K default), maximum query duration (60s), maximum number of subqueries (10).
- Regex complexity limits: MetricsQL regex matchers are checked for backtracking patterns before execution.
Explore the Technologies
The technologies and infrastructure patterns referenced throughout this design:
Core Technologies
| Technology | Role in This System | Learn More |
|---|---|---|
| OpenTelemetry | Vendor-neutral instrumentation standard. SDK for custom metrics, traces, and context propagation across all services | OpenTelemetry |
| OTel Collector | Telemetry pipeline engine. Receives from SDK and Beyla, processes (batch, filter, enrich, sample), exports to storage backends | OTel Collector |
| Grafana Beyla | eBPF-based auto-instrumentation, zero-config RED metrics and trace spans from day one without code changes | Grafana Beyla |
| Kafka | Ingestion buffer, 400M metrics/sec (post-pipeline), fingerprint-partitioned topics | Kafka |
| VictoriaMetrics | Hot TSDB, Gorilla compression, shared-nothing cluster (vminsert/vmselect/vmstorage) | VictoriaMetrics |
| Flink | Stream aggregation, recording rules, SLO burn rate calculation | Flink |
| Prometheus | Scrape format standard, MetricsQL foundation, meta-monitoring stack | Prometheus |
| Grafana | Dashboard visualization, multi-datasource queries, flame graph panels | Grafana |
| Grafana Pyroscope | Continuous profiling storage, S3-native, pprof format, span-to-profile correlation | Grafana Pyroscope |
| InfluxDB | Alternative TSDB evaluated in Section 4.3 (FDAP stack, Arrow + Parquet) | InfluxDB |
| ClickHouse | Optional analytics layer for cross-signal wide-event exploration | ClickHouse |
| RocksDB | Flink state backend, VictoriaMetrics mergeset index, ML anomaly detector baselines | RocksDB |
| gRPC | vminsert-to-vmstorage communication, protobuf metric serialization | gRPC |
| Envoy | Service mesh sidecar, mTLS for scrape traffic, load balancing | Envoy |
| Memcached | Query result cache for Tempo trace queries and vmselect overflow caching | Memcached |
| Gorilla Compression | Delta-of-delta + XOR encoding, 12x compression ratio, 1.37 bytes/sample | Gorilla Compression |
| ZSTD Compression | General-purpose compression for cold-tier Parquet blocks on S3, Kafka messages, RocksDB SST files | ZSTD Compression |
| Grafana Tempo | Distributed trace storage, S3-backed Parquet blocks, TraceQL queries, exemplar correlation | Grafana Tempo |
| VictoriaLogs | Log storage, bloom filter per-token index, LogsQL queries, vlinsert/vlselect/vlstorage cluster | VictoriaLogs |
Infrastructure Patterns
| Pattern | Role in This System | Learn More |
|---|---|---|
| Metrics & Monitoring | Core domain. Gorilla compression, TSDB internals, cardinality management | Metrics & Monitoring |
| Message Queues & Event Streaming | Kafka topic design, fingerprint partitioning, backpressure handling | Event Streaming |
| Alerting & On-Call | Distributed rule evaluation, Alertmanager, SLO burn rates, ML anomaly detection | Alerting & On-Call |
| Kubernetes Architecture | Service discovery, DaemonSet collectors, eBPF agents, pod-level scraping | Kubernetes Architecture |
| Rate Limiting & Throttling | Per-tenant ingestion limits, query concurrency limits, cardinality budgets | Rate Limiting |
| Circuit Breaker & Resilience | vmstorage failure handling, Kafka retry policies, graceful degradation | Circuit Breaker & Resilience |
| Auto-Scaling Patterns | vminsert/vmselect horizontal scaling, Flink TaskManager auto-scaling | Auto-Scaling |
| Distributed Tracing | Trace collection via OTel SDK + eBPF, tail-based sampling, Tempo on S3, profile correlation | Distributed Tracing |
| Distributed Logging | Log collection via OTel filelog receiver, VictoriaLogs bloom filter indexing, value-based routing | Distributed Logging |
| Database Sharding | VictoriaMetrics consistent hashing, tenant shuffle-sharding | Database Sharding |
| Service Discovery & Registration | Kubernetes SD, Consul, eBPF auto-discovery | Service Discovery |
| Deployment Strategies | Blue-green for Flink jobs, rolling upgrades for VictoriaMetrics | Deployment Strategies |
| Caching Strategies | vmselect query cache, recording rule pre-computation, Parquet block cache | Caching Strategies |
| Object Storage & Data Lake | S3 + Parquet warm/cold tiers, Pyroscope S3 storage, ZSTD compression | Object Storage |
| Secrets Management | mTLS certificates, tenant API keys, S3 encryption keys | Secrets Management |
Further Reading
- Facebook Gorilla Paper: A Fast, Scalable, In-Memory Time Series Database (VLDB 2015)
- Datadog Monocle: Evolving Real-Time Timeseries Storage in Rust
- Uber M3: Open Source Large-Scale Metrics Platform
- High-Cardinality TSDB Benchmarks: VictoriaMetrics vs TimescaleDB vs InfluxDB
- Datadog Metrics without Limits
- Google SRE Book: Alerting on SLOs
- Grafana Mimir Architecture
- Honeycomb: Why Observability Requires a Distributed Column Store
- Grafana Pyroscope: Continuous Profiling at Scale
- Grafana Beyla: eBPF Auto-Instrumentation
- ClickHouse for Observability: SigNoz Architecture
- STL Decomposition for Anomaly Detection in Time Series
A unified observability platform at this scale is four pipelines sharing a two-tier collection layer. eBPF provides baseline telemetry for every service from day zero; OTel SDK adds depth when teams need it. A value-based routing pipeline drops noise before storage, reducing costs by 15-30%. Kafka buffers all four signal firehoses. VictoriaMetrics stores metrics with 12x Gorilla compression. Tempo stores traces as Parquet blocks on S3 after tail-based sampling cuts 98-99.5% of raw volume. VictoriaLogs stores logs with bloom filter indexing at 3x the throughput of Loki. Pyroscope stores profiles on S3 with span-level correlation, closing the gap between "this span is slow" and "this function is the bottleneck." ML anomaly detection supplements SLO burn rates with seasonal baseline comparison, catching gradual degradations that threshold-based alerts miss. Grafana ties all four together: four clicks from "something is slow" to "here is the exact function causing it." The hardest problems remain cardinality for metrics, sampling strategy for traces, volume control for logs, and adoption for profiling. Per-tenant limits, meta-monitoring, and a separate watchdog stack keep the whole thing running.