CrackingWalnuts

System DesignMarch 12, 2026· 114 min read

Observability Platform at Scale - Metrics, Traces, Logs, and Profiles

Goal: Build a unified observability platform that ingests 500 million metrics per second, collects distributed traces across 500K+ services, centralizes logs from every container in the fleet, and continuously profiles application performance. Support all four pillars of observability: metrics (counters, gauges, histograms, summaries), traces (per-request call trees across service boundaries), logs (structured application output with full-text search), and profiles (CPU, memory, and lock contention snapshots correlated to traces). Provide real-time dashboards, alerting with SLO (Service Level Objective)-based burn rates and ML-assisted anomaly detection, multi-tenant isolation, and long-term retention with automatic downsampling.

Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.

Sections 1-8: Metrics pipeline in full depth (the largest and most complex signal)

Section 9: Distributed tracing at scale

Section 10: Centralized logging

Section 11: Continuous profiling (the fourth pillar)

Sections 12-16: Cross-cutting concerns (bottlenecks, failures, meta-monitoring, deployment, security)

All four signals share the same OTel Collector fleet and Kafka cluster but diverge at the storage layer.

New to observability? Start with Sections 1-7 for the core architecture and sizing. Skim Section 8.3 (cardinality) for the scariest problem in metrics. Skip to Section 13 for real failure stories.

Building something similar? Sections 7-8 have the sizing math and deep dives you need. Section 11 covers the profiling pipeline most teams skip.

Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 12 (bottlenecks) and Section 13 (failures) are common follow-up questions.

TL;DR: A unified observability platform handling 500M metrics/sec, 200M spans/sec, 50M log lines/sec, and 50K profiles/sec across 500K+ services. Gorilla compression achieves 1.3–2 bytes/sample (12x reduction). Tail-based trace sampling reduces volume by 98-99.5% while keeping 100% of errors. Value-based routing drops 15-30% of low-value data before storage. ML anomaly detection supplements SLO burn rates for alerting. The hardest problems: cardinality explosions in metrics, sampling strategy for traces, volume spikes in logs, and profile-to-trace correlation for profiling.

Architecture in One Minute

At a high level the platform consists of five layers:

Two-tier collection layer. eBPF baseline telemetry via Grafana Beyla (zero code changes) plus OpenTelemetry SDK instrumentation for deep visibility.
Observability pipeline. OTel Collector processors perform metadata enrichment, value-based routing, and deduplication before data reaches Kafka.
Streaming ingestion. Apache Kafka buffers telemetry and supports replay. Apache Flink performs recording rules, rollups, and SLO burn-rate calculations.
Signal-specific storage engines. Metrics → VictoriaMetrics, Traces → Grafana Tempo, Logs → VictoriaLogs, Profiles → Grafana Pyroscope. Each engine is optimized for its signal's access pattern.
Unified query and correlation. Grafana correlates signals using exemplars, trace_id, and span_id. Optional ClickHouse analytics gives you cross-signal exploration.

1. Problem Statement

A few clarifications before getting into the architecture:

Scale clarification: This design targets extreme hyperscale: a stress-test upper bound, not a typical deployment. Most organizations ingest thousands to low millions of metrics/sec; only the largest enterprises (thousands of microservices, hundreds of thousands of hosts) reach 10M-50M. At the frontier: Uber's M3 aggregates 500M metrics/sec pre-aggregation across 6.6B active series, Datadog processes 100+ trillion events/day, and Grafana Mimir benchmarked 1B active series at ~50M samples/sec. The 500M figure here matches Uber's pre-aggregation rate to pressure-test every layer. Section 7.3.1 shows how the architecture scales down linearly to 10-50M with fewer components.

Assumptions:

This is a multi-tenant platform (like Datadog, Grafana Cloud, or New Relic). Multiple teams and organizations push metrics into the same infrastructure.
A metric is "ingested" once it's in durable storage and queryable, which takes under 2 seconds.
Each metric is a (name, label set, timestamp, float64 value) tuple. Labels are key-value pairs like {service="checkout", env="prod", region="us-east-1"}.
Services expose metrics via Prometheus exposition format or push via OTLP (OpenTelemetry Protocol).

Operating model:

Responsibility split: The platform team operates collection infrastructure (OTel Collectors, eBPF agents, Kafka, storage, dashboards). Tenant teams instrument their applications using OpenTelemetry SDKs, Prometheus client libraries, or rely on eBPF-based auto-instrumentation for baseline telemetry. Tenants either expose a /metrics endpoint (pull) or push OTLP to the platform's ingestion endpoint (push). Tenants do not run their own Prometheus servers or storage. The platform handles everything after the application boundary.
Two-tier instrumentation: The platform deploys Grafana Beyla (eBPF-based) as a DaemonSet alongside the OTel Collector DaemonSet. Beyla auto-instruments HTTP/gRPC/SQL at the kernel level with zero application code changes, giving you RED metrics (Rate, Errors, Duration) and basic trace spans for every service.

The split matters because eBPF and OTel SDK see different things. eBPF watches network boundaries: it knows "POST /checkout took 500ms" and "a SQL call to port 5432 took 200ms," but it has no idea what happened inside the service. It can't tell you which function was slow, what the actual SQL query text was, or why the request failed. OTel SDK instrumentation lives inside the application code, so it captures the internal breakdown: which method took how long, the exact query text, custom business metrics like orders_placed or payment_amount, span attributes like user_id and cart_size, and CPU/memory profiles showing which line of code is the bottleneck. Profiles come exclusively from SDK instrumentation, not from eBPF.

Why not just use the SDK for everything? Because it requires code changes. Importing libraries, adding instrumentation, redeploying. Across 500+ services in five languages, that takes months. eBPF covers every service in an afternoon: deploy one DaemonSet and you have RED metrics and basic trace spans for the entire cluster. When a team eventually adds SDK instrumentation to a service, the OTel Collector deduplicates the overlap (keeping the richer SDK version) and eBPF quietly steps back for those endpoints.
Data transport: For tenants on the same Kubernetes cluster, the platform deploys OTel Collectors as a DaemonSet (one per node) that scrapes tenant pods locally over mTLS with no cross-network hop. For tenants on separate infrastructure, the platform exposes an HTTPS ingestion endpoint (OTLP or Prometheus remote_write) authenticated via per-tenant API key. Both paths converge into the same Kafka-backed pipeline.

Scope: Four Pillars of Observability

A production monitoring platform handles four signals that look nothing alike:

Metrics answer "how much?" Small numbers (CPU at 45%, 200 req/sec), arriving constantly.
Traces answer "what happened to this one request?" A call tree across service boundaries with timing data.
Logs answer "what did the system say?" Structured text messages with full-text search.
Profiles answer "why is this code slow?" CPU flame graphs, memory allocations, lock contention snapshots showing exactly which function is the bottleneck.

Each needs a different storage engine because the query patterns are completely different:

Signal	Data Shape	Size per Event	Storage Engine	Query Pattern
Metrics	`(name, labels, timestamp, float64)`	1.5-2 bytes (Gorilla compression, Section 8.2)	TSDB (VictoriaMetrics)	Aggregate: `rate(x[5m])`
Traces	`(trace_id, span_id, parent_id, service, duration, attributes)`	500-2000 bytes/span	Columnar/object store (Grafana Tempo)	By trace ID, filter by service+latency
Logs	`(timestamp, severity, message, structured_fields)`	200-1000 bytes	Log store (VictoriaLogs)	Full-text search, filter by severity
Profiles	`(profile_id, span_id, service, type, stacktrace, values)`	50-200 KB/snapshot	Profile store (Grafana Pyroscope)	By span ID, by service+time range

What NOT to do:

A single Prometheus instance typically handles 3-10 million active series depending on available RAM and scrape interval. At 500M metrics/sec with potentially billions of unique series, that setup will OOM (Out of Memory crash) before lunch. Running 50,000 standalone Prometheus instances isn't a solution either. That's 50,000 things to manage, 50,000 things to fail, and no global query capability.

The other common mistake: storing raw metrics forever. At 500M samples/sec, that's 43.2 trillion samples/day. Even at 1.37 bytes per sample (Gorilla compression), that's 54 TB/day of hot storage. Without downsampling, the storage bill alone kills the project. Downsampling applies to metrics only. Traces, logs, and profiles retain full resolution or are discarded entirely (Sections 9-11 explain why).

2. Functional Requirements

ID	Requirement	Priority
FR-01	Ingest metrics via pull (Prometheus scrape) and push (OTLP remote write)	P0
FR-02	Support four metric types: counter, gauge, histogram, summary	P0
FR-03	Real-time dashboard queries with sub-second latency over 15-day window	P0
FR-04	Continuous alert rule evaluation with configurable thresholds	P0
FR-05	SLO-based alerting with multi-window burn rate detection	P0
FR-06	Automatic metric downsampling: raw (15d) → 5-min aggregates (90d) → 1-hour aggregates (forever)	P0
FR-07	Multi-tenant data isolation with per-tenant cardinality limits	P0
FR-08	Distributed trace collection via OTLP with tail-based sampling	P0
FR-09	Centralized log ingestion with full-text search (50M lines/sec)	P0
FR-10	eBPF-based baseline collection (Grafana Beyla) for automatic RED metrics and trace spans	P0
FR-11	Value-based data routing: full fidelity for errors/SLO-critical, sample/drop health checks and debug noise	P0
FR-12	ML-assisted anomaly detection with seasonal baselines, adaptive thresholds, and cross-service correlation	P0

P1 requirements (recording rules, RBAC, exemplar correlation, cross-region federation, continuous profiling) and P2 requirements (metric relabeling, cost attribution) are addressed in their respective deep-dive sections.

3. Non-Functional Requirements

Requirement	Target
Ingestion throughput	500M data points/sec sustained
Query latency (15-day window)	p50 < 50ms, p99 < 200ms
Query latency (90-day window, downsampled)	p50 < 200ms, p99 < 500ms
Ingestion-to-dashboard visibility	< 2 seconds
Alert rule evaluation interval	15 seconds (configurable)
Alert firing latency (ingestion to notification)	< 30 seconds
ML anomaly detection latency	< 60 seconds from ingestion to anomaly signal
Profile ingestion throughput	50K profiles/sec sustained
Availability	99.99% (52 min downtime/year)
Data durability	99.999% (no data loss on single node failure)
Active time series capacity	10 billion+
Retention	Raw: 15 days, 5-min: 90 days, 1-hour: indefinite
Tenant isolation	No cross-tenant data leakage, noisy neighbor protection

Design Principles

Several principles shape the architecture:

1. Specialized engines beat general-purpose systems at scale. Metrics, traces, logs, and profiles have different query patterns and storage requirements. Using signal-specific storage engines reduces cost and improves performance compared to forcing all signals through a single system like ClickHouse or Elasticsearch.

2. Observability must work on day one. Grafana Beyla (eBPF) gives every service RED metrics and trace spans the moment it's deployed, before anyone writes instrumentation code. Without this baseline, what actually happens is: teams deprioritize SDK work, services run dark for months, and when an incident hits an uninstrumented service you're debugging with kubectl logs and guesswork.

3. Cost control must happen before storage. Value-based routing drops low-value signals (health checks, readiness probes, debug noise) before they reach Kafka. This reduces storage costs by 15-30% without losing data that matters during incidents.

4. The write path must tolerate burst traffic. Kafka buffers between collection and storage so each layer scales independently. Details in Section 8.1.

5. Queries should touch minimal data. Recording rules pre-compute expensive aggregations. Downsampling reduces query scan ranges by 20x for older data. The result: sub-second dashboard loads even over 90-day windows.

4. High-Level Approach & Technology Selection

4.1 Why This Architecture

Observability workloads are extremely write-heavy but require fast queries when incidents occur. The write-to-read ratio is roughly 1000:1. Millions of data points flow in every second. A handful of dashboards and alert rules query that data.

Append-only writes. Metrics are immutable. "At 3:14pm, CPU was 45%" never changes. The storage engine only appends, never seeks and updates, which is far faster.
Parallel everything. 500M writes per second on one machine is impossible. Spread the work across thousands of machines, each handling a slice of the data.
No read-before-write. At 500M/sec, checking whether a row exists before writing would be the bottleneck. Skip the lookup entirely.
Shard by metric identity. Assign each metric to a specific storage node based on its fingerprint (a hash of the metric name + labels). All samples for one metric always go to the same node. This keeps related data together, which is critical for compression and fast queries.
Compress aggressively. Time-series data has unique properties (timestamps arrive at regular intervals, values often repeat or change slowly) that enable 12x compression with the right algorithm.
Touch minimal data on reads. Inverted indexes map label selectors to series IDs, so queries skip irrelevant data.
Tier retention automatically. Hot data (raw, 15 days) on fast storage. Warm data (5-min aggregates, 90 days) on object storage. Cold data (1-hour aggregates) archived indefinitely.
Route by value, not just by time. Not all data is equally valuable. Errors and SLO-critical signals get full fidelity. Health checks, readiness probes, and debug noise get sampled or dropped before reaching storage. This reduces stored volume by 15-30% without losing anything that matters during an incident.

4.2 Technology Selection

The pipeline flows through seven stages -- collection, processing, buffering, aggregation, storage, query, and analytics. Each stage uses a purpose-built technology:

Component	Technology	Why This Choice
Collection (baseline)	Grafana Beyla (eBPF)	Zero-config HTTP/gRPC/SQL instrumentation at kernel level. ~200 MB RAM per node. Produces RED metrics and basic trace spans without any SDK changes.
Collection (deep)	OpenTelemetry SDK + OTel Collector	CNCF standard. Pull (Prometheus scrape) and push (OTLP) in one agent. 700+ integrations. Vendor-neutral. Custom business metrics, detailed span attributes, profile collection.
Pipeline processing	OTel Collector processor chain	Value-based routing, metadata enrichment, deduplication, and cost-aware sampling as processor stages in the existing OTel Collector fleet. No separate product needed.
Ingestion buffer	Apache Kafka	Replay on storage failure (24h retention), fan-out to Flink + vminsert, burst absorption at 500M/sec. RF=3, acks=all. Partitioned by metric fingerprint for locality.
Stream aggregation	Apache Flink	Native streaming (not micro-batch like Spark). Managed state with RocksDB backend. Exactly-once via distributed snapshots. Handles 50M+ keys without GC pressure.
TSDB (hot, 0-15d)	VictoriaMetrics Cluster	2.2M data points/sec per node baseline (5M+ on production-grade hardware, see Section 7.3). PromQL-compatible. Gorilla compression built-in. Shared-nothing architecture.
Long-term storage (warm/cold)	S3 + Apache Parquet	Columnar format supports partition pruning. ZSTD compression. Massively lower storage footprint vs hot TSDB. Inspired by Thanos and InfluxDB 3.0 FDAP stack.
Inverted index	VictoriaMetrics mergeset	Maps label combinations to series IDs. Handles 100M+ series without OOM. Disk-backed, not in-memory like Prometheus.
Query engine	MetricsQL	PromQL superset. Keeps metric names after aggregations. Better high-cardinality handling. WITH templates for complex queries. Drop-in replacement for existing PromQL users.
Alert evaluation	vmalert + Alertmanager	vmalert evaluates rules against vmselect, sharded across pods for scale (similar pattern to Grafana Mimir ruler). Alertmanager handles dedup, grouping, routing to PagerDuty/Slack/Opsgenie.
ML anomaly detection	Custom Kafka consumer + RocksDB	Seasonal baseline per metric (STL decomposition), anomaly scoring, cross-signal correlation. Lightweight: 5-10 nodes. Anomalies go to Slack, SLO burn rates remain the paging mechanism.
Service discovery	Kubernetes API + Consul	Auto-discover scrape targets. No manual configuration when services scale. Label-based relabeling for metric filtering.
Visualization	Grafana	Industry standard. Multi-datasource. Dashboard load target <3s. Variable templating for multi-service views.
Compression	Gorilla + ZSTD	Not sequential. Different compressors for different tiers. Gorilla compresses raw samples in vmstorage (hot, 0-15 days): 12x ratio, 1.37 bytes/sample. When data ages out, Flink produces aggregated doubles (min/max/avg/sum/count) that are no longer sequential time-series samples, so Gorilla doesn't apply. ZSTD compresses these Parquet blocks on S3 (warm/cold, 15+ days): ~3x ratio, tunable levels (1-22).
Multi-tenancy	Tenant ID header + per-tenant series limits	Inject tenant_id on remote write. Reject writes exceeding cardinality budget. Shuffle-sharding for isolation.
Trace backend	Grafana Tempo	S3-native Parquet storage. No index to maintain (bloom filters for trace ID lookup). TraceQL query language. Fraction of Jaeger+Elasticsearch infrastructure at scale.
Log backend	VictoriaLogs	Bloom filter per-token index on all fields. LogsQL for full-text search. vlinsert/vlselect/vlstorage cluster scales horizontally. Section 10.2 benchmarks against Loki and Elasticsearch.
Profile backend	Grafana Pyroscope	S3-native storage. pprof format ingestion. Span-to-profile correlation via OTel SDK. Same Grafana ecosystem as Tempo. Flame graph visualization built into Grafana.
Analytics (optional)	ClickHouse	Complementary wide-event store for cross-signal exploration. Not the primary store (see Section 4.6 for why). 10-20 nodes.
Context propagation	W3C Trace Context	Industry standard `traceparent` header. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers.

When to add Kafka: Above ~50M metrics/sec, Kafka becomes necessary for replay, fan-out, and burst buffering. Below that, write directly via remote_write. Full cost-benefit analysis in Section 8.1.

4.3 Why VictoriaMetrics Over Alternatives

Dimension	VictoriaMetrics	InfluxDB 3.0	TimescaleDB	M3DB (Uber)	ClickHouse
Ingestion rate (single node)	2.2M pts/sec	330K pts/sec	480K pts/sec	~500K pts/sec	~500K-1M pts/sec
RAM usage (4M series benchmark)	6 GB	20.5 GB	2.5 GB	~12 GB	~1-2 TB per 1B series
Disk usage (4M series)	3 GB	18.4 GB	52 GB	~8 GB	~4-6 bytes/sample
Query language	MetricsQL (PromQL superset)	SQL (via DataFusion)	SQL	M3QL + PromQL	SQL
Compression	Gorilla (1.37 bytes/sample)	Arrow + Parquet + ZSTD	PostgreSQL TOAST	M3TSZ (Gorilla variant)	LZ4/ZSTD on Float64 columns
Architecture	Shared-nothing (vminsert/vmselect/vmstorage)	Monolithic rewrite (Rust)	PostgreSQL extension	Quorum-based (3-way replication)	MergeTree, columnar
Operational complexity	Low (stateless query/insert, stateful storage only)	Medium (newer, less battle-tested)	Medium (PostgreSQL tuning required)	High (etcd dependency, complex topology)	Medium (general-purpose, not TSDB-optimized)

Why VictoriaMetrics wins here:

VictoriaMetrics ingests 4–7x faster than InfluxDB at one-third the RAM (the exact multiplier depends on hardware and label cardinality; the 2.2M baseline is from public benchmarks on modest instances, while production-grade nodes with NVMe and tuned OS settings reach 5M+). The shared-nothing architecture means vminsert nodes are stateless and horizontally scalable. Each vmstorage node owns its shard independently. No quorum overhead, no coordination on writes. For a 500M/sec write-heavy workload, that absence of coordination is critical.

The three components: VictoriaMetrics cluster splits into three specialized roles:

vminsert (write router): Receives metrics and uses consistent hashing on the metric fingerprint to route each sample to the correct vmstorage node. Stateless (no local data), scales horizontally.
vmstorage (storage engine): Each node owns a shard of time series data on local NVMe. Handles Gorilla compression, maintains the inverted index, and runs compaction. The only stateful component.
vmselect (query engine): Receives MetricsQL queries, fans them out to all vmstorage nodes in parallel, merges results, and applies functions like rate(). Also stateless.

Each dimension (write throughput, storage capacity, query concurrency) scales independently by adding nodes to the relevant component.

The PromQL compatibility matters too. The entire alerting ecosystem (Alertmanager, Grafana, recording rules) speaks PromQL. Choosing a SQL-native TSDB like InfluxDB 3.0 means rebuilding that tooling from scratch. MetricsQL extends PromQL without breaking it.

M3DB is battle-proven at Uber (6.6B series), but the operational complexity is much higher. etcd cluster management, complex topology, and 3-way quorum writes add latency and failure modes. VictoriaMetrics achieves comparable scale with a simpler operational model. Use Prometheus as the format and query language standard. Replace the storage layer with VictoriaMetrics.

Licensing note: VictoriaMetrics single-node is Apache 2.0, fully open source. The cluster version is also open source for core functionality, but enterprise features like downsampling, multi-tenant rate limiting, deduplication across HA pairs, and advanced RBAC require a commercial license. For most teams the open-source cluster version covers what you need. If licensing flexibility is a hard constraint, Prometheus paired with Thanos or Cortex for long-term storage is the safer path. You lose some ingestion performance, but the licensing picture is cleaner.

4.4 Why Flink for Stream Aggregation

Dimension	Apache Flink	Spark Structured Streaming	Kafka Streams
Processing model	True streaming (event-by-event)	Micro-batch (100ms+ intervals)	True streaming
State management	RocksDB backend (spills to disk)	Heap-based (GC pressure)	RocksDB or in-memory
Exactly-once	Distributed snapshots (Chandy-Lamport)	Offset tracking	Kafka transactions
Scalability	Independent of Kafka (separate cluster)	Independent cluster	Tied to Kafka partitions
Windowing	Event-time, processing-time, session	Event-time, processing-time	Event-time, session
State size at 500M/sec	50M+ keys, multi-GB, no GC pauses	Multi-second GC pauses at scale	Limited by Kafka partition count

For this system: 500M metrics/sec flowing through recording rules, rollup aggregations, and SLO burn rate calculations. At peak, that's 50 million active aggregation keys in state. Flink's RocksDB state backend handles this without heap pressure. Spark's JVM heap would cause multi-second GC pauses that violate the 2-second ingestion-to-visibility target.

Why not the Prometheus/Mimir ruler? The ruler evaluates recording rules by querying the TSDB every 15-30 seconds. At moderate scale that works. At 500M metrics/sec, each evaluation cycle competes with dashboard and alert queries for vmselect capacity. Flink processes metrics inline during ingestion with zero query overhead. The trade-off: Flink adds operational complexity. Below ~50M metrics/sec, the ruler is simpler and sufficient.

Scope: metrics only. Flink processes metrics only. Traces, logs, and profiles bypass Flink entirely (Sections 9-11). They don't need pre-aggregation or downsampling.

How Grafana queries raw vs aggregated data: Flink writes aggregated metrics back into VictoriaMetrics as separate metric names. For example, Flink computes a 1-minute rolling average and writes it as http_response_time_avg_1m{service="checkout"}. The raw metric http_response_time_seconds still flows into VictoriaMetrics separately. Both live in the same VictoriaMetrics instance, different metric names. Grafana doesn't route automatically. The dashboard author picks the metric name in each panel's PromQL query. A 6-hour ops panel uses: avg(http_response_time_avg_1m{service="checkout"}). An incident investigation panel queries raw data: histogram_quantile(0.99, rate(http_response_time_seconds_bucket{service="checkout"}[5m])). Same data source, different query.

4.5 Observability Pipeline: Value-Based Data Routing

Traditional observability pipelines treat all data equally: every metric, span, and log line gets the same treatment from collection to storage. At 500M metrics/sec, this is wasteful. Health check endpoints, Kubernetes readiness probes, and debug-level application noise account for 15-30% of ingestion volume but are almost never queried during incidents.

Value-based routing makes cost-aware decisions at the OTel Collector processor chain level, before data reaches Kafka:

yaml

# OTel Collector pipeline configuration (simplified)
processors:
  # Step 1: Enrich with ownership metadata from platform registry
  metadata_enrichment:
    endpoint: "http://metadata-service:8080/v1/lookup"
    cache_ttl: 5m
    enrich_fields:
      - team_owner
      - cost_center
      - slo_tier        # gold, silver, bronze

  # Step 2: Value-based routing decisions
  routing:
    rules:
      # Full fidelity: errors, SLO-critical signals
      - condition: 'attributes["http.status_code"] >= 500'
        action: pass_through
      - condition: 'resource["slo_tier"] == "gold"'
        action: pass_through

      # Drop: health checks, readiness probes
      - condition: 'attributes["http.route"] == "/healthz"'
        action: drop
      - condition: 'attributes["http.route"] == "/readyz"'
        action: drop

      # Sample: debug-level signals, low-priority services
      - condition: 'resource["slo_tier"] == "bronze"'
        action: probabilistic_sample
        sampling_percentage: 10

      # Default: pass through
      - condition: 'true'
        action: pass_through

  # Step 3: Deduplicate overlapping scrape targets
  deduplication:
    # When both Beyla (eBPF) and OTel SDK produce the same metric,
    # prefer the SDK version (richer labels) and drop the Beyla duplicate
    strategy: prefer_sdk_over_ebpf
    match_on: [metric_name, service_name]

Why this lives in the OTel Collector, not a separate product: The processor chain already handles batching, filtering, and transformation. Adding routing decisions here avoids an extra network hop and keeps the pipeline simple. Products like Cribl, Mezmo Pipeline, and Edge Delta solve the same problem for organizations that prefer a managed pipeline. At this scale, the OTel Collector's native processors are sufficient and avoid vendor lock-in.

Impact on storage costs: The 15-30% volume reduction from value-based routing (Design Principle #3) translates to 75-150M fewer samples/sec reaching storage. Net effect: slightly lower costs despite adding profiling as a fourth signal.

Metadata enrichment service: A lightweight gRPC service backed by an in-memory cache (refreshed from the platform's service registry every 5 minutes). It maps (service_name, namespace) to (team_owner, cost_center, slo_tier). The OTel Collector caches lookups locally (5-minute TTL) so the metadata service handles only ~600 lookups/sec across the fleet, not per-sample lookups. If the metadata service is unreachable and the local cache is cold (e.g., a freshly started collector), the pipeline defaults to slo_tier=silver and routes data through at standard fidelity. No data is dropped due to enrichment failures.

4.6 Why Not a Unified ClickHouse Store?

A reasonable question: why not store everything in ClickHouse? Several production observability platforms do exactly this. SigNoz stores metrics, traces, and logs in ClickHouse. Highlight.io uses ClickHouse for session replay, traces, and logs. Uptrace uses ClickHouse for all signals. The approach works and has real advantages.

The case for ClickHouse as a unified store:

ClickHouse excels at high-cardinality analytical queries. Its columnar storage with LZ4/ZSTD compression achieves excellent compression ratios. MergeTree engine handles append-heavy workloads efficiently. SQL is the query language, which most engineers already know. A single storage engine means simpler operations: one cluster to manage, one set of failure modes to understand, one backup strategy.

SigNoz demonstrates this at moderate scale: metrics, traces, and logs in ClickHouse, with a custom query layer on top that speaks PromQL for metrics and a trace query language for spans. It works well for organizations processing up to ~10-50M metrics/sec.

The case against at 500M metrics/sec:

At hyperscale, specialized engines win on efficiency (see Section 4.3 comparison table for the full benchmark). The crossover point is roughly 50M metrics/sec. Below that, ClickHouse as a unified store is simpler and the efficiency gap doesn't matter. Above that, the 3-4x compression advantage and 2-5x ingestion throughput advantage of a specialized TSDB compounds into hundreds of nodes and petabytes of storage difference.

For traces and logs, the gap is narrower. Tempo on S3 is mostly an operational simplicity choice (no cluster to manage for trace storage). VictoriaLogs' bloom filter approach is competitive with ClickHouse for log search but more memory-efficient at this scale.

The compromise: ClickHouse as a complementary analytics layer.

Instead of replacing the specialized stores, add ClickHouse as an optional wide-event exploration layer:

Metrics → VictoriaMetrics (primary, full fidelity)
Traces  → Tempo (primary, full fidelity)
Logs    → VictoriaLogs (primary, full fidelity)
Profiles → Pyroscope (primary, full fidelity)
    ↓
All signals → ClickHouse (secondary, sampled wide events for cross-signal exploration)

The Flink pipeline exports a sampled stream of wide events to ClickHouse: denormalized rows that combine metric context, trace attributes, log snippets, and profile metadata into a single queryable record. This opens up queries that no single pillar store can answer alone:

"Show me all requests where latency > 2s AND the CPU profile shows GC pressure AND the downstream DB logs show lock contention." This requires joining traces, profiles, and logs.
"Which team's services have the worst error-budget burn this week, broken down by endpoint?" This requires joining metrics with service ownership metadata.

ClickHouse handles these exploratory, ad-hoc queries well. The primary pillar stores handle the high-throughput, latency-sensitive operational queries (dashboards, alerts, trace lookups).

Sizing: 10-20 ClickHouse nodes for the analytics layer. Receives ~1-5% of total signal volume (sampled). Retention: 30-90 days. This is a "nice to have" capability, not a critical path component. The platform works without it.

5. High-Level Architecture

5.1 Bird's-Eye View

Component glossary (numbered steps match the diagram):

(1) Collection (two-tier model):

eBPF baseline (Grafana Beyla): Deployed as a DaemonSet (~3,000 nodes). Beyla attaches eBPF probes to the kernel's network and syscall layers. It automatically produces RED metrics (http_server_request_duration_seconds, http_server_request_body_size_bytes) and basic trace spans for every HTTP/gRPC/SQL call without any application code changes. ~200 MB RAM per node. Beyla forwards its output to the local OTel Collector for enrichment and routing.
Pull (same K8s cluster): The platform's OTel Collector DaemonSet runs one collector per K8s node (~3,000 collectors total). It scrapes local pods' /metrics endpoints over mTLS. No cross-network hop. Service Discovery (K8s API, Consul) automatically finds new scrape targets as services scale. It also receives Beyla's output and deduplicates where SDK instrumentation overlaps.
Push (external infrastructure): Services on separate clusters, on-prem, or serverless can't be scraped. They push metrics, traces, logs, and profiles via OTLP to the OTel Collector Gateway over HTTPS, authenticated by per-tenant API key.

(1b) Pipeline processing:

The OTel Collector processor chain applies value-based routing (Section 4.5) before forwarding to Kafka: metadata enrichment from the platform registry, health check/readiness probe filtering, SLO-tier-based sampling, and Beyla/SDK deduplication. This reduces stored volume by 15-30%.

(2) Kafka (ingestion buffer):

metrics-ingestion: 10,000 partitions, 24h retention, partitioned by metric fingerprint. Enables replay if storage crashes.
traces-raw: partitioned by trace_id so all spans from one trace land together.
logs-raw: partitioned by service_name for locality.
profiles-raw: partitioned by service_name. Lower volume than other signals (~1 GB/sec).

(3)→(4) Processing:

Flink computes recording rules, 5-minute rollups (20x data reduction), and SLO burn rate calculations on the metrics stream.
Tail-Based Sampler keeps errors, p99 latency, and a small random baseline (Section 9.2 covers the full sampling strategy).
Log Parser extracts structured fields (service name, log level, trace_id) and routes by severity.
ML Anomaly Detector consumes the metrics stream from Kafka, maintains seasonal baselines per metric in RocksDB, and emits anomaly signals to Alertmanager when deviations exceed thresholds (Section 8.7.1).

(5) Storage:

vminsert → vmstorage → S3 Parquet: Stateless write router hashes metrics to storage nodes. vmstorage compresses with Gorilla (1.37 bytes/sample) on NVMe (15-day hot tier). Aged data exports to S3 as ZSTD-compressed Parquet (warm 90 days, cold 1 year).
Tempo Ingesters → S3 Parquet: Batches sampled spans into Parquet blocks on S3. Bloom filters enable fast trace ID lookup.
vlinsert → vlstorage: Distributes logs across storage nodes. Bloom filter index on every field. Local NVMe storage.
Pyroscope Ingesters → S3: Batches profile data into S3 blocks. Indexed by service + time range + profile type. Span-to-profile correlation via span_id.

(6) Query:

vmselect: Fans out MetricsQL queries to all vmstorage nodes, merges results.
Tempo Query: TraceQL queries against S3 Parquet blocks.
vlselect: LogsQL full-text search across vlstorage nodes.
Pyroscope Query: Profile queries by service, time range, and span_id. Flame graph rendering.
Grafana: Unified dashboard querying all four backends. Exemplars link metrics to traces. trace_id links traces to logs. span_id links traces to profiles.
ClickHouse Analytics (optional): Cross-signal wide-event queries for exploratory analysis.

Control plane (tenant configs, rules, budgets, RBAC) and data plane (everything that touches telemetry) scale and deploy independently.

All four signals share the same OTel Collectors and Kafka cluster, then split at the storage layer: each signal routes to a storage engine built for its access pattern. Section 6.3 details the full Kafka topic layout.

Pull vs Push collection. The OTel Collector normalizes both models into the same protobuf format.

How each signal reaches the platform:

Signal	Same K8s cluster	External infrastructure	Protocol
Metrics	Pull: Collector scrapes `/metrics` (local, mTLS). eBPF: Beyla auto-generates RED metrics.	Push: App sends OTLP to collector gateway (HTTPS)	Prometheus text (pull), OTLP protobuf (push), or eBPF (auto)
Traces	Push: OTel SDK sends spans to local collector (gRPC). eBPF: Beyla generates basic spans.	Push: OTel SDK sends spans to collector gateway (gRPC/HTTPS)	OTLP protobuf over gRPC (default) or HTTP
Logs	File tail: Collector reads `/var/log/pods/*/` (no network)	Push: App sends OTLP logs to collector gateway (gRPC/HTTPS)	Local file I/O (same cluster) or OTLP protobuf (external)
Profiles	Push: OTel SDK sends profiles to local collector (gRPC)	Push: OTel SDK sends profiles to collector gateway (gRPC/HTTPS)	OTLP profiles protobuf (experimental) or Pyroscope SDK push

After Kafka, each signal diverges to its specialized storage pipeline (see numbered steps above). Grafana correlates all four via exemplars, trace_id, and span_id fields.

6. Data Model

6.1 Metric Wire Format

Every metric sample follows the Prometheus data model:

metric_name{label1="value1", label2="value2"} float64_value unix_timestamp_ms

Example:
http_requests_total{service="checkout", method="POST", status="200", region="us-east-1"} 142857 1709827200000

Why fingerprints? At 500M metrics/sec, the system needs deterministic routing: each metric must always land on the same Kafka partition and storage node. A fingerprint hash of the metric name + labels provides this.

The series identity (fingerprint) is computed as:

fingerprint = FNV-1a(sorted(metric_name + labels))    // FNV-1a: a fast, non-cryptographic hash function
              ↑ Labels sorted alphabetically by key name before hashing.
              Different producers may emit labels in different order
              (e.g., {service, method} vs {method, service}).
              Sorting guarantees the same series always produces the same fingerprint.

Example:
  metric: http_requests_total
  labels: {method="POST", region="us-east-1", service="checkout", status="200"}
  fingerprint: hash("http_requests_total\x00method\x00POST\x00region\x00us-east-1\x00service\x00checkout\x00status\x00200")
  → 0x7a3f2b1c (used for Kafka partitioning and storage sharding)

Why the fingerprint matters: This single hash drives data locality at every layer.

Kafka uses it as the partition key. All samples for one metric land on the same partition, so Flink can process that metric's recording rules without fetching data from other partitions.
vminsert uses it to route all data for one metric to the same storage node. Keeping data together on one node is critical for compression (the compressor needs consecutive values to find patterns) and for queries (look up one node, not all 200).
S3 Parquet files are keyed by fingerprint range. When querying cold data, the system skips entire files whose fingerprint range doesn't match, instead of scanning everything.

6.2 Metric Types

Type	Description	Storage Behavior
Counter	Monotonically increasing value (resets on process restart)	Store raw value. `rate()` computed at query time.
Gauge	Arbitrary value that goes up and down	Store raw value.
Histogram	Observations bucketed into configurable ranges. Each bucket counts how many observations fell at or below that boundary.	Expands to multiple series: `_bucket{le="X"}` (one per bucket, where `le` = "less than or equal to", the upper boundary), plus `_sum` and `_count`. A histogram with 10 buckets = 12 time series per unique label combination.
Summary	Pre-computed quantiles on the client side	Expands to `{quantile="X"}`, `_sum`, `_count`.

Why histograms explode into multiple series:

A counter metric (http_requests_total) is one number: "142,857 requests." One series. A histogram metric (http_request_duration_seconds) is a distribution: "How many requests took 0-10ms? How many took 10-50ms? 50-100ms?" Each bucket is a separate series. With 10 buckets plus a _sum and _count, one histogram becomes 12 series. Now multiply by every unique label combination and this is where cardinality explodes. A single histogram metric with 10 buckets, exposed by 1,000 pods, with 5 label dimensions (service, method, status, region, zone) each having moderate cardinality:

Series count = buckets × pods × label_combinations
            = 12 × 1,000 × (5 services × 4 methods × 5 statuses × 3 regions × 2 zones)
            = 12 × 1,000 × 600
            = 7,200,000 series from ONE histogram metric

The modern fix: native/exponential histograms. OpenTelemetry exponential histograms (and Prometheus native histograms, which share the same underlying format) solve this problem by encoding the entire distribution into a single time series instead of expanding into 12+ separate _bucket series. The histogram uses exponentially-spaced bucket boundaries determined by a scale parameter, so there are no fixed le labels to fan out. A single series stores the full distribution, and the receiver merges buckets on ingestion. At the storage layer, VictoriaMetrics v1.100+ supports native histogram ingestion, and Grafana can render them natively. For the 7.2M-series example above, exponential histograms reduce the series count to 1,000 pods × 600 label combinations = 600,000 series -- a 12x reduction. The trade-off: older PromQL functions like histogram_quantile() work differently on native histograms, and client libraries need explicit opt-in. But for any new instrumentation at this scale, exponential histograms should be the default.

6.3 Kafka Topic Layout

Topic: metrics-ingestion
  Producer: OTel Collectors (scrape + OTLP push + Beyla eBPF)
  Consumer: Flink (recording rules, rollups) + vminsert (raw metric writes) + ML Anomaly Detector
  Partitions: 10,000
  Replication Factor: 3
  Retention: 24 hours (replay buffer for recovery)
  Partitioning key: metric fingerprint (FNV-1a hash)
  Value format: Protobuf (TimeSeries message)

Topic: metrics-aggregated
  Producer: Flink (processed recording rules and rollups)
  Consumer: vminsert (aggregated metric writes)
  Partitions: 5,000
  Replication Factor: 3
  Retention: 12 hours

Topic: metrics-alerts
  Producer: Flink (alert events when burn rate thresholds breach) + ML Anomaly Detector
  Consumer: Alertmanager (routes and deduplicates alerts)
  Partitions: 100
  Replication Factor: 3
  Retention: 7 days

Topic: traces-raw
  Producer: OTel Collectors (tail-based sampler output)
  Consumer: Tempo Ingesters
  Partitions: 2,000
  Replication Factor: 3
  Retention: 24 hours
  Partitioning key: trace_id (all spans from one trace co-locate)
  Value format: Protobuf (OTLP ExportTraceServiceRequest)

Topic: logs-raw
  Producer: OTel Collectors (filelog receiver + pipeline processors)
  Consumer: vlinsert (VictoriaLogs write nodes)
  Partitions: 1,000
  Replication Factor: 3
  Retention: 24 hours
  Partitioning key: service_name (locality for per-service indexing)
  Value format: Protobuf (OTLP ExportLogsServiceRequest)

Topic: profiles-raw
  Producer: OTel Collectors (profile signal from SDKs)
  Consumer: Pyroscope Ingesters
  Partitions: 500
  Replication Factor: 3
  Retention: 12 hours
  Partitioning key: service_name
  Value format: Protobuf (pprof-encoded profiles)

6.4 Downsampled Parquet Schema (Warm/Cold Tier)

When raw metrics age past 15 days, they don't stay in vmstorage. Flink writes 5-minute rollups to S3 as Parquet files during ingestion (Section 8.6). A nightly batch job further aggregates those into 1-hour rollups. vmselect reads these files when a query spans beyond the hot tier, using partition pruning on tenant, resolution, and time range to skip irrelevant files. Each row stores five aggregate values (min, max, avg, sum, count) because different query functions need different aggregates: max() reads value_max, rate() uses value_sum / value_count, and so on.

sql

-- Parquet file layout on S3
-- Path: s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet

-- Schema:
CREATE TABLE downsampled_metrics (
  fingerprint     UINT64,       -- Metric identity
  metric_name     STRING,
  labels          MAP<STRING, STRING>,
  timestamp_ms    INT64,        -- Aligned to resolution boundary
  value_min       DOUBLE,
  value_max       DOUBLE,
  value_avg       DOUBLE,
  value_sum       DOUBLE,
  value_count     INT64,        -- Number of raw samples in this window
  resolution      STRING        -- '5m' or '1h'
)
PARTITIONED BY (tenant_id, resolution, year, month);

6.5 Wide-Event Export Schema (ClickHouse Analytics Layer)

When the optional ClickHouse analytics layer is enabled, Flink exports a sampled stream of wide events: denormalized records that combine context from multiple signals into a single queryable row:

sql

-- ClickHouse table for cross-signal exploration (key columns shown)
CREATE TABLE wide_events (
  event_time       DateTime64(3),
  trace_id         String,
  span_id          String,
  service_name     LowCardinality(String),
  slo_tier         LowCardinality(String),
  error_rate_5m    Float64,       -- metric context
  span_duration_ms Float64,       -- trace context
  log_message      String,        -- log snippet
  has_profile      UInt8          -- profile context
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (service_name, event_time)
TTL event_time + INTERVAL 90 DAY;

This table receives ~1-5% of total events (sampled at the Flink export stage). It's not the source of truth for any signal. The pillar-native stores are. It exists for exploratory queries that span multiple signals.

7. Back-of-the-Envelope Estimation

7.1 Ingestion Volume

Target: 500M metrics/sec (peak)

Source split (assumed):
  Infrastructure metrics (CPU, memory, disk, network):  30%  →  150M/sec
  Application metrics (request rate, latency, errors):   40%  →  200M/sec
  Business metrics (orders, revenue, conversions):       10%  →   50M/sec
  Custom/user-defined metrics:                           20%  →  100M/sec

Daily volume:
  500M/sec × 86,400 sec/day                               ← 86,400 = 24 × 60 × 60
  = 43.2 trillion samples/day                              ← ~43T samples

  At 40% average utilization (not peak all day):
  200M/sec × 86,400 = 17.28T samples/day                  ← actual sustained rate

Pipeline reduction (value-based routing):
  ~20% of raw volume dropped/sampled before Kafka          ← health checks, probes, debug
  Effective ingestion: ~400M/sec peak, ~160M/sec sustained

7.2 Storage Sizing

Per-sample size:
  Raw (uncompressed): 16 bytes (8B timestamp + 8B float64 value)
  Gorilla compressed: 1.37 bytes/sample                    ← 16 / 1.37 = 12x compression
  With labels + index overhead: ~2 bytes/sample             ← use this for all estimates below

─── Hot tier (raw, 15 days) ───

  400M/sec × 2 bytes      = 800 MB/sec                     ← post-pipeline ingestion rate
  800 MB/sec × 86,400      = 69 TB/day                      ← one day of hot storage
  69 TB × 15 days           = 1,035 TB  (~1.0 PB)            ← 15 days at peak

  At 40% average utilization:
  1.0 PB × 0.4             = ~414 TB                         ← actual provisioned capacity

─── Warm tier (5-min aggregates in Parquet on S3, 90 days) ───

  400M/sec ÷ 20            = 20M aggregated samples/sec     ← 20 raw samples per 5-min window become 1
  20M × 5 values            = 100M values/sec                ← each aggregate stores min, max, avg, sum, count
  100M × 8 bytes            = 800 MB/sec                     ← raw doubles, no Gorilla (not sequential anymore)
  800 MB/sec ÷ 3 (ZSTD)    = 267 MB/sec                     ← ZSTD compresses ~3x on float64 arrays
  267 MB/sec × 86,400       = 23 TB/day                      ← one day of warm storage
  23 TB × 90 days            = 2,070 TB  (~2.1 PB)            ← 90 days of warm

─── Cold tier (1-hour aggregates in Parquet on S3, 1 year) ───

  20M/sec ÷ 12             = ~1.67M aggregated samples/sec  ← twelve 5-min windows per hour
  1.67M × 5 × 8 bytes      = 67 MB/sec                      ← same 5-value pattern as warm
  67 MB/sec ÷ 3 (ZSTD)     = ~22 MB/sec                     ← after compression
  22 MB/sec × 86,400        = 1.9 TB/day                     ← one day of cold storage
  1.9 TB × 365 days          = 694 TB                         ← one year of cold

─── Total ───

  Hot:  414 TB   (vmstorage NVMe, 15 days at 40% avg)
  Warm: 2,070 TB (S3 Parquet, ~2.1 PB, 90 days)
  Cold: 694 TB   (S3 Parquet, 1 year)
  Total: ~3.2 PB

Note: The pipeline's value-based routing reduces total storage by ~20% compared to storing everything. The savings more than offset the new profiling signal's storage requirements (Section 7.5).

7.3 Component Sizing

Beyla eBPF Agents:
  Deployed as DaemonSet: 1 per node, ~3,000 agents
  ~200 MB RAM per agent (eBPF maps + metric aggregation)
  Minimal CPU overhead: eBPF programs run in kernel space

OTel Collectors:
  Each collector: ~100-200K metrics/sec on c6g.xlarge (4 vCPU, 8GB)
  Range depends on pipeline complexity: bare receiver+exporter ≈ 200K,
    full chain (enrichment, routing, dedup, tail sampling) ≈ 100-150K
  500M / 170K = ~3,000 collectors                           ← Go-based, CPU-efficient for protobuf
  Deployed as DaemonSet (1 per node) + dedicated fleet for high-volume targets
  Additional ~200 MB RAM for pipeline processors (routing, enrichment, dedup)

Kafka:
  OTel Collectors batch ~2,000 samples per Kafka message (batch processor)
  400M samples/sec ÷ 2,000 = 200K messages/sec              ← message rate, not sample rate
  Byte throughput: 400M × ~50 bytes/sample (compressed batch) = ~20 GB/sec
  With RF=3: ~60 GB/sec cross-cluster replication
  Per-broker sustained: ~300 MB/sec (RF=3, acks=all)         ← byte throughput is the real bottleneck
  60,000 MB / 300 MB = 200 brokers                           ← sized on bytes, not message count
  Deployed as multiple regional clusters (e.g., 100 per region in a 2-region setup)
  Partitions: 10,000                                        ← enough for parallelism without overhead

Flink:
  Each TaskManager: ~1M events/sec (4 slots × 250K each)    ← Amazon Managed Flink: ~28K/sec per KPU;
                                                                dedicated r6g.2xlarge has much more headroom
  400M / 1M = 400 TaskManagers                              ← 1,600 parallel tasks (4 slots each)

VictoriaMetrics (vmstorage):
  Benchmark: 100M on 46 nodes = 2.17M/node
  On i3.4xlarge (16 vCPU, 122GB, 3.8TB NVMe): ~5M/sec      ← ~2.3x benchmark with 4x hardware
  400M / 5M   = 80 vmstorage nodes
  × 2 (RF)    = 160 vmstorage nodes
  160 × 3.8 TB = 608 TB                                     ← covers 414 TB hot + WAL/compaction headroom

vminsert (stateless):
  400M / 12.5M per node = 32 vminsert nodes

vmselect (stateless):
  Start with 20 nodes. Scale based on dashboard concurrency.

ML Anomaly Detector:
  Kafka consumer group, 5-10 nodes (c6g.2xlarge)
  RocksDB state: ~50 GB per node (seasonal baselines for monitored metrics)
  Processes a subset of metrics (top 100K series per tenant by query frequency)

7.3.1 Right-Sizing for Smaller Scale

The numbers above target 500M metrics/sec. The architecture scales down linearly:

Scale	Beyla Agents	OTel Collectors	Kafka Brokers	Flink TaskManagers	vmstorage	vminsert
10M/sec	60	60	10	10	4	2
50M/sec	300	300	25	50	20	8
100M/sec	600	600	50	100	40	16
500M/sec	3,000	3,000	200	400	160	32

At 10-50M metrics/sec, Kafka and Flink can be replaced entirely with a direct OTel Collector to vminsert pipeline using Prometheus remote_write. Recording rules move to the VictoriaMetrics vmalert component (query-based evaluation, similar to the Mimir ruler). This removes Kafka and Flink entirely. The jump to Kafka + Flink is justified when direct remote_write cannot absorb burst traffic or when stream-based recording rules outperform query-based evaluation at the target scale. The ML anomaly detector and ClickHouse analytics layer are also optional at smaller scales.

7.4 Query Layer Sizing

Dashboard query assumptions:
  50 concurrent Grafana users
  20 panels per dashboard
  30-second refresh interval
  Average query touches 50K series (with recording rules)

Query throughput:
  50 × 20 / 30             = 33 queries/sec                ← steady-state
  Peak (all refresh at once): ~200 queries/sec

vmselect fan-out:
  33 queries/sec × 160 nodes = 5,280 vmstorage req/sec     ← every query hits every node
  Peak: 200 × 160             = 32,000 req/sec

vmselect sizing:
  Each node: ~100 concurrent queries
  20 nodes = 2,000 concurrent queries                       ← 10x headroom over 200 peak
  Memory: 32 GB per node for result merging + caching

Result cache:
  Key: (query_hash, time_range, step)
  10 GB per node (LRU eviction)
  Hit rate: 60-80%                                          ← dashboard auto-refresh is repetitive
  Effective query reduction: 3-5x

7.5 Trace, Log, and Profile Volume Estimation

Traces, logs, and profiles add incremental infrastructure on the same shared collection and buffer layers.

Tracing:
  500K services × 50 RPS   = 25M requests/sec
  25M × 8 spans/trace      = 200M spans/sec
  200M × 1 KB/span         = 200 GB/sec                    ← raw trace volume
  200 GB/sec × 86,400      = 17.3 PB/day                   ← not viable to store 100%
  × 0.005 (0.5% sampling)  = 1 GB/sec = 86 TB/day          ← after tail-based sampling
  Tiered storage (180 days): ~15.5 PB total (Section 9.7)
  Tempo ingesters: 20, Compactors: 5, +4 Kafka brokers

Logging:
  500K services × 100 lines/sec = 50M lines/sec
  50M × 500 bytes           = 25 GB/sec                     ← raw log volume
  25 GB/sec × 86,400        = 2.16 PB/day
  ÷ 8 (VictoriaLogs compression) = ~270 TB/day stored       ← columnar + LZ4/ZSTD
  Tiered storage: ~9.5 PB across active tiers
  vlinsert: 17, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers

Profiling:
  500K services × 10-sec interval = 50K profiles/sec
  50K × 100 KB avg (compressed pprof) = 5 GB/sec            ← raw profile volume
  5 GB/sec ÷ 5 (Pyroscope compression) = 1 GB/sec stored    ← after dedup + compression
  1 GB/sec × 86,400         = 86 TB/day stored
  Tiered storage (90 days): ~7.7 PB total
  Pyroscope ingesters: 10 (r6g.2xlarge), +2 Kafka brokers

The same OTel Collectors, Kafka cluster, and Grafana deployment serve all four signals. Only the storage layer diverges.

7.6 Network Bandwidth

Metrics ingestion path:
  OTel Collectors → Kafka:       800 MB/sec (compressed protobuf, post-pipeline)
  Kafka replication (RF=3):      × 3 = 2.4 GB/sec cross-broker
  Kafka → consumers (vminsert, Flink, ML): ~1.7 GB/sec
  vminsert → vmstorage:          800 MB/sec (distributed across 160 nodes)
  Total metrics bandwidth:       ~5.7 GB/sec aggregate

Other signals (traces + logs + profiles): ~108 GB/sec aggregate
  (dominated by log pipeline: 25 GB/sec raw + 75 GB/sec Kafka replication)

Query path (bursty):
  vmselect → vmstorage fan-out:  1-5 GB/sec during peak dashboard load

Log pipeline dominates bandwidth. At 25 GB/sec raw volume, logs account for over 70% of total network traffic. Compressing logs at the OTel Collector (gzip or ZSTD) and using Kafka's end-to-end compression reduces wire traffic by 3-5x. Co-locate Kafka partition leaders with primary consumers in the same AZ using broker.rack configuration to minimize cross-AZ transfer costs.

What Breaks First at Scale

Before diving into deep dives, it helps to know where this architecture is most likely to fail:

Cardinality explosions. A single user_id label can create 120M series and OOM the storage layer. Per-tenant cardinality limits and ingestion-time label validation prevent this (Section 8.3).

Log volume spikes. Verbose logging during incidents can multiply ingestion volume by 10x. Value-based routing and per-tenant rate limits prevent a single noisy service from overwhelming the log pipeline (Section 10).

Trace sampling mistakes. Sampling too aggressively hides critical latency problems. Tail-based sampling ensures 100% capture of errors and slow requests while reducing baseline volume by 98-99.5% (Section 9). The actual keep rate depends on your error rate and latency distribution -- environments with higher error rates retain more traces.

Noisy tenants. One team emitting high-cardinality metrics or excessive debug logs can degrade query performance for everyone. Shuffle-sharded storage isolation and per-tenant query concurrency limits contain the blast radius (Section 8.9).

Section 12 covers the full bottleneck analysis with scaling triggers and mitigation strategies for each component.

8. Deep Dives: Metrics Pipeline

8.1 Ingestion Pipeline

The full ingestion flow:

Key design decisions in the ingestion path:

Two-tier collection. Beyla eBPF agents auto-generate RED metrics and basic trace spans for every HTTP/gRPC/SQL call at the kernel level (the rationale for this split is covered in the operating model above). The OTel Collector receives Beyla's output alongside SDK telemetry and deduplicates where they overlap, keeping SDK versions for richer labels.

How eBPF binds to traces. Beyla attaches uprobes to language-specific HTTP handler entry points: Go's net/http.(*conn).serve, Java's javax.servlet.http.HttpServlet.service, Python's WSGI/ASGI handler functions, Node.js's HTTP module. When a request enters the function, the uprobe fires in kernel space and records the timestamp, connection metadata, and reads the incoming W3C traceparent header directly from the request buffer. If a traceparent exists, Beyla creates a child span with the correct parent_span_id and trace_id, linking the eBPF-generated span into the existing distributed trace. If no traceparent exists, Beyla creates a new root span. For languages without specific uprobe targets, Beyla falls back to kprobes on tcp_sendmsg/tcp_recvmsg, which still capture request/response pairs but with less protocol detail (no parsed HTTP method or path, just TCP-level timing).

What eBPF can and cannot see. Beyla sees every network call (HTTP method, path, status, duration) but not internal code execution — that requires SDK spans or continuous profiling. The operating model section above covers the full breakdown.

Deduplication logic. When both Beyla and OTel SDK instrument the same service, the OTel Collector sees two spans for the same HTTP request: one from eBPF (network-level, fewer attributes) and one from the SDK (application-level, richer labels like user_id, cart_size). The dedup processor matches them by {service_name, http.method, http.target, time_window} and keeps the SDK version (richer attributes). For metrics, the same matching drops the eBPF-generated http_server_request_duration_seconds when an SDK-generated equivalent exists for the same service+endpoint.

Pipeline processing before Kafka. The OTel Collector processor chain applies value-based routing (Section 4.5) before any data reaches Kafka. This is the key difference from a 2022-era architecture that treated all data equally. By dropping health checks, sampling debug noise, and enriching with ownership metadata at the edge, the pipeline reduces Kafka and storage load by 15-30%.

Kernel networking bottleneck. eBPF handles collection well — but the collected data still has to leave each box through the standard Linux kernel networking stack. And that stack was not designed for millions of tiny packets. Every telemetry packet takes the same expensive trip: send() syscall → context switch into the kernel → socket buffer allocation → full TCP/IP processing → NIC driver → wire. That is a lot of ceremony for a single metric sample.

Here is the thing most people miss: a typical metric sample, trace span, or log line is under 1KB after serialization. Small packets are the worst case for kernel networking. The per-packet overhead — syscalls, buffer copies, lock contention — dominates the actual transmission. The CPU spends more time managing each packet than sending it. At 500M metrics/sec across 3,000 nodes, each box is generating thousands of these tiny packets every second. A standard Linux box tops out at roughly 1M small packets/sec before the kernel becomes the bottleneck. Not the NIC. Not the network. The kernel itself.

Network acceleration: XDP + DPDK. Two techniques solve this, each at a different layer of the pipeline:

Layer	Technology	Where	What It Does	Impact
Ship out	XDP (eXpress Data Path)	Every agent node (3,000 boxes)	Hooks into the NIC driver and forwards telemetry packets before they enter the full kernel networking stack. It runs as an eBPF program — same toolchain, same deployment model you already have. No dedicated CPU cores needed.	5-10M packets/sec per core vs ~1M with normal `send()` syscalls
Receive at scale	DPDK (Data Plane Development Kit)	OTel Collector gateways (few boxes)	Complete kernel bypass. The NIC delivers packets directly into userspace memory via hugepages. Dedicated cores poll the NIC in a tight loop — no interrupts, no syscalls, no copies.	15-20M+ packets/sec per core. Handles the aggregated fan-in traffic from all 3,000 agent nodes.

Why two different tools? XDP is lightweight enough to deploy on every production server — no dedicated cores, no app changes, just an eBPF program attached to the NIC driver. You would not want DPDK there because it takes over entire CPU cores and requires rewriting your networking layer. But on the collector gateways, where thousands of agent streams converge into a handful of boxes and the kernel networking stack genuinely hits the wall, DPDK is worth the complexity. The full data plane becomes: eBPF (collect) → XDP (fast egress) → DPDK (high-throughput ingestion) → Kafka.

Fingerprint-based partitioning. All samples for a given time series land on the same Kafka partition (see Section 6.1 for why this matters). Flink gets partition-local state, vmstorage gets sequential writes optimal for Gorilla compression.

Two Kafka topics. Raw metrics go to metrics-ingestion, Flink outputs go to metrics-aggregated. That way, vmstorage only consumes pre-processed data. If Flink needs to be restarted or redeployed, raw metrics buffer safely in Kafka (24-hour retention) while Flink catches up.

Why not write directly to VictoriaMetrics? A direct OTel → vminsert pipeline is simpler, lower latency, and works well up to ~50M metrics/sec. Beyond that scale, three problems emerge: (1) burst absorption: a deploy that doubles metric volume for 5 minutes overwhelms vminsert with no buffer; (2) replay: if vmstorage goes down and recovers, there's no way to replay the lost window without Kafka retention; (3) fan-out: recording rules, downsampling, ML anomaly detection, and metrics-to-logs correlation all need the same raw stream, and tapping vmstorage's write path is fragile. Kafka solves all three. Cost: operational complexity and ~2 seconds of additional end-to-end latency. Below 50M/sec, skip Kafka and write directly.

Backpressure handling. If vmstorage falls behind, Kafka consumer lag increases. OTel Collectors detect backpressure via Kafka producer latency and can temporarily batch more aggressively (larger batches, less frequent sends). The 24-hour Kafka retention gives enough buffer for most recovery scenarios.

Tenant validation at the edge. OTel Collectors inject the tenant_id header and validate it against a local cache of known tenants. Invalid tenants get rejected before hitting Kafka, which prevents unknown tenants from consuming pipeline resources.

Scrape endpoint security. The /metrics endpoint on tenant pods is not open to arbitrary callers. OTel Collectors and application endpoints authenticate each other via mTLS, with certificates issued by cert-manager in Kubernetes. A collector without a valid certificate cannot scrape, and a pod without a valid certificate is rejected as a scrape target. For push-model ingestion, the platform's OTLP endpoint requires a per-tenant API key in the Authorization header, validated at the OTel Collector gateway before data reaches Kafka.

8.2 Gorilla Compression

VictoriaMetrics uses Gorilla compression from Facebook's 2015 paper, the foundation of time-series compression in every modern TSDB (VictoriaMetrics, M3DB, Prometheus TSDB, InfluxDB all use variants). The algorithm exploits the regularity of time-series data: timestamps arrive at fixed intervals (delta-of-delta compresses to near zero), and consecutive values are often identical or structurally similar (XOR encoding). Net result: 1.3–2 bytes per sample in production vs 16 bytes raw, an 8–12x compression ratio. The original Facebook paper achieved 1.37 bytes/sample on highly regular in-memory workloads; real production systems with chunk headers, label overhead, and variable scrape intervals typically land at 1.5–2 bytes/sample. The capacity estimates in Section 7.2 use 2 bytes/sample to be conservative.

For the full bit-level walkthrough (IEEE 754 XOR encoding, end-to-end 8-sample compression, late scrape impact, and chunk structure), see the Gorilla Compression deep dive.

8.3 High Cardinality: The Silent Killer

Cardinality explosion is the single biggest operational risk in a metrics monitoring system, bigger than disk failures, network partitions, or Kafka lag. At 500M metrics/sec, the vast majority of data points come from infrastructure exporters and framework auto-instrumentation, not custom application code. But it's the custom labels on application metrics that cause cardinality explosions.

What actually happens:

An engineer adds a user_id label to a request latency histogram. Seems reasonable. Want to track per-user latency for debugging.

http_request_duration_seconds_bucket{
  service="checkout",
  method="POST",
  status="200",
  user_id="u_abc123",    ← This label
  le="0.1"
}

The math:

Before user_id label:
  1 metric × 12 series (10 buckets + sum + count) × 5 services × 4 methods × 5 statuses = 1,200 series
  Memory: ~8 KB per series × 1,200 = 9.6 MB (fine)

After user_id label (10M users):
  1 metric × 12 series (10 buckets + sum + count) × 10M users = 120,000,000 series
  Memory: ~8 KB per series × 120M = 960 GB

Timeline:
  00:00  Deploy goes out with user_id label
  00:15  First scrape. 120M new series created.
  00:16  vmstorage memory jumps from 10 MB to 960 GB
  00:17  OOM. vmstorage killed.
  00:17  All dashboards go blank. All alerts stop evaluating.
  00:18  The monitoring system is dead. Nobody gets paged about it being dead.

Detection:

Monitor the monitoring system's own cardinality metrics:

promql

# Alert: cardinality growing too fast
rate(vm_new_timeseries_created_total[1h]) > 100000

# Alert: approaching tenant series limit
vm_active_timeseries{tenant_id="acme"} / vm_series_limit{tenant_id="acme"} > 0.8

# Dashboard: top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))

Prevention (layered):

Ingestion-time cardinality limits. vminsert checks active series count per tenant. Reject new series if the tenant exceeds its budget. Return HTTP 429 with X-Series-Limit header.
Label allowlists. The OTel Collector's metric_relabel_configs can drop labels before they reach Kafka. Block known high-cardinality labels (user_id, request_id, trace_id, pod_id) from metric labels.

yaml

# OTel Collector relabel config
metric_relabel_configs:
  - source_labels: [user_id]
    action: labeldrop
  - source_labels: [request_id]
    action: labeldrop

Cardinality analysis at deploy time. A CI/CD check scans metric instrumentation for unbounded label values. If a label's value set is unbounded (user IDs, UUIDs), the pipeline blocks the deploy.
The "Metrics without Limits" pattern (Datadog-inspired). Decouple ingestion from indexing. Ingest all raw data into Kafka (cheap, append-only). But only index the label combinations that are actually queried. If user_id is never used in a dashboard or alert rule, it's stored but not indexed. The inverted index stays small. Storage cost increases, but memory and query performance are protected.

8.3.1 Schema Evolution: Label Changes

Metric schemas change. A service renames status to http_status, adds a new region label, or drops an unused version label. Each of these creates a new set of time series because a different label set produces a different fingerprint, which means a different series.

What breaks:

Dashboards referencing the old label name return empty results
Alert rules using the old label stop evaluating correctly
The old series goes stale (5-minute staleness marker), but historical data remains under the old name
Recording rules that aggregate by the old label silently stop including the renamed service

Migration strategy:

Dual-emit during transition. The application emits metrics with both old and new label names for one retention period (15 days). Dashboards and alert rules are updated to use the new name. After the transition window, the old label is dropped.
Relabeling at the collector. If the application can't dual-emit, the OTel Collector's metric_relabel_configs can copy the old label to the new name during the transition:

yaml

metric_relabel_configs:
  - source_labels: [status]
    target_label: http_status

Dashboard migration tooling. Grafana's provisioning API can batch-update dashboard JSON to replace label references. Script this as part of the label rename runbook.

There is no way to retroactively rename labels in historical data stored in vmstorage or S3 Parquet files. Plan label names carefully. Breaking changes should be coordinated across the team with a defined cutover window.

8.4 Inverted Index at Scale

The inverted index is how the query engine turns a label selector like {service="checkout", env="prod"} into a list of matching time series IDs.

How it works:

VictoriaMetrics mergeset structure:

VictoriaMetrics uses mergeset for its inverted index, an LSM-tree variant optimized for sorted key lookups. New label-to-TSID mappings buffer in memory, flush to immutable sorted parts on disk, and background merges keep the part count small. Queries look up each label selector across parts, then intersect posting lists to find matching series.

High-cardinality labels degrade index performance:

The inverted index for service=checkout might have 1,000 TSIDs in its posting list. Fast to intersect. But user_id=u_abc123 has exactly 1 TSID, and there are 10 million such entries. The index now stores 10 million separate posting lists. Each label lookup scans a massive key range. Intersection becomes trivial (single entry), but the index scan to find it is expensive.

The mergeset's key space also grows linearly with cardinality. More keys means more disk I/O during compaction, more memory for bloom filters, and longer startup times when rebuilding the index.

Bloom filters for segment skipping:

VictoriaMetrics uses bloom filters on each index part to quickly determine if a label value might exist in that part. If the bloom filter says "definitely not here," the part is skipped entirely. For low-cardinality labels (service, env, region), bloom filters skip 95%+ of index parts on first check. For high-cardinality labels, the bloom filter's false positive rate increases and the benefit diminishes.

When a query has multiple label selectors, the engine intersects posting lists using size-adaptive algorithms (galloping search for skewed lists, linear merge for similar-sized lists).

8.5 Query Engine and Execution

MetricsQL query execution flow:

Query splitting for large time ranges:

When a dashboard requests a 30-day range, the query spans both hot (vmstorage, raw data for last 15 days) and warm (S3, 5-min aggregates for days 16-30). vmselect splits the query:

Days 1-15: Fan out to vmstorage nodes. Raw data, full resolution.
Days 16-30: Read from S3 Parquet files. 5-min aggregated data. Partition pruning by tenant_id + time range skips irrelevant files.
Merge: vmselect stitches the two result sets together, aligning timestamps at the step boundary.

Dashboard experience across tiers:

Querying beyond the hot tier is slower and lower resolution.

Time Range	Storage Tier	Resolution	Typical Latency	What You Lose
Last 15 days	Hot (vmstorage)	15-sec raw	p99 < 200ms	Nothing, full fidelity
15-90 days	Warm (S3 Parquet)	5-min aggregates	p99 < 500ms	Short spikes (<5 min) disappear. Dashboard lines are smoother. A 3-minute error burst that was visible in hot tier becomes a single averaged data point in warm.
90+ days	Cold (S3 Parquet)	1-hour aggregates	500ms-2s	Only useful for trend analysis ("did error rate increase this quarter?"), not incident investigation.

Over 95% of dashboard queries are last-15-minutes or last-1-hour (real-time operations). Long-range queries (30-day, 90-day) are infrequent capacity-planning views where 5-min resolution and 200-500ms extra latency are fine. Recording rules help here too. Pre-computed aggregates like service:http_requests:rate5m exist at hot tier resolution regardless of time range, so dashboards built on recording rules never hit S3 for common panels.

Result caching:

vmselect caches query results in its own internal LRU cache, keyed by (query_hash, time_range, step). For dashboard auto-refresh (every 30s), the cache hit rate is high because only the latest data point changes. The cache uses a time-bounded eviction policy: cached results for time ranges that end in the past are valid indefinitely. Results for ranges that include "now" expire after the scrape interval (15s).

Tempo uses a separate approach: an external Memcached cluster caches trace query results. This keeps repeated trace ID lookups from hitting S3 on every request, which matters because Tempo's query path is entirely S3-backed with no local query cache.

Query priority scheduling. Not all queries are equal. Alert rule evaluation from vmalert is the highest priority. A delayed alert is a missed incident. Dashboard auto-refresh is medium priority. Ad-hoc exploration queries from engineers are lowest priority. vmselect enforces this via per-tenant query concurrency slots: alert rules get dedicated slots that are never preempted, dashboard queries share a pool, and ad-hoc queries use whatever capacity remains.

Recording rules to keep dashboards fast:

Without recording rules, a dashboard panel querying across 500 services scans millions of raw series on every refresh. Recording rules pre-compute these aggregations and store the result as a few hundred series instead of millions. Section 8.8 covers recording rules in detail, including when to use them and their cost.

8.6 Downsampling and Retention

Three storage tiers with automatic lifecycle management:

Stream aggregation (Flink) vs batch rollup:

Two approaches to downsampling, each with tradeoffs:

Approach	How	Pros	Cons
Stream aggregation	Flink aggregates on the ingest path. Every 5 minutes, emit min/max/avg/sum/count for each series. Write aggregates to warm tier.	No read penalty. Aggregates available immediately. No backfill needed.	Uses Flink state (memory + checkpoint). Must handle late-arriving data.
Batch rollup	Background job reads raw data from vmstorage, computes aggregates, writes to S3.	Simple. No ingest-path overhead. Can recompute if logic changes.	Reads data twice (once to store, once to rollup). Creates I/O contention on vmstorage. Rollup delay (hours).

Hot-to-warm uses stream aggregation (Flink computes 5-min rollups on the ingest path). Warm-to-cold uses a nightly batch job that reads 5-min Parquet files, re-aggregates to 1-hour granularity, and writes new Parquet files.

Each tier reduces volume dramatically: hot-to-warm cuts 20x (raw to 5-min aggregates), warm-to-cold cuts another 12x (5-min to 1-hour). Total storage across all tiers: ~3.2 PB. See Section 7.2 for the full derivation.

Value-based routing integration: The pipeline layer's routing decisions (Section 4.5) interact with downsampling. Signals marked as bronze SLO tier that survive the initial sampling can be downsampled more aggressively: 1-minute aggregates in the warm tier instead of 5-minute, reducing warm-tier storage further. Gold-tier signals always get full 5-minute resolution in warm and 1-hour in cold.

Handling late-arriving data:

Flink's 5-min aggregation window uses event-time processing (the sample's own timestamp, not arrival time) with a 2-minute watermark. The watermark signals "no events older than this will arrive," letting Flink close windows. Data arriving up to 2 minutes late lands in the correct window. Data arriving later than 2 minutes is either:

Dropped (if it falls within the hot tier window and raw data is still available)
Counted toward the next window (if it falls in the warm tier)

For metrics, late arrivals are rare. Scrape intervals are clock-driven. Delays beyond 2 minutes indicate network failures.

8.7 Alert Evaluation at Scale

Thousands of alert rules across billions of time series, evaluated every 15 seconds. A single evaluator can't keep up, so the rules are sharded.

Rule format: vmalert reads rules from YAML files on disk, the same format Prometheus uses. Each file contains one or more rule groups. Each group has its own evaluation interval. Two types: alerting rules fire notifications when a condition holds (e.g., error rate > 5% for 2 minutes), and recording rules pre-compute expensive queries and save the result as a new metric so dashboards stay fast.

Alerting rule example:

yaml

# /etc/vmalert/rules/shard-0/checkout-alerts.yml
groups:
  - name: checkout-alerts
    interval: 15s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          tenant_id: acme
        annotations:
          description: "Error rate above 5% for checkout service"
          runbook: "https://wiki.internal/runbooks/checkout-errors"

      - alert: SLOBurnRateFast
        expr: slo:error_budget:burn_rate_5m > 14.4
        for: 1m
        labels:
          severity: critical
        annotations:
          description: "Fast burn: error budget will exhaust within 1 hour at current rate"

Recording rule example:

yaml

# /etc/vmalert/rules/shard-0/checkout-recording.yml
groups:
  - name: checkout-recording
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: slo:error_budget:burn_rate_5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) / (1 - 0.999)

Rule management at scale: The platform's /api/v1/rules endpoint provides CRUD operations for rules. Behind the scenes, it stores rule metadata in a database, generates these YAML files, and writes them to the correct vmalert shard's directory. A hot reload via vmalert's /-/reload endpoint picks up changes without restart.

Rule sharding strategy:

Each alert rule is assigned to exactly one vmalert shard. The rule management API uses consistent hashing on the rule name to decide which shard directory (/etc/vmalert/rules/shard-{N}/). When vmalert pods scale up or down, the API reassigns rules and triggers reloads on affected instances.

Each evaluator shard, every 15 seconds:

Loads its assigned rules from the rule store
Executes each rule's MetricsQL expression against vmselect
Compares results against thresholds
For rules that have been firing for longer than their for duration, sends an alert to Alertmanager
Updates local state (firing duration, last evaluation time)

SLO-based burn rate alerting:

Traditional threshold alerts ("error rate > 5%") are noisy. SLO-based alerting asks a better question: "At the current error rate, will the error budget run out before the SLO period ends?"

How burn rates work: A 99.9% SLO gives you ~43 minutes of error budget over a 30-day window.

A burn rate measures how fast that budget gets consumed:

Burn rate 1x: Budget consumed exactly over 30 days. No action needed.
Burn rate 14.4x: Budget consumed 14.4x faster. Runs out in ~2 days. Page immediately.
Burn rate 6x: Runs out in ~5 days. Urgent, but hours remain, not minutes.

The formula: burn_rate = current_error_rate / allowed_error_rate. With a 0.1% error allowance and a current rate of 1.44%, the burn rate is 14.4x:

Multi-window burn rates:
  Fast burn (last 5 min):   error_rate / (1 - 0.999) > 14.4   → page immediately
  Medium burn (last 30 min): error_rate / (1 - 0.999) > 6      → page (slower burn)
  Slow burn (last 6 hours):  error_rate / (1 - 0.999) > 1      → ticket (not a page)

Alertmanager deduplication and grouping:

When 500 pods fire the same "HighErrorRate" alert, that's 500 PagerDuty notifications without grouping. Alertmanager collapses them:

yaml

route:
  group_by: ['alertname', 'service', 'env']
  group_wait: 30s          # Wait 30s to batch alerts in the same group
  group_interval: 5m       # Wait 5m between re-notifications
  repeat_interval: 4h      # Re-notify every 4h if still firing
  receiver: 'pagerduty-critical'

  routes:
    - match:
        severity: warning
      receiver: 'slack-warnings'
    - match:
        severity: info
      receiver: 'slack-info'

500 pod-level alerts get grouped into 1 notification: "HighErrorRate firing for service=checkout, env=prod (500 instances)."

Alert storm protection:

If a widespread failure causes 100,000 alerts to fire in 1 minute (datacenter-level event), even grouped alerts can overwhelm on-call engineers. Protections:

Inhibition rules. If a datacenter-level alert is firing, suppress all service-level alerts in that datacenter.
Rate limiting. Alertmanager limits notification sends to 100/minute per receiver. Excess alerts are queued.
Escalation. If more than 50 unique alert groups fire within 5 minutes, trigger a "mass incident" page to incident command, not individual service owners.

8.7.1 ML-Assisted Anomaly Detection

SLO burn rates are the primary paging mechanism and remain so. ML anomaly detection is a supplementary signal that catches patterns burn rates miss: gradual degradations, seasonal deviations, and correlated anomalies across services that individually stay within thresholds.

Architecture:

The ML anomaly detector is a Kafka consumer group (5-10 nodes) that processes the metrics-ingestion topic. It doesn't process every metric. It watches the top ~100K series per tenant by query frequency (the metrics that actually appear in dashboards and alert rules). This keeps compute requirements modest.

Metric selection: how the monitored set is built. vmselect logs which metric names are queried by dashboards and alert rules. A recording rule aggregates query counts per metric name per tenant over a rolling 7-day window. Every 6 hours, the anomaly detector recalculates its monitored set by taking the top 100K series by query frequency. Metrics that drop out of all dashboards and alert rules for 7 days are removed from the monitored set. A new metric added to a dashboard enters the monitored set at the next 6-hour recalculation, not immediately. The key insight: the system only watches metrics that humans actually look at. If nobody queries node_entropy_available_bits, the anomaly detector ignores it. This is the single most effective mechanism for keeping false positives low -- most metrics are infrastructure noise that nobody monitors.

How it works:

Seasonal baseline (STL decomposition). For each monitored metric, the service maintains a seasonal baseline using STL (Seasonal and Trend decomposition using Loess). This decomposes the metric into trend, seasonal, and residual components. The seasonal component captures daily/weekly patterns (e.g., traffic drops at 3am, spikes at 9am). Baselines are stored in RocksDB on each node's local SSD.

Cold start: New metrics with less than 7 days of history have anomaly detection suppressed entirely -- there's no value in alerting on a metric with no baseline. Metrics with 7-28 days of history use whatever data exists, but the anomaly threshold widens to 5σ (vs the steady-state 3σ) to prevent noisy alerting during ramp-up. After 28 days, the metric enters steady-state detection with the full seasonal model. New tenants onboarding start with all metrics in cold-start mode, so the detector is effectively silent for the first week. This is the same 5σ widening used during baseline rebuilds after major traffic pattern changes (Section 13.8).
Anomaly scoring. Each new data point is compared against the expected value from the seasonal model. The score is based on how many standard deviations the residual is from zero. A threshold of 3σ (three standard deviations) triggers an anomaly signal. The threshold adapts based on the metric's historical volatility. Metrics that are naturally noisy (cache hit rates) get wider bands than stable metrics (CPU on idle hosts).

Why STL over other approaches? STL was chosen for three reasons: it's computationally cheap (O(n) per series per update, which matters at 100K series), it requires no training data beyond the metric's own history, and it's interpretable -- you can render the trend and seasonal components on a Grafana dashboard and an engineer can see why the system thinks something is anomalous. Prophet fits a full Bayesian model per series, which is prohibitively expensive at this scale. ARIMA doesn't model seasonality natively (SARIMA does, but requires manually specifying seasonal order for each metric). Isolation forests work well for multivariate anomaly detection but require a feature vector, and this system monitors each series independently. The cross-service correlator (step 3) provides a lightweight version of multivariate detection without the model complexity.
Cross-service correlation. When multiple services show simultaneous anomalies, the correlator groups them by time window and shared attributes (region, dependency). Instead of 50 individual "anomaly detected" alerts, it produces one "correlated anomaly: 50 services in us-east-1 showing elevated latency, likely caused by shared dependency X."

Grounded expectations:

ML anomaly alerts go to Slack only, not PagerDuty. They are informational signals for engineers already investigating or proactively monitoring.
SLO burn rate alerts remain the only paging mechanism. ML never pages.
False positive rate target: <5% of anomaly signals. Achieved by only monitoring high-query-frequency metrics and using adaptive thresholds.
Detection latency: <60 seconds from metric ingestion to anomaly signal in Slack.

Feedback and threshold tuning:

Engineers dismiss false positives in Slack via a thumbs-down emoji reaction, which triggers a webhook back to the anomaly detector. Dismissals are tracked per metric. If a metric accumulates more than 3 dismissals in 7 days, the system auto-widens its threshold by 0.5σ (e.g., 3σ becomes 3.5σ). There is no real-time model retraining -- STL is a statistical decomposition, not a trained model in the gradient-descent sense. The "learning" is purely threshold adjustment based on operator feedback. An API endpoint (GET /api/v1/anomaly/tuning/{metric}) exposes the current effective threshold, dismissal count, and baseline age for any monitored metric, which helps on-call engineers understand why a particular metric is or isn't firing.

What ML catches that burn rates miss:

Scenario	SLO Burn Rate	ML Anomaly Detection
Error rate suddenly jumps from 0.01% to 0.5% (still under 0.1% SLO)	No alert (within budget)	Detects: 50x deviation from baseline
Latency slowly increases 10% per day over 2 weeks	No alert until budget exhaustion	Detects: trend component diverging
10 services simultaneously show 20% traffic drop at an unusual hour	No alert (availability is fine)	Detects: correlated anomaly, likely upstream issue
Cache hit rate drops from 95% to 85%	No alert (no SLO on cache)	Detects: significant deviation, Slack notification

What the Slack alert looks like:

🔶 Anomaly detected: http_request_duration_seconds_p99
   Service:   checkout (us-east-1)
   Current:   450ms
   Expected:  120ms (seasonal baseline)
   Deviation: 4.2σ (threshold: 3.0σ)
   Trend:     increasing over last 35 minutes
   ➜ View in Grafana: https://grafana.internal/d/svc-latency?service=checkout&from=now-1h

For correlated anomalies, the correlator groups individual signals into a single message:

🔴 Correlated anomaly: 12 services in us-east-1 showing elevated p99 latency
   Top affected:
     checkout    4.2σ  (120ms → 450ms)
     payment     3.8σ  (85ms → 310ms)
     inventory   3.5σ  (60ms → 195ms)
   Suspected shared dependency: redis-cluster-east
   Window: 14:32 - 14:47 UTC
   ➜ View in Grafana: https://grafana.internal/d/correlated?region=us-east-1&from=now-1h

Sizing:

ML Anomaly Detector:
  5-10 nodes (c6g.2xlarge: 8 vCPU, 16 GB RAM)
  Kafka consumer: processes ~100K series per tenant
  RocksDB state: ~50 GB per node (28-day seasonal baselines)
  Total cluster: ~500 GB state, ~100K anomaly evaluations/sec

8.8 Recording Rules and Pre-Computation

Recording rules are pre-computed queries that run on the Flink aggregation layer and write results back as new time series. Two reasons to use them:

Performance. A dashboard querying sum(rate(http_requests_total[5m])) by (service) across 10M raw series takes seconds. The same query against 500 pre-computed series takes milliseconds.
Consistency. Alert rules and dashboard panels that need the same aggregation use the same recording rule. No divergence from slightly different query syntax.

Where recording rules execute:

Flink receives raw metrics from Kafka. For each recording rule, it maintains a windowed aggregation in state. Every evaluation interval (30s default), it emits the result as a new time series with the recording rule's name.

When to use recording rules vs raw queries:

Scenario	Use Recording Rule	Use Raw Query
Dashboard panel queried by multiple users	Yes	No
Alert rule evaluated every 15s	Yes	No
One-off debugging query	No	Yes
Query touches >100K series	Yes	No
Query touches <1K series	No	Yes
SLO burn rate calculation	Yes (always)	No

Cost of recording rules: Each rule produces new time series (roughly 2 MB/day for a 500-series rule, negligible). But rules with high-cardinality by clauses (e.g., by (service, endpoint, region, zone)) can produce millions of new series. Estimate rule cardinality before deployment.

Backfill when recording rules change:

Recording rules computed by Flink are forward-only. When you modify a recording rule expression (changing the aggregation logic, adding a new by dimension), the new output starts from the moment the updated Flink job deploys. Historical data under the old expression stays as-is.

If historical backfill is needed (e.g., a new SLO burn rate calculation that should show 30 days of history on day one), run a one-off batch job that reads raw metrics from the S3 warm tier, applies the new recording rule expression, and writes the backfilled series to vmstorage or S3. The Flink job template can be reused with a bounded source (S3 Parquet reader instead of Kafka consumer) for this.

In practice, most recording rule changes don't need backfill. Dashboards showing the new metric simply have no data before the change date, and that's fine. Reserve backfill for SLO calculations where a full window of historical data is required for accurate burn rate alerting from the start.

8.9 Multi-Tenancy and Isolation

At 500M metrics/sec from hundreds of organizations, one tenant's cardinality explosion must never take down another tenant's dashboards.

Tenant identification:

Every request (remote write, query, rule evaluation) carries a X-Scope-OrgID header. This tenant_id propagates through the entire pipeline:

OTel Collector → Pipeline Processors (enrichment) → Kafka (message header)
  → Flink (state key prefix) → vminsert (routing) → vmstorage (directory isolation)
  → vmselect (query filter)

Per-tenant limits:

Limit	Default	Purpose
Max active series	50M	Prevent cardinality explosion
Max ingestion rate	5M samples/sec	Prevent single tenant from saturating Kafka
Max query series	500K	Prevent dashboard query from scanning too many series
Max query duration	60s	Kill runaway queries
Max alert rules	10,000	Prevent rule evaluator overload
Max label name length	128 chars	Index size control
Max label value length	1,024 chars	Index size control
Max labels per series	30	Cardinality control

Limits are enforced at the earliest possible point:

Ingestion rate: OTel Collector checks before sending to Kafka
Series limit: vminsert checks before routing to vmstorage. When a tenant exceeds their series limit, the remote write endpoint returns 429 Too Many Requests with X-Series-Limit and X-Series-Current headers so clients can track cardinality usage against their budget.
Query limits: vmselect checks before fanning out to vmstorage

Shuffle-sharding for noisy neighbor protection: Shuffle-sharding assigns each tenant to a random subset of nodes (not all nodes), ensuring one tenant's failure blast radius is limited to their subset, not the entire cluster.

Each tenant is assigned a subset of vmstorage nodes (4 out of 12 in this example). If Tenant A causes a node to OOM, only nodes 1, 4, 7, 10 are affected. Tenants B and C are on different nodes and see no impact.

The shard assignment uses consistent hashing with the tenant_id as input. When nodes are added or removed, only a fraction of tenant assignments change.

Cost attribution:

Per-tenant billing based on:

Active series count (sampled hourly, billed at max)
Ingestion rate (samples/sec, billed at p95 over the billing period)
Query volume (number of series touched per query, summed monthly)
Storage consumed (bytes across all tiers)

8.10 Service Discovery and Scrape

The collection layer must automatically discover new services as they deploy and stop scraping services that disappear. Manual target configuration doesn't work at 500K+ services.

Scope: pull model only. Service discovery applies to same-cluster tenants (see Section 5.1). External tenants push directly and don't need discovery.

eBPF auto-discovery. Beyla attaches to new pods automatically based on Kubernetes labels. When Service Discovery detects a new pod, Beyla instruments it at the kernel level within seconds. This means every new service has baseline RED metrics and trace spans before the first scrape interval completes, even if the team hasn't added any OTel SDK instrumentation yet. The OTel Collector receives Beyla's output and deduplicates it against any SDK-generated metrics for the same service.

Scrape interval tuning: 15-second scrape interval is the default. It aligns with alert evaluation intervals (also 15s), so each evaluation sees fresh data. Going faster rarely improves alerting because the for duration on most alerts is 1-5 minutes anyway. Use 30-60s only for low-priority infrastructure metrics.

Target churn handling:

Kubernetes pods churn constantly, and each new pod creates new series (the pod label changes). When a target disappears, the OTel Collector sends a stale marker (NaN with a staleness bit). VictoriaMetrics stops considering the series "active" after 5 minutes, so stale series don't count against cardinality limits or appear in queries.

Relabeling for metric filtering:

Not every metric from every service is worth storing. The OTel Collector's relabeling pipeline can drop, rename, or filter metrics before they reach Kafka:

yaml

metric_relabel_configs:
  # Drop Go runtime metrics (noisy, rarely useful)
  - source_labels: [__name__]
    regex: 'go_(gc|memstats|threads)_.*'
    action: drop

  # Drop high-cardinality labels
  - regex: 'request_id|trace_id|user_id'
    action: labeldrop

  # Rename for consistency
  - source_labels: [__name__]
    regex: 'node_cpu_seconds_total'
    target_label: __name__
    replacement: 'cpu_seconds_total'

Fleet management at scale:

3,000 OTel Collectors need coordinated configuration updates: scrape targets, relabel rules, sampling policies, and receiver settings. Two approaches:

GitOps pipeline. Collector configuration lives in a Git repository. Changes go through PR review. A CI/CD pipeline renders per-node configs (using templates for node-specific settings like node_name and local pod discovery) and deploys via Kubernetes rolling update of the DaemonSet. Safe, auditable, but slow (minutes for a full rollout).
OpAMP (Open Agent Management Protocol). The CNCF standard for remote agent management. A central OpAMP server pushes configuration updates to collectors in real time. Collectors report health, effective configuration, and capabilities back to the server. Useful for emergency changes (e.g., adding a labeldrop rule during a cardinality explosion). Faster than GitOps (seconds, not minutes), but requires running and securing the OpAMP server infrastructure.

Most teams use GitOps for planned changes and keep OpAMP as a fast-path override for incidents.

9. Distributed Tracing at Scale

Traces are expensive. From the Section 7 capacity estimate: 200M spans/sec at ~1 KB each produces 17.3 PB/day. At ~$23/TB/month on S3, that's roughly $400K/month in storage alone. Not viable.

Tail-based sampling (Section 9.2) keeps errors, slow requests, and a 0.1% random baseline, dropping the rest to 86 TB/day. First, the trace span format.

9.1 Trace Span Format

Each trace span captures a single operation within a distributed request. A complete trace is a directed acyclic graph of spans connected by parent-child relationships.

json

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "e457b5a2e4d86bd1",
  "operationName": "POST /checkout",
  "serviceName": "checkout-service",
  "startTime": 1709827200000,
  "duration": 145,
  "attributes": {
    "http.method": "POST",
    "http.status_code": 200,
    "db.system": "postgresql"
  },
  "events": [{"name": "retry", "timestamp": 1709827200045}],
  "links": [{"traceId": "...", "spanId": "...", "attributes": {"link.type": "profile"}}]
}

500-2000 bytes per span. Variable structure. Rich attributes. A single request generates one metric increment but 5-20 spans across service boundaries. This 100-1000x size difference per event is exactly why traces need a different storage engine than metrics.

9.2 Tech Stack Selection

Layer	Technology	Why This Choice
Instrumentation	OpenTelemetry SDK + Beyla eBPF	SDK for deep instrumentation. Beyla provides baseline trace spans for all HTTP/gRPC/SQL without code changes.
Collection	OTel Collector (same fleet)	Add `otlp` trace receiver to existing DaemonSet configuration. Same fleet, no new infrastructure for collection.
Sampling	OTel Collector tail-based sampling	Buffer spans 60s, keep errors + p99 + 0.1% baseline. Cuts 98-99.5% (depends on error rate). Head-based misses rare errors.
Buffer	Kafka (topic: `traces-raw`)	Same Kafka cluster, separate topic. Isolates trace volume from metric ingestion. Fingerprint-partitioned by trace_id for span co-location.
Trace Backend	Grafana Tempo	S3-native storage (no index to maintain). TraceQL for queries. Stateless query layer scales horizontally. Bloom filters for trace ID lookup.
Long-term Storage	S3 (Parquet blocks)	Same S3 infrastructure used for metrics warm/cold tiers. Tempo writes trace blocks as Parquet files. S3 lifecycle policies manage tier transitions.
Correlation	Exemplars + span-to-profile links	Link metric data points to the specific trace that caused a spike. Link trace spans to CPU/memory profiles showing why the span was slow.
Context Propagation	W3C Trace Context (`traceparent`)	Industry standard. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers.
Visualization	Grafana (already deployed)	Native Tempo datasource. Trace waterfall view. Auto-generated service dependency maps from trace data. Integrated flame graph panel for span-linked profiles.

Why tail-based sampling over head-based: Head-based sampling decides at the start of a request (e.g., "keep 1% of all traces") before knowing whether the request will fail or be slow. This inevitably drops error and high-latency traces that are the most valuable for debugging. Tail-based sampling buffers all spans for a configurable window (60 seconds in this design), waits for the complete trace, then makes a keep/drop decision based on outcome: all errors kept, all p99 latency kept, 0.1% random baseline. The cost is memory and an extra routing hop.

Trace-affinity routing for tail-based sampling: Tail-based sampling requires ALL spans from a single trace to arrive at the same sampler instance, since the keep/drop decision needs the complete trace. A trace crossing 8 services generates spans on 8 different nodes, hitting 8 different DaemonSet collectors. No single collector has the full picture. The solution is a two-tier collector architecture for traces:

DaemonSet collectors (3,000 nodes) receive spans from local pods and forward them using the OTel Collector's load-balancing exporter, which routes by trace_id hash.
Sampling tier (a dedicated pool of ~50 OTel Collector instances) receives all spans for a given trace on the same instance, buffers for 60 seconds, and makes the keep/drop decision.

This adds one network hop but ensures correct sampling decisions. The sampling tier instances each handle ~4M spans in the 60-second buffer, consuming ~2.7 GB RAM each (see Section 12.8 for sizing).

Why Tempo over Jaeger: Tempo stores traces as Parquet blocks on S3 with bloom filters for trace ID lookups. No inverted index to maintain, no Elasticsearch cluster to babysit. At this scale (hundreds of terabytes per day after sampling), that operational simplicity matters more than Jaeger's slightly more mature UI.

9.3 Data Flow

9.4 Exemplar Storage and Trace-to-Profile Correlation

Exemplars link metrics to traces: when the OTel SDK records a histogram observation, it can attach the current trace_id. Clicking an exemplar dot on a Grafana metric panel opens the linked trace in Tempo.

Sampling alignment matters. At a 0.5-2% trace keep rate, 98-99.5% of exemplar links will be dead ends. In practice, this is acceptable because the exemplars that matter most (errors, p99 latency) correspond to traces that the tail-based sampler keeps by policy.

Trace-to-profile correlation is the newer, higher-value link. When a trace span shows a slow database call (500ms), clicking "View Profile" shows the CPU flame graph for that exact span, showing whether the slowness was regex backtracking, GC pressure, lock contention, or actual I/O wait. This correlation is covered in depth in Section 11.4.

9.5 Tiered Storage for Traces

Same hot/warm/cold philosophy as the metrics pipeline (Section 8.6).

Tier	Retention	Storage	Resolution	Query Latency
Hot	0-3 days	Tempo ingesters (local SSD) + S3 Standard	Full spans, all attributes	< 500ms by trace ID
Warm	3-30 days	S3 Standard (compacted Parquet blocks)	Full spans, all attributes	1-3s (S3 reads + bloom filter lookup)
Cold	30-180 days	S3 Glacier Instant Retrieval	Full spans, all attributes	5-10s (Glacier retrieval)
Archive	180d+	S3 Glacier Deep Archive	Full spans (compliance/audit)	Minutes (restore required)

Unlike metrics, traces are never downsampled. A trace is either useful in full detail or not at all. Cost savings come from tiering across S3 storage classes, not reducing resolution.

9.6 Capacity Estimate

From Section 7.5: raw trace volume is 17.3 PB/day. After tail-based sampling (0.5% keep rate), ingested volume is 86 TB/day.

Tiered storage (180-day retention):
  Hot  (3 days):   86 TB x 3       = 258 TB on S3 Standard
  Warm (27 days):  86 TB x 27      = 2.32 PB on S3 Standard
  Cold (150 days): 86 TB x 150     = 12.9 PB on Glacier Instant
  Total trace storage: ~15.5 PB

Tempo ingester fleet:
  1 GB/sec ingest rate. Each ingester handles ~50 MB/sec.
  20 ingesters (r6g.2xlarge)

Tempo compactor:
  10-15 compactor nodes (c6g.xlarge), auto-scaled by pending block count
  Monitor: tempo_compactor_outstanding_blocks. Scale up if backlog > 500 blocks.
  At 1 GB/sec ingestion, 5 compactors fall behind within hours during peak.

OTel Collector overhead:
  Tail-based sampling requires 4 GB RAM per 100K spans/sec in the decision buffer
  200M raw spans/sec across 3,000 collectors = ~67K spans/sec per collector
  ~2.7 GB additional RAM per collector (fits within c6g.xlarge 8GB with headroom)

For deeper coverage of sampling strategies and context propagation patterns, see the Distributed Tracing reference. Failure scenarios for the trace pipeline are covered in Section 13.

10. Centralized Logging

The trace shows which service failed. Logs show why: "connection refused to stripe-api.com:443, certificate expired." The logging pipeline reuses the same OTel Collector fleet and Kafka cluster, diverging only at the storage layer.

10.1 Log Entry Format

Structured logs follow a consistent JSON schema with fields auto-indexed by VictoriaLogs.

json

{
  "timestamp": "2026-03-07T14:30:00.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "connection refused to stripe-api.com:443, certificate expired",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "a1b2c3d4e5f60718",
  "host": "payment-pod-7b8f9",
  "region": "us-east-1",
  "error.type": "ConnectionRefused",
  "retry_count": 3
}

200-1000 bytes per log line. The trace_id field links logs back to traces. VictoriaLogs auto-indexes every field using bloom filters (2 bytes per unique token), so both trace_id lookups and full-text search across the message field are index-assisted.

10.2 Tech Stack Selection

Three serious options exist for log storage at this scale: Elasticsearch, Grafana Loki, and VictoriaLogs.

Dimension	VictoriaLogs	Grafana Loki	Elasticsearch
Indexing strategy	Bloom filters on all tokens. Index-assisted search on every field automatically.	Labels only (service, env, level). Log content is NOT indexed. Brute-force grep on chunks.	Full-text inverted index on all fields.
Ingestion throughput	3x higher than Loki on same hardware. 66 MB/s vs 20 MB/s on 4 vCPU / 8 GiB (benchmark).	Bottlenecks on high-volume ingestion. Requires more nodes.	High throughput but index building is CPU-intensive.
Query speed (content search)	Fast. Bloom filter token lookup skips non-matching blocks.	Slow. Brute-force scan after label filter.	Fast. Inverted index on all content.
Resource efficiency	72% less CPU, 87% less RAM than Loki. Up to 30x less RAM than Elasticsearch.	Memory-hungry ingesters.	Highest resource consumption. Large JVM heaps.

At 25 GB/sec: The deciding factor is what happens when queries go beyond label filtering. At 500K services, operators frequently search log content across broad service groups. VictoriaLogs uses bloom filter token indexes to skip non-matching blocks entirely.

Chosen stack:

Layer	Technology	Why
Collection	OTel Collector (filelog receiver)	Same DaemonSet fleet. Tail container stdout/stderr via Kubernetes log paths.
Buffer	Kafka (topic: `logs-raw`)	Same cluster, separate topic. Partitioned by service_name.
Storage & Indexing	VictoriaLogs (cluster mode)	Bloom filter per-token index on all fields. Auto-indexes everything.
Long-term	Local NVMe on vlstorage + vmbackup to S3	Hot/warm on local disk. Cold/archive snapshots to S3.
Query	LogsQL	Filter and search: `_time:5m AND service:checkout AND error AND _stream:{level="error"}`.
Correlation	trace_id field in structured logs	Click from log line to open trace in Tempo. Click from trace span to see logs.

10.3 Data Flow

Container stdout --> Kubernetes log file --> OTel Collector (filelog receiver)
  --> Parse JSON --> Extract fields (service, level, trace_id)
  --> Pipeline processors (value-based routing: drop DEBUG from bronze-tier services)
  --> Kafka topic: logs-raw
  --> vlinsert --> Distribute to vlstorage nodes
  --> vlstorage --> Tokenize, build bloom filters, compress, write to local NVMe
  --> Query: Grafana --> vlselect --> Fan out to vlstorage nodes --> Return results

10.4 Tiered Storage for Logs

Logs follow the same tiered storage strategy. Unlike traces, partial log retention is viable. DEBUG and INFO logs older than 30 days rarely get queried. Dropping low-severity logs after the warm tier cuts cold/archive storage by 90%.

Dynamic log level adjustment. In production, most services run at INFO level. During incidents, operators can temporarily increase specific services to DEBUG via OpAMP. The pipeline's value-based routing (Section 4.5) integrates here: bronze-tier services have DEBUG logs dropped at the OTel Collector level, while gold-tier services retain all log levels.

Tier	Retention	Storage	What Gets Stored	Query Latency
Hot	0-3 days	vlstorage local NVMe	All log levels, full content	< 500ms
Warm	3-30 days	vlstorage local disk	All log levels, full content	500ms-2s
Cold	30-90 days	S3 Standard (vmbackup snapshots)	ERROR/WARN only	5-15s
Archive	90-365 days	S3 Glacier Deep Archive	ERROR/WARN only (compliance/audit)	Minutes

10.5 Capacity Estimate

From Section 7.5: 50M log lines/sec at 500 bytes = 25 GB/sec = 2.16 PB/day raw.

VictoriaLogs compression: ~8x on structured JSON logs
Compressed: ~260 TB/day stored on local disk

Tiered storage (365-day retention for errors, 30-day for all):
  Hot  (3 days):   260 TB x 3           = 780 TB on local NVMe
  Warm (27 days):  260 TB x 27          = 7.0 PB on local SSD
  Cold (60 days):  260 TB x 0.10 x 60   = 1.56 PB on S3 Standard
  Archive (275d):  260 TB x 0.10 x 275  = 7.15 PB on Glacier Deep
  Total log storage: ~9.5 PB across active tiers (+ 7.15 PB archive)

vlinsert: 17 nodes, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers

Failure scenarios for the logging pipeline are covered in Section 13.

11. Continuous Profiling: The Fourth Pillar

Metrics tell you what is slow. Traces tell you where in the call chain it's slow. Profiles tell you why the code is slow: which function, which line, which allocation pattern. Without profiling, the investigation stops at "the checkout service has a slow span" and the engineer guesses. With profiling, they see the CPU flame graph showing 40% of time spent in regex compilation inside a hot loop.

11.1 Profile Format and Types

Profiles use the pprof format (Google's protocol buffer-based profile format), the de facto standard adopted by Go, Java (via async-profiler), Python (via py-spy), and Rust (via pprof-rs). Each profile is a snapshot of stack traces with associated values.

Profile Type	What It Captures	When to Use	Typical Size
CPU	Time spent in each function	"This span is slow, but I don't know which function"	50-200 KB
Memory (alloc)	Bytes allocated by each call stack	"GC pauses are killing latency"	30-100 KB
Memory (inuse)	Current heap usage by call stack	"Memory is growing, what's leaking?"	30-100 KB
Goroutine	Number of goroutines by creation stack (Go)	"Goroutine count is climbing"	10-50 KB
Mutex	Time spent waiting on locks by call stack	"Contention is high but I don't know which lock"	10-50 KB
Block	Time spent blocked on channel/mutex operations (Go)	"Something is blocking, what?"	10-50 KB

Profile structure (simplified pprof):

Profile {
  sample_type: [("cpu", "nanoseconds")]
  sample: [
    {
      location_id: [4, 3, 2, 1],           // Stack trace (leaf to root)
      value: [150000000],                    // 150ms of CPU time
      label: {
        "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
        "span_id": "00f067aa0ba902b7"        // Links to specific trace span
      }
    },
    ...
  ]
  location: [
    {id: 1, function: "main.handleCheckout", file: "checkout.go", line: 42},
    {id: 2, function: "validator.ValidateCart", file: "validate.go", line: 87},
    {id: 3, function: "regexp.Compile", file: "regexp.go", line: 172},
    {id: 4, function: "regexp.compile", file: "compile.go", line: 94}
  ]
}

11.2 Tech Stack Selection

Dimension	Grafana Pyroscope	Parca	Datadog Continuous Profiler
Storage backend	S3-native (object storage)	S3 or local disk	Proprietary (SaaS)
Profile format	pprof (native)	pprof (native)	Proprietary + pprof conversion
Query language	Label-based + time range	Label-based + time range	Proprietary UI
Grafana integration	Native (same vendor stack)	Community datasource plugin	Requires Datadog UI
Trace correlation	Span-to-profile via span_id in pprof labels	Experimental	Built-in (proprietary)
Maturity	Production-proven at Grafana Labs, acquired from Pyroscope project	Early production (CNCF sandbox)	Mature (SaaS)
OTel support	OTel profile signal (experimental) + Pyroscope SDK	OTel profile signal	Datadog agent only

Why Pyroscope: Same vendor ecosystem as Tempo and Grafana. S3-native storage matches the trace pipeline's model. Span-to-profile correlation works via shared span_id labels in the pprof data, enabling the "click from slow trace span to CPU flame graph" workflow in Grafana. The OTel profile signal support (experimental as of early 2026) means profiles flow through the same OTel Collector fleet as other signals.

11.3 Data Flow

Key points:

Beyla does NOT produce profiles. eBPF-based instrumentation generates RED metrics and basic trace spans, but CPU/memory/lock profiles require SDK-level instrumentation. Profiling is an opt-in capability that teams enable when they need deeper performance insight.
Profile collection is per-service, not per-request. The SDK takes a CPU profile snapshot every 10 seconds (configurable). The snapshot covers all activity on the process during that window, not individual requests. Span-to-profile correlation works by matching the span's time window to the profile snapshot that covers it.
Profiles flow through the same OTel Collector fleet and pipeline processors. Value-based routing applies: profiles from gold-tier services are always kept; profiles from bronze-tier services may be sampled during cost pressure.

11.4 Profile-to-Trace Correlation: The Key Value

This is where profiling transforms from "nice to have" to "changes how you debug." The four-click investigation flow:

How the correlation works technically: The OTel SDK labels each profile sample with the current span_id in pprof metadata. In Grafana, clicking "Profiles" on a trace span queries Pyroscope for profiles matching that span_id and time range, rendering the flame graph for that exact execution window.

What the flame graph shows that traces can't:

Trace Shows	Profile Shows (the "why")
checkout-service span: 500ms	`regexp.Compile` called 10,000 times inside `validateCart()`, compiling the same regex per request instead of caching
payment-service span: 300ms	60% time in `runtime.mallocgc`, excessive allocations causing GC pressure
db-proxy span: 200ms	45% time in `sync.Mutex.Lock`, lock contention on connection pool
auth-service span: 150ms	80% actual network I/O wait (expected), no code issue, downstream dependency is slow

Without profiles, the engineer sees "checkout-service is slow" and starts guessing. With profiles, they see "regexp.Compile is called 10,000 times per request. Cache the compiled regex" and ship a fix in 20 minutes.

11.5 Capacity Estimate

Profile volume:
  500K services with profiling enabled (not all, opt-in via SDK)
  Assume 50% adoption: 250K services
  10-second profile interval per service
  250K / 10 = 25K profiles/sec (conservative)
  At full adoption: 50K profiles/sec (target capacity)

Per-profile size:
  CPU profile: 50-200 KB compressed (avg 100 KB)
  Memory profile: 30-100 KB compressed
  Avg across types: ~100 KB

Ingestion rate:
  50K profiles/sec × 100 KB = 5 GB/sec raw
  Pyroscope dedup + compression: ~5x reduction
  Stored: ~1 GB/sec = 86 TB/day

Tiered storage (90-day retention):
  Hot  (7 days):   86 TB × 7    = 602 TB on S3 Standard
  Warm (83 days):  86 TB × 83   = 7.1 PB on S3 Standard-IA
  Total profile storage: ~7.7 PB

Pyroscope cluster:
  Ingesters: 10 nodes (r6g.2xlarge, 8 vCPU, 64 GB RAM)
  Query frontend: 5 nodes (c6g.2xlarge)
  Compactor: 3 nodes (c6g.xlarge)
  Kafka brokers: +2 (for profiles-raw topic)

Cost context: Profile storage is ~7.7 PB, compared to ~15.5 PB for traces and ~9.5 PB for logs. The incremental storage cost is meaningful but not dominant. The value (going from "this span is slow" to "this function is the bottleneck") justifies the cost for any organization where developer time spent debugging is expensive.

12. Identify Bottlenecks

12.1 Cardinality Ceiling

The inverted index is the first component to break under cardinality pressure. At 10 billion active series, the mergeset index consumes significant disk I/O and memory for bloom filters. Beyond 10B series per vmstorage node, lookup latency degrades. Symptoms: dashboard queries that used to return in 50ms start taking 500ms+. The metric vm_index_search_duration_seconds creeps up. Alert rule evaluations miss their 15-second window. The fix: more vmstorage nodes (horizontal sharding) and stricter cardinality limits per tenant.

12.2 Query Fan-Out

Every MetricsQL query fans out to all vmstorage nodes. With 160 nodes (RF=2), that's 160 parallel requests per query. If a dashboard has 20 panels, each refreshing every 30 seconds, that's 20 × 160 = 3,200 vmstorage requests per dashboard refresh. With 50 concurrent dashboard users, that's 160K requests/30s to vmstorage. Symptoms: Grafana dashboards show "query timeout" errors. vm_concurrent_select_requests on vmselect stays at max.

To handle this: recording rules reduce the number of series each panel queries, result caching in vmselect cuts redundant reads, and query coalescing merges identical concurrent queries into one.

12.3 Kafka Consumer Lag

If Flink or vminsert falls behind, Kafka consumer lag grows. With 400M/sec ingestion and 24-hour retention, Kafka stores up to ~34.6T samples. If a consumer falls 1 hour behind, it needs to process 1.44T extra samples to catch up while also handling current load. Symptoms: kafka_consumergroup_lag climbs steadily. Metrics appear on dashboards with increasing delay. The 2-second ingestion-to-visibility target starts slipping to 10+ seconds.

Monitor consumer lag (kafka_consumergroup_lag) and alert if lag exceeds 5 minutes. Auto-scale Flink TaskManagers based on lag. The 24-hour Kafka retention gives headroom for recovery.

12.4 Flink Checkpoint Size

Flink's recording rules maintain state (current window aggregations) in RocksDB. At 50M active aggregation keys, checkpoints can be multi-GB. Checkpoint duration affects recovery time (longer checkpoint = longer restart + replay). Symptoms: Flink's checkpoint duration metric (flink_jobmanager_job_lastCheckpointDuration) grows from seconds to minutes. If checkpoints take longer than the checkpoint interval, they start overlapping and Flink falls behind.

Mitigation: incremental checkpointing (only write changed state), tuning RocksDB block cache and compaction, and using aligned checkpoint barriers to reduce checkpoint duration.

12.5 vmstorage Compaction Storms

VictoriaMetrics periodically merges small parts into larger ones (similar to LSM-tree compaction). During compaction, disk I/O spikes. If compaction falls behind ingestion, parts accumulate, and read latency increases. Symptoms: vm_parts_count on vmstorage keeps climbing instead of staying flat after merges. Disk I/O utilization stays pegged at 100%.

Provision vmstorage nodes with 30% I/O headroom for compaction and use NVMe SSDs (not EBS gp3). Monitor vm_merge_need_free_disk_space and scale storage before compaction bottlenecks.

Tempo compactor backlog. Trace ingestion produces far more small blocks than metrics because of high cardinality (every request generates a new trace). At 1 GB/sec, ingesters flush blocks to S3 every 30 seconds. If the compactor fleet cannot merge blocks faster than they arrive, the block count grows, bloom filter lookups slow down, and trace queries degrade. Safeguards: auto-scale compactor instances when pending block count exceeds 500 per tenant (monitor tempo_compactor_outstanding_blocks). If compaction lag exceeds 2 hours, apply ingestion backpressure by rejecting low-priority traces (debug-level spans) rather than letting query performance degrade for all tenants.

12.6 S3 Query Latency

Queries spanning both hot and cold tiers bottleneck on the S3 leg. vmstorage responds in single-digit milliseconds. S3 GET latency is 50-200ms per request. A query touching 100 Parquet files on S3 waits for the slowest one. Symptoms: any dashboard query with a time range beyond 15 days loads noticeably slower.

The Parquet path structure (s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet) supports three levels of pruning: tenant_id eliminates other organizations' data, time range skips irrelevant days, and fingerprint range narrows to the specific series. A well-pruned query touches 10-50 files, not thousands.

12.7 Inverted Index Rebuild Time

When a vmstorage node restarts, it replays the WAL and rebuilds the in-memory portion of the mergeset index. With billions of series, this can take 10-30 minutes. During this window, the node accepts writes but cannot serve queries efficiently. Periodic index snapshots reduce WAL replay size. Keep the hot tier series count manageable per node (target: <500M active series per vmstorage node). Enable replication factor 2 so the replica serves queries while the primary rebuilds.

12.8 Tail-Based Sampling Memory

The OTel Collector's tail-based sampling processor buffers all spans for 60 seconds before making keep/drop decisions. At 67K spans/sec per collector (200M raw spans/sec across 3,000 collectors), each collector holds ~4M spans in memory, consuming ~2.7 GB RAM. If a burst of high-cardinality traces arrives, this buffer can exceed the collector's memory limit.

Set max_traces on the tail-based sampling processor to cap memory usage. If the limit is reached, new traces pass through unsampled rather than crashing the collector. Monitor otelcol_processor_tail_sampling_count_traces_dropped and scale the collector fleet if drops exceed 1%.

12.9 Tempo S3 Query Latency

Tempo stores trace blocks as Parquet files on S3 with bloom filters for trace ID lookup. When a bloom filter returns a false positive, Tempo must fetch the full Parquet block from S3. For attribute-based searches (e.g., {resource.service.name="checkout" && duration > 2s}), Tempo may scan dozens of blocks. Keep the Tempo compactor healthy. A memcached query result cache avoids repeated S3 reads for popular queries.

12.10 VictoriaLogs Bloom Filter Cache

VictoriaLogs uses bloom filters on every log field. High-cardinality queries hit bloom filters on every vlstorage node. If the bloom filter working set exceeds the OS page cache, queries trigger disk reads instead of cache hits, increasing latency from sub-second to multi-second. Monitor bloom filter cache hit rate and provision enough RAM for the OS page cache.

12.11 Pyroscope Ingester Memory

Pyroscope ingesters buffer profile data in memory before flushing to S3. At 50K profiles/sec with an average 100 KB per profile, each ingester (10 nodes) handles 5K profiles/sec = 500 MB/sec in-memory. With a 30-second flush interval, each ingester holds ~15 GB of profile data in RAM. Symptoms: ingester OOM during traffic spikes, profile data gaps in Grafana flame graph panels.

Mitigation: configure ingester.max-block-bytes to limit per-tenant buffer size. Use Kubernetes memory limits with 30% headroom above expected peak. During volume spikes, excess profiles are dropped with a metric (pyroscope_ingester_dropped_profiles_total) for alerting.

12.12 Scaling Triggers

Component	Scaling Metric	Trigger	Action
vmstorage	`vm_active_timeseries` per node	> 400M series	Add vmstorage nodes, rebalance hash ring
vmstorage	Disk utilization	> 70% on NVMe	Add nodes or increase retention export frequency
vminsert	CPU utilization	> 70% sustained	Add vminsert pods (stateless, instant)
vmselect	`vm_concurrent_select_requests`	> 80% of max	Add vmselect pods
Kafka	Bytes in per broker	> 200 MB/sec per broker	Add brokers, rebalance partitions
Flink	`kafka_consumergroup_lag`	Lag > 5 minutes sustained	Add TaskManagers
Flink	Checkpoint duration	> 50% of checkpoint interval	Tune RocksDB, add parallelism
vlstorage	Disk utilization	> 70%	Add vlstorage nodes
vlstorage	Bloom filter cache miss rate	> 20%	Add RAM or add nodes
Pyroscope	Ingester memory utilization	> 70%	Add ingester nodes
ML Anomaly	Processing lag	> 2 minutes behind real-time	Add consumer nodes
Beyla	`beyla_ebpf_attach_errors`	> 0 sustained	Investigate kernel compatibility
OTel Collectors	`otelcol_exporter_send_failed_metric_points`	Failure rate > 1%	Scale fleet or increase per-collector resources

Use Kubernetes HPA for stateless components (vminsert, vmselect, vlinsert, vlselect) with custom metrics from the meta-monitoring stack. Stateful components (vmstorage, vlstorage, Kafka) require manual scaling with planned rebalancing.

13. Failure Scenarios

What happens when components fail or scaling limits are exceeded, and how the platform recovers.

13.1 VictoriaMetrics Node Loss

Scenario: A vmstorage node loses its disk (NVMe failure) and all data on it.

Impact: The series sharded to that node become unavailable for queries. Ingestion continues (vminsert detects the failed node and redistributes to remaining nodes). Queries return partial results for affected series.

Recovery:

vminsert's health check detects the failed node within 10 seconds
vminsert's consistent hashing ring updates. New writes for affected series go to the next node in the ring.
With replication factor 2, the replica node has a copy. vmselect queries both the primary and replica, so queries still return complete data.
A replacement node joins the cluster. vmstorage rebalancing migrates data from overloaded nodes.

Why replication factor 2 is enough: Unlike databases, metrics data is append-only and immutable. There are no updates, no transactions, no read-before-write. Losing recent data is tolerable (the system will re-scrape and backfill). Losing old data is mitigated by S3 exports.

13.2 Kafka Broker Failure

Scenario: A Kafka broker in the metrics-ingestion cluster crashes.

Impact: Partitions led by that broker become temporarily unavailable. With RF=3, the controller elects a new leader from in-sync replicas.

Recovery:

Kafka controller detects broker failure (KRaft heartbeat)
Leader election for affected partitions (~1-5 seconds)
Producers retry buffered messages to the new leader
No data loss (RF=3 with acks=all means at least 2 replicas have every message)

During failover: 1-5 seconds of elevated producer latency. OTel Collectors buffer in memory during this window (configured via sending_queue settings).

13.3 Cardinality Explosion (OOM)

Scenario: A tenant adds user_id as a histogram label (see Section 8.3). 10M users × 12 histogram series = 120M new series.

Recovery with per-tenant limits (correct configuration):

vminsert detects that the tenant has exceeded its 50M series limit
New series creation is rejected (HTTP 429)
Existing series continue receiving data
Alert fires: vm_active_timeseries > 0.8 * vm_series_limit
Over the next staleness period (5 minutes), the 120M stale series are marked inactive

Per-tenant cardinality limits are P0 for exactly this reason.

13.4 Split-Brain in VictoriaMetrics Cluster

Scenario: Network partition isolates vmstorage nodes 1-3 from nodes 4-6.

VictoriaMetrics is AP (availability over consistency). During a partition, reads and writes continue but may return temporarily inconsistent data. vminsert routes writes for unreachable nodes to the next available node. When the partition heals, deduplication on vmselect merges overlapping data. No manual intervention required.

13.5 Flink Job Failure

Scenario: A Flink TaskManager crashes mid-checkpoint.

Impact: Recording rules stop producing pre-computed metrics. SLO burn rate calculations freeze. Raw metric ingestion into vmstorage continues unaffected.

Recovery: Flink JobManager reschedules failed tasks. Tasks restore from the last successful checkpoint. Consumer offsets reset to the checkpoint's Kafka position. Flink replays events, rebuilding state. Gap during replay: typically 30 seconds to 2 minutes.

13.6 eBPF Collector Failure

Scenario: A kernel upgrade breaks Beyla eBPF probes on a subset of nodes. Beyla pods enter CrashLoopBackOff.

Impact: Baseline RED metrics and auto-generated trace spans disappear for services on affected nodes. Services with OTel SDK instrumentation are completely unaffected. Their metrics, traces, logs, and profiles continue flowing normally. Only services relying exclusively on eBPF for observability lose visibility.

Recovery:

Alert fires: beyla_ebpf_attach_errors > 0 on affected nodes
Platform team identifies the kernel version causing the failure
Short-term: roll back the kernel upgrade on affected nodes, or pin Beyla to a compatible version
Long-term: update Beyla to a version compatible with the new kernel
The graceful degradation to SDK-only is automatic. No manual intervention needed for SDK-instrumented services

Key design point: The two-tier instrumentation model (eBPF + SDK) means eBPF failures are always a partial degradation, never a total outage. Teams that have added SDK instrumentation are fully protected.

13.7 Pyroscope Ingester Failure

Scenario: A Pyroscope ingester node OOMs due to a traffic spike from a newly onboarded high-volume service.

Impact: Profile data for services assigned to that ingester is lost for the duration. Trace-to-profile links show "profile not found." Metrics, traces, and logs are completely unaffected.

Recovery:

Kafka buffers profiles in the profiles-raw topic (12-hour retention)
Remaining ingesters pick up partitions from the failed node (consumer group rebalance)
Once the replacement ingester catches up from Kafka, profile coverage resumes
Profiles during the gap are permanently lost (not re-collectable)

Prevention: Per-tenant profile ingestion rate limits at the OTel Collector. Monitor pyroscope_ingester_memory_utilization and alert at 70%.

13.8 ML Anomaly Detector Divergence

Scenario: The ML anomaly detector's seasonal baselines become stale after a major traffic pattern change (Black Friday, migration to new region, daylight saving time shift).

Impact: Every metric deviates from the stale baseline. Hundreds of false-positive anomaly alerts flood Slack. Engineers ignore the channel. When a real anomaly occurs simultaneously, it's buried in noise.

Recovery:

On-call engineer uses the anomaly suppression API: POST /api/v1/anomaly/suppress?duration=4h&scope=all
After the traffic pattern stabilizes, trigger a baseline reset: POST /api/v1/anomaly/reset-baseline?service=*
The detector rebuilds baselines over the next 24-48 hours using the new traffic pattern
During rebuild, anomaly detection operates with wider thresholds (5σ instead of 3σ) to reduce noise

Key design point: ML anomaly alerts go to Slack only, never PagerDuty. This design decision means a false positive storm is annoying but not dangerous. SLO burn rate alerts (the paging mechanism) are unaffected by ML detector issues.

13.9 Trace Sampling Pipeline Failure

Scenario: The tail-based sampling processor crashes or hits its max_traces limit. Spans either get dropped entirely or pass through unsampled at 100% keep rate.

Impact (100% pass-through): Tempo ingesters receive 200x normal volume. Kafka topic saturates. Storage spikes.

Recovery: The OTel Collector's memory limiter processor acts as a circuit breaker, dropping spans before OOM. Kafka buffers excess until sampling resumes and Tempo catches up. Monitor otelcol_processor_tail_sampling_count_traces_sampled for anomalies.

13.10 Network Partition Scenarios

OTel-to-Kafka and Flink-to-VictoriaMetrics partitions share the same recovery model: upstream components buffer locally (OTel Collectors have a 500 MB retry queue; Flink backpressures to Kafka), and Kafka retention bridges the gap once connectivity restores. Deploy Kafka brokers across multiple AZs for prevention.

13.11 VictoriaLogs Disk Full

Scenario: A noisy service emits 10x normal log volume. vlstorage nodes fill their local NVMe.

Recovery: Alert at 85% disk. Identify noisy service via sum(rate(vlinsert_ingested_bytes_total[5m])) by (service). Apply per-service rate limiting at vlinsert. VictoriaLogs retention policy auto-deletes old partitions. The pipeline's value-based routing (Section 4.5) also helps: bronze-tier services can have their log volume capped at the OTel Collector level.

13.12 Clock Skew Between Collectors

NTP misconfiguration causes clock drift: future timestamps confuse dashboards and break rate() calculations, past timestamps outside VictoriaMetrics' 30-minute window are silently dropped. Monitor vm_out_of_order_samples_total. Prevention: validate timestamps at the collector (current time +/- 5 minutes) and overwrite out-of-range timestamps.

14. Observability (Meta-Monitoring)

The meta-problem: if the metrics platform itself goes down, how does anyone find out? Dashboards are blank. Alerts aren't evaluating. PagerDuty is silent.

Solution: A separate, minimal monitoring stack.

A single Prometheus instance on separate infrastructure (different Kubernetes cluster, different cloud region, different failure domain) monitors five things:

Is vmstorage responding? Scrape /metrics. If scrape fails 3 times, page.
Is ingestion flowing? Check vm_rows_inserted_total rate. If it drops to 0, page.
Is the alert evaluator running? Check rule_evaluations_total rate. If it drops to 0, page.
Is Alertmanager healthy? Check alertmanager_alerts_received_total. If it flatlines, page.
Is the profile pipeline healthy? Check pyroscope_ingester_appended_profiles_total. If it drops to 0, alert but don't page (profiles are P1, not P0).

Total footprint: one t3.medium instance.

14.1 Critical Metrics

What to watch, by component:

# === INGESTION ===
rate(vm_rows_inserted_total[5m])                                          # Should match 400M/sec ± 20%
kafka_consumergroup_lag{group="flink-recording-rules"}                    # Flink consumer lag
kafka_consumergroup_lag{group="vminsert-consumer"}                        # vminsert consumer lag
otelcol_scraper_scraped_metric_points / otelcol_scraper_errored_metric_points  # Scrape success rate
rate(vm_rows_rejected_total{reason="series_limit"}[5m])                   # Tenant limit rejections

# === eBPF COLLECTION ===
beyla_ebpf_attach_errors                                                  # eBPF probe failures
beyla_http_server_request_duration_seconds_count                          # Beyla generating metrics

# === PIPELINE ===
otelcol_processor_routing_dropped_total                                   # Health checks/probes dropped
otelcol_processor_routing_sampled_total                                   # Bronze-tier signals sampled

# === STORAGE ===
vm_active_timeseries                                                      # Global active series
rate(vm_new_timeseries_created_total[5m])                                 # Series churn rate
vm_data_size_bytes{type="indexdb"}                                        # Index size
vm_merge_need_free_disk_space                                             # Compaction backlog

# === QUERY ===
histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m]))    # p99 query latency
vm_concurrent_queries                                                      # Active queries
rate(vm_slow_queries_total[5m])                                           # Slow queries (>10s)

# === ALERTING ===
histogram_quantile(0.99, rate(rule_evaluation_duration_seconds_bucket[5m]))  # Rule eval duration
rate(rule_evaluation_failures_total[5m])                                     # Missed evaluations
rate(alertmanager_notifications_failed_total[5m])                            # Notification failures

# === ML ANOMALY DETECTION ===
ml_anomaly_detector_lag_seconds                                           # Processing lag behind real-time
ml_anomaly_detector_false_positive_rate                                   # Should be < 5%
ml_anomaly_detector_baseline_age_hours                                    # Stale baseline detection

# === FLINK ===
flink_job_lastCheckpointDuration                                          # Should be < checkpoint interval
rate(flink_taskmanager_job_task_numRecordsInPerSecond[5m])                 # Should match Kafka rate
flink_taskmanager_job_task_backPressuredTimeMsPerSecond                    # High = can't keep up

# === TRACE PIPELINE ===
tempo_ingester_flush_duration_seconds                                      # Ingester lag
rate(otelcol_processor_tail_sampling_count_traces_dropped[5m])             # Span drop rate

# === LOG PIPELINE ===
rate(vlinsert_ingested_rows_total[5m])                                     # Log ingestion rate
vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes                    # Disk usage

# === PROFILE PIPELINE ===
rate(pyroscope_ingester_appended_profiles_total[5m])                       # Profile ingestion rate
pyroscope_ingester_memory_utilization                                      # Ingester memory pressure
pyroscope_compactor_outstanding_blocks                                     # Compaction health

14.2 Alert Rules for Meta-Monitoring

Alert	Expression	Severity	For
IngestionDrop	`rate(vm_rows_inserted_total[5m]) < 320000000`	critical	2m
KafkaLagHigh	`kafka_consumergroup_lag > 50000000`	warning	5m
KafkaLagCritical	`kafka_consumergroup_lag > 500000000`	critical	2m
CardinalitySpike	`rate(vm_new_timeseries_created_total[1h]) > 100000`	warning	0m
VmstorageDown	`up{job="vmstorage"} == 0`	critical	30s
QueryLatencyHigh	`histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) > 1`	warning	5m
RuleEvalFailing	`rate(rule_evaluation_failures_total[5m]) > 0`	critical	2m
FlinkBackpressure	`flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500`	warning	5m
CompactionBehind	`vm_merge_need_free_disk_space > 0`	warning	10m
S3ExportFailed	`time() - last_s3_export_timestamp > 86400`	critical	0m
BeylaAttachErrors	`beyla_ebpf_attach_errors > 0`	warning	5m
MLAnomalyLag	`ml_anomaly_detector_lag_seconds > 120`	warning	5m
MLBaselineStale	`ml_anomaly_detector_baseline_age_hours > 672`	warning	0m
PyroscopeIngesterHigh	`pyroscope_ingester_memory_utilization > 0.7`	warning	5m
ProfileIngestionDrop	`rate(pyroscope_ingester_dropped_profiles_total[5m]) > 0`	warning	2m
TraceSamplingAnomaly	`rate(otelcol_processor_tail_sampling_count_traces_sampled[5m]) / rate(otelcol_processor_tail_sampling_count_traces_received[5m]) > 0.05`	warning	2m
VictoriaLogsStorageFull	`vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes > 0.85`	warning	5m
LogIngestionDrop	`rate(vlinsert_ingested_rows_total[5m]) < 40000000`	critical	5m

15. Deployment Strategy

The platform deploys on Kubernetes across multiple regions. Each region runs a complete, independent pipeline. Kubernetes provides DaemonSet-based collector deployment, service discovery for scrape targets, and horizontal pod autoscaling for stateless components.

15.1 Multi-Region Topology

Active-active ingestion: Each region runs its own complete pipeline (Beyla + OTel → Kafka → Flink → VictoriaMetrics + Pyroscope). Telemetry stays region-local, no cross-region ingestion traffic.

Cross-region query federation: A global vmselect gateway fans out queries to VictoriaMetrics clusters in all regions. For global dashboards ("total request rate across all regions"), the gateway merges results from both regions.

S3 replication: Each region's S3 bucket has cross-region replication enabled. If us-east-1 goes down entirely, warm/cold data is still accessible from us-west-2.

15.2 Rolling Upgrades

VictoriaMetrics: Upgrade one component at a time. vminsert first (stateless, instant restart). vmselect next (stateless). vmstorage last (stateful, needs graceful drain: mark as draining, stop new writes, wait for in-flight to complete, upgrade, re-join).

Flink: Blue-green deployment. Deploy new Flink job version alongside the existing one (different consumer group). Once the new job has caught up (consumer lag = 0), switch traffic. Stop the old job.

Beyla: Rolling DaemonSet update. eBPF probes detach when the old pod stops and reattach when the new pod starts. Brief gap (~5-10 seconds per node) in eBPF-generated telemetry during the rollout.

15.3 Disaster Recovery

RPO/RTO targets:

Component	RPO (data loss tolerance)	RTO (recovery time)	Mechanism
Metrics (hot)	0 (with RF=2)	< 5 minutes	Replica serves queries while failed node recovers.
Metrics (warm/cold)	0	< 30 minutes	S3 cross-region replication.
Kafka	0 (RF=3, acks=all)	< 1 minute	Leader election on surviving replicas.
Flink	Last checkpoint (seconds to minutes)	< 5 minutes	Restart from checkpoint.
Traces (Tempo)	~60s of in-flight spans	< 10 minutes	Stateless query frontend. Ingesters lose buffer.
Logs (VictoriaLogs)	Since last vmbackup (hours)	15-30 minutes	Restore from S3 snapshot.
Profiles (Pyroscope)	~30s of in-flight profiles	< 10 minutes	Kafka buffers (12h retention). Ingesters lose buffer.
ML Anomaly Detector	Baseline state in RocksDB (hours to rebuild)	< 15 minutes	Restart, replay from Kafka. Baselines rebuild over 24-48h.

16. Security

16.1 Security Fundamentals

Transport: mTLS between all components (OTel Collectors, Beyla, Kafka, Flink, VictoriaMetrics, Pyroscope). TLS 1.3 for external-facing APIs. Certificates managed by cert-manager.
Tenant isolation: Separate data directories per tenant on vmstorage (/data/{tenant_id}/). vmselect injects tenant_id filter into every query. Alert rules, recording rules, and dashboards are tenant-scoped with RBAC at the API gateway.
Encryption at rest: AES-256 disk encryption for vmstorage (EBS/NVMe). SSE-S3 or SSE-KMS for S3. Broker-side encryption for Kafka.
eBPF security: Beyla requires CAP_SYS_ADMIN or CAP_BPF + CAP_PERFMON capabilities. These are restricted to the Beyla DaemonSet pods only. eBPF programs are verified by the kernel's BPF verifier before loading, preventing arbitrary kernel code execution.
Profile data sensitivity: CPU and memory profiles can reveal internal function names, code paths, and potentially sensitive string values in stack frames. Profile data inherits the same tenant isolation and encryption-at-rest as other signals. RBAC restricts profile access to service owners and on-call engineers.
Audit logging: All administrative actions (tenant CRUD, alert rule changes, cardinality limit overrides, anomaly suppression) logged to a separate append-only store. 1-year minimum retention for compliance.

16.2 Query Injection Prevention

MetricsQL expressions are user-provided strings. Without sanitization, a crafted query could attempt to access another tenant's data or execute resource-intensive queries.

Protections:

vmselect forcibly prepends {tenant_id="<authenticated_tenant>"} to every query. The user-provided tenant_id label is stripped.
Query complexity limits: maximum number of series touched (500K default), maximum query duration (60s), maximum number of subqueries (10).
Regex complexity limits: MetricsQL regex matchers are checked for backtracking patterns before execution.

Explore the Technologies

The technologies and infrastructure patterns referenced throughout this design:

Core Technologies

Technology	Role in This System	Learn More
OpenTelemetry	Vendor-neutral instrumentation standard. SDK for custom metrics, traces, and context propagation across all services	OpenTelemetry
OTel Collector	Telemetry pipeline engine. Receives from SDK and Beyla, processes (batch, filter, enrich, sample), exports to storage backends	OTel Collector
Grafana Beyla	eBPF-based auto-instrumentation, zero-config RED metrics and trace spans from day one without code changes	Grafana Beyla
Kafka	Ingestion buffer, 400M metrics/sec (post-pipeline), fingerprint-partitioned topics	Kafka
VictoriaMetrics	Hot TSDB, Gorilla compression, shared-nothing cluster (vminsert/vmselect/vmstorage)	VictoriaMetrics
Flink	Stream aggregation, recording rules, SLO burn rate calculation	Flink
Prometheus	Scrape format standard, MetricsQL foundation, meta-monitoring stack	Prometheus
Grafana	Dashboard visualization, multi-datasource queries, flame graph panels	Grafana
Grafana Pyroscope	Continuous profiling storage, S3-native, pprof format, span-to-profile correlation	Grafana Pyroscope
InfluxDB	Alternative TSDB evaluated in Section 4.3 (FDAP stack, Arrow + Parquet)	InfluxDB
ClickHouse	Optional analytics layer for cross-signal wide-event exploration	ClickHouse
RocksDB	Flink state backend, VictoriaMetrics mergeset index, ML anomaly detector baselines	RocksDB
gRPC	vminsert-to-vmstorage communication, protobuf metric serialization	gRPC
Envoy	Service mesh sidecar, mTLS for scrape traffic, load balancing	Envoy
Memcached	Query result cache for Tempo trace queries and vmselect overflow caching	Memcached
Gorilla Compression	Delta-of-delta + XOR encoding, 12x compression ratio, 1.37 bytes/sample	Gorilla Compression
ZSTD Compression	General-purpose compression for cold-tier Parquet blocks on S3, Kafka messages, RocksDB SST files	ZSTD Compression
Grafana Tempo	Distributed trace storage, S3-backed Parquet blocks, TraceQL queries, exemplar correlation	Grafana Tempo
VictoriaLogs	Log storage, bloom filter per-token index, LogsQL queries, vlinsert/vlselect/vlstorage cluster	VictoriaLogs

Infrastructure Patterns

Pattern	Role in This System	Learn More
Metrics & Monitoring	Core domain. Gorilla compression, TSDB internals, cardinality management	Metrics & Monitoring
Message Queues & Event Streaming	Kafka topic design, fingerprint partitioning, backpressure handling	Event Streaming
Alerting & On-Call	Distributed rule evaluation, Alertmanager, SLO burn rates, ML anomaly detection	Alerting & On-Call
Kubernetes Architecture	Service discovery, DaemonSet collectors, eBPF agents, pod-level scraping	Kubernetes Architecture
Rate Limiting & Throttling	Per-tenant ingestion limits, query concurrency limits, cardinality budgets	Rate Limiting
Circuit Breaker & Resilience	vmstorage failure handling, Kafka retry policies, graceful degradation	Circuit Breaker & Resilience
Auto-Scaling Patterns	vminsert/vmselect horizontal scaling, Flink TaskManager auto-scaling	Auto-Scaling
Distributed Tracing	Trace collection via OTel SDK + eBPF, tail-based sampling, Tempo on S3, profile correlation	Distributed Tracing
Distributed Logging	Log collection via OTel filelog receiver, VictoriaLogs bloom filter indexing, value-based routing	Distributed Logging
Database Sharding	VictoriaMetrics consistent hashing, tenant shuffle-sharding	Database Sharding
Service Discovery & Registration	Kubernetes SD, Consul, eBPF auto-discovery	Service Discovery
Deployment Strategies	Blue-green for Flink jobs, rolling upgrades for VictoriaMetrics	Deployment Strategies
Caching Strategies	vmselect query cache, recording rule pre-computation, Parquet block cache	Caching Strategies
Object Storage & Data Lake	S3 + Parquet warm/cold tiers, Pyroscope S3 storage, ZSTD compression	Object Storage
Secrets Management	mTLS certificates, tenant API keys, S3 encryption keys	Secrets Management

System Design: GitHub (200M Repos, Git Object Storage, Sparse Trigram Code Search, Per-Job VM CI)

April 23, 2026 · 69 min read

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

System Design: Job Scheduler (10M Jobs/day, DAG Dependencies, Effectively-Once Execution)

April 15, 2026 · 61 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.

CrackingWalnuts

System DesignMarch 12, 2026· 114 min read

Observability Platform at Scale - Metrics, Traces, Logs, and Profiles

Goal: Build a unified observability platform that ingests 500 million metrics per second, collects distributed traces across 500K+ services, centralizes logs from every container in the fleet, and continuously profiles application performance. Support all four pillars of observability: metrics (counters, gauges, histograms, summaries), traces (per-request call trees across service boundaries), logs (structured application output with full-text search), and profiles (CPU, memory, and lock contention snapshots correlated to traces). Provide real-time dashboards, alerting with SLO (Service Level Objective)-based burn rates and ML-assisted anomaly detection, multi-tenant isolation, and long-term retention with automatic downsampling.

Reading guide: This is a long, detailed deep dive. You don't need to read it linearly.

Sections 1-8: Metrics pipeline in full depth (the largest and most complex signal)

Section 9: Distributed tracing at scale

Section 10: Centralized logging

Section 11: Continuous profiling (the fourth pillar)

Sections 12-16: Cross-cutting concerns (bottlenecks, failures, meta-monitoring, deployment, security)

All four signals share the same OTel Collector fleet and Kafka cluster but diverge at the storage layer.

New to observability? Start with Sections 1-7 for the core architecture and sizing. Skim Section 8.3 (cardinality) for the scariest problem in metrics. Skip to Section 13 for real failure stories.

Building something similar? Sections 7-8 have the sizing math and deep dives you need. Section 11 covers the profiling pipeline most teams skip.

Preparing for a system design interview? Sections 1-8 cover what interviewers expect. Section 12 (bottlenecks) and Section 13 (failures) are common follow-up questions.

TL;DR: A unified observability platform handling 500M metrics/sec, 200M spans/sec, 50M log lines/sec, and 50K profiles/sec across 500K+ services. Gorilla compression achieves 1.3–2 bytes/sample (12x reduction). Tail-based trace sampling reduces volume by 98-99.5% while keeping 100% of errors. Value-based routing drops 15-30% of low-value data before storage. ML anomaly detection supplements SLO burn rates for alerting. The hardest problems: cardinality explosions in metrics, sampling strategy for traces, volume spikes in logs, and profile-to-trace correlation for profiling.

Architecture in One Minute

At a high level the platform consists of five layers:

Two-tier collection layer. eBPF baseline telemetry via Grafana Beyla (zero code changes) plus OpenTelemetry SDK instrumentation for deep visibility.
Observability pipeline. OTel Collector processors perform metadata enrichment, value-based routing, and deduplication before data reaches Kafka.
Streaming ingestion. Apache Kafka buffers telemetry and supports replay. Apache Flink performs recording rules, rollups, and SLO burn-rate calculations.
Signal-specific storage engines. Metrics → VictoriaMetrics, Traces → Grafana Tempo, Logs → VictoriaLogs, Profiles → Grafana Pyroscope. Each engine is optimized for its signal's access pattern.
Unified query and correlation. Grafana correlates signals using exemplars, trace_id, and span_id. Optional ClickHouse analytics gives you cross-signal exploration.

1. Problem Statement

A few clarifications before getting into the architecture:

Assumptions:

This is a multi-tenant platform (like Datadog, Grafana Cloud, or New Relic). Multiple teams and organizations push metrics into the same infrastructure.
A metric is "ingested" once it's in durable storage and queryable, which takes under 2 seconds.
Each metric is a (name, label set, timestamp, float64 value) tuple. Labels are key-value pairs like {service="checkout", env="prod", region="us-east-1"}.
Services expose metrics via Prometheus exposition format or push via OTLP (OpenTelemetry Protocol).

Operating model:

Responsibility split: The platform team operates collection infrastructure (OTel Collectors, eBPF agents, Kafka, storage, dashboards). Tenant teams instrument their applications using OpenTelemetry SDKs, Prometheus client libraries, or rely on eBPF-based auto-instrumentation for baseline telemetry. Tenants either expose a /metrics endpoint (pull) or push OTLP to the platform's ingestion endpoint (push). Tenants do not run their own Prometheus servers or storage. The platform handles everything after the application boundary.
Two-tier instrumentation: The platform deploys Grafana Beyla (eBPF-based) as a DaemonSet alongside the OTel Collector DaemonSet. Beyla auto-instruments HTTP/gRPC/SQL at the kernel level with zero application code changes, giving you RED metrics (Rate, Errors, Duration) and basic trace spans for every service.

The split matters because eBPF and OTel SDK see different things. eBPF watches network boundaries: it knows "POST /checkout took 500ms" and "a SQL call to port 5432 took 200ms," but it has no idea what happened inside the service. It can't tell you which function was slow, what the actual SQL query text was, or why the request failed. OTel SDK instrumentation lives inside the application code, so it captures the internal breakdown: which method took how long, the exact query text, custom business metrics like orders_placed or payment_amount, span attributes like user_id and cart_size, and CPU/memory profiles showing which line of code is the bottleneck. Profiles come exclusively from SDK instrumentation, not from eBPF.

Why not just use the SDK for everything? Because it requires code changes. Importing libraries, adding instrumentation, redeploying. Across 500+ services in five languages, that takes months. eBPF covers every service in an afternoon: deploy one DaemonSet and you have RED metrics and basic trace spans for the entire cluster. When a team eventually adds SDK instrumentation to a service, the OTel Collector deduplicates the overlap (keeping the richer SDK version) and eBPF quietly steps back for those endpoints.
Data transport: For tenants on the same Kubernetes cluster, the platform deploys OTel Collectors as a DaemonSet (one per node) that scrapes tenant pods locally over mTLS with no cross-network hop. For tenants on separate infrastructure, the platform exposes an HTTPS ingestion endpoint (OTLP or Prometheus remote_write) authenticated via per-tenant API key. Both paths converge into the same Kafka-backed pipeline.

Scope: Four Pillars of Observability

A production monitoring platform handles four signals that look nothing alike:

Metrics answer "how much?" Small numbers (CPU at 45%, 200 req/sec), arriving constantly.
Traces answer "what happened to this one request?" A call tree across service boundaries with timing data.
Logs answer "what did the system say?" Structured text messages with full-text search.
Profiles answer "why is this code slow?" CPU flame graphs, memory allocations, lock contention snapshots showing exactly which function is the bottleneck.

Each needs a different storage engine because the query patterns are completely different:

Signal	Data Shape	Size per Event	Storage Engine	Query Pattern
Metrics	`(name, labels, timestamp, float64)`	1.5-2 bytes (Gorilla compression, Section 8.2)	TSDB (VictoriaMetrics)	Aggregate: `rate(x[5m])`
Traces	`(trace_id, span_id, parent_id, service, duration, attributes)`	500-2000 bytes/span	Columnar/object store (Grafana Tempo)	By trace ID, filter by service+latency
Logs	`(timestamp, severity, message, structured_fields)`	200-1000 bytes	Log store (VictoriaLogs)	Full-text search, filter by severity
Profiles	`(profile_id, span_id, service, type, stacktrace, values)`	50-200 KB/snapshot	Profile store (Grafana Pyroscope)	By span ID, by service+time range

What NOT to do:

2. Functional Requirements

ID	Requirement	Priority
FR-01	Ingest metrics via pull (Prometheus scrape) and push (OTLP remote write)	P0
FR-02	Support four metric types: counter, gauge, histogram, summary	P0
FR-03	Real-time dashboard queries with sub-second latency over 15-day window	P0
FR-04	Continuous alert rule evaluation with configurable thresholds	P0
FR-05	SLO-based alerting with multi-window burn rate detection	P0
FR-06	Automatic metric downsampling: raw (15d) → 5-min aggregates (90d) → 1-hour aggregates (forever)	P0
FR-07	Multi-tenant data isolation with per-tenant cardinality limits	P0
FR-08	Distributed trace collection via OTLP with tail-based sampling	P0
FR-09	Centralized log ingestion with full-text search (50M lines/sec)	P0
FR-10	eBPF-based baseline collection (Grafana Beyla) for automatic RED metrics and trace spans	P0
FR-11	Value-based data routing: full fidelity for errors/SLO-critical, sample/drop health checks and debug noise	P0
FR-12	ML-assisted anomaly detection with seasonal baselines, adaptive thresholds, and cross-service correlation	P0

3. Non-Functional Requirements

Requirement	Target
Ingestion throughput	500M data points/sec sustained
Query latency (15-day window)	p50 < 50ms, p99 < 200ms
Query latency (90-day window, downsampled)	p50 < 200ms, p99 < 500ms
Ingestion-to-dashboard visibility	< 2 seconds
Alert rule evaluation interval	15 seconds (configurable)
Alert firing latency (ingestion to notification)	< 30 seconds
ML anomaly detection latency	< 60 seconds from ingestion to anomaly signal
Profile ingestion throughput	50K profiles/sec sustained
Availability	99.99% (52 min downtime/year)
Data durability	99.999% (no data loss on single node failure)
Active time series capacity	10 billion+
Retention	Raw: 15 days, 5-min: 90 days, 1-hour: indefinite
Tenant isolation	No cross-tenant data leakage, noisy neighbor protection

Design Principles

Several principles shape the architecture:

4. The write path must tolerate burst traffic. Kafka buffers between collection and storage so each layer scales independently. Details in Section 8.1.

4. High-Level Approach & Technology Selection

4.1 Why This Architecture

Append-only writes. Metrics are immutable. "At 3:14pm, CPU was 45%" never changes. The storage engine only appends, never seeks and updates, which is far faster.
Parallel everything. 500M writes per second on one machine is impossible. Spread the work across thousands of machines, each handling a slice of the data.
No read-before-write. At 500M/sec, checking whether a row exists before writing would be the bottleneck. Skip the lookup entirely.
Shard by metric identity. Assign each metric to a specific storage node based on its fingerprint (a hash of the metric name + labels). All samples for one metric always go to the same node. This keeps related data together, which is critical for compression and fast queries.
Compress aggressively. Time-series data has unique properties (timestamps arrive at regular intervals, values often repeat or change slowly) that enable 12x compression with the right algorithm.
Touch minimal data on reads. Inverted indexes map label selectors to series IDs, so queries skip irrelevant data.
Tier retention automatically. Hot data (raw, 15 days) on fast storage. Warm data (5-min aggregates, 90 days) on object storage. Cold data (1-hour aggregates) archived indefinitely.
Route by value, not just by time. Not all data is equally valuable. Errors and SLO-critical signals get full fidelity. Health checks, readiness probes, and debug noise get sampled or dropped before reaching storage. This reduces stored volume by 15-30% without losing anything that matters during an incident.

4.2 Technology Selection

The pipeline flows through seven stages -- collection, processing, buffering, aggregation, storage, query, and analytics. Each stage uses a purpose-built technology:

Component	Technology	Why This Choice
Collection (baseline)	Grafana Beyla (eBPF)	Zero-config HTTP/gRPC/SQL instrumentation at kernel level. ~200 MB RAM per node. Produces RED metrics and basic trace spans without any SDK changes.
Collection (deep)	OpenTelemetry SDK + OTel Collector	CNCF standard. Pull (Prometheus scrape) and push (OTLP) in one agent. 700+ integrations. Vendor-neutral. Custom business metrics, detailed span attributes, profile collection.
Pipeline processing	OTel Collector processor chain	Value-based routing, metadata enrichment, deduplication, and cost-aware sampling as processor stages in the existing OTel Collector fleet. No separate product needed.
Ingestion buffer	Apache Kafka	Replay on storage failure (24h retention), fan-out to Flink + vminsert, burst absorption at 500M/sec. RF=3, acks=all. Partitioned by metric fingerprint for locality.
Stream aggregation	Apache Flink	Native streaming (not micro-batch like Spark). Managed state with RocksDB backend. Exactly-once via distributed snapshots. Handles 50M+ keys without GC pressure.
TSDB (hot, 0-15d)	VictoriaMetrics Cluster	2.2M data points/sec per node baseline (5M+ on production-grade hardware, see Section 7.3). PromQL-compatible. Gorilla compression built-in. Shared-nothing architecture.
Long-term storage (warm/cold)	S3 + Apache Parquet	Columnar format supports partition pruning. ZSTD compression. Massively lower storage footprint vs hot TSDB. Inspired by Thanos and InfluxDB 3.0 FDAP stack.
Inverted index	VictoriaMetrics mergeset	Maps label combinations to series IDs. Handles 100M+ series without OOM. Disk-backed, not in-memory like Prometheus.
Query engine	MetricsQL	PromQL superset. Keeps metric names after aggregations. Better high-cardinality handling. WITH templates for complex queries. Drop-in replacement for existing PromQL users.
Alert evaluation	vmalert + Alertmanager	vmalert evaluates rules against vmselect, sharded across pods for scale (similar pattern to Grafana Mimir ruler). Alertmanager handles dedup, grouping, routing to PagerDuty/Slack/Opsgenie.
ML anomaly detection	Custom Kafka consumer + RocksDB	Seasonal baseline per metric (STL decomposition), anomaly scoring, cross-signal correlation. Lightweight: 5-10 nodes. Anomalies go to Slack, SLO burn rates remain the paging mechanism.
Service discovery	Kubernetes API + Consul	Auto-discover scrape targets. No manual configuration when services scale. Label-based relabeling for metric filtering.
Visualization	Grafana	Industry standard. Multi-datasource. Dashboard load target <3s. Variable templating for multi-service views.
Compression	Gorilla + ZSTD	Not sequential. Different compressors for different tiers. Gorilla compresses raw samples in vmstorage (hot, 0-15 days): 12x ratio, 1.37 bytes/sample. When data ages out, Flink produces aggregated doubles (min/max/avg/sum/count) that are no longer sequential time-series samples, so Gorilla doesn't apply. ZSTD compresses these Parquet blocks on S3 (warm/cold, 15+ days): ~3x ratio, tunable levels (1-22).
Multi-tenancy	Tenant ID header + per-tenant series limits	Inject tenant_id on remote write. Reject writes exceeding cardinality budget. Shuffle-sharding for isolation.
Trace backend	Grafana Tempo	S3-native Parquet storage. No index to maintain (bloom filters for trace ID lookup). TraceQL query language. Fraction of Jaeger+Elasticsearch infrastructure at scale.
Log backend	VictoriaLogs	Bloom filter per-token index on all fields. LogsQL for full-text search. vlinsert/vlselect/vlstorage cluster scales horizontally. Section 10.2 benchmarks against Loki and Elasticsearch.
Profile backend	Grafana Pyroscope	S3-native storage. pprof format ingestion. Span-to-profile correlation via OTel SDK. Same Grafana ecosystem as Tempo. Flame graph visualization built into Grafana.
Analytics (optional)	ClickHouse	Complementary wide-event store for cross-signal exploration. Not the primary store (see Section 4.6 for why). 10-20 nodes.
Context propagation	W3C Trace Context	Industry standard `traceparent` header. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers.

4.3 Why VictoriaMetrics Over Alternatives

Dimension	VictoriaMetrics	InfluxDB 3.0	TimescaleDB	M3DB (Uber)	ClickHouse
Ingestion rate (single node)	2.2M pts/sec	330K pts/sec	480K pts/sec	~500K pts/sec	~500K-1M pts/sec
RAM usage (4M series benchmark)	6 GB	20.5 GB	2.5 GB	~12 GB	~1-2 TB per 1B series
Disk usage (4M series)	3 GB	18.4 GB	52 GB	~8 GB	~4-6 bytes/sample
Query language	MetricsQL (PromQL superset)	SQL (via DataFusion)	SQL	M3QL + PromQL	SQL
Compression	Gorilla (1.37 bytes/sample)	Arrow + Parquet + ZSTD	PostgreSQL TOAST	M3TSZ (Gorilla variant)	LZ4/ZSTD on Float64 columns
Architecture	Shared-nothing (vminsert/vmselect/vmstorage)	Monolithic rewrite (Rust)	PostgreSQL extension	Quorum-based (3-way replication)	MergeTree, columnar
Operational complexity	Low (stateless query/insert, stateful storage only)	Medium (newer, less battle-tested)	Medium (PostgreSQL tuning required)	High (etcd dependency, complex topology)	Medium (general-purpose, not TSDB-optimized)

Why VictoriaMetrics wins here:

The three components: VictoriaMetrics cluster splits into three specialized roles:

vminsert (write router): Receives metrics and uses consistent hashing on the metric fingerprint to route each sample to the correct vmstorage node. Stateless (no local data), scales horizontally.
vmstorage (storage engine): Each node owns a shard of time series data on local NVMe. Handles Gorilla compression, maintains the inverted index, and runs compaction. The only stateful component.
vmselect (query engine): Receives MetricsQL queries, fans them out to all vmstorage nodes in parallel, merges results, and applies functions like rate(). Also stateless.

Each dimension (write throughput, storage capacity, query concurrency) scales independently by adding nodes to the relevant component.

4.4 Why Flink for Stream Aggregation

Dimension	Apache Flink	Spark Structured Streaming	Kafka Streams
Processing model	True streaming (event-by-event)	Micro-batch (100ms+ intervals)	True streaming
State management	RocksDB backend (spills to disk)	Heap-based (GC pressure)	RocksDB or in-memory
Exactly-once	Distributed snapshots (Chandy-Lamport)	Offset tracking	Kafka transactions
Scalability	Independent of Kafka (separate cluster)	Independent cluster	Tied to Kafka partitions
Windowing	Event-time, processing-time, session	Event-time, processing-time	Event-time, session
State size at 500M/sec	50M+ keys, multi-GB, no GC pauses	Multi-second GC pauses at scale	Limited by Kafka partition count

Scope: metrics only. Flink processes metrics only. Traces, logs, and profiles bypass Flink entirely (Sections 9-11). They don't need pre-aggregation or downsampling.

4.5 Observability Pipeline: Value-Based Data Routing

Value-based routing makes cost-aware decisions at the OTel Collector processor chain level, before data reaches Kafka:

yaml

# OTel Collector pipeline configuration (simplified)
processors:
  # Step 1: Enrich with ownership metadata from platform registry
  metadata_enrichment:
    endpoint: "http://metadata-service:8080/v1/lookup"
    cache_ttl: 5m
    enrich_fields:
      - team_owner
      - cost_center
      - slo_tier        # gold, silver, bronze

  # Step 2: Value-based routing decisions
  routing:
    rules:
      # Full fidelity: errors, SLO-critical signals
      - condition: 'attributes["http.status_code"] >= 500'
        action: pass_through
      - condition: 'resource["slo_tier"] == "gold"'
        action: pass_through

      # Drop: health checks, readiness probes
      - condition: 'attributes["http.route"] == "/healthz"'
        action: drop
      - condition: 'attributes["http.route"] == "/readyz"'
        action: drop

      # Sample: debug-level signals, low-priority services
      - condition: 'resource["slo_tier"] == "bronze"'
        action: probabilistic_sample
        sampling_percentage: 10

      # Default: pass through
      - condition: 'true'
        action: pass_through

  # Step 3: Deduplicate overlapping scrape targets
  deduplication:
    # When both Beyla (eBPF) and OTel SDK produce the same metric,
    # prefer the SDK version (richer labels) and drop the Beyla duplicate
    strategy: prefer_sdk_over_ebpf
    match_on: [metric_name, service_name]

4.6 Why Not a Unified ClickHouse Store?

The case for ClickHouse as a unified store:

The case against at 500M metrics/sec:

The compromise: ClickHouse as a complementary analytics layer.

Instead of replacing the specialized stores, add ClickHouse as an optional wide-event exploration layer:

Metrics → VictoriaMetrics (primary, full fidelity)
Traces  → Tempo (primary, full fidelity)
Logs    → VictoriaLogs (primary, full fidelity)
Profiles → Pyroscope (primary, full fidelity)
    ↓
All signals → ClickHouse (secondary, sampled wide events for cross-signal exploration)

"Show me all requests where latency > 2s AND the CPU profile shows GC pressure AND the downstream DB logs show lock contention." This requires joining traces, profiles, and logs.
"Which team's services have the worst error-budget burn this week, broken down by endpoint?" This requires joining metrics with service ownership metadata.

ClickHouse handles these exploratory, ad-hoc queries well. The primary pillar stores handle the high-throughput, latency-sensitive operational queries (dashboards, alerts, trace lookups).

5. High-Level Architecture

5.1 Bird's-Eye View

Component glossary (numbered steps match the diagram):

(1) Collection (two-tier model):

eBPF baseline (Grafana Beyla): Deployed as a DaemonSet (~3,000 nodes). Beyla attaches eBPF probes to the kernel's network and syscall layers. It automatically produces RED metrics (http_server_request_duration_seconds, http_server_request_body_size_bytes) and basic trace spans for every HTTP/gRPC/SQL call without any application code changes. ~200 MB RAM per node. Beyla forwards its output to the local OTel Collector for enrichment and routing.
Pull (same K8s cluster): The platform's OTel Collector DaemonSet runs one collector per K8s node (~3,000 collectors total). It scrapes local pods' /metrics endpoints over mTLS. No cross-network hop. Service Discovery (K8s API, Consul) automatically finds new scrape targets as services scale. It also receives Beyla's output and deduplicates where SDK instrumentation overlaps.
Push (external infrastructure): Services on separate clusters, on-prem, or serverless can't be scraped. They push metrics, traces, logs, and profiles via OTLP to the OTel Collector Gateway over HTTPS, authenticated by per-tenant API key.

(1b) Pipeline processing:

The OTel Collector processor chain applies value-based routing (Section 4.5) before forwarding to Kafka: metadata enrichment from the platform registry, health check/readiness probe filtering, SLO-tier-based sampling, and Beyla/SDK deduplication. This reduces stored volume by 15-30%.

(2) Kafka (ingestion buffer):

metrics-ingestion: 10,000 partitions, 24h retention, partitioned by metric fingerprint. Enables replay if storage crashes.
traces-raw: partitioned by trace_id so all spans from one trace land together.
logs-raw: partitioned by service_name for locality.
profiles-raw: partitioned by service_name. Lower volume than other signals (~1 GB/sec).

(3)→(4) Processing:

Flink computes recording rules, 5-minute rollups (20x data reduction), and SLO burn rate calculations on the metrics stream.
Tail-Based Sampler keeps errors, p99 latency, and a small random baseline (Section 9.2 covers the full sampling strategy).
Log Parser extracts structured fields (service name, log level, trace_id) and routes by severity.
ML Anomaly Detector consumes the metrics stream from Kafka, maintains seasonal baselines per metric in RocksDB, and emits anomaly signals to Alertmanager when deviations exceed thresholds (Section 8.7.1).

(5) Storage:

vminsert → vmstorage → S3 Parquet: Stateless write router hashes metrics to storage nodes. vmstorage compresses with Gorilla (1.37 bytes/sample) on NVMe (15-day hot tier). Aged data exports to S3 as ZSTD-compressed Parquet (warm 90 days, cold 1 year).
Tempo Ingesters → S3 Parquet: Batches sampled spans into Parquet blocks on S3. Bloom filters enable fast trace ID lookup.
vlinsert → vlstorage: Distributes logs across storage nodes. Bloom filter index on every field. Local NVMe storage.
Pyroscope Ingesters → S3: Batches profile data into S3 blocks. Indexed by service + time range + profile type. Span-to-profile correlation via span_id.

(6) Query:

vmselect: Fans out MetricsQL queries to all vmstorage nodes, merges results.
Tempo Query: TraceQL queries against S3 Parquet blocks.
vlselect: LogsQL full-text search across vlstorage nodes.
Pyroscope Query: Profile queries by service, time range, and span_id. Flame graph rendering.
Grafana: Unified dashboard querying all four backends. Exemplars link metrics to traces. trace_id links traces to logs. span_id links traces to profiles.
ClickHouse Analytics (optional): Cross-signal wide-event queries for exploratory analysis.

Control plane (tenant configs, rules, budgets, RBAC) and data plane (everything that touches telemetry) scale and deploy independently.

Pull vs Push collection. The OTel Collector normalizes both models into the same protobuf format.

How each signal reaches the platform:

Signal	Same K8s cluster	External infrastructure	Protocol
Metrics	Pull: Collector scrapes `/metrics` (local, mTLS). eBPF: Beyla auto-generates RED metrics.	Push: App sends OTLP to collector gateway (HTTPS)	Prometheus text (pull), OTLP protobuf (push), or eBPF (auto)
Traces	Push: OTel SDK sends spans to local collector (gRPC). eBPF: Beyla generates basic spans.	Push: OTel SDK sends spans to collector gateway (gRPC/HTTPS)	OTLP protobuf over gRPC (default) or HTTP
Logs	File tail: Collector reads `/var/log/pods/*/` (no network)	Push: App sends OTLP logs to collector gateway (gRPC/HTTPS)	Local file I/O (same cluster) or OTLP protobuf (external)
Profiles	Push: OTel SDK sends profiles to local collector (gRPC)	Push: OTel SDK sends profiles to collector gateway (gRPC/HTTPS)	OTLP profiles protobuf (experimental) or Pyroscope SDK push

After Kafka, each signal diverges to its specialized storage pipeline (see numbered steps above). Grafana correlates all four via exemplars, trace_id, and span_id fields.

6. Data Model

6.1 Metric Wire Format

Every metric sample follows the Prometheus data model:

metric_name{label1="value1", label2="value2"} float64_value unix_timestamp_ms

Example:
http_requests_total{service="checkout", method="POST", status="200", region="us-east-1"} 142857 1709827200000

The series identity (fingerprint) is computed as:

fingerprint = FNV-1a(sorted(metric_name + labels))    // FNV-1a: a fast, non-cryptographic hash function
              ↑ Labels sorted alphabetically by key name before hashing.
              Different producers may emit labels in different order
              (e.g., {service, method} vs {method, service}).
              Sorting guarantees the same series always produces the same fingerprint.

Example:
  metric: http_requests_total
  labels: {method="POST", region="us-east-1", service="checkout", status="200"}
  fingerprint: hash("http_requests_total\x00method\x00POST\x00region\x00us-east-1\x00service\x00checkout\x00status\x00200")
  → 0x7a3f2b1c (used for Kafka partitioning and storage sharding)

Why the fingerprint matters: This single hash drives data locality at every layer.

Kafka uses it as the partition key. All samples for one metric land on the same partition, so Flink can process that metric's recording rules without fetching data from other partitions.
vminsert uses it to route all data for one metric to the same storage node. Keeping data together on one node is critical for compression (the compressor needs consecutive values to find patterns) and for queries (look up one node, not all 200).
S3 Parquet files are keyed by fingerprint range. When querying cold data, the system skips entire files whose fingerprint range doesn't match, instead of scanning everything.

6.2 Metric Types

Type	Description	Storage Behavior
Counter	Monotonically increasing value (resets on process restart)	Store raw value. `rate()` computed at query time.
Gauge	Arbitrary value that goes up and down	Store raw value.
Histogram	Observations bucketed into configurable ranges. Each bucket counts how many observations fell at or below that boundary.	Expands to multiple series: `_bucket{le="X"}` (one per bucket, where `le` = "less than or equal to", the upper boundary), plus `_sum` and `_count`. A histogram with 10 buckets = 12 time series per unique label combination.
Summary	Pre-computed quantiles on the client side	Expands to `{quantile="X"}`, `_sum`, `_count`.

Why histograms explode into multiple series:

Series count = buckets × pods × label_combinations
            = 12 × 1,000 × (5 services × 4 methods × 5 statuses × 3 regions × 2 zones)
            = 12 × 1,000 × 600
            = 7,200,000 series from ONE histogram metric

6.3 Kafka Topic Layout

Topic: metrics-ingestion
  Producer: OTel Collectors (scrape + OTLP push + Beyla eBPF)
  Consumer: Flink (recording rules, rollups) + vminsert (raw metric writes) + ML Anomaly Detector
  Partitions: 10,000
  Replication Factor: 3
  Retention: 24 hours (replay buffer for recovery)
  Partitioning key: metric fingerprint (FNV-1a hash)
  Value format: Protobuf (TimeSeries message)

Topic: metrics-aggregated
  Producer: Flink (processed recording rules and rollups)
  Consumer: vminsert (aggregated metric writes)
  Partitions: 5,000
  Replication Factor: 3
  Retention: 12 hours

Topic: metrics-alerts
  Producer: Flink (alert events when burn rate thresholds breach) + ML Anomaly Detector
  Consumer: Alertmanager (routes and deduplicates alerts)
  Partitions: 100
  Replication Factor: 3
  Retention: 7 days

Topic: traces-raw
  Producer: OTel Collectors (tail-based sampler output)
  Consumer: Tempo Ingesters
  Partitions: 2,000
  Replication Factor: 3
  Retention: 24 hours
  Partitioning key: trace_id (all spans from one trace co-locate)
  Value format: Protobuf (OTLP ExportTraceServiceRequest)

Topic: logs-raw
  Producer: OTel Collectors (filelog receiver + pipeline processors)
  Consumer: vlinsert (VictoriaLogs write nodes)
  Partitions: 1,000
  Replication Factor: 3
  Retention: 24 hours
  Partitioning key: service_name (locality for per-service indexing)
  Value format: Protobuf (OTLP ExportLogsServiceRequest)

Topic: profiles-raw
  Producer: OTel Collectors (profile signal from SDKs)
  Consumer: Pyroscope Ingesters
  Partitions: 500
  Replication Factor: 3
  Retention: 12 hours
  Partitioning key: service_name
  Value format: Protobuf (pprof-encoded profiles)

6.4 Downsampled Parquet Schema (Warm/Cold Tier)

sql

-- Parquet file layout on S3
-- Path: s3://metrics-archive/{tenant_id}/{resolution}/{year}/{month}/{day}/{fingerprint_range}.parquet

-- Schema:
CREATE TABLE downsampled_metrics (
  fingerprint     UINT64,       -- Metric identity
  metric_name     STRING,
  labels          MAP<STRING, STRING>,
  timestamp_ms    INT64,        -- Aligned to resolution boundary
  value_min       DOUBLE,
  value_max       DOUBLE,
  value_avg       DOUBLE,
  value_sum       DOUBLE,
  value_count     INT64,        -- Number of raw samples in this window
  resolution      STRING        -- '5m' or '1h'
)
PARTITIONED BY (tenant_id, resolution, year, month);

6.5 Wide-Event Export Schema (ClickHouse Analytics Layer)

sql

-- ClickHouse table for cross-signal exploration (key columns shown)
CREATE TABLE wide_events (
  event_time       DateTime64(3),
  trace_id         String,
  span_id          String,
  service_name     LowCardinality(String),
  slo_tier         LowCardinality(String),
  error_rate_5m    Float64,       -- metric context
  span_duration_ms Float64,       -- trace context
  log_message      String,        -- log snippet
  has_profile      UInt8          -- profile context
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(event_time)
ORDER BY (service_name, event_time)
TTL event_time + INTERVAL 90 DAY;

7. Back-of-the-Envelope Estimation

7.1 Ingestion Volume

Target: 500M metrics/sec (peak)

Source split (assumed):
  Infrastructure metrics (CPU, memory, disk, network):  30%  →  150M/sec
  Application metrics (request rate, latency, errors):   40%  →  200M/sec
  Business metrics (orders, revenue, conversions):       10%  →   50M/sec
  Custom/user-defined metrics:                           20%  →  100M/sec

Daily volume:
  500M/sec × 86,400 sec/day                               ← 86,400 = 24 × 60 × 60
  = 43.2 trillion samples/day                              ← ~43T samples

  At 40% average utilization (not peak all day):
  200M/sec × 86,400 = 17.28T samples/day                  ← actual sustained rate

Pipeline reduction (value-based routing):
  ~20% of raw volume dropped/sampled before Kafka          ← health checks, probes, debug
  Effective ingestion: ~400M/sec peak, ~160M/sec sustained

7.2 Storage Sizing

Per-sample size:
  Raw (uncompressed): 16 bytes (8B timestamp + 8B float64 value)
  Gorilla compressed: 1.37 bytes/sample                    ← 16 / 1.37 = 12x compression
  With labels + index overhead: ~2 bytes/sample             ← use this for all estimates below

─── Hot tier (raw, 15 days) ───

  400M/sec × 2 bytes      = 800 MB/sec                     ← post-pipeline ingestion rate
  800 MB/sec × 86,400      = 69 TB/day                      ← one day of hot storage
  69 TB × 15 days           = 1,035 TB  (~1.0 PB)            ← 15 days at peak

  At 40% average utilization:
  1.0 PB × 0.4             = ~414 TB                         ← actual provisioned capacity

─── Warm tier (5-min aggregates in Parquet on S3, 90 days) ───

  400M/sec ÷ 20            = 20M aggregated samples/sec     ← 20 raw samples per 5-min window become 1
  20M × 5 values            = 100M values/sec                ← each aggregate stores min, max, avg, sum, count
  100M × 8 bytes            = 800 MB/sec                     ← raw doubles, no Gorilla (not sequential anymore)
  800 MB/sec ÷ 3 (ZSTD)    = 267 MB/sec                     ← ZSTD compresses ~3x on float64 arrays
  267 MB/sec × 86,400       = 23 TB/day                      ← one day of warm storage
  23 TB × 90 days            = 2,070 TB  (~2.1 PB)            ← 90 days of warm

─── Cold tier (1-hour aggregates in Parquet on S3, 1 year) ───

  20M/sec ÷ 12             = ~1.67M aggregated samples/sec  ← twelve 5-min windows per hour
  1.67M × 5 × 8 bytes      = 67 MB/sec                      ← same 5-value pattern as warm
  67 MB/sec ÷ 3 (ZSTD)     = ~22 MB/sec                     ← after compression
  22 MB/sec × 86,400        = 1.9 TB/day                     ← one day of cold storage
  1.9 TB × 365 days          = 694 TB                         ← one year of cold

─── Total ───

  Hot:  414 TB   (vmstorage NVMe, 15 days at 40% avg)
  Warm: 2,070 TB (S3 Parquet, ~2.1 PB, 90 days)
  Cold: 694 TB   (S3 Parquet, 1 year)
  Total: ~3.2 PB

Note: The pipeline's value-based routing reduces total storage by ~20% compared to storing everything. The savings more than offset the new profiling signal's storage requirements (Section 7.5).

7.3 Component Sizing

Beyla eBPF Agents:
  Deployed as DaemonSet: 1 per node, ~3,000 agents
  ~200 MB RAM per agent (eBPF maps + metric aggregation)
  Minimal CPU overhead: eBPF programs run in kernel space

OTel Collectors:
  Each collector: ~100-200K metrics/sec on c6g.xlarge (4 vCPU, 8GB)
  Range depends on pipeline complexity: bare receiver+exporter ≈ 200K,
    full chain (enrichment, routing, dedup, tail sampling) ≈ 100-150K
  500M / 170K = ~3,000 collectors                           ← Go-based, CPU-efficient for protobuf
  Deployed as DaemonSet (1 per node) + dedicated fleet for high-volume targets
  Additional ~200 MB RAM for pipeline processors (routing, enrichment, dedup)

Kafka:
  OTel Collectors batch ~2,000 samples per Kafka message (batch processor)
  400M samples/sec ÷ 2,000 = 200K messages/sec              ← message rate, not sample rate
  Byte throughput: 400M × ~50 bytes/sample (compressed batch) = ~20 GB/sec
  With RF=3: ~60 GB/sec cross-cluster replication
  Per-broker sustained: ~300 MB/sec (RF=3, acks=all)         ← byte throughput is the real bottleneck
  60,000 MB / 300 MB = 200 brokers                           ← sized on bytes, not message count
  Deployed as multiple regional clusters (e.g., 100 per region in a 2-region setup)
  Partitions: 10,000                                        ← enough for parallelism without overhead

Flink:
  Each TaskManager: ~1M events/sec (4 slots × 250K each)    ← Amazon Managed Flink: ~28K/sec per KPU;
                                                                dedicated r6g.2xlarge has much more headroom
  400M / 1M = 400 TaskManagers                              ← 1,600 parallel tasks (4 slots each)

VictoriaMetrics (vmstorage):
  Benchmark: 100M on 46 nodes = 2.17M/node
  On i3.4xlarge (16 vCPU, 122GB, 3.8TB NVMe): ~5M/sec      ← ~2.3x benchmark with 4x hardware
  400M / 5M   = 80 vmstorage nodes
  × 2 (RF)    = 160 vmstorage nodes
  160 × 3.8 TB = 608 TB                                     ← covers 414 TB hot + WAL/compaction headroom

vminsert (stateless):
  400M / 12.5M per node = 32 vminsert nodes

vmselect (stateless):
  Start with 20 nodes. Scale based on dashboard concurrency.

ML Anomaly Detector:
  Kafka consumer group, 5-10 nodes (c6g.2xlarge)
  RocksDB state: ~50 GB per node (seasonal baselines for monitored metrics)
  Processes a subset of metrics (top 100K series per tenant by query frequency)

7.3.1 Right-Sizing for Smaller Scale

The numbers above target 500M metrics/sec. The architecture scales down linearly:

Scale	Beyla Agents	OTel Collectors	Kafka Brokers	Flink TaskManagers	vmstorage	vminsert
10M/sec	60	60	10	10	4	2
50M/sec	300	300	25	50	20	8
100M/sec	600	600	50	100	40	16
500M/sec	3,000	3,000	200	400	160	32

7.4 Query Layer Sizing

Dashboard query assumptions:
  50 concurrent Grafana users
  20 panels per dashboard
  30-second refresh interval
  Average query touches 50K series (with recording rules)

Query throughput:
  50 × 20 / 30             = 33 queries/sec                ← steady-state
  Peak (all refresh at once): ~200 queries/sec

vmselect fan-out:
  33 queries/sec × 160 nodes = 5,280 vmstorage req/sec     ← every query hits every node
  Peak: 200 × 160             = 32,000 req/sec

vmselect sizing:
  Each node: ~100 concurrent queries
  20 nodes = 2,000 concurrent queries                       ← 10x headroom over 200 peak
  Memory: 32 GB per node for result merging + caching

Result cache:
  Key: (query_hash, time_range, step)
  10 GB per node (LRU eviction)
  Hit rate: 60-80%                                          ← dashboard auto-refresh is repetitive
  Effective query reduction: 3-5x

7.5 Trace, Log, and Profile Volume Estimation

Traces, logs, and profiles add incremental infrastructure on the same shared collection and buffer layers.

Tracing:
  500K services × 50 RPS   = 25M requests/sec
  25M × 8 spans/trace      = 200M spans/sec
  200M × 1 KB/span         = 200 GB/sec                    ← raw trace volume
  200 GB/sec × 86,400      = 17.3 PB/day                   ← not viable to store 100%
  × 0.005 (0.5% sampling)  = 1 GB/sec = 86 TB/day          ← after tail-based sampling
  Tiered storage (180 days): ~15.5 PB total (Section 9.7)
  Tempo ingesters: 20, Compactors: 5, +4 Kafka brokers

Logging:
  500K services × 100 lines/sec = 50M lines/sec
  50M × 500 bytes           = 25 GB/sec                     ← raw log volume
  25 GB/sec × 86,400        = 2.16 PB/day
  ÷ 8 (VictoriaLogs compression) = ~270 TB/day stored       ← columnar + LZ4/ZSTD
  Tiered storage: ~9.5 PB across active tiers
  vlinsert: 17, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers

Profiling:
  500K services × 10-sec interval = 50K profiles/sec
  50K × 100 KB avg (compressed pprof) = 5 GB/sec            ← raw profile volume
  5 GB/sec ÷ 5 (Pyroscope compression) = 1 GB/sec stored    ← after dedup + compression
  1 GB/sec × 86,400         = 86 TB/day stored
  Tiered storage (90 days): ~7.7 PB total
  Pyroscope ingesters: 10 (r6g.2xlarge), +2 Kafka brokers

The same OTel Collectors, Kafka cluster, and Grafana deployment serve all four signals. Only the storage layer diverges.

7.6 Network Bandwidth

Metrics ingestion path:
  OTel Collectors → Kafka:       800 MB/sec (compressed protobuf, post-pipeline)
  Kafka replication (RF=3):      × 3 = 2.4 GB/sec cross-broker
  Kafka → consumers (vminsert, Flink, ML): ~1.7 GB/sec
  vminsert → vmstorage:          800 MB/sec (distributed across 160 nodes)
  Total metrics bandwidth:       ~5.7 GB/sec aggregate

Other signals (traces + logs + profiles): ~108 GB/sec aggregate
  (dominated by log pipeline: 25 GB/sec raw + 75 GB/sec Kafka replication)

Query path (bursty):
  vmselect → vmstorage fan-out:  1-5 GB/sec during peak dashboard load

What Breaks First at Scale

Before diving into deep dives, it helps to know where this architecture is most likely to fail:

Cardinality explosions. A single user_id label can create 120M series and OOM the storage layer. Per-tenant cardinality limits and ingestion-time label validation prevent this (Section 8.3).

Section 12 covers the full bottleneck analysis with scaling triggers and mitigation strategies for each component.

8. Deep Dives: Metrics Pipeline

8.1 Ingestion Pipeline

The full ingestion flow:

Key design decisions in the ingestion path:

Network acceleration: XDP + DPDK. Two techniques solve this, each at a different layer of the pipeline:

Layer	Technology	Where	What It Does	Impact
Ship out	XDP (eXpress Data Path)	Every agent node (3,000 boxes)	Hooks into the NIC driver and forwards telemetry packets before they enter the full kernel networking stack. It runs as an eBPF program — same toolchain, same deployment model you already have. No dedicated CPU cores needed.	5-10M packets/sec per core vs ~1M with normal `send()` syscalls
Receive at scale	DPDK (Data Plane Development Kit)	OTel Collector gateways (few boxes)	Complete kernel bypass. The NIC delivers packets directly into userspace memory via hugepages. Dedicated cores poll the NIC in a tight loop — no interrupts, no syscalls, no copies.	15-20M+ packets/sec per core. Handles the aggregated fan-in traffic from all 3,000 agent nodes.

8.2 Gorilla Compression

For the full bit-level walkthrough (IEEE 754 XOR encoding, end-to-end 8-sample compression, late scrape impact, and chunk structure), see the Gorilla Compression deep dive.

8.3 High Cardinality: The Silent Killer

What actually happens:

An engineer adds a user_id label to a request latency histogram. Seems reasonable. Want to track per-user latency for debugging.

http_request_duration_seconds_bucket{
  service="checkout",
  method="POST",
  status="200",
  user_id="u_abc123",    ← This label
  le="0.1"
}

The math:

Before user_id label:
  1 metric × 12 series (10 buckets + sum + count) × 5 services × 4 methods × 5 statuses = 1,200 series
  Memory: ~8 KB per series × 1,200 = 9.6 MB (fine)

After user_id label (10M users):
  1 metric × 12 series (10 buckets + sum + count) × 10M users = 120,000,000 series
  Memory: ~8 KB per series × 120M = 960 GB

Timeline:
  00:00  Deploy goes out with user_id label
  00:15  First scrape. 120M new series created.
  00:16  vmstorage memory jumps from 10 MB to 960 GB
  00:17  OOM. vmstorage killed.
  00:17  All dashboards go blank. All alerts stop evaluating.
  00:18  The monitoring system is dead. Nobody gets paged about it being dead.

Detection:

Monitor the monitoring system's own cardinality metrics:

promql

# Alert: cardinality growing too fast
rate(vm_new_timeseries_created_total[1h]) > 100000

# Alert: approaching tenant series limit
vm_active_timeseries{tenant_id="acme"} / vm_series_limit{tenant_id="acme"} > 0.8

# Dashboard: top 10 metrics by series count
topk(10, count by (__name__) ({__name__=~".+"}))

Prevention (layered):

Ingestion-time cardinality limits. vminsert checks active series count per tenant. Reject new series if the tenant exceeds its budget. Return HTTP 429 with X-Series-Limit header.
Label allowlists. The OTel Collector's metric_relabel_configs can drop labels before they reach Kafka. Block known high-cardinality labels (user_id, request_id, trace_id, pod_id) from metric labels.

yaml

# OTel Collector relabel config
metric_relabel_configs:
  - source_labels: [user_id]
    action: labeldrop
  - source_labels: [request_id]
    action: labeldrop

Cardinality analysis at deploy time. A CI/CD check scans metric instrumentation for unbounded label values. If a label's value set is unbounded (user IDs, UUIDs), the pipeline blocks the deploy.
The "Metrics without Limits" pattern (Datadog-inspired). Decouple ingestion from indexing. Ingest all raw data into Kafka (cheap, append-only). But only index the label combinations that are actually queried. If user_id is never used in a dashboard or alert rule, it's stored but not indexed. The inverted index stays small. Storage cost increases, but memory and query performance are protected.

8.3.1 Schema Evolution: Label Changes

What breaks:

Dashboards referencing the old label name return empty results
Alert rules using the old label stop evaluating correctly
The old series goes stale (5-minute staleness marker), but historical data remains under the old name
Recording rules that aggregate by the old label silently stop including the renamed service

Migration strategy:

Dual-emit during transition. The application emits metrics with both old and new label names for one retention period (15 days). Dashboards and alert rules are updated to use the new name. After the transition window, the old label is dropped.
Relabeling at the collector. If the application can't dual-emit, the OTel Collector's metric_relabel_configs can copy the old label to the new name during the transition:

yaml

metric_relabel_configs:
  - source_labels: [status]
    target_label: http_status

Dashboard migration tooling. Grafana's provisioning API can batch-update dashboard JSON to replace label references. Script this as part of the label rename runbook.

8.4 Inverted Index at Scale

The inverted index is how the query engine turns a label selector like {service="checkout", env="prod"} into a list of matching time series IDs.

How it works:

VictoriaMetrics mergeset structure:

High-cardinality labels degrade index performance:

The mergeset's key space also grows linearly with cardinality. More keys means more disk I/O during compaction, more memory for bloom filters, and longer startup times when rebuilding the index.

Bloom filters for segment skipping:

When a query has multiple label selectors, the engine intersects posting lists using size-adaptive algorithms (galloping search for skewed lists, linear merge for similar-sized lists).

8.5 Query Engine and Execution

MetricsQL query execution flow:

Query splitting for large time ranges:

When a dashboard requests a 30-day range, the query spans both hot (vmstorage, raw data for last 15 days) and warm (S3, 5-min aggregates for days 16-30). vmselect splits the query:

Days 1-15: Fan out to vmstorage nodes. Raw data, full resolution.
Days 16-30: Read from S3 Parquet files. 5-min aggregated data. Partition pruning by tenant_id + time range skips irrelevant files.
Merge: vmselect stitches the two result sets together, aligning timestamps at the step boundary.

Dashboard experience across tiers:

Querying beyond the hot tier is slower and lower resolution.

Time Range	Storage Tier	Resolution	Typical Latency	What You Lose
Last 15 days	Hot (vmstorage)	15-sec raw	p99 < 200ms	Nothing, full fidelity
15-90 days	Warm (S3 Parquet)	5-min aggregates	p99 < 500ms	Short spikes (<5 min) disappear. Dashboard lines are smoother. A 3-minute error burst that was visible in hot tier becomes a single averaged data point in warm.
90+ days	Cold (S3 Parquet)	1-hour aggregates	500ms-2s	Only useful for trend analysis ("did error rate increase this quarter?"), not incident investigation.

Result caching:

Recording rules to keep dashboards fast:

8.6 Downsampling and Retention

Three storage tiers with automatic lifecycle management:

Stream aggregation (Flink) vs batch rollup:

Two approaches to downsampling, each with tradeoffs:

Approach	How	Pros	Cons
Stream aggregation	Flink aggregates on the ingest path. Every 5 minutes, emit min/max/avg/sum/count for each series. Write aggregates to warm tier.	No read penalty. Aggregates available immediately. No backfill needed.	Uses Flink state (memory + checkpoint). Must handle late-arriving data.
Batch rollup	Background job reads raw data from vmstorage, computes aggregates, writes to S3.	Simple. No ingest-path overhead. Can recompute if logic changes.	Reads data twice (once to store, once to rollup). Creates I/O contention on vmstorage. Rollup delay (hours).

Handling late-arriving data:

Dropped (if it falls within the hot tier window and raw data is still available)
Counted toward the next window (if it falls in the warm tier)

For metrics, late arrivals are rare. Scrape intervals are clock-driven. Delays beyond 2 minutes indicate network failures.

8.7 Alert Evaluation at Scale

Thousands of alert rules across billions of time series, evaluated every 15 seconds. A single evaluator can't keep up, so the rules are sharded.

Alerting rule example:

yaml

# /etc/vmalert/rules/shard-0/checkout-alerts.yml
groups:
  - name: checkout-alerts
    interval: 15s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
          tenant_id: acme
        annotations:
          description: "Error rate above 5% for checkout service"
          runbook: "https://wiki.internal/runbooks/checkout-errors"

      - alert: SLOBurnRateFast
        expr: slo:error_budget:burn_rate_5m > 14.4
        for: 1m
        labels:
          severity: critical
        annotations:
          description: "Fast burn: error budget will exhaust within 1 hour at current rate"

Recording rule example:

yaml

# /etc/vmalert/rules/shard-0/checkout-recording.yml
groups:
  - name: checkout-recording
    interval: 30s
    rules:
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service)

      - record: slo:error_budget:burn_rate_5m
        expr: |
          1 - (
            sum(rate(http_requests_total{status!~"5.."}[5m]))
            /
            sum(rate(http_requests_total[5m]))
          ) / (1 - 0.999)

Rule sharding strategy:

Each evaluator shard, every 15 seconds:

Loads its assigned rules from the rule store
Executes each rule's MetricsQL expression against vmselect
Compares results against thresholds
For rules that have been firing for longer than their for duration, sends an alert to Alertmanager
Updates local state (firing duration, last evaluation time)

SLO-based burn rate alerting:

Traditional threshold alerts ("error rate > 5%") are noisy. SLO-based alerting asks a better question: "At the current error rate, will the error budget run out before the SLO period ends?"

How burn rates work: A 99.9% SLO gives you ~43 minutes of error budget over a 30-day window.

A burn rate measures how fast that budget gets consumed:

Burn rate 1x: Budget consumed exactly over 30 days. No action needed.
Burn rate 14.4x: Budget consumed 14.4x faster. Runs out in ~2 days. Page immediately.
Burn rate 6x: Runs out in ~5 days. Urgent, but hours remain, not minutes.

The formula: burn_rate = current_error_rate / allowed_error_rate. With a 0.1% error allowance and a current rate of 1.44%, the burn rate is 14.4x:

Multi-window burn rates:
  Fast burn (last 5 min):   error_rate / (1 - 0.999) > 14.4   → page immediately
  Medium burn (last 30 min): error_rate / (1 - 0.999) > 6      → page (slower burn)
  Slow burn (last 6 hours):  error_rate / (1 - 0.999) > 1      → ticket (not a page)

Alertmanager deduplication and grouping:

When 500 pods fire the same "HighErrorRate" alert, that's 500 PagerDuty notifications without grouping. Alertmanager collapses them:

yaml

route:
  group_by: ['alertname', 'service', 'env']
  group_wait: 30s          # Wait 30s to batch alerts in the same group
  group_interval: 5m       # Wait 5m between re-notifications
  repeat_interval: 4h      # Re-notify every 4h if still firing
  receiver: 'pagerduty-critical'

  routes:
    - match:
        severity: warning
      receiver: 'slack-warnings'
    - match:
        severity: info
      receiver: 'slack-info'

500 pod-level alerts get grouped into 1 notification: "HighErrorRate firing for service=checkout, env=prod (500 instances)."

Alert storm protection:

If a widespread failure causes 100,000 alerts to fire in 1 minute (datacenter-level event), even grouped alerts can overwhelm on-call engineers. Protections:

Inhibition rules. If a datacenter-level alert is firing, suppress all service-level alerts in that datacenter.
Rate limiting. Alertmanager limits notification sends to 100/minute per receiver. Excess alerts are queued.
Escalation. If more than 50 unique alert groups fire within 5 minutes, trigger a "mass incident" page to incident command, not individual service owners.

8.7.1 ML-Assisted Anomaly Detection

Architecture:

How it works:

Seasonal baseline (STL decomposition). For each monitored metric, the service maintains a seasonal baseline using STL (Seasonal and Trend decomposition using Loess). This decomposes the metric into trend, seasonal, and residual components. The seasonal component captures daily/weekly patterns (e.g., traffic drops at 3am, spikes at 9am). Baselines are stored in RocksDB on each node's local SSD.

Cold start: New metrics with less than 7 days of history have anomaly detection suppressed entirely -- there's no value in alerting on a metric with no baseline. Metrics with 7-28 days of history use whatever data exists, but the anomaly threshold widens to 5σ (vs the steady-state 3σ) to prevent noisy alerting during ramp-up. After 28 days, the metric enters steady-state detection with the full seasonal model. New tenants onboarding start with all metrics in cold-start mode, so the detector is effectively silent for the first week. This is the same 5σ widening used during baseline rebuilds after major traffic pattern changes (Section 13.8).
Anomaly scoring. Each new data point is compared against the expected value from the seasonal model. The score is based on how many standard deviations the residual is from zero. A threshold of 3σ (three standard deviations) triggers an anomaly signal. The threshold adapts based on the metric's historical volatility. Metrics that are naturally noisy (cache hit rates) get wider bands than stable metrics (CPU on idle hosts).

Why STL over other approaches? STL was chosen for three reasons: it's computationally cheap (O(n) per series per update, which matters at 100K series), it requires no training data beyond the metric's own history, and it's interpretable -- you can render the trend and seasonal components on a Grafana dashboard and an engineer can see why the system thinks something is anomalous. Prophet fits a full Bayesian model per series, which is prohibitively expensive at this scale. ARIMA doesn't model seasonality natively (SARIMA does, but requires manually specifying seasonal order for each metric). Isolation forests work well for multivariate anomaly detection but require a feature vector, and this system monitors each series independently. The cross-service correlator (step 3) provides a lightweight version of multivariate detection without the model complexity.
Cross-service correlation. When multiple services show simultaneous anomalies, the correlator groups them by time window and shared attributes (region, dependency). Instead of 50 individual "anomaly detected" alerts, it produces one "correlated anomaly: 50 services in us-east-1 showing elevated latency, likely caused by shared dependency X."

Grounded expectations:

ML anomaly alerts go to Slack only, not PagerDuty. They are informational signals for engineers already investigating or proactively monitoring.
SLO burn rate alerts remain the only paging mechanism. ML never pages.
False positive rate target: <5% of anomaly signals. Achieved by only monitoring high-query-frequency metrics and using adaptive thresholds.
Detection latency: <60 seconds from metric ingestion to anomaly signal in Slack.

Feedback and threshold tuning:

What ML catches that burn rates miss:

Scenario	SLO Burn Rate	ML Anomaly Detection
Error rate suddenly jumps from 0.01% to 0.5% (still under 0.1% SLO)	No alert (within budget)	Detects: 50x deviation from baseline
Latency slowly increases 10% per day over 2 weeks	No alert until budget exhaustion	Detects: trend component diverging
10 services simultaneously show 20% traffic drop at an unusual hour	No alert (availability is fine)	Detects: correlated anomaly, likely upstream issue
Cache hit rate drops from 95% to 85%	No alert (no SLO on cache)	Detects: significant deviation, Slack notification

What the Slack alert looks like:

🔶 Anomaly detected: http_request_duration_seconds_p99
   Service:   checkout (us-east-1)
   Current:   450ms
   Expected:  120ms (seasonal baseline)
   Deviation: 4.2σ (threshold: 3.0σ)
   Trend:     increasing over last 35 minutes
   ➜ View in Grafana: https://grafana.internal/d/svc-latency?service=checkout&from=now-1h

For correlated anomalies, the correlator groups individual signals into a single message:

🔴 Correlated anomaly: 12 services in us-east-1 showing elevated p99 latency
   Top affected:
     checkout    4.2σ  (120ms → 450ms)
     payment     3.8σ  (85ms → 310ms)
     inventory   3.5σ  (60ms → 195ms)
   Suspected shared dependency: redis-cluster-east
   Window: 14:32 - 14:47 UTC
   ➜ View in Grafana: https://grafana.internal/d/correlated?region=us-east-1&from=now-1h

Sizing:

ML Anomaly Detector:
  5-10 nodes (c6g.2xlarge: 8 vCPU, 16 GB RAM)
  Kafka consumer: processes ~100K series per tenant
  RocksDB state: ~50 GB per node (28-day seasonal baselines)
  Total cluster: ~500 GB state, ~100K anomaly evaluations/sec

8.8 Recording Rules and Pre-Computation

Recording rules are pre-computed queries that run on the Flink aggregation layer and write results back as new time series. Two reasons to use them:

Performance. A dashboard querying sum(rate(http_requests_total[5m])) by (service) across 10M raw series takes seconds. The same query against 500 pre-computed series takes milliseconds.
Consistency. Alert rules and dashboard panels that need the same aggregation use the same recording rule. No divergence from slightly different query syntax.

Where recording rules execute:

When to use recording rules vs raw queries:

Scenario	Use Recording Rule	Use Raw Query
Dashboard panel queried by multiple users	Yes	No
Alert rule evaluated every 15s	Yes	No
One-off debugging query	No	Yes
Query touches >100K series	Yes	No
Query touches <1K series	No	Yes
SLO burn rate calculation	Yes (always)	No

Backfill when recording rules change:

8.9 Multi-Tenancy and Isolation

At 500M metrics/sec from hundreds of organizations, one tenant's cardinality explosion must never take down another tenant's dashboards.

Tenant identification:

Every request (remote write, query, rule evaluation) carries a X-Scope-OrgID header. This tenant_id propagates through the entire pipeline:

OTel Collector → Pipeline Processors (enrichment) → Kafka (message header)
  → Flink (state key prefix) → vminsert (routing) → vmstorage (directory isolation)
  → vmselect (query filter)

Per-tenant limits:

Limit	Default	Purpose
Max active series	50M	Prevent cardinality explosion
Max ingestion rate	5M samples/sec	Prevent single tenant from saturating Kafka
Max query series	500K	Prevent dashboard query from scanning too many series
Max query duration	60s	Kill runaway queries
Max alert rules	10,000	Prevent rule evaluator overload
Max label name length	128 chars	Index size control
Max label value length	1,024 chars	Index size control
Max labels per series	30	Cardinality control

Limits are enforced at the earliest possible point:

Ingestion rate: OTel Collector checks before sending to Kafka
Series limit: vminsert checks before routing to vmstorage. When a tenant exceeds their series limit, the remote write endpoint returns 429 Too Many Requests with X-Series-Limit and X-Series-Current headers so clients can track cardinality usage against their budget.
Query limits: vmselect checks before fanning out to vmstorage

The shard assignment uses consistent hashing with the tenant_id as input. When nodes are added or removed, only a fraction of tenant assignments change.

Cost attribution:

Per-tenant billing based on:

Active series count (sampled hourly, billed at max)
Ingestion rate (samples/sec, billed at p95 over the billing period)
Query volume (number of series touched per query, summed monthly)
Storage consumed (bytes across all tiers)

8.10 Service Discovery and Scrape

The collection layer must automatically discover new services as they deploy and stop scraping services that disappear. Manual target configuration doesn't work at 500K+ services.

Scope: pull model only. Service discovery applies to same-cluster tenants (see Section 5.1). External tenants push directly and don't need discovery.

Target churn handling:

Relabeling for metric filtering:

Not every metric from every service is worth storing. The OTel Collector's relabeling pipeline can drop, rename, or filter metrics before they reach Kafka:

yaml

metric_relabel_configs:
  # Drop Go runtime metrics (noisy, rarely useful)
  - source_labels: [__name__]
    regex: 'go_(gc|memstats|threads)_.*'
    action: drop

  # Drop high-cardinality labels
  - regex: 'request_id|trace_id|user_id'
    action: labeldrop

  # Rename for consistency
  - source_labels: [__name__]
    regex: 'node_cpu_seconds_total'
    target_label: __name__
    replacement: 'cpu_seconds_total'

Fleet management at scale:

3,000 OTel Collectors need coordinated configuration updates: scrape targets, relabel rules, sampling policies, and receiver settings. Two approaches:

GitOps pipeline. Collector configuration lives in a Git repository. Changes go through PR review. A CI/CD pipeline renders per-node configs (using templates for node-specific settings like node_name and local pod discovery) and deploys via Kubernetes rolling update of the DaemonSet. Safe, auditable, but slow (minutes for a full rollout).
OpAMP (Open Agent Management Protocol). The CNCF standard for remote agent management. A central OpAMP server pushes configuration updates to collectors in real time. Collectors report health, effective configuration, and capabilities back to the server. Useful for emergency changes (e.g., adding a labeldrop rule during a cardinality explosion). Faster than GitOps (seconds, not minutes), but requires running and securing the OpAMP server infrastructure.

Most teams use GitOps for planned changes and keep OpAMP as a fast-path override for incidents.

9. Distributed Tracing at Scale

Traces are expensive. From the Section 7 capacity estimate: 200M spans/sec at ~1 KB each produces 17.3 PB/day. At ~$23/TB/month on S3, that's roughly $400K/month in storage alone. Not viable.

Tail-based sampling (Section 9.2) keeps errors, slow requests, and a 0.1% random baseline, dropping the rest to 86 TB/day. First, the trace span format.

9.1 Trace Span Format

Each trace span captures a single operation within a distributed request. A complete trace is a directed acyclic graph of spans connected by parent-child relationships.

json

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "e457b5a2e4d86bd1",
  "operationName": "POST /checkout",
  "serviceName": "checkout-service",
  "startTime": 1709827200000,
  "duration": 145,
  "attributes": {
    "http.method": "POST",
    "http.status_code": 200,
    "db.system": "postgresql"
  },
  "events": [{"name": "retry", "timestamp": 1709827200045}],
  "links": [{"traceId": "...", "spanId": "...", "attributes": {"link.type": "profile"}}]
}

9.2 Tech Stack Selection

Layer	Technology	Why This Choice
Instrumentation	OpenTelemetry SDK + Beyla eBPF	SDK for deep instrumentation. Beyla provides baseline trace spans for all HTTP/gRPC/SQL without code changes.
Collection	OTel Collector (same fleet)	Add `otlp` trace receiver to existing DaemonSet configuration. Same fleet, no new infrastructure for collection.
Sampling	OTel Collector tail-based sampling	Buffer spans 60s, keep errors + p99 + 0.1% baseline. Cuts 98-99.5% (depends on error rate). Head-based misses rare errors.
Buffer	Kafka (topic: `traces-raw`)	Same Kafka cluster, separate topic. Isolates trace volume from metric ingestion. Fingerprint-partitioned by trace_id for span co-location.
Trace Backend	Grafana Tempo	S3-native storage (no index to maintain). TraceQL for queries. Stateless query layer scales horizontally. Bloom filters for trace ID lookup.
Long-term Storage	S3 (Parquet blocks)	Same S3 infrastructure used for metrics warm/cold tiers. Tempo writes trace blocks as Parquet files. S3 lifecycle policies manage tier transitions.
Correlation	Exemplars + span-to-profile links	Link metric data points to the specific trace that caused a spike. Link trace spans to CPU/memory profiles showing why the span was slow.
Context Propagation	W3C Trace Context (`traceparent`)	Industry standard. OTel SDK auto-injects for HTTP/gRPC. Manual injection required for Kafka message headers.
Visualization	Grafana (already deployed)	Native Tempo datasource. Trace waterfall view. Auto-generated service dependency maps from trace data. Integrated flame graph panel for span-linked profiles.

DaemonSet collectors (3,000 nodes) receive spans from local pods and forward them using the OTel Collector's load-balancing exporter, which routes by trace_id hash.
Sampling tier (a dedicated pool of ~50 OTel Collector instances) receives all spans for a given trace on the same instance, buffers for 60 seconds, and makes the keep/drop decision.

This adds one network hop but ensures correct sampling decisions. The sampling tier instances each handle ~4M spans in the 60-second buffer, consuming ~2.7 GB RAM each (see Section 12.8 for sizing).

9.3 Data Flow

9.4 Exemplar Storage and Trace-to-Profile Correlation

9.5 Tiered Storage for Traces

Same hot/warm/cold philosophy as the metrics pipeline (Section 8.6).

Tier	Retention	Storage	Resolution	Query Latency
Hot	0-3 days	Tempo ingesters (local SSD) + S3 Standard	Full spans, all attributes	< 500ms by trace ID
Warm	3-30 days	S3 Standard (compacted Parquet blocks)	Full spans, all attributes	1-3s (S3 reads + bloom filter lookup)
Cold	30-180 days	S3 Glacier Instant Retrieval	Full spans, all attributes	5-10s (Glacier retrieval)
Archive	180d+	S3 Glacier Deep Archive	Full spans (compliance/audit)	Minutes (restore required)

Unlike metrics, traces are never downsampled. A trace is either useful in full detail or not at all. Cost savings come from tiering across S3 storage classes, not reducing resolution.

9.6 Capacity Estimate

From Section 7.5: raw trace volume is 17.3 PB/day. After tail-based sampling (0.5% keep rate), ingested volume is 86 TB/day.

Tiered storage (180-day retention):
  Hot  (3 days):   86 TB x 3       = 258 TB on S3 Standard
  Warm (27 days):  86 TB x 27      = 2.32 PB on S3 Standard
  Cold (150 days): 86 TB x 150     = 12.9 PB on Glacier Instant
  Total trace storage: ~15.5 PB

Tempo ingester fleet:
  1 GB/sec ingest rate. Each ingester handles ~50 MB/sec.
  20 ingesters (r6g.2xlarge)

Tempo compactor:
  10-15 compactor nodes (c6g.xlarge), auto-scaled by pending block count
  Monitor: tempo_compactor_outstanding_blocks. Scale up if backlog > 500 blocks.
  At 1 GB/sec ingestion, 5 compactors fall behind within hours during peak.

OTel Collector overhead:
  Tail-based sampling requires 4 GB RAM per 100K spans/sec in the decision buffer
  200M raw spans/sec across 3,000 collectors = ~67K spans/sec per collector
  ~2.7 GB additional RAM per collector (fits within c6g.xlarge 8GB with headroom)

For deeper coverage of sampling strategies and context propagation patterns, see the Distributed Tracing reference. Failure scenarios for the trace pipeline are covered in Section 13.

10. Centralized Logging

10.1 Log Entry Format

Structured logs follow a consistent JSON schema with fields auto-indexed by VictoriaLogs.

json

{
  "timestamp": "2026-03-07T14:30:00.123Z",
  "level": "ERROR",
  "service": "payment-service",
  "message": "connection refused to stripe-api.com:443, certificate expired",
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "a1b2c3d4e5f60718",
  "host": "payment-pod-7b8f9",
  "region": "us-east-1",
  "error.type": "ConnectionRefused",
  "retry_count": 3
}

10.2 Tech Stack Selection

Three serious options exist for log storage at this scale: Elasticsearch, Grafana Loki, and VictoriaLogs.

Dimension	VictoriaLogs	Grafana Loki	Elasticsearch
Indexing strategy	Bloom filters on all tokens. Index-assisted search on every field automatically.	Labels only (service, env, level). Log content is NOT indexed. Brute-force grep on chunks.	Full-text inverted index on all fields.
Ingestion throughput	3x higher than Loki on same hardware. 66 MB/s vs 20 MB/s on 4 vCPU / 8 GiB (benchmark).	Bottlenecks on high-volume ingestion. Requires more nodes.	High throughput but index building is CPU-intensive.
Query speed (content search)	Fast. Bloom filter token lookup skips non-matching blocks.	Slow. Brute-force scan after label filter.	Fast. Inverted index on all content.
Resource efficiency	72% less CPU, 87% less RAM than Loki. Up to 30x less RAM than Elasticsearch.	Memory-hungry ingesters.	Highest resource consumption. Large JVM heaps.

Chosen stack:

Layer	Technology	Why
Collection	OTel Collector (filelog receiver)	Same DaemonSet fleet. Tail container stdout/stderr via Kubernetes log paths.
Buffer	Kafka (topic: `logs-raw`)	Same cluster, separate topic. Partitioned by service_name.
Storage & Indexing	VictoriaLogs (cluster mode)	Bloom filter per-token index on all fields. Auto-indexes everything.
Long-term	Local NVMe on vlstorage + vmbackup to S3	Hot/warm on local disk. Cold/archive snapshots to S3.
Query	LogsQL	Filter and search: `_time:5m AND service:checkout AND error AND _stream:{level="error"}`.
Correlation	trace_id field in structured logs	Click from log line to open trace in Tempo. Click from trace span to see logs.

10.3 Data Flow

Container stdout --> Kubernetes log file --> OTel Collector (filelog receiver)
  --> Parse JSON --> Extract fields (service, level, trace_id)
  --> Pipeline processors (value-based routing: drop DEBUG from bronze-tier services)
  --> Kafka topic: logs-raw
  --> vlinsert --> Distribute to vlstorage nodes
  --> vlstorage --> Tokenize, build bloom filters, compress, write to local NVMe
  --> Query: Grafana --> vlselect --> Fan out to vlstorage nodes --> Return results

10.4 Tiered Storage for Logs

Tier	Retention	Storage	What Gets Stored	Query Latency
Hot	0-3 days	vlstorage local NVMe	All log levels, full content	< 500ms
Warm	3-30 days	vlstorage local disk	All log levels, full content	500ms-2s
Cold	30-90 days	S3 Standard (vmbackup snapshots)	ERROR/WARN only	5-15s
Archive	90-365 days	S3 Glacier Deep Archive	ERROR/WARN only (compliance/audit)	Minutes

10.5 Capacity Estimate

From Section 7.5: 50M log lines/sec at 500 bytes = 25 GB/sec = 2.16 PB/day raw.

VictoriaLogs compression: ~8x on structured JSON logs
Compressed: ~260 TB/day stored on local disk

Tiered storage (365-day retention for errors, 30-day for all):
  Hot  (3 days):   260 TB x 3           = 780 TB on local NVMe
  Warm (27 days):  260 TB x 27          = 7.0 PB on local SSD
  Cold (60 days):  260 TB x 0.10 x 60   = 1.56 PB on S3 Standard
  Archive (275d):  260 TB x 0.10 x 275  = 7.15 PB on Glacier Deep
  Total log storage: ~9.5 PB across active tiers (+ 7.15 PB archive)

vlinsert: 17 nodes, vlstorage: ~140 (hot + warm), vlselect: 10, +20 Kafka brokers

Failure scenarios for the logging pipeline are covered in Section 13.

11. Continuous Profiling: The Fourth Pillar

11.1 Profile Format and Types

Profile Type	What It Captures	When to Use	Typical Size
CPU	Time spent in each function	"This span is slow, but I don't know which function"	50-200 KB
Memory (alloc)	Bytes allocated by each call stack	"GC pauses are killing latency"	30-100 KB
Memory (inuse)	Current heap usage by call stack	"Memory is growing, what's leaking?"	30-100 KB
Goroutine	Number of goroutines by creation stack (Go)	"Goroutine count is climbing"	10-50 KB
Mutex	Time spent waiting on locks by call stack	"Contention is high but I don't know which lock"	10-50 KB
Block	Time spent blocked on channel/mutex operations (Go)	"Something is blocking, what?"	10-50 KB

Profile structure (simplified pprof):

Profile {
  sample_type: [("cpu", "nanoseconds")]
  sample: [
    {
      location_id: [4, 3, 2, 1],           // Stack trace (leaf to root)
      value: [150000000],                    // 150ms of CPU time
      label: {
        "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
        "span_id": "00f067aa0ba902b7"        // Links to specific trace span
      }
    },
    ...
  ]
  location: [
    {id: 1, function: "main.handleCheckout", file: "checkout.go", line: 42},
    {id: 2, function: "validator.ValidateCart", file: "validate.go", line: 87},
    {id: 3, function: "regexp.Compile", file: "regexp.go", line: 172},
    {id: 4, function: "regexp.compile", file: "compile.go", line: 94}
  ]
}

11.2 Tech Stack Selection

Dimension	Grafana Pyroscope	Parca	Datadog Continuous Profiler
Storage backend	S3-native (object storage)	S3 or local disk	Proprietary (SaaS)
Profile format	pprof (native)	pprof (native)	Proprietary + pprof conversion
Query language	Label-based + time range	Label-based + time range	Proprietary UI
Grafana integration	Native (same vendor stack)	Community datasource plugin	Requires Datadog UI
Trace correlation	Span-to-profile via span_id in pprof labels	Experimental	Built-in (proprietary)
Maturity	Production-proven at Grafana Labs, acquired from Pyroscope project	Early production (CNCF sandbox)	Mature (SaaS)
OTel support	OTel profile signal (experimental) + Pyroscope SDK	OTel profile signal	Datadog agent only

11.3 Data Flow

Key points:

Beyla does NOT produce profiles. eBPF-based instrumentation generates RED metrics and basic trace spans, but CPU/memory/lock profiles require SDK-level instrumentation. Profiling is an opt-in capability that teams enable when they need deeper performance insight.
Profile collection is per-service, not per-request. The SDK takes a CPU profile snapshot every 10 seconds (configurable). The snapshot covers all activity on the process during that window, not individual requests. Span-to-profile correlation works by matching the span's time window to the profile snapshot that covers it.
Profiles flow through the same OTel Collector fleet and pipeline processors. Value-based routing applies: profiles from gold-tier services are always kept; profiles from bronze-tier services may be sampled during cost pressure.

11.4 Profile-to-Trace Correlation: The Key Value

This is where profiling transforms from "nice to have" to "changes how you debug." The four-click investigation flow:

What the flame graph shows that traces can't:

Trace Shows	Profile Shows (the "why")
checkout-service span: 500ms	`regexp.Compile` called 10,000 times inside `validateCart()`, compiling the same regex per request instead of caching
payment-service span: 300ms	60% time in `runtime.mallocgc`, excessive allocations causing GC pressure
db-proxy span: 200ms	45% time in `sync.Mutex.Lock`, lock contention on connection pool
auth-service span: 150ms	80% actual network I/O wait (expected), no code issue, downstream dependency is slow

11.5 Capacity Estimate

Profile volume:
  500K services with profiling enabled (not all, opt-in via SDK)
  Assume 50% adoption: 250K services
  10-second profile interval per service
  250K / 10 = 25K profiles/sec (conservative)
  At full adoption: 50K profiles/sec (target capacity)

Per-profile size:
  CPU profile: 50-200 KB compressed (avg 100 KB)
  Memory profile: 30-100 KB compressed
  Avg across types: ~100 KB

Ingestion rate:
  50K profiles/sec × 100 KB = 5 GB/sec raw
  Pyroscope dedup + compression: ~5x reduction
  Stored: ~1 GB/sec = 86 TB/day

Tiered storage (90-day retention):
  Hot  (7 days):   86 TB × 7    = 602 TB on S3 Standard
  Warm (83 days):  86 TB × 83   = 7.1 PB on S3 Standard-IA
  Total profile storage: ~7.7 PB

Pyroscope cluster:
  Ingesters: 10 nodes (r6g.2xlarge, 8 vCPU, 64 GB RAM)
  Query frontend: 5 nodes (c6g.2xlarge)
  Compactor: 3 nodes (c6g.xlarge)
  Kafka brokers: +2 (for profiles-raw topic)

12. Identify Bottlenecks

12.1 Cardinality Ceiling

12.2 Query Fan-Out

To handle this: recording rules reduce the number of series each panel queries, result caching in vmselect cuts redundant reads, and query coalescing merges identical concurrent queries into one.

12.3 Kafka Consumer Lag

Monitor consumer lag (kafka_consumergroup_lag) and alert if lag exceeds 5 minutes. Auto-scale Flink TaskManagers based on lag. The 24-hour Kafka retention gives headroom for recovery.

12.4 Flink Checkpoint Size

Mitigation: incremental checkpointing (only write changed state), tuning RocksDB block cache and compaction, and using aligned checkpoint barriers to reduce checkpoint duration.

12.5 vmstorage Compaction Storms

Provision vmstorage nodes with 30% I/O headroom for compaction and use NVMe SSDs (not EBS gp3). Monitor vm_merge_need_free_disk_space and scale storage before compaction bottlenecks.

12.6 S3 Query Latency

12.7 Inverted Index Rebuild Time

12.8 Tail-Based Sampling Memory

12.9 Tempo S3 Query Latency

12.10 VictoriaLogs Bloom Filter Cache

12.11 Pyroscope Ingester Memory

12.12 Scaling Triggers

Component	Scaling Metric	Trigger	Action
vmstorage	`vm_active_timeseries` per node	> 400M series	Add vmstorage nodes, rebalance hash ring
vmstorage	Disk utilization	> 70% on NVMe	Add nodes or increase retention export frequency
vminsert	CPU utilization	> 70% sustained	Add vminsert pods (stateless, instant)
vmselect	`vm_concurrent_select_requests`	> 80% of max	Add vmselect pods
Kafka	Bytes in per broker	> 200 MB/sec per broker	Add brokers, rebalance partitions
Flink	`kafka_consumergroup_lag`	Lag > 5 minutes sustained	Add TaskManagers
Flink	Checkpoint duration	> 50% of checkpoint interval	Tune RocksDB, add parallelism
vlstorage	Disk utilization	> 70%	Add vlstorage nodes
vlstorage	Bloom filter cache miss rate	> 20%	Add RAM or add nodes
Pyroscope	Ingester memory utilization	> 70%	Add ingester nodes
ML Anomaly	Processing lag	> 2 minutes behind real-time	Add consumer nodes
Beyla	`beyla_ebpf_attach_errors`	> 0 sustained	Investigate kernel compatibility
OTel Collectors	`otelcol_exporter_send_failed_metric_points`	Failure rate > 1%	Scale fleet or increase per-collector resources

13. Failure Scenarios

What happens when components fail or scaling limits are exceeded, and how the platform recovers.

13.1 VictoriaMetrics Node Loss

Scenario: A vmstorage node loses its disk (NVMe failure) and all data on it.

Recovery:

vminsert's health check detects the failed node within 10 seconds
vminsert's consistent hashing ring updates. New writes for affected series go to the next node in the ring.
With replication factor 2, the replica node has a copy. vmselect queries both the primary and replica, so queries still return complete data.
A replacement node joins the cluster. vmstorage rebalancing migrates data from overloaded nodes.

13.2 Kafka Broker Failure

Scenario: A Kafka broker in the metrics-ingestion cluster crashes.

Impact: Partitions led by that broker become temporarily unavailable. With RF=3, the controller elects a new leader from in-sync replicas.

Recovery:

Kafka controller detects broker failure (KRaft heartbeat)
Leader election for affected partitions (~1-5 seconds)
Producers retry buffered messages to the new leader
No data loss (RF=3 with acks=all means at least 2 replicas have every message)

During failover: 1-5 seconds of elevated producer latency. OTel Collectors buffer in memory during this window (configured via sending_queue settings).

13.3 Cardinality Explosion (OOM)

Scenario: A tenant adds user_id as a histogram label (see Section 8.3). 10M users × 12 histogram series = 120M new series.

Recovery with per-tenant limits (correct configuration):

vminsert detects that the tenant has exceeded its 50M series limit
New series creation is rejected (HTTP 429)
Existing series continue receiving data
Alert fires: vm_active_timeseries > 0.8 * vm_series_limit
Over the next staleness period (5 minutes), the 120M stale series are marked inactive

Per-tenant cardinality limits are P0 for exactly this reason.

13.4 Split-Brain in VictoriaMetrics Cluster

Scenario: Network partition isolates vmstorage nodes 1-3 from nodes 4-6.

13.5 Flink Job Failure

Scenario: A Flink TaskManager crashes mid-checkpoint.

Impact: Recording rules stop producing pre-computed metrics. SLO burn rate calculations freeze. Raw metric ingestion into vmstorage continues unaffected.

13.6 eBPF Collector Failure

Scenario: A kernel upgrade breaks Beyla eBPF probes on a subset of nodes. Beyla pods enter CrashLoopBackOff.

Recovery:

Alert fires: beyla_ebpf_attach_errors > 0 on affected nodes
Platform team identifies the kernel version causing the failure
Short-term: roll back the kernel upgrade on affected nodes, or pin Beyla to a compatible version
Long-term: update Beyla to a version compatible with the new kernel
The graceful degradation to SDK-only is automatic. No manual intervention needed for SDK-instrumented services

13.7 Pyroscope Ingester Failure

Scenario: A Pyroscope ingester node OOMs due to a traffic spike from a newly onboarded high-volume service.

Impact: Profile data for services assigned to that ingester is lost for the duration. Trace-to-profile links show "profile not found." Metrics, traces, and logs are completely unaffected.

Recovery:

Kafka buffers profiles in the profiles-raw topic (12-hour retention)
Remaining ingesters pick up partitions from the failed node (consumer group rebalance)
Once the replacement ingester catches up from Kafka, profile coverage resumes
Profiles during the gap are permanently lost (not re-collectable)

Prevention: Per-tenant profile ingestion rate limits at the OTel Collector. Monitor pyroscope_ingester_memory_utilization and alert at 70%.

13.8 ML Anomaly Detector Divergence

Scenario: The ML anomaly detector's seasonal baselines become stale after a major traffic pattern change (Black Friday, migration to new region, daylight saving time shift).

Recovery:

On-call engineer uses the anomaly suppression API: POST /api/v1/anomaly/suppress?duration=4h&scope=all
After the traffic pattern stabilizes, trigger a baseline reset: POST /api/v1/anomaly/reset-baseline?service=*
The detector rebuilds baselines over the next 24-48 hours using the new traffic pattern
During rebuild, anomaly detection operates with wider thresholds (5σ instead of 3σ) to reduce noise

13.9 Trace Sampling Pipeline Failure

Scenario: The tail-based sampling processor crashes or hits its max_traces limit. Spans either get dropped entirely or pass through unsampled at 100% keep rate.

Impact (100% pass-through): Tempo ingesters receive 200x normal volume. Kafka topic saturates. Storage spikes.

13.10 Network Partition Scenarios

13.11 VictoriaLogs Disk Full

Scenario: A noisy service emits 10x normal log volume. vlstorage nodes fill their local NVMe.

13.12 Clock Skew Between Collectors

14. Observability (Meta-Monitoring)

The meta-problem: if the metrics platform itself goes down, how does anyone find out? Dashboards are blank. Alerts aren't evaluating. PagerDuty is silent.

Solution: A separate, minimal monitoring stack.

A single Prometheus instance on separate infrastructure (different Kubernetes cluster, different cloud region, different failure domain) monitors five things:

Is vmstorage responding? Scrape /metrics. If scrape fails 3 times, page.
Is ingestion flowing? Check vm_rows_inserted_total rate. If it drops to 0, page.
Is the alert evaluator running? Check rule_evaluations_total rate. If it drops to 0, page.
Is Alertmanager healthy? Check alertmanager_alerts_received_total. If it flatlines, page.
Is the profile pipeline healthy? Check pyroscope_ingester_appended_profiles_total. If it drops to 0, alert but don't page (profiles are P1, not P0).

Total footprint: one t3.medium instance.

14.1 Critical Metrics

What to watch, by component:

# === INGESTION ===
rate(vm_rows_inserted_total[5m])                                          # Should match 400M/sec ± 20%
kafka_consumergroup_lag{group="flink-recording-rules"}                    # Flink consumer lag
kafka_consumergroup_lag{group="vminsert-consumer"}                        # vminsert consumer lag
otelcol_scraper_scraped_metric_points / otelcol_scraper_errored_metric_points  # Scrape success rate
rate(vm_rows_rejected_total{reason="series_limit"}[5m])                   # Tenant limit rejections

# === eBPF COLLECTION ===
beyla_ebpf_attach_errors                                                  # eBPF probe failures
beyla_http_server_request_duration_seconds_count                          # Beyla generating metrics

# === PIPELINE ===
otelcol_processor_routing_dropped_total                                   # Health checks/probes dropped
otelcol_processor_routing_sampled_total                                   # Bronze-tier signals sampled

# === STORAGE ===
vm_active_timeseries                                                      # Global active series
rate(vm_new_timeseries_created_total[5m])                                 # Series churn rate
vm_data_size_bytes{type="indexdb"}                                        # Index size
vm_merge_need_free_disk_space                                             # Compaction backlog

# === QUERY ===
histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m]))    # p99 query latency
vm_concurrent_queries                                                      # Active queries
rate(vm_slow_queries_total[5m])                                           # Slow queries (>10s)

# === ALERTING ===
histogram_quantile(0.99, rate(rule_evaluation_duration_seconds_bucket[5m]))  # Rule eval duration
rate(rule_evaluation_failures_total[5m])                                     # Missed evaluations
rate(alertmanager_notifications_failed_total[5m])                            # Notification failures

# === ML ANOMALY DETECTION ===
ml_anomaly_detector_lag_seconds                                           # Processing lag behind real-time
ml_anomaly_detector_false_positive_rate                                   # Should be < 5%
ml_anomaly_detector_baseline_age_hours                                    # Stale baseline detection

# === FLINK ===
flink_job_lastCheckpointDuration                                          # Should be < checkpoint interval
rate(flink_taskmanager_job_task_numRecordsInPerSecond[5m])                 # Should match Kafka rate
flink_taskmanager_job_task_backPressuredTimeMsPerSecond                    # High = can't keep up

# === TRACE PIPELINE ===
tempo_ingester_flush_duration_seconds                                      # Ingester lag
rate(otelcol_processor_tail_sampling_count_traces_dropped[5m])             # Span drop rate

# === LOG PIPELINE ===
rate(vlinsert_ingested_rows_total[5m])                                     # Log ingestion rate
vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes                    # Disk usage

# === PROFILE PIPELINE ===
rate(pyroscope_ingester_appended_profiles_total[5m])                       # Profile ingestion rate
pyroscope_ingester_memory_utilization                                      # Ingester memory pressure
pyroscope_compactor_outstanding_blocks                                     # Compaction health

14.2 Alert Rules for Meta-Monitoring

Alert	Expression	Severity	For
IngestionDrop	`rate(vm_rows_inserted_total[5m]) < 320000000`	critical	2m
KafkaLagHigh	`kafka_consumergroup_lag > 50000000`	warning	5m
KafkaLagCritical	`kafka_consumergroup_lag > 500000000`	critical	2m
CardinalitySpike	`rate(vm_new_timeseries_created_total[1h]) > 100000`	warning	0m
VmstorageDown	`up{job="vmstorage"} == 0`	critical	30s
QueryLatencyHigh	`histogram_quantile(0.99, rate(vm_request_duration_seconds_bucket[5m])) > 1`	warning	5m
RuleEvalFailing	`rate(rule_evaluation_failures_total[5m]) > 0`	critical	2m
FlinkBackpressure	`flink_taskmanager_job_task_backPressuredTimeMsPerSecond > 500`	warning	5m
CompactionBehind	`vm_merge_need_free_disk_space > 0`	warning	10m
S3ExportFailed	`time() - last_s3_export_timestamp > 86400`	critical	0m
BeylaAttachErrors	`beyla_ebpf_attach_errors > 0`	warning	5m
MLAnomalyLag	`ml_anomaly_detector_lag_seconds > 120`	warning	5m
MLBaselineStale	`ml_anomaly_detector_baseline_age_hours > 672`	warning	0m
PyroscopeIngesterHigh	`pyroscope_ingester_memory_utilization > 0.7`	warning	5m
ProfileIngestionDrop	`rate(pyroscope_ingester_dropped_profiles_total[5m]) > 0`	warning	2m
TraceSamplingAnomaly	`rate(otelcol_processor_tail_sampling_count_traces_sampled[5m]) / rate(otelcol_processor_tail_sampling_count_traces_received[5m]) > 0.05`	warning	2m
VictoriaLogsStorageFull	`vlstorage_disk_usage_bytes / vlstorage_disk_total_bytes > 0.85`	warning	5m
LogIngestionDrop	`rate(vlinsert_ingested_rows_total[5m]) < 40000000`	critical	5m

15. Deployment Strategy

15.1 Multi-Region Topology

S3 replication: Each region's S3 bucket has cross-region replication enabled. If us-east-1 goes down entirely, warm/cold data is still accessible from us-west-2.

15.2 Rolling Upgrades

Beyla: Rolling DaemonSet update. eBPF probes detach when the old pod stops and reattach when the new pod starts. Brief gap (~5-10 seconds per node) in eBPF-generated telemetry during the rollout.

15.3 Disaster Recovery

RPO/RTO targets:

Component	RPO (data loss tolerance)	RTO (recovery time)	Mechanism
Metrics (hot)	0 (with RF=2)	< 5 minutes	Replica serves queries while failed node recovers.
Metrics (warm/cold)	0	< 30 minutes	S3 cross-region replication.
Kafka	0 (RF=3, acks=all)	< 1 minute	Leader election on surviving replicas.
Flink	Last checkpoint (seconds to minutes)	< 5 minutes	Restart from checkpoint.
Traces (Tempo)	~60s of in-flight spans	< 10 minutes	Stateless query frontend. Ingesters lose buffer.
Logs (VictoriaLogs)	Since last vmbackup (hours)	15-30 minutes	Restore from S3 snapshot.
Profiles (Pyroscope)	~30s of in-flight profiles	< 10 minutes	Kafka buffers (12h retention). Ingesters lose buffer.
ML Anomaly Detector	Baseline state in RocksDB (hours to rebuild)	< 15 minutes	Restart, replay from Kafka. Baselines rebuild over 24-48h.

16. Security

16.1 Security Fundamentals

Transport: mTLS between all components (OTel Collectors, Beyla, Kafka, Flink, VictoriaMetrics, Pyroscope). TLS 1.3 for external-facing APIs. Certificates managed by cert-manager.
Tenant isolation: Separate data directories per tenant on vmstorage (/data/{tenant_id}/). vmselect injects tenant_id filter into every query. Alert rules, recording rules, and dashboards are tenant-scoped with RBAC at the API gateway.
Encryption at rest: AES-256 disk encryption for vmstorage (EBS/NVMe). SSE-S3 or SSE-KMS for S3. Broker-side encryption for Kafka.
eBPF security: Beyla requires CAP_SYS_ADMIN or CAP_BPF + CAP_PERFMON capabilities. These are restricted to the Beyla DaemonSet pods only. eBPF programs are verified by the kernel's BPF verifier before loading, preventing arbitrary kernel code execution.
Profile data sensitivity: CPU and memory profiles can reveal internal function names, code paths, and potentially sensitive string values in stack frames. Profile data inherits the same tenant isolation and encryption-at-rest as other signals. RBAC restricts profile access to service owners and on-call engineers.
Audit logging: All administrative actions (tenant CRUD, alert rule changes, cardinality limit overrides, anomaly suppression) logged to a separate append-only store. 1-year minimum retention for compliance.

16.2 Query Injection Prevention

MetricsQL expressions are user-provided strings. Without sanitization, a crafted query could attempt to access another tenant's data or execute resource-intensive queries.

Protections:

vmselect forcibly prepends {tenant_id="<authenticated_tenant>"} to every query. The user-provided tenant_id label is stripped.
Query complexity limits: maximum number of series touched (500K default), maximum query duration (60s), maximum number of subqueries (10).
Regex complexity limits: MetricsQL regex matchers are checked for backtracking patterns before execution.

Explore the Technologies

The technologies and infrastructure patterns referenced throughout this design:

Core Technologies

Technology	Role in This System	Learn More
OpenTelemetry	Vendor-neutral instrumentation standard. SDK for custom metrics, traces, and context propagation across all services	OpenTelemetry
OTel Collector	Telemetry pipeline engine. Receives from SDK and Beyla, processes (batch, filter, enrich, sample), exports to storage backends	OTel Collector
Grafana Beyla	eBPF-based auto-instrumentation, zero-config RED metrics and trace spans from day one without code changes	Grafana Beyla
Kafka	Ingestion buffer, 400M metrics/sec (post-pipeline), fingerprint-partitioned topics	Kafka
VictoriaMetrics	Hot TSDB, Gorilla compression, shared-nothing cluster (vminsert/vmselect/vmstorage)	VictoriaMetrics
Flink	Stream aggregation, recording rules, SLO burn rate calculation	Flink
Prometheus	Scrape format standard, MetricsQL foundation, meta-monitoring stack	Prometheus
Grafana	Dashboard visualization, multi-datasource queries, flame graph panels	Grafana
Grafana Pyroscope	Continuous profiling storage, S3-native, pprof format, span-to-profile correlation	Grafana Pyroscope
InfluxDB	Alternative TSDB evaluated in Section 4.3 (FDAP stack, Arrow + Parquet)	InfluxDB
ClickHouse	Optional analytics layer for cross-signal wide-event exploration	ClickHouse
RocksDB	Flink state backend, VictoriaMetrics mergeset index, ML anomaly detector baselines	RocksDB
gRPC	vminsert-to-vmstorage communication, protobuf metric serialization	gRPC
Envoy	Service mesh sidecar, mTLS for scrape traffic, load balancing	Envoy
Memcached	Query result cache for Tempo trace queries and vmselect overflow caching	Memcached
Gorilla Compression	Delta-of-delta + XOR encoding, 12x compression ratio, 1.37 bytes/sample	Gorilla Compression
ZSTD Compression	General-purpose compression for cold-tier Parquet blocks on S3, Kafka messages, RocksDB SST files	ZSTD Compression
Grafana Tempo	Distributed trace storage, S3-backed Parquet blocks, TraceQL queries, exemplar correlation	Grafana Tempo
VictoriaLogs	Log storage, bloom filter per-token index, LogsQL queries, vlinsert/vlselect/vlstorage cluster	VictoriaLogs

Infrastructure Patterns

Pattern	Role in This System	Learn More
Metrics & Monitoring	Core domain. Gorilla compression, TSDB internals, cardinality management	Metrics & Monitoring
Message Queues & Event Streaming	Kafka topic design, fingerprint partitioning, backpressure handling	Event Streaming
Alerting & On-Call	Distributed rule evaluation, Alertmanager, SLO burn rates, ML anomaly detection	Alerting & On-Call
Kubernetes Architecture	Service discovery, DaemonSet collectors, eBPF agents, pod-level scraping	Kubernetes Architecture
Rate Limiting & Throttling	Per-tenant ingestion limits, query concurrency limits, cardinality budgets	Rate Limiting
Circuit Breaker & Resilience	vmstorage failure handling, Kafka retry policies, graceful degradation	Circuit Breaker & Resilience
Auto-Scaling Patterns	vminsert/vmselect horizontal scaling, Flink TaskManager auto-scaling	Auto-Scaling
Distributed Tracing	Trace collection via OTel SDK + eBPF, tail-based sampling, Tempo on S3, profile correlation	Distributed Tracing
Distributed Logging	Log collection via OTel filelog receiver, VictoriaLogs bloom filter indexing, value-based routing	Distributed Logging
Database Sharding	VictoriaMetrics consistent hashing, tenant shuffle-sharding	Database Sharding
Service Discovery & Registration	Kubernetes SD, Consul, eBPF auto-discovery	Service Discovery
Deployment Strategies	Blue-green for Flink jobs, rolling upgrades for VictoriaMetrics	Deployment Strategies
Caching Strategies	vmselect query cache, recording rule pre-computation, Parquet block cache	Caching Strategies
Object Storage & Data Lake	S3 + Parquet warm/cold tiers, Pyroscope S3 storage, ZSTD compression	Object Storage
Secrets Management	mTLS certificates, tenant API keys, S3 encryption keys	Secrets Management

System Design: GitHub (200M Repos, Git Object Storage, Sparse Trigram Code Search, Per-Job VM CI)

April 23, 2026 · 69 min read

System Design: Online Auction (50K Bids/sec, Effectively-Once Settlement, Anti-Sniping)

April 17, 2026 · 83 min read

System Design: Job Scheduler (10M Jobs/day, DAG Dependencies, Effectively-Once Execution)

April 15, 2026 · 61 min read

Continue Learning

Explore 30+ topics in System Design Interview Prep→

Deep dives, diagrams, and interview-ready knowledge.