Observability Platform Design
The Four Pillars and Why You Need All Four
Metrics tell you something is wrong. Traces tell you where in the call chain it's wrong. Profiles tell you why the code is slow. Logs tell you what the system said about it. You need all four. A spike in error rate (metric) leads you to the request trace showing a slow checkout span (trace), which links to a CPU flame graph revealing regex compilation in a hot loop (profile), with logs confirming a cart with 10,000 items triggered the issue (log).
Organizations that stop at three pillars (metrics, traces, logs) can identify slow services but can't pinpoint slow functions without reproducing the issue locally. Adding continuous profiling closes this gap — the production flame graph shows the bottleneck in seconds instead of hours of local debugging.
Datadog vs Grafana Stack vs Honeycomb
Datadog is the all-in-one solution. Everything works together out of the box. The cost is brutal at scale. A company with 500 services, 100 engineers, and moderate traffic can easily spend $50-100k per month. Datadog's per-host pricing ($23/host/month for infrastructure, $15/host for APM) and per-GB log ingestion ($0.10/GB) compounds fast.
The Grafana stack (Prometheus for metrics, Loki for logs, Tempo for traces) costs a fraction but requires operational expertise. You need to manage storage (S3/GCS backends), retention, and scaling. Grafana Cloud offers a managed version at roughly 30-50% of Datadog pricing.
Honeycomb takes a different approach with high-cardinality event data. Instead of pre-aggregated metrics, you send structured events and query them in real time. It's exceptional for debugging complex distributed systems but has a learning curve. Pricing starts at $130/month for 200M events.
Pick Datadog if operational simplicity matters more than cost. Pick Grafana stack if you have platform engineers who can manage infrastructure. Pick Honeycomb if debugging speed for complex distributed systems is your priority.
VictoriaMetrics Licensing
This document references VictoriaMetrics for metrics storage. A note on licensing: the single-node version is Apache 2.0, fully open source. The cluster version is also open source for core functionality, but enterprise features like downsampling, multi-tenant rate limiting, deduplication across HA pairs, and advanced RBAC require a commercial license. For most teams starting out, the open-source cluster version covers what you need. If licensing flexibility is a hard constraint for your organization, Prometheus paired with Thanos or Cortex for long-term storage is the safer path. You lose some of VictoriaMetrics' ingestion performance, but the licensing picture is cleaner.
OpenTelemetry as the Standard
OpenTelemetry (OTel) is the CNCF project that standardizes instrumentation. It provides SDKs for every major language, automatic instrumentation for popular frameworks, and the OTLP protocol for exporting telemetry data.
The key insight is that OTel decouples instrumentation from backends. Your application code uses OTel APIs. The OTel Collector receives, processes, and exports data to whatever backend you choose. Switch from Datadog to Grafana without changing application code. This is why OTel should be your instrumentation standard regardless of which backend you use.
Deploy the OTel Collector as a DaemonSet in Kubernetes. Each node's collector receives telemetry from local pods, processes it (batching, filtering, sampling), and exports to your backends. This architecture keeps egress traffic efficient and gives you a central point for sampling decisions.
Cost Management at Scale
Observability cost management boils down to three strategies: sampling, aggregation, and retention. For traces, implement head-based sampling at 1-10% for normal traffic and tail-based sampling that captures 100% of errors and slow requests. This gives you statistical accuracy for dashboards while keeping full fidelity for debugging.
For logs, stop logging at DEBUG level in production. Set default log levels to WARN with the ability to dynamically increase to INFO or DEBUG for specific services during incidents. Structured logging (JSON) with consistent fields makes logs queryable without processing overhead.
For metrics, set retention tiers. High-resolution (15-second intervals) for 7 days, downsampled to 1-minute for 30 days, 5-minute for 1 year. Most dashboards work fine with 1-minute resolution. Only real-time alerting needs 15-second data.
Stream Processing and How Grafana Queries Aggregated Data
If you use a stream processor like Apache Flink for real-time aggregation, the data flow works like this: Flink sits between ingestion and storage. It reads raw metrics from the OTel Collector (or directly from Kafka), computes rolling aggregates, and writes the results back into VictoriaMetrics as new metric series. For example, Flink computes a 1-minute rolling average response time and writes it as http_response_time_avg_1m{service="checkout"}. The raw metric http_response_time_seconds still flows into VictoriaMetrics separately.
Grafana queries VictoriaMetrics either way. For a dashboard showing the last 6 hours of average response time, it reads the pre-aggregated _avg_1m metric. Fast query, low storage scan. For an ad-hoc investigation into the last 15 minutes of a specific endpoint, it queries the raw metric directly for full-resolution data.
There is no automatic routing. The person building the dashboard picks the metric name in each panel's PromQL query. A 6-hour ops dashboard panel uses the pre-aggregated metric: avg(http_response_time_avg_1m{service="checkout"}). An incident investigation panel queries raw data for full resolution: histogram_quantile(0.99, rate(http_response_time_seconds_bucket{service="checkout"}[5m])). Same VictoriaMetrics instance, same Grafana data source, different metric name in the query box. No magic, just a naming convention.
Standardized Dashboards
Every service should launch with four golden signal dashboards pre-configured: request rate, error rate, latency percentiles (p50, p95, p99), and resource saturation (CPU, memory, connections). These should be auto-generated from service metadata. When a team registers a service in Backstage, the observability platform creates their dashboards automatically. No manual Grafana configuration. Custom dashboards come later for service-specific needs, but the golden signals dashboard is the starting point for every incident.
Key Points
- •Two-tier instrumentation: eBPF (Grafana Beyla) provides baseline RED metrics and trace spans for every service with zero code changes, while OTel SDK adds depth for custom business metrics and profiling
- •Value-based data routing in the OTel Collector pipeline drops health checks, samples debug noise, and enriches with ownership metadata before storage — reducing costs 15-30%
- •Four pillars (metrics, traces, logs, profiles) with pillar-native storage: VictoriaMetrics, Tempo, VictoriaLogs, Pyroscope — each optimized for its signal's access pattern
- •Profile-to-trace correlation is the key differentiator: click from a slow trace span to a CPU flame graph showing exactly which function is the bottleneck
- •ML anomaly detection supplements SLO burn rates for alerting — ML goes to Slack for awareness, burn rates page via PagerDuty
Common Mistakes
- ✗Collecting everything at full resolution and getting a $200k/month Datadog bill before realizing that 90% of the data is never queried. Value-based routing fixes this at the pipeline level
- ✗Treating observability as a project with an end date instead of an ongoing platform capability that evolves with the organization
- ✗Stopping at three pillars and missing profiling — without profiles, the investigation ends at 'this service is slow' instead of 'this function is the bottleneck'
- ✗Not correlating signals: metrics to traces to profiles to logs. Each click should deepen the investigation, not require switching tools