Coordination & InfrastructureTech 11 of 15

Observability

Grafana Pyroscope

S3-native continuous profiling with span-to-profile correlation and flame graph visualization

Use Cases

Continuous profiling of CPU, memory, goroutine, and mutex contention in productionProfile-to-trace correlation: click from a slow trace span to the CPU flame graph showing the exact bottleneckIdentifying hot functions, excessive allocations, and lock contention without reproducing locallyRegression detection by comparing flame graphs across deploymentsCost-effective long-term profile retention via S3 storage classesPolyglot profiling across Go, Java, Python, Ruby, Rust, and .NET services

Architecture

Why It Exists

Distributed tracing shows which service is slow. Metrics show how slow. But neither reveals why the code is slow. A trace shows "checkout-service took 500ms on the database span." The engineer's next step is to reproduce the issue locally, attach a profiler, run load tests, and hope the same code path triggers. That process takes hours.

Continuous profiling eliminates the reproduction step. Pyroscope captures CPU, memory, goroutine, and mutex profiles from production at regular intervals (every 10 seconds by default). When a trace shows a slow span, the engineer clicks through to the flame graph and sees the exact functions consuming time: regexp.Compile called 10,000 times per request, runtime.mallocgc dominating due to excessive allocations, sync.Mutex.Lock showing connection pool contention.

Grafana Labs acquired the original open-source Pyroscope project and integrated it into the Grafana ecosystem alongside Tempo (traces), Loki (logs), and Mimir (metrics). The result is a unified investigation flow where each click deepens the analysis: metric spike → trace waterfall → flame graph → log context.

How It Works

Profile Collection: The profiling SDK (OTel profile signal, Pyroscope SDK, or async-profiler for Java) takes periodic snapshots of the application's execution state. A CPU profile captures stack traces at regular intervals (typically every 10ms), counting how many samples land on each function. After the snapshot window (default 10 seconds), the SDK serializes the samples into pprof format and sends them to the OTel Collector.

pprof Format: Each profile contains three key structures. Samples are stack traces with associated values (CPU time in nanoseconds, allocation bytes, lock wait duration). Locations map stack frames to functions, files, and line numbers. Labels attach metadata like service_name, span_id, pod_name, and profile type. The span_id label is what enables trace-to-profile correlation.

Ingestion Path: Profiles arrive at the OTel Collector DaemonSet via OTLP (experimental) or Pyroscope SDK push. The collector applies the same pipeline processing as other signals: value-based routing (gold-tier services keep all profiles, bronze-tier may be sampled), metadata enrichment (team ownership, cost center), and batching. Profiles then flow to Kafka (profiles-raw topic) for buffering.

Storage: Pyroscope ingesters consume from Kafka, buffer profiles in memory, and periodically flush them to S3 as compressed blocks. Each block covers a time window and contains profiles indexed by service name, profile type, and time range. A compactor runs in the background merging small blocks into larger ones — identical to how Tempo handles trace blocks.

Architecture Deep Dive

Ingester Sizing: Ingesters are the stateful component. Each ingester buffers incoming profiles in memory before flushing to S3. At 50K profiles/sec across 10 ingester nodes, each handles 5K profiles/sec. At 100 KB average per profile, that is 500 MB/sec per ingester in memory throughput. With a 30-second flush interval, each ingester holds approximately 15 GB of profile data in RAM before flushing. Size instances at 64 GB RAM minimum (r6g.2xlarge or equivalent) to handle peak traffic with headroom.

S3 Block Structure:

s3://pyroscope-profiles/
  <tenant-id>/
    <block-id>/
      profiles.parquet      # Profile data in columnar format
      index.tsdb            # Time-series index (service + type + time)
      meta.json             # Block metadata (time range, profile count)

Compactor: The compactor merges small blocks into larger, better-compressed blocks and enforces retention policies. Without compaction, the block count grows unbounded, and query latency degrades as the query frontend must scan more blocks. Monitor pyroscope_compactor_outstanding_blocks and alert when it exceeds 2x the expected count.

Query Path: When a user views a flame graph in Grafana, the query specifies service name, profile type (CPU, memory, mutex), time range, and optionally span_id. The query frontend fans out to read relevant blocks from S3, merges the profile samples, and returns a flame graph where each node's width represents its share of total samples. Differential queries compare two time ranges and highlight changes.

Profile-to-Trace Correlation

This is the feature that makes Pyroscope more than just another profiling tool. The correlation works through shared span_id labels:

The OTel SDK records a trace span with span_id: "abc123" and trace_id: "xyz789".
During that span's execution window, the profiling SDK captures a CPU snapshot. It labels the profile with span_id: "abc123" in the pprof metadata.
The profile flows through the OTel Collector to Pyroscope with the span_id label preserved.
In Grafana, when viewing the trace waterfall in Tempo, the "Profiles" tab on a span queries Pyroscope: service_name=checkout AND span_id=abc123 AND time_range=[span_start, span_end].
The flame graph renders, showing exactly which functions consumed CPU time during that specific span.

What this reveals that traces alone cannot:

Trace Shows	Profile Shows (the "why")
checkout-service span: 500ms	`regexp.Compile` called 10,000 times in `validateCart()` — compiling the same regex per request
payment-service span: 300ms	60% time in `runtime.mallocgc` — excessive allocations causing GC pressure
db-proxy span: 200ms	45% time in `sync.Mutex.Lock` — lock contention on connection pool
auth-service span: 150ms	80% actual I/O wait (expected) — downstream dependency is slow, no code issue

Without profiles, the investigation stops at "checkout is slow." With profiles, the root cause is visible in seconds.

Capacity Planning

Profile volume (at full adoption):
  500K services × 50% adoption = 250K services profiled
  10-second snapshot interval: 250K / 10 = 25K profiles/sec (conservative)
  Full adoption target: 50K profiles/sec

Per-profile size:
  CPU profile:    50-200 KB compressed (avg 100 KB)
  Memory profile: 30-100 KB compressed
  Average:        ~100 KB across types

Ingestion rate:
  50K profiles/sec × 100 KB = 5 GB/sec raw
  Pyroscope dedup + compression: ~5x reduction
  Stored: ~1 GB/sec = 86 TB/day

Tiered storage (90-day retention):
  Hot  (7 days):  86 TB × 7  = 602 TB on S3 Standard
  Warm (83 days): 86 TB × 83 = 7.1 PB on S3 Standard-IA
  Total: ~7.7 PB

Pyroscope cluster:
  Ingesters:      10 nodes (r6g.2xlarge, 8 vCPU, 64 GB RAM)
  Query frontend:  5 nodes (c6g.2xlarge)
  Compactor:       3 nodes (c6g.xlarge)
  Kafka:          +2 brokers for profiles-raw topic

Best Practices

Start with CPU profiling only. CPU profiles are the easiest to interpret and cover the most common performance issues (hot functions, algorithmic inefficiency). Add memory, goroutine, and mutex profiling once the team is comfortable reading flame graphs.

Always configure span-to-profile correlation. Standalone flame graphs show global hotspots. Flame graphs linked to trace spans show why a specific request was slow. The span_id label in pprof metadata is what makes this work — verify it is set in the SDK configuration.

Set retention policies from day one. Profiles are large and cannot be downsampled. Keep 7 days on S3 Standard for active investigation, 83 days on S3 Standard-IA for post-incident reviews, and drop after 90 days. Most profiling investigations happen within 48 hours.

Use differential flame graphs for deploy validation. After each deployment, compare the flame graph from the previous version's time range against the new version. Functions highlighted in red got slower; functions in green got faster. This catches performance regressions before they trigger alerts.

Monitor ingester memory utilization and alert at 70%. Ingester OOM is the most common Pyroscope failure mode. At 15 GB of buffered profiles per ingester with a 64 GB RAM instance, utilization should hover around 25-30%. A traffic spike that doubles profile volume pushes it to 50-60%. Alert before it reaches the point of no return.

Pros

• S3-native storage means profile cost scales with object storage pricing. Petabytes of profiles at a fraction of local-disk cost
• Span-to-profile correlation via shared span_id labels in pprof data. Click from a Tempo trace span to the CPU flame graph for that exact execution window
• Native Grafana integration. Flame graph panel is built-in. Differential flame graphs compare two time ranges to spot regressions
• pprof format is the industry standard. Go, Java (async-profiler), Python (py-spy), Ruby, Rust, and .NET all produce pprof-compatible output
• Same architecture as Tempo: ingesters buffer in memory, flush to S3, compactors merge blocks. One operational model for traces and profiles
• OTel profile signal support (experimental) means profiles flow through the same OTel Collector fleet as metrics, traces, and logs

Cons

• Profiles require SDK-level instrumentation. eBPF (Grafana Beyla) does not produce profiles — only RED metrics and basic trace spans
• Per-profile size is large (50-200 KB per snapshot) compared to metrics (2 bytes) or trace spans (1 KB). Storage adds up at scale
• Ingester memory consumption is significant. Buffering 50K profiles/sec at 100 KB average requires careful node sizing
• OTel profile signal is still experimental as of early 2026. Most deployments use the Pyroscope SDK or async-profiler agent directly
• Flame graph interpretation requires performance engineering skills. Without training, teams may struggle to act on profile data

When to use

• You already run Grafana + Tempo and want profile-to-trace correlation in the same UI
• Debugging production performance issues requires knowing which function is the bottleneck, not just which service
• S3-native storage with automatic lifecycle tiering is a requirement for cost control
• Multiple language runtimes need profiling under a single system (Go, Java, Python, Ruby)

When NOT to use

• You only need RED metrics and basic traces. Grafana Beyla covers that without profiles
• Your services run on Windows or platforms without pprof-compatible profilers
• Budget does not allow the incremental S3 storage cost for continuous profiles at scale
• Team lacks performance engineering skills to interpret flame graphs (invest in training first)

Key Points

•Pyroscope stores profile data as blocks on S3, following the same architecture as Grafana Tempo. Ingesters buffer incoming profiles in memory, flush them to S3 as compressed blocks, and a compactor merges small blocks into larger ones. This means teams already running Tempo understand Pyroscope's operational model immediately
•Span-to-profile correlation is the killer feature. The OTel SDK records a trace span with span_id abc123. During that span's execution, the profiling SDK captures a CPU snapshot and labels it with the same span_id. In Grafana, the Profiles tab on a trace span queries Pyroscope for profiles matching that span_id and time window. The result: a flame graph showing exactly which functions consumed CPU during the slow span
•pprof is the standard profile format, originally from Go but now supported across languages. A pprof file contains stack traces with sample counts and values (CPU time, allocations, lock wait time). Each sample maps to a location (function, file, line number). Pyroscope stores pprof natively without format conversion
•Profile types cover different performance dimensions: CPU profiles show where compute time is spent, heap/memory profiles show allocation hotspots and GC pressure, goroutine profiles (Go) show concurrency bottlenecks, and mutex profiles show lock contention. Each type answers a different 'why is this slow?' question
•Differential flame graphs compare two time ranges side by side, highlighting functions that got slower (red) or faster (green) between deployments. This turns 'did the last deploy cause a regression?' from a manual investigation into a single visual comparison

Common Mistakes

✗Expecting Grafana Beyla to produce profiles. Beyla uses eBPF for network-level instrumentation (HTTP/gRPC/SQL) and generates RED metrics and basic trace spans. CPU, memory, and mutex profiles require SDK-level instrumentation because they need stack trace sampling inside the application process, not just network call observation
✗Setting profile interval too aggressively. A 1-second CPU profile interval generates 10x the data of a 10-second interval with minimal additional insight. The default 10-second interval captures enough samples for statistically meaningful flame graphs. Only reduce the interval for short-lived performance investigations
✗Not sizing Pyroscope ingesters for peak load. At 50K profiles/sec with 100 KB average, each ingester (10 nodes) buffers ~500 MB/sec. With a 30-second flush interval, each ingester holds ~15 GB of profiles in RAM. OOM during traffic spikes is the most common failure mode
✗Storing profiles forever at full resolution. Unlike metrics, profiles cannot be downsampled meaningfully. Set retention policies: 7 days on S3 Standard, 83 days on S3 Standard-IA, drop after 90 days. Most profile investigations happen within 48 hours of an incident
✗Ignoring the four-click investigation flow. Profiles are most valuable when correlated to traces. A standalone flame graph shows what is slow globally. A flame graph linked to a specific trace span shows why that specific request was slow. Always configure span-to-profile correlation via span_id labels

Related Technologies