Coordination & InfrastructureTech 12 of 15

Observability

Grafana Tempo

S3-native distributed tracing with no index to maintain and TraceQL for structural queries

Use Cases

Distributed tracing across microservice architectures at scaleS3-native trace storage with automatic tiered lifecycleTraceQL queries for attribute-based and structural trace analysisGrafana-native observability workflows with metric-to-trace correlationHigh-volume trace ingestion with tail-based samplingCost-effective long-term trace retention via S3 storage classesService dependency map generation from trace dataIncident investigation with exemplar-linked trace lookups

Architecture

Why It Exists

Before Tempo, distributed tracing at scale meant running Elasticsearch or Cassandra as the storage backend for Jaeger. Elasticsearch requires JVM tuning, shard management, ILM policies, and a dedicated team to operate. Cassandra requires topology planning, compaction tuning, and consistent hash ring management. Both are expensive to run at petabyte scale.

Grafana Labs built Tempo around a single insight: traces are write-heavy, read-light, and almost always queried by trace ID. If 90% of trace lookups are by a known trace ID (clicked from a metric exemplar or a log line), a full inverted index is unnecessary. A bloom filter per block is enough to locate the right Parquet file on S3. The remaining 10% of queries (search by attribute) are slower, but the operational and cost savings from eliminating the index are massive.

The result is a trace backend where storage cost scales with S3 pricing (~$0.023/GB/month for Standard, ~$0.004/GB/month for Glacier Instant Retrieval), not with compute cluster sizing. A petabyte of traces on Glacier Instant Retrieval costs ~$4,000/month. The same volume on Elasticsearch would require dozens of data nodes at 10-50x the cost.

How the Storage Engine Works

Tempo writes trace data as Apache Parquet columnar files on S3. Each file is called a "block" and contains all spans ingested during a time window.

Why Parquet? Parquet stores data column-by-column, not row-by-row. A trace span has dozens of attributes (service_name, operation, duration, http.method, http.status_code, custom attributes). A query filtering on service_name reads only the service_name column from the Parquet file, skipping all other columns entirely. This column pruning reduces the bytes read from S3 by 10-50x compared to reading full rows.

Block structure on S3:

s3://tempo-traces/
  single-tenant/
    <block-id>/
      data.parquet          # Span data in columnar format
      bloom-0.bloom         # Bloom filter shard (trace IDs)
      bloom-1.bloom         # Bloom filter shard
      meta.json             # Block metadata (time range, span count, size)

Bloom filter lookup flow:

User queries {traceid="abc123"}
Tempo reads meta.json for all blocks in the queried time range
For each block, Tempo checks the bloom filter: "Could trace abc123 be in this block?"
Bloom filter says "definitely not" for 9,997 out of 10,000 blocks → skipped
Bloom filter says "maybe" for 3 blocks → Tempo reads data.parquet from S3 for those 3
Parquet column pruning: read only the columns needed for the response
Return the matching trace

The bloom filter eliminates 99.97% of S3 reads in this example. The false positive rate depends on block size and bloom filter configuration. Well-compacted blocks (fewer, larger blocks) have more efficient bloom filters.

Ingester and Write Path

The write path follows this sequence:

Distributor receives spans via OTLP (gRPC or HTTP) from OTel Collectors
Distributor hashes each span's trace_id and routes it to the correct ingester (consistent hashing ring)
The ingester appends the span to an in-memory buffer and writes it to the local WAL (Write-Ahead Log) on SSD
When the buffer reaches the flush threshold (configurable: time-based or size-based), the ingester:
- Sorts spans by trace_id
- Builds a Parquet file with columnar encoding
- Constructs bloom filter shards for all trace IDs in the block
- Uploads data.parquet, bloom filter files, and meta.json to S3
- Deletes the corresponding WAL segments

The trace_id-based routing in the distributor ensures all spans from a single trace arrive at the same ingester. This means the ingester has the complete trace (or most of it) when it builds the Parquet block, producing better compression and more accurate bloom filters.

WAL replay on crash: If an ingester crashes before flushing, the WAL on local SSD survives. On restart, the ingester replays the WAL, reconstructs the in-memory buffer, and flushes normally. No spans are lost unless the local SSD itself fails (mitigated by running multiple ingesters with replication factor 2-3).

Compactor

The compactor is Tempo's background maintenance process. Without it, query performance degrades over time.

What it does:

Merges small blocks. Frequent ingester flushes create many small Parquet files on S3. The compactor reads multiple small blocks, merges them into a single larger block, and deletes the originals. Fewer blocks = fewer bloom filter checks per query = faster queries.
Rebuilds bloom filters. Merged blocks get new, more efficient bloom filters. A bloom filter covering 1 million traces in one block is more space-efficient than 10 bloom filters covering 100K traces each.
Enforces retention. Blocks older than the configured retention period are deleted from S3.
Manages block lifecycle. Tracks block state (active, compacted, deleted) in a per-tenant block list stored on S3.

Compactor sizing: The compactor reads blocks from S3, merges them in memory, and writes the result back to S3. Memory usage scales with block size and concurrency. A single compactor handling a few hundred GB of daily traces needs 4-8 GB RAM and 2-4 CPU cores. At petabyte scale, run multiple compactor instances with sharded tenants.

TraceQL

TraceQL is Tempo's query language for trace analysis. It operates on spans and traces, not on time series.

Basic attribute queries:

# Find all traces from the checkout service
{resource.service.name = "checkout"}

# Find error spans longer than 2 seconds
{status = error && duration > 2s}

# Find database spans with slow queries
{span.db.system = "postgresql" && duration > 500ms}

Structural queries (unique to TraceQL):

# Find traces where a parent HTTP span contains a slow child DB span
{span.http.method = "POST"} >> {span.db.system = "postgresql" && duration > 1s}

# Find traces with more than 20 spans (complex request flows)
{rootSpan} | count() > 20

The >> operator means "is an ancestor of" — this finds traces where an HTTP POST span eventually calls a PostgreSQL query that takes over 1 second. No other trace query language supports this kind of structural query. Jaeger supports tag-based search. VictoriaTraces supports LogsQL flat field search. Only TraceQL can express parent-child span relationships.

Query execution: The query frontend receives a TraceQL query, splits it by time range, and fans it out to querier pods. Each querier checks bloom filters, reads relevant Parquet blocks from S3, evaluates the query against the columnar data, and returns matching spans. The query frontend merges results from all queriers.

Single-Node vs Distributed

Monolithic mode: A single binary runs all components (distributor, ingester, querier, query-frontend, compactor). Suitable for development, testing, and small deployments up to ~50K spans/sec. Simple to deploy: one process, one config file, one S3 bucket.

Microservices mode: Each component runs as a separate process (or Kubernetes deployment). Scale each dimension independently:

Component	Role	Scales with	Stateful?
Distributor	Accepts OTLP spans, routes by trace_id	Ingestion rate	No
Ingester	Buffers spans, writes Parquet to S3	Ingestion rate + trace cardinality	Yes (WAL on local SSD)
Querier	Reads Parquet from S3, evaluates TraceQL	Query concurrency	No
Query Frontend	Query scheduling, splitting, caching	Query concurrency	No (cache is external)
Compactor	Merges blocks, rebuilds bloom filters	Block count + retention volume	No (reads/writes S3)

The ingester is the only stateful component (local WAL). Everything else is stateless and horizontally scalable.

Decision Criteria

Criteria	Grafana Tempo	Jaeger	VictoriaTraces
Storage backend	S3/GCS (object storage)	Elasticsearch, Cassandra, or ClickHouse	Local NVMe/SSD
Index strategy	No index. Bloom filters for trace_id.	Full inverted index (ES) or column index (ClickHouse)	Bloom filters on all span fields
Query language	TraceQL (structural + attribute)	Tag-based search (limited)	Jaeger APIs + LogsQL
Storage cost (1 PB)	~$23K/month (S3 Standard)	~$200-500K/month (ES cluster)	Local NVMe provisioning
Resource efficiency	1.35 vCPU, 4.26 GiB at 10K spans/sec	Depends on backend	0.50 vCPU, 1.15 GiB at 10K spans/sec
Grafana integration	Native Tempo datasource	Jaeger datasource plugin	Jaeger datasource (+ experimental Tempo API)
Tiered storage	S3 lifecycle policies (automatic)	ES ILM or Cassandra TTL	NVMe → HDD → S3 via vmbackup (manual)
Operational complexity	Medium (S3 config, compactor monitoring)	High (external DB management)	Low (local disk, single binary or 3-component cluster)
Attribute search speed	Slow on large ranges (no index, Parquet scan)	Fast (inverted index)	Medium (bloom filter assisted)
Community	Large (Grafana Labs ecosystem)	Largest (CNCF graduated)	Growing (VictoriaMetrics ecosystem)

Choosing between them:

TraceQL structural queries are required: Grafana Tempo. Only option.
S3-native with automatic lifecycle tiering: Grafana Tempo. Zero local disk management.
Resource efficiency is the priority: VictoriaTraces. 3.7x less RAM, 2.6x less CPU.
Fast attribute search on large time ranges: Jaeger with Elasticsearch. Full inverted index.
Already running VictoriaMetrics + VictoriaLogs: VictoriaTraces. Same operational model.
Air-gapped environments: VictoriaTraces. No external storage dependencies.

Capacity Planning

Ingester sizing:

Spans/sec	Ingesters	CPU per ingester	RAM per ingester	Local SSD (WAL)
50,000	3	2 cores	8 GiB	50 GiB
200,000	10	4 cores	16 GiB	100 GiB
1,000,000	30	4 cores	16 GiB	100 GiB

S3 storage estimates (after tail-based sampling at 0.5% keep rate):

Raw spans/sec	After sampling	Daily S3 writes	Monthly S3 cost (Standard)
10M	50K	~4 TB	~$92
100M	500K	~40 TB	~$920
200M	1M	~86 TB	~$1,978

S3 request costs add 10-30% on top of storage costs at high query volume. Use query-frontend caching (memcached) to reduce S3 GETs.

Compactor sizing: 1 compactor instance per ~500 GB of daily ingestion. 4-8 GiB RAM, 2-4 CPU cores each. Monitor tempodb_compaction_outstanding_blocks — if this grows continuously, add compactor capacity.

Failure Scenarios

Scenario 1: Ingester Crash — WAL Replay

Trigger: An ingester runs out of memory during a traffic spike and is OOM-killed.

Impact: In-memory spans not yet flushed to S3 are lost from memory, but the WAL on local SSD has a copy. New spans for trace_ids hashed to this ingester are temporarily unavailable until it restarts.

Recovery: The ingester restarts, replays the WAL from local SSD, reconstructs the in-memory buffer, and flushes to S3. With replication factor 2+, the replica ingester continues accepting spans during the outage. Recovery time: 30 seconds to 2 minutes depending on WAL size.

Scenario 2: Compactor Backlog

Trigger: Compactor is under-resourced or temporarily down. Small blocks accumulate on S3 faster than the compactor can merge them.

Impact: Block count grows. Each query must check more bloom filters and potentially read more Parquet blocks. Query latency degrades gradually — from sub-second to multi-second for trace ID lookups, from seconds to tens of seconds for attribute searches.

Detection: Monitor tempodb_compaction_outstanding_blocks. Alert when outstanding blocks exceed 2x the expected count for the current ingestion rate.

Recovery: Scale up compactor resources (more instances, more memory). The compactor will catch up by merging the backlog. Once block count returns to normal, query latency recovers. No data loss occurs — the blocks are all valid, just not optimally merged.

Scenario 3: S3 Query Latency Spike

Trigger: A broad TraceQL query ({http.status_code=500} across 7 days) triggers thousands of S3 GET requests. S3 throttles the request rate or the querier exhausts its connection pool.

Impact: The broad query dominates querier resources. Other queries (including targeted trace ID lookups) queue behind it. Overall query latency spikes.

Recovery: Configure query-frontend limits: max_bytes_per_trace, max_search_duration (limit attribute search time range), and per-tenant query concurrency limits. Use a dedicated querier pool for broad analytical queries so they don't affect operational trace lookups. Add memcached query-frontend cache to avoid repeated S3 reads for the same blocks.

Pros

• No index to maintain. Bloom filters for trace ID lookup, Parquet columnar format for attribute search. Zero index management overhead.
• S3-native storage means trace cost scales with object storage pricing, not compute. Petabytes of traces for a fraction of Elasticsearch cost.
• TraceQL is the only trace query language with structural operators (find traces where span A is a parent of span B with duration > 2s)
• Native Grafana integration. Tempo datasource is built-in. Trace waterfall, service maps, and exemplar links work out of the box.
• Automatic tiered storage via S3 lifecycle policies. Hot to Glacier with zero application-level logic.
• Stateless query layer scales horizontally. Add querier pods to handle more concurrent queries without touching storage.
• Apache Parquet columnar format enables column pruning — queries that filter on one attribute skip all other columns

Cons

• S3 query latency is higher than local disk. Non-cached attribute searches can take 2-10 seconds depending on block count.
• Uses 4.26 GiB RAM at 10K spans/sec vs VictoriaTraces' 1.15 GiB. Ingesters buffer spans in memory before flushing to S3.
• Compactor is critical infrastructure. If it falls behind, block count grows, bloom filters fragment, and query latency degrades.
• No local-disk-only option. S3 or compatible object storage is required. Cannot run in air-gapped environments without MinIO or similar.
• Attribute-based queries (not by trace ID) can be slow on large time ranges because there is no inverted index — Tempo must scan Parquet blocks.
• Bloom filter false positives increase with block count. Well-compacted blocks are essential for query performance.

When to use

• S3-native storage with automatic lifecycle tiering is a requirement
• TraceQL queries for structural trace analysis are needed
• Already running Grafana and want native datasource integration
• Trace volume is high and object storage economics make more sense than local disk provisioning
• Team prefers operational simplicity over resource efficiency (stateless query, no local state to manage)

When NOT to use

• Air-gapped or no-external-dependency environments (VictoriaTraces uses local disk only)
• Resource efficiency is the top priority (VictoriaTraces uses 3.7x less RAM)
• Already running VictoriaMetrics + VictoriaLogs and want the same operational model for traces
• Need sub-second attribute search on large time ranges (Elasticsearch-backed Jaeger has inverted indexes)
• Budget does not allow S3 costs for trace storage

Key Points

•Tempo stores trace data as Apache Parquet columnar files on S3. Each block contains spans for a time window (default: 30 minutes to 2 hours depending on ingestion rate). Parquet's columnar format means a query filtering on service_name reads only the service_name column, skipping all other span attributes. ZSTD compression on top of Parquet typically achieves 5-10x compression on trace data.
•Bloom filters eliminate unnecessary S3 reads for trace ID lookups. Each Parquet block has an associated bloom filter containing all trace IDs in that block. When a user queries by trace_id, Tempo checks bloom filters first. If the bloom filter says 'definitely not in this block', the block is skipped entirely — no S3 GET required. For a cluster with 10,000 blocks, a single trace ID lookup typically reads 1-3 blocks instead of all 10,000.
•Ingesters buffer incoming spans in memory and write to a local Write-Ahead Log (WAL) on SSD. When the buffer reaches a configurable size or time threshold, the ingester batches the buffered spans into a Parquet block and flushes it to S3. If an ingester crashes, the WAL is replayed on restart, ensuring no span loss. This WAL-then-flush pattern is identical to how VictoriaMetrics' vmstorage handles metrics.
•The compactor is Tempo's background maintenance process. It periodically merges small Parquet blocks into larger, better-compressed blocks. This is critical for query performance — fewer blocks means fewer bloom filter checks and fewer S3 GETs per query. The compactor also rebuilds bloom filters for merged blocks and deletes blocks that have exceeded the retention period. A healthy compactor is the single most important factor in Tempo query performance.
•TraceQL is a purpose-built query language for structural trace analysis. Basic filters: {resource.service.name="checkout" && duration > 2s}. Structural queries: find traces where a specific parent span contains a child span matching certain criteria. This is something no other trace backend offers — Jaeger has tag-based search, VictoriaTraces has LogsQL, but neither supports structural trace queries.
•Tiered storage is handled entirely by S3 lifecycle policies, requiring zero application-level logic. Configure S3 lifecycle rules to transition blocks: S3 Standard (0-3 days) to S3 Standard-IA (3-30 days) to Glacier Instant Retrieval (30-180 days) to Glacier Deep Archive (180+ days). Tempo reads from any storage class transparently. Glacier Instant Retrieval provides millisecond access time, so even cold traces are queryable without restore delays.
•Tempo runs in two deployment modes. Monolithic mode (single binary) handles everything — suitable for development and small deployments up to ~50K spans/sec. Microservices mode splits into distributor (accepts spans, hashes by trace_id), ingester (buffers + flushes to S3), querier (reads from S3), query-frontend (query scheduling and caching), and compactor (block maintenance). Each component scales independently.

Common Mistakes

✗Neglecting the compactor. If the compactor falls behind or is under-resourced, block count grows unbounded. More blocks means more bloom filter checks per query, more S3 GETs, and progressively slower queries. Monitor compactor lag and block count. Alert when outstanding blocks exceed 2x the expected count.
✗Setting ingester flush period too short. Frequent flushes create many small Parquet blocks on S3. Small blocks have higher per-block overhead (bloom filter, block metadata) and degrade query performance. The compactor must work harder to merge them. Default flush period (30s-2min) works for most deployments. Only shorten it if you need lower ingestion-to-query latency.
✗Running attribute-based queries on large time ranges without understanding the cost. A query like {http.status_code=500} across 7 days may scan thousands of Parquet blocks. Tempo has no inverted index — it must check bloom filters and potentially read blocks. Always add time range constraints and service_name filters to narrow the search.
✗Not configuring query-frontend result caching. Without caching, repeated TraceQL queries (common during incident investigation when multiple engineers query the same traces) each hit S3. A memcached or Redis cache on the query-frontend eliminates redundant S3 reads and dramatically improves query latency for repeated lookups.
✗Ignoring S3 request costs at scale. Tempo generates S3 GET requests for every Parquet block read during queries and PUT requests during ingester flushes and compaction. At high query volume, S3 request costs can exceed storage costs. Use the query-frontend cache to reduce GETs. Monitor S3 request metrics per bucket.
✗Using Glacier Deep Archive for traces that may need operational access. Deep Archive has 12-hour restore time. If a compliance audit or post-incident review needs 6-month-old traces, Glacier Instant Retrieval is a better default cold tier (millisecond access, slightly higher storage cost).

Related Technologies