Grafana Tempo
S3-native distributed tracing with no index to maintain and TraceQL for structural queries
Use Cases
Architecture
Why It Exists
Before Tempo, distributed tracing at scale meant running Elasticsearch or Cassandra as the storage backend for Jaeger. Elasticsearch requires JVM tuning, shard management, ILM policies, and a dedicated team to operate. Cassandra requires topology planning, compaction tuning, and consistent hash ring management. Both are expensive to run at petabyte scale.
Grafana Labs built Tempo around a single insight: traces are write-heavy, read-light, and almost always queried by trace ID. If 90% of trace lookups are by a known trace ID (clicked from a metric exemplar or a log line), a full inverted index is unnecessary. A bloom filter per block is enough to locate the right Parquet file on S3. The remaining 10% of queries (search by attribute) are slower, but the operational and cost savings from eliminating the index are massive.
The result is a trace backend where storage cost scales with S3 pricing (~$0.023/GB/month for Standard, ~$0.004/GB/month for Glacier Instant Retrieval), not with compute cluster sizing. A petabyte of traces on Glacier Instant Retrieval costs ~$4,000/month. The same volume on Elasticsearch would require dozens of data nodes at 10-50x the cost.
How the Storage Engine Works
Tempo writes trace data as Apache Parquet columnar files on S3. Each file is called a "block" and contains all spans ingested during a time window.
Why Parquet? Parquet stores data column-by-column, not row-by-row. A trace span has dozens of attributes (service_name, operation, duration, http.method, http.status_code, custom attributes). A query filtering on service_name reads only the service_name column from the Parquet file, skipping all other columns entirely. This column pruning reduces the bytes read from S3 by 10-50x compared to reading full rows.
Block structure on S3:
s3://tempo-traces/
single-tenant/
<block-id>/
data.parquet # Span data in columnar format
bloom-0.bloom # Bloom filter shard (trace IDs)
bloom-1.bloom # Bloom filter shard
meta.json # Block metadata (time range, span count, size)
Bloom filter lookup flow:
- User queries
{traceid="abc123"} - Tempo reads
meta.jsonfor all blocks in the queried time range - For each block, Tempo checks the bloom filter: "Could trace abc123 be in this block?"
- Bloom filter says "definitely not" for 9,997 out of 10,000 blocks → skipped
- Bloom filter says "maybe" for 3 blocks → Tempo reads
data.parquetfrom S3 for those 3 - Parquet column pruning: read only the columns needed for the response
- Return the matching trace
The bloom filter eliminates 99.97% of S3 reads in this example. The false positive rate depends on block size and bloom filter configuration. Well-compacted blocks (fewer, larger blocks) have more efficient bloom filters.
Ingester and Write Path
The write path follows this sequence:
- Distributor receives spans via OTLP (gRPC or HTTP) from OTel Collectors
- Distributor hashes each span's
trace_idand routes it to the correct ingester (consistent hashing ring) - The ingester appends the span to an in-memory buffer and writes it to the local WAL (Write-Ahead Log) on SSD
- When the buffer reaches the flush threshold (configurable: time-based or size-based), the ingester:
- Sorts spans by trace_id
- Builds a Parquet file with columnar encoding
- Constructs bloom filter shards for all trace IDs in the block
- Uploads
data.parquet, bloom filter files, andmeta.jsonto S3 - Deletes the corresponding WAL segments
The trace_id-based routing in the distributor ensures all spans from a single trace arrive at the same ingester. This means the ingester has the complete trace (or most of it) when it builds the Parquet block, producing better compression and more accurate bloom filters.
WAL replay on crash: If an ingester crashes before flushing, the WAL on local SSD survives. On restart, the ingester replays the WAL, reconstructs the in-memory buffer, and flushes normally. No spans are lost unless the local SSD itself fails (mitigated by running multiple ingesters with replication factor 2-3).
Compactor
The compactor is Tempo's background maintenance process. Without it, query performance degrades over time.
What it does:
-
Merges small blocks. Frequent ingester flushes create many small Parquet files on S3. The compactor reads multiple small blocks, merges them into a single larger block, and deletes the originals. Fewer blocks = fewer bloom filter checks per query = faster queries.
-
Rebuilds bloom filters. Merged blocks get new, more efficient bloom filters. A bloom filter covering 1 million traces in one block is more space-efficient than 10 bloom filters covering 100K traces each.
-
Enforces retention. Blocks older than the configured retention period are deleted from S3.
-
Manages block lifecycle. Tracks block state (active, compacted, deleted) in a per-tenant block list stored on S3.
Compactor sizing: The compactor reads blocks from S3, merges them in memory, and writes the result back to S3. Memory usage scales with block size and concurrency. A single compactor handling a few hundred GB of daily traces needs 4-8 GB RAM and 2-4 CPU cores. At petabyte scale, run multiple compactor instances with sharded tenants.
TraceQL
TraceQL is Tempo's query language for trace analysis. It operates on spans and traces, not on time series.
Basic attribute queries:
# Find all traces from the checkout service
{resource.service.name = "checkout"}
# Find error spans longer than 2 seconds
{status = error && duration > 2s}
# Find database spans with slow queries
{span.db.system = "postgresql" && duration > 500ms}
Structural queries (unique to TraceQL):
# Find traces where a parent HTTP span contains a slow child DB span
{span.http.method = "POST"} >> {span.db.system = "postgresql" && duration > 1s}
# Find traces with more than 20 spans (complex request flows)
{rootSpan} | count() > 20
The >> operator means "is an ancestor of" — this finds traces where an HTTP POST span eventually calls a PostgreSQL query that takes over 1 second. No other trace query language supports this kind of structural query. Jaeger supports tag-based search. VictoriaTraces supports LogsQL flat field search. Only TraceQL can express parent-child span relationships.
Query execution: The query frontend receives a TraceQL query, splits it by time range, and fans it out to querier pods. Each querier checks bloom filters, reads relevant Parquet blocks from S3, evaluates the query against the columnar data, and returns matching spans. The query frontend merges results from all queriers.
Single-Node vs Distributed
Monolithic mode: A single binary runs all components (distributor, ingester, querier, query-frontend, compactor). Suitable for development, testing, and small deployments up to ~50K spans/sec. Simple to deploy: one process, one config file, one S3 bucket.
Microservices mode: Each component runs as a separate process (or Kubernetes deployment). Scale each dimension independently:
| Component | Role | Scales with | Stateful? |
|---|---|---|---|
| Distributor | Accepts OTLP spans, routes by trace_id | Ingestion rate | No |
| Ingester | Buffers spans, writes Parquet to S3 | Ingestion rate + trace cardinality | Yes (WAL on local SSD) |
| Querier | Reads Parquet from S3, evaluates TraceQL | Query concurrency | No |
| Query Frontend | Query scheduling, splitting, caching | Query concurrency | No (cache is external) |
| Compactor | Merges blocks, rebuilds bloom filters | Block count + retention volume | No (reads/writes S3) |
The ingester is the only stateful component (local WAL). Everything else is stateless and horizontally scalable.
Decision Criteria
| Criteria | Grafana Tempo | Jaeger | VictoriaTraces |
|---|---|---|---|
| Storage backend | S3/GCS (object storage) | Elasticsearch, Cassandra, or ClickHouse | Local NVMe/SSD |
| Index strategy | No index. Bloom filters for trace_id. | Full inverted index (ES) or column index (ClickHouse) | Bloom filters on all span fields |
| Query language | TraceQL (structural + attribute) | Tag-based search (limited) | Jaeger APIs + LogsQL |
| Storage cost (1 PB) | ~$23K/month (S3 Standard) | ~$200-500K/month (ES cluster) | Local NVMe provisioning |
| Resource efficiency | 1.35 vCPU, 4.26 GiB at 10K spans/sec | Depends on backend | 0.50 vCPU, 1.15 GiB at 10K spans/sec |
| Grafana integration | Native Tempo datasource | Jaeger datasource plugin | Jaeger datasource (+ experimental Tempo API) |
| Tiered storage | S3 lifecycle policies (automatic) | ES ILM or Cassandra TTL | NVMe → HDD → S3 via vmbackup (manual) |
| Operational complexity | Medium (S3 config, compactor monitoring) | High (external DB management) | Low (local disk, single binary or 3-component cluster) |
| Attribute search speed | Slow on large ranges (no index, Parquet scan) | Fast (inverted index) | Medium (bloom filter assisted) |
| Community | Large (Grafana Labs ecosystem) | Largest (CNCF graduated) | Growing (VictoriaMetrics ecosystem) |
Choosing between them:
- TraceQL structural queries are required: Grafana Tempo. Only option.
- S3-native with automatic lifecycle tiering: Grafana Tempo. Zero local disk management.
- Resource efficiency is the priority: VictoriaTraces. 3.7x less RAM, 2.6x less CPU.
- Fast attribute search on large time ranges: Jaeger with Elasticsearch. Full inverted index.
- Already running VictoriaMetrics + VictoriaLogs: VictoriaTraces. Same operational model.
- Air-gapped environments: VictoriaTraces. No external storage dependencies.
Capacity Planning
Ingester sizing:
| Spans/sec | Ingesters | CPU per ingester | RAM per ingester | Local SSD (WAL) |
|---|---|---|---|---|
| 50,000 | 3 | 2 cores | 8 GiB | 50 GiB |
| 200,000 | 10 | 4 cores | 16 GiB | 100 GiB |
| 1,000,000 | 30 | 4 cores | 16 GiB | 100 GiB |
S3 storage estimates (after tail-based sampling at 0.5% keep rate):
| Raw spans/sec | After sampling | Daily S3 writes | Monthly S3 cost (Standard) |
|---|---|---|---|
| 10M | 50K | ~4 TB | ~$92 |
| 100M | 500K | ~40 TB | ~$920 |
| 200M | 1M | ~86 TB | ~$1,978 |
S3 request costs add 10-30% on top of storage costs at high query volume. Use query-frontend caching (memcached) to reduce S3 GETs.
Compactor sizing: 1 compactor instance per ~500 GB of daily ingestion. 4-8 GiB RAM, 2-4 CPU cores each. Monitor tempodb_compaction_outstanding_blocks — if this grows continuously, add compactor capacity.
Failure Scenarios
Scenario 1: Ingester Crash — WAL Replay
Trigger: An ingester runs out of memory during a traffic spike and is OOM-killed.
Impact: In-memory spans not yet flushed to S3 are lost from memory, but the WAL on local SSD has a copy. New spans for trace_ids hashed to this ingester are temporarily unavailable until it restarts.
Recovery: The ingester restarts, replays the WAL from local SSD, reconstructs the in-memory buffer, and flushes to S3. With replication factor 2+, the replica ingester continues accepting spans during the outage. Recovery time: 30 seconds to 2 minutes depending on WAL size.
Scenario 2: Compactor Backlog
Trigger: Compactor is under-resourced or temporarily down. Small blocks accumulate on S3 faster than the compactor can merge them.
Impact: Block count grows. Each query must check more bloom filters and potentially read more Parquet blocks. Query latency degrades gradually — from sub-second to multi-second for trace ID lookups, from seconds to tens of seconds for attribute searches.
Detection: Monitor tempodb_compaction_outstanding_blocks. Alert when outstanding blocks exceed 2x the expected count for the current ingestion rate.
Recovery: Scale up compactor resources (more instances, more memory). The compactor will catch up by merging the backlog. Once block count returns to normal, query latency recovers. No data loss occurs — the blocks are all valid, just not optimally merged.
Scenario 3: S3 Query Latency Spike
Trigger: A broad TraceQL query ({http.status_code=500} across 7 days) triggers thousands of S3 GET requests. S3 throttles the request rate or the querier exhausts its connection pool.
Impact: The broad query dominates querier resources. Other queries (including targeted trace ID lookups) queue behind it. Overall query latency spikes.
Recovery: Configure query-frontend limits: max_bytes_per_trace, max_search_duration (limit attribute search time range), and per-tenant query concurrency limits. Use a dedicated querier pool for broad analytical queries so they don't affect operational trace lookups. Add memcached query-frontend cache to avoid repeated S3 reads for the same blocks.
Pros
- • No index to maintain. Bloom filters for trace ID lookup, Parquet columnar format for attribute search. Zero index management overhead.
- • S3-native storage means trace cost scales with object storage pricing, not compute. Petabytes of traces for a fraction of Elasticsearch cost.
- • TraceQL is the only trace query language with structural operators (find traces where span A is a parent of span B with duration > 2s)
- • Native Grafana integration. Tempo datasource is built-in. Trace waterfall, service maps, and exemplar links work out of the box.
- • Automatic tiered storage via S3 lifecycle policies. Hot to Glacier with zero application-level logic.
- • Stateless query layer scales horizontally. Add querier pods to handle more concurrent queries without touching storage.
- • Apache Parquet columnar format enables column pruning — queries that filter on one attribute skip all other columns
Cons
- • S3 query latency is higher than local disk. Non-cached attribute searches can take 2-10 seconds depending on block count.
- • Uses 4.26 GiB RAM at 10K spans/sec vs VictoriaTraces' 1.15 GiB. Ingesters buffer spans in memory before flushing to S3.
- • Compactor is critical infrastructure. If it falls behind, block count grows, bloom filters fragment, and query latency degrades.
- • No local-disk-only option. S3 or compatible object storage is required. Cannot run in air-gapped environments without MinIO or similar.
- • Attribute-based queries (not by trace ID) can be slow on large time ranges because there is no inverted index — Tempo must scan Parquet blocks.
- • Bloom filter false positives increase with block count. Well-compacted blocks are essential for query performance.
When to use
- • S3-native storage with automatic lifecycle tiering is a requirement
- • TraceQL queries for structural trace analysis are needed
- • Already running Grafana and want native datasource integration
- • Trace volume is high and object storage economics make more sense than local disk provisioning
- • Team prefers operational simplicity over resource efficiency (stateless query, no local state to manage)
When NOT to use
- • Air-gapped or no-external-dependency environments (VictoriaTraces uses local disk only)
- • Resource efficiency is the top priority (VictoriaTraces uses 3.7x less RAM)
- • Already running VictoriaMetrics + VictoriaLogs and want the same operational model for traces
- • Need sub-second attribute search on large time ranges (Elasticsearch-backed Jaeger has inverted indexes)
- • Budget does not allow S3 costs for trace storage
Key Points
- •Tempo stores trace data as Apache Parquet columnar files on S3. Each block contains spans for a time window (default: 30 minutes to 2 hours depending on ingestion rate). Parquet's columnar format means a query filtering on service_name reads only the service_name column, skipping all other span attributes. ZSTD compression on top of Parquet typically achieves 5-10x compression on trace data.
- •Bloom filters eliminate unnecessary S3 reads for trace ID lookups. Each Parquet block has an associated bloom filter containing all trace IDs in that block. When a user queries by trace_id, Tempo checks bloom filters first. If the bloom filter says 'definitely not in this block', the block is skipped entirely — no S3 GET required. For a cluster with 10,000 blocks, a single trace ID lookup typically reads 1-3 blocks instead of all 10,000.
- •Ingesters buffer incoming spans in memory and write to a local Write-Ahead Log (WAL) on SSD. When the buffer reaches a configurable size or time threshold, the ingester batches the buffered spans into a Parquet block and flushes it to S3. If an ingester crashes, the WAL is replayed on restart, ensuring no span loss. This WAL-then-flush pattern is identical to how VictoriaMetrics' vmstorage handles metrics.
- •The compactor is Tempo's background maintenance process. It periodically merges small Parquet blocks into larger, better-compressed blocks. This is critical for query performance — fewer blocks means fewer bloom filter checks and fewer S3 GETs per query. The compactor also rebuilds bloom filters for merged blocks and deletes blocks that have exceeded the retention period. A healthy compactor is the single most important factor in Tempo query performance.
- •TraceQL is a purpose-built query language for structural trace analysis. Basic filters: {resource.service.name="checkout" && duration > 2s}. Structural queries: find traces where a specific parent span contains a child span matching certain criteria. This is something no other trace backend offers — Jaeger has tag-based search, VictoriaTraces has LogsQL, but neither supports structural trace queries.
- •Tiered storage is handled entirely by S3 lifecycle policies, requiring zero application-level logic. Configure S3 lifecycle rules to transition blocks: S3 Standard (0-3 days) to S3 Standard-IA (3-30 days) to Glacier Instant Retrieval (30-180 days) to Glacier Deep Archive (180+ days). Tempo reads from any storage class transparently. Glacier Instant Retrieval provides millisecond access time, so even cold traces are queryable without restore delays.
- •Tempo runs in two deployment modes. Monolithic mode (single binary) handles everything — suitable for development and small deployments up to ~50K spans/sec. Microservices mode splits into distributor (accepts spans, hashes by trace_id), ingester (buffers + flushes to S3), querier (reads from S3), query-frontend (query scheduling and caching), and compactor (block maintenance). Each component scales independently.
Common Mistakes
- ✗Neglecting the compactor. If the compactor falls behind or is under-resourced, block count grows unbounded. More blocks means more bloom filter checks per query, more S3 GETs, and progressively slower queries. Monitor compactor lag and block count. Alert when outstanding blocks exceed 2x the expected count.
- ✗Setting ingester flush period too short. Frequent flushes create many small Parquet blocks on S3. Small blocks have higher per-block overhead (bloom filter, block metadata) and degrade query performance. The compactor must work harder to merge them. Default flush period (30s-2min) works for most deployments. Only shorten it if you need lower ingestion-to-query latency.
- ✗Running attribute-based queries on large time ranges without understanding the cost. A query like {http.status_code=500} across 7 days may scan thousands of Parquet blocks. Tempo has no inverted index — it must check bloom filters and potentially read blocks. Always add time range constraints and service_name filters to narrow the search.
- ✗Not configuring query-frontend result caching. Without caching, repeated TraceQL queries (common during incident investigation when multiple engineers query the same traces) each hit S3. A memcached or Redis cache on the query-frontend eliminates redundant S3 reads and dramatically improves query latency for repeated lookups.
- ✗Ignoring S3 request costs at scale. Tempo generates S3 GET requests for every Parquet block read during queries and PUT requests during ingester flushes and compaction. At high query volume, S3 request costs can exceed storage costs. Use the query-frontend cache to reduce GETs. Monitor S3 request metrics per bucket.
- ✗Using Glacier Deep Archive for traces that may need operational access. Deep Archive has 12-hour restore time. If a compliance audit or post-incident review needs 6-month-old traces, Glacier Instant Retrieval is a better default cold tier (millisecond access, slightly higher storage cost).