Coordination & InfrastructureTech 14 of 15

Observability

VictoriaTraces

Distributed tracing built on the same engine as VictoriaLogs, without the external storage tax

Use Cases

Distributed tracing across microservice architecturesRequest latency analysis and bottleneck identificationEnd-to-end request flow visualizationTrace-based alerting via vmalert integrationFull-stack observability alongside VictoriaMetrics and VictoriaLogsHigh-volume trace ingestion from OpenTelemetry instrumented servicesService dependency graph generationRoot cause analysis during production incidents

Architecture

Why It Exists

Distributed tracing backends traditionally require external storage dependencies. Jaeger needs Elasticsearch or Cassandra. Grafana Tempo needs S3 or GCS. These dependencies work, but they add operational surface area. Running an Elasticsearch cluster for trace storage means managing shards, ILM policies, and JVM tuning. Using S3 means paying for object storage and accepting higher query latency for non-cached data.

VictoriaTraces takes a different approach. It stores trace data on local disk using the same storage engine as VictoriaLogs, with bloom filter indexing, columnar compression, and daily partition management. No external database. No object storage. The trade-off is that local disk requires capacity planning and manual cold-tier backup to S3 via vmbackup. But the operational surface area is dramatically smaller: a single binary or a vtinsert/vtselect/vtstorage cluster, local NVMe, and standard backup tooling.

For teams already running VictoriaMetrics (metrics) and VictoriaLogs (logs), VictoriaTraces completes the observability stack with the same operational model. The cluster architecture follows the same pattern (insert/select/storage), the same retention model (daily partitions, configurable retention period), and the same monitoring approach. One team can operate all three systems without learning three different operational paradigms.

How the Storage Engine Works

VictoriaTraces is built directly on the VictoriaLogs storage engine. Trace spans, despite looking different from log lines, are structurally similar: both are collections of key-value pairs with a timestamp, high ingestion volume, and selective querying.

When a trace span arrives via OTLP, VictoriaTraces transforms it into a structured entry:

Stream fields (indexed, low-cardinality): service name, span name. These become the primary filtering dimensions, similar to Loki's labels or VictoriaLogs' stream fields.
Time field: the span's start timestamp. Used for partition routing and time-range queries.
Ordinary fields (bloom filter indexed): everything else. trace_id, span_id, parent_span_id, duration, http.method, http.status_code, db.system, and any custom attributes. Each field's tokens are added to the bloom filter for the data block.

The bloom filter is the key indexing mechanism. Every unique token in every span field costs roughly 2 bytes of bloom filter storage. When a query searches for a specific trace_id (a 32-character hex string), the bloom filter on each data block can instantly determine "this block definitely does not contain this trace_id" and skip it. Only blocks that might contain the token get read from disk. For unique identifiers like trace_id, this eliminates 99%+ of disk reads.

Data is stored in daily UTC partitions. All spans arriving on a given calendar day go into that day's partition directory. Retention works by deleting entire partition directories when they exceed the configured -retentionPeriod. No scanning, no garbage collection, no tombstones. A 30-day retention with 100 GB per day partition means retention cleanup deletes a single directory in milliseconds, regardless of how many spans are inside.

Columnar compression stores each span field column-by-column. All service_name values for a block are compressed together, all duration values together, all trace_id values together. Repeated values (the same service name across thousands of spans) compress to nearly nothing. Timestamps use delta-of-delta encoding (identical to VictoriaMetrics). The result: 3.27 GiB of disk for the same span dataset that takes 5.86 GiB in ClickHouse and roughly 4.4 GiB in Tempo.

Single-Node vs Cluster

Single-node is a single binary that handles OTLP ingestion, storage, and queries. For deployments ingesting up to a few thousand spans per second, this is sufficient. A single node on modest hardware (4 cores, 8 GB RAM, NVMe) handled 10,000 spans/sec in benchmarks at 0.50 vCPU and 1.15 GiB RAM. Scale that to a 16-core machine and single-node handles tens of thousands of spans per second comfortably.

Cluster mode splits into three components:

vtinsert: Accepts incoming OTLP spans (gRPC and HTTP). Distributes spans across vtstorage nodes by trace_id hash. This ensures all spans from a single trace land on the same vtstorage node, so trace reconstruction is a local read operation. Stateless and horizontally scalable behind a load balancer.
vtstorage: Stores span data on local NVMe/SSD. Each node is essentially an independent single-node VictoriaTraces instance. Adding a vtstorage node adds capacity linearly with no rebalancing. Handles tokenization, bloom filter construction, columnar compression, and local query execution.
vtselect: Accepts queries via Jaeger Query Service APIs. Fans out queries to all vtstorage nodes. Merges results. Stateless. Scale based on query concurrency.

The trace_id-based distribution in vtinsert is critical. Earlier versions distributed spans randomly, which meant reconstructing a single trace required reading from all vtstorage nodes. With trace_id-based routing, all spans for a trace are co-located on one node. A trace_id lookup reads from a single vtstorage node, not from every node in the cluster.

vtinsert continues functioning when some vtstorage nodes are unavailable. It automatically reroutes spans to the remaining healthy nodes, ensuring ingestion continues uninterrupted. When the failed node recovers, spans for its trace_id range resume routing there. The traces that were rerouted during the outage end up split across nodes, but vtselect handles cross-node trace reconstruction during query.

OTLP Ingestion

VictoriaTraces accepts trace spans exclusively via the OpenTelemetry Protocol (OTLP). Three transport formats are supported:

OTLP/gRPC on port 4317 (requires -otlpGRPCListenAddr flag)
OTLP/HTTP with binary protobuf payloads
OTLP/HTTP with JSON payloads

The gRPC implementation is notable for what it does not use. Instead of the standard gRPC-Go library (which pulls in a large dependency tree), VictoriaTraces implements a custom HTTP/2 server that speaks the gRPC wire protocol directly. The endpoint path follows the OTLP specification: /opentelemetry.proto.collector.trace.v1.TraceService/Export.

For protobuf unmarshalling, the implementation uses easyproto instead of golang/protobuf. easyproto does not require protoc code generation and does not increase binary size with generated code. The combined effect: 25% smaller binary and 36% less CPU usage compared to a standard gRPC-Go implementation.

This matters at scale. At 100,000 spans/sec across a fleet of vtinsert nodes, a 36% CPU reduction translates directly to fewer nodes and lower infrastructure cost. The OTel Collector fleet already running for metrics and logs can ship trace spans to vtinsert with a configuration change: add an OTLP exporter pointed at the vtinsert endpoint.

Querying and Visualization

VictoriaTraces exposes the Jaeger Query Service JSON APIs. This makes it compatible with two visualization frontends:

Grafana connects to VictoriaTraces using the built-in Jaeger datasource. In the Grafana datasource configuration, point the Jaeger URL at the vtselect endpoint. Trace waterfall views, service dependency graphs, and trace search all work through this integration. The Tempo datasource API is also supported experimentally, which enables Grafana Tempo-style trace panels.

Jaeger UI connects directly to the vtselect API endpoint. This provides the classic Jaeger trace search experience: filter by service, operation, tags, duration range, and time window.

For programmatic trace search, the /select/logsql/query HTTP endpoint accepts LogsQL queries against the underlying span data. Since spans are stored as structured entries, LogsQL queries work directly:

# Find all spans for a specific trace
trace_id:"4bf92f3577b34da6a3ce929d0e0e4736"

# Find error spans in the checkout service
service_name:checkout AND http.status_code:500

# Find slow database queries
db.system:postgresql AND _time:1h | stats by(service_name) count() as slow_queries

# Find spans with a specific user attribute
user_id:"user_123" AND _time:5m

The LogsQL approach differs from Grafana Tempo's TraceQL. TraceQL is purpose-built for trace queries with structural operators (find traces where span A is a parent of span B). LogsQL treats spans as flat records and searches by field values. For most operational debugging (find this trace, find errors in this service, find slow queries), the difference is minimal. For structural trace analysis (find traces where the database span is a child of a specific API span), TraceQL is more expressive.

Decision Criteria

Criteria	VictoriaTraces	Grafana Tempo	Jaeger
Storage backend	Local NVMe/SSD. No external dependencies.	S3/GCS (object storage). No local storage.	Elasticsearch, Cassandra, or ClickHouse
Storage cost (1 PB)	Local NVMe provisioning. No per-request S3 costs.	~$23K/month (S3 Standard)	~$200-500K/month (ES cluster)
Index strategy	Bloom filters on all span tokens. Index-assisted search.	No index. Bloom filters for trace_id lookup only.	Full inverted index (ES) or column index (ClickHouse)
Query language	Jaeger APIs + LogsQL	TraceQL (attribute filtering, structural queries)	Jaeger APIs
Resource efficiency	0.50 vCPU, 1.15 GiB at 10K spans/sec	1.35 vCPU, 4.26 GiB at 10K spans/sec	Depends on backend (ES is heavy)
OTLP support	Native (custom HTTP/2, 36% less CPU)	Native	Via OTel Collector
Grafana integration	Jaeger datasource (+ experimental Tempo API)	Native Tempo datasource	Jaeger datasource
Tiered storage	NVMe (hot) to HDD (warm) to S3 via vmbackup (cold)	S3 lifecycle policies (automatic)	ES ILM or Cassandra TTL
Cluster architecture	vtinsert/vtselect/vtstorage (linear scaling)	Distributor/Ingester/Querier/Compactor	Collector/Query/Ingester (varies by backend)
Operational complexity	Low (single binary or 3-component cluster, local disk)	Medium (stateless query, S3 config, compactor)	High (external DB management)
Trace-to-logs	VictoriaLogs correlation via shared trace_id field	Native Loki correlation in Grafana	Manual configuration
Community	Growing, part of VictoriaMetrics ecosystem	Large (Grafana Labs ecosystem)	Largest (CNCF graduated, Uber origin)
Best for	Resource-constrained, air-gapped, or VM-native stacks	S3-native, TraceQL queries, Grafana-native workflows	Mature ecosystem, flexible backend choice

Choosing between them:

TraceQL is required: Grafana Tempo. No other backend supports this query language.
S3-native tiered storage is required: Grafana Tempo. Automatic lifecycle policies, no local disk management.
Resource efficiency is the priority: VictoriaTraces. 3.7x less RAM, 2.6x less CPU at the same span rate.
Already running VictoriaMetrics + VictoriaLogs: VictoriaTraces. Same operational model, same cluster pattern, same tooling.
Mature ecosystem and community support: Jaeger. CNCF graduated, widest adoption, most integrations.
Air-gapped or no-external-dependency environments: VictoriaTraces. Local disk only, no S3 or external database.

Capacity Planning

Single-node sizing based on benchmarks:

Spans/sec	CPU	RAM	Local Disk (7-day retention)
10,000	1 core	2 GiB	~50 GiB
30,000	3 cores	4 GiB	~150 GiB
100,000	8 cores	16 GiB	~500 GiB

These estimates extrapolate from the published benchmark (10K spans/sec = 0.50 vCPU, 1.15 GiB RAM, 3.27 GiB disk per day). Actual numbers depend on span size, attribute count, and compression ratio. The disk estimate assumes ~3.3 GiB per 10K spans/sec per day, multiplied by 7 days of retention.

For the cluster version, each component has different bottlenecks:

Component	Instance	Throughput	Bottleneck
vtinsert	c6g.xlarge (4 vCPU, 8GB)	~50K spans/sec routing + OTLP parsing	CPU (HTTP/2 handling, protobuf parsing, trace_id hashing)
vtstorage	i3en.3xlarge (12 vCPU, 96GB, 2x7.5TB NVMe)	~100K spans/sec ingestion + storage	Disk I/O (bloom filter writes, columnar compression)
vtselect	r6g.xlarge (4 vCPU, 32GB)	~50 concurrent trace queries	Memory (result merging from fan-out)

Tiered storage planning: VictoriaTraces supports moving older partitions from NVMe to HDD via partition attach/detach APIs. Hot data (0-3 days) should live on NVMe for sub-second queries. Warm data (3-30 days) can move to HDD where 2-5 second query latency is acceptable. Cold data beyond the retention window gets archived to S3 via vmbackup snapshots.

Network bandwidth: vtinsert distributes spans by trace_id to vtstorage nodes. At high span rates, the internal network between vtinsert and vtstorage carries the full ingestion volume. Budget 10 Gbps between these components. vtselect fan-out queries generate less sustained traffic but can spike during broad queries.

Failure Scenarios

Scenario 1: Trace ID Routing Skew Overloads a Single vtstorage Node

Trigger: A popular service generates a disproportionate number of traces. Because vtinsert routes by trace_id hash, if many trace_ids hash to the same vtstorage node, that node receives more spans than others. This happens most often when a small number of very large traces (thousands of spans each) all route to the same node.

Impact: The overloaded vtstorage node's disk I/O saturates. Ingestion latency increases. Bloom filter construction falls behind, causing queries against that node to slow down. Other vtstorage nodes are underutilized. The overall cluster has capacity, but one node becomes the bottleneck.

Detection: Monitor per-node ingestion rate (vt_rows_ingested_total per vtstorage) and disk I/O utilization. Alert when any single node's ingestion rate exceeds 150% of the cluster average.

Recovery: In the short term, add more vtstorage nodes to spread the hash space. trace_id hashing is deterministic, so adding nodes redistributes future traces. Existing data stays on the original node. For persistent skew caused by a single service generating most of the trace volume, consider adding a trace_id salt or adjusting the OTel Collector sampling rate for that service to reduce its trace volume. The root fix is ensuring the trace_id distribution is reasonably uniform, which it should be if trace_ids are generated by standard OTel SDKs (128-bit random).

Scenario 2: Query Timeout on Broad Time Range Trace Search

Trigger: An operator searches for all traces with http.status_code:500 across a 7-day time range. This query touches every daily partition on every vtstorage node. Bloom filters help skip blocks that definitely do not contain 500, but the token "500" appears in many blocks (it is a common value in http.status_code fields). The query ends up scanning a large fraction of the data.

Impact: vtselect and vtstorage CPU spike. The query takes 30+ seconds and may time out. If multiple operators run similar broad queries simultaneously, vtstorage disk I/O saturates, slowing down all concurrent queries including targeted trace_id lookups.

Detection: Monitor vt_slow_queries_total and vtselect query latency. Alert when query p99 exceeds 10 seconds.

Recovery: Narrow the query. Add service_name or time range constraints to reduce the scan scope. For the platform, set default query time range limits on vtselect (e.g., maximum 3 days for non-trace_id queries). Configure the metadata_lookbehind window to 1-3 days for operational debugging patterns. For analytical queries that genuinely need broad scans, route them to a dedicated vtselect instance with separate resource allocation so they do not affect operational query performance.

Scenario 3: vtinsert Crashes During OTLP Burst, Spans Lost

Trigger: A traffic spike causes all services to emit 5x normal trace volume simultaneously. vtinsert instances hit their memory limit from buffering incoming OTLP requests while waiting for vtstorage acknowledgments. One or more vtinsert instances crash with OOM.

Impact: The OTel Collectors detect the vtinsert failure and begin buffering locally (if configured with retry). Spans generated during the vtinsert downtime are either buffered in the OTel Collector, buffered in Kafka (if Kafka sits between OTel and vtinsert), or dropped if buffer capacity is exceeded. After vtinsert restarts, the backlog arrives as a burst, potentially causing another OOM cycle.

Detection: Monitor vtinsert memory usage and OTel Collector export error rates. Alert when vtinsert RSS exceeds 80% of available memory or when OTel Collector otelcol_exporter_send_failed_spans spikes.

Recovery: Scale vtinsert horizontally before the next spike. Set memory limits on vtinsert and configure the OTel Collector with sending_queue and retry_on_failure with exponential backoff. If Kafka sits in the ingestion path, it acts as a natural buffer and prevents OTel Collector data loss. After recovery, the Kafka consumer group on vtinsert drains the backlog at a controlled rate. For the OTLP direct path (no Kafka), configure the OTel Collector's batch processor with a maximum queue size to apply backpressure rather than allowing unbounded memory growth.

Pros

• Uses 3.7x less RAM and 2.6x less CPU than Grafana Tempo in benchmarks
• No external storage dependencies. No Elasticsearch, Cassandra, or S3 required for production.
• Same operational model as VictoriaMetrics and VictoriaLogs (vtinsert/vtselect/vtstorage)
• OTLP-native ingestion with custom HTTP/2 server (25% smaller binary, 36% less CPU than gRPC-Go)
• Bloom filter indexed search on all span fields without manual index configuration
• Cluster mode with linear horizontal scaling. Each component scales independently.
• Compatible with Grafana (via Jaeger datasource) and Jaeger UI for visualization

Cons

• Younger project than Jaeger and Grafana Tempo, smaller community
• No S3-native storage. Uses local disk. Cold/archive requires vmbackup to S3.
• No TraceQL equivalent. Querying uses Jaeger APIs and LogsQL, which is less expressive for trace-specific patterns
• Grafana integration via Jaeger datasource plugin, not a native datasource
• Tempo datasource API support is still experimental
• Fewer managed service options and third-party integrations compared to Tempo and Jaeger

When to use

• Already running VictoriaMetrics and VictoriaLogs and want the same operational model for traces
• Resource efficiency is a priority and the infrastructure budget is tight
• Trace volume is high and Tempo's RAM usage (4x higher) is a concern
• No external storage dependencies is a hard requirement (air-gapped environments, strict compliance)
• Need trace storage with bloom filter indexed search on all span attributes

When NOT to use

• Need TraceQL for advanced trace queries (Grafana Tempo is the only option for this)
• S3-native storage with automatic lifecycle tiering is required
• Deep Grafana ecosystem integration is a priority (Tempo has native support)
• Need a mature, battle-tested tracing backend with a large community (Jaeger, Tempo)
• Team is already invested in the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir)

Key Points

•VictoriaTraces is built on the VictoriaLogs storage engine. Trace spans are transformed into structured key-value entries where service name and span name become stream fields (indexed), timestamps become the time field, and all other span attributes become ordinary fields. This reuse means the same bloom filter indexing, columnar compression, and daily partition model that powers VictoriaLogs also powers trace storage.
•In benchmarks at 10,000 spans/sec, VictoriaTraces used 0.50 vCPU (vs Tempo's 1.35 vCPU), 1.15 GiB RAM (vs Tempo's 4.26 GiB), and 3.27 GiB disk (vs ClickHouse's 5.86 GiB). At 30,000 spans/sec, average CPU usage was 1.2 cores with peaks at 2.6 cores. The 3.7x RAM advantage over Tempo comes from avoiding in-memory buffering of entire trace blocks before flushing to storage.
•OTLP ingestion uses a custom HTTP/2 server instead of the standard gRPC-Go library. This produces a 25% smaller binary and 36% less CPU usage compared to gRPC-Go implementations. The server handles OTLP/gRPC on port 4317, OTLP/HTTP with protobuf, and OTLP/HTTP with JSON payloads. Protobuf unmarshalling uses easyproto instead of golang/protobuf, avoiding protoc code generation overhead.
•The cluster mode distributes spans by trace_id to vtstorage nodes. This ensures all spans from a single trace land on the same vtstorage node, making trace reconstruction a local operation rather than a cross-node fan-out. vtinsert handles the routing, vtstorage handles storage and local query execution, vtselect handles query fan-out and result merging.
•Querying works through Jaeger Query Service JSON APIs, making VictoriaTraces compatible with both Grafana (via Jaeger datasource) and the standalone Jaeger UI. For programmatic trace search, the /select/logsql/query HTTP endpoint accepts LogsQL queries against the underlying span data. This enables queries like trace_id:abc123 or service_name:checkout AND http.status_code:500.
•Storage uses daily UTC partitions, identical to VictoriaLogs. Each day's trace data lives in a partition subdirectory. Retention is enforced by deleting entire partition directories, making cleanup instant regardless of data volume. Multi-tier storage is supported: hot data on NVMe, warm data migrated to HDD via partition attach/detach APIs, cold/archive via vmbackup snapshots to S3.
•Service dependency graphs are available through the experimental Jaeger service dependencies API (enabled with --servicegraph.enableTask=true). This generates a directed graph of service-to-service calls from trace data, useful for understanding runtime architecture and spotting unexpected dependencies.

Common Mistakes

✗Expecting TraceQL query capabilities. VictoriaTraces exposes Jaeger APIs and LogsQL, not TraceQL. For trace queries that need attribute-based filtering with duration and structural conditions, the query patterns are different from what Tempo users expect. Plan query workflows around Jaeger UI search and LogsQL before migrating.
✗Not configuring the -retentionPeriod flag. Default retention is 7 days. Production deployments handling compliance or post-incident analysis typically need 30-90 days. Set this at startup because changing it later only affects new data.
✗Running vtstorage on HDD for hot data. Bloom filter lookups and block scans are I/O-intensive. NVMe or SSD is required for hot trace data to maintain sub-second query latency. HDD is acceptable for warm tier data accessed less frequently.
✗Not setting -otlpGRPCListenAddr when using OTel Collector. OTLP/gRPC ingestion requires this flag to be explicitly set on single-node VictoriaTraces or vtinsert. Without it, the gRPC endpoint is not started and the OTel Collector will fail to connect.
✗Ignoring the metadata_lookbehind window for trace queries. Trace ID lookups scan partitions within a configurable lookbehind window. Setting this too wide (e.g., 30 days) makes trace_id queries slow because every partition gets scanned. Set it to match the typical age of traces being investigated (1-3 days for operational debugging).

Related Technologies