Distributed Tracing

Why It Exists

Metrics say something broke. Logs say what happened inside a single service. Neither one answers the question that actually needs answering during an incident: where did the request spend its time, and which service caused the failure?

When a request touches an API gateway, three backends, two databases, and a message queue, something needs to reconstruct the full path. That's distributed tracing. It provides the entire request lifecycle as a directed acyclic graph of operations, exposing latency bottlenecks, fan-out patterns, and failure propagation. It's the only observability signal that captures causality across service boundaries. Without it, everyone's guessing.

How It Works

Trace Anatomy

A trace represents the full journey of a single request. It's made up of spans. Each span captures one unit of work: an HTTP call, a database query, a cache lookup, a queue publish. Every span records start time, duration, status, and key-value attributes. Spans link together through parent-child relationships. The root span is the entry point, and child spans represent downstream operations.

The data model matters. If spans don't carry the right attributes, traces become noise.

OpenTelemetry Architecture

The OTel SDK lives in each service and creates spans through auto-instrumentation (HTTP/gRPC libraries, database drivers) or manual instrumentation for custom business logic. Spans get batched and shipped to the OTel Collector, a standalone process that receives, processes (filtering, sampling, enrichment), and exports telemetry to the backend of choice. The Collector decouples instrumentation from storage. Switching from Jaeger to Tempo requires zero application code changes. That flexibility matters more than most teams expect when the tracing bill triples overnight.

Context Propagation

The traceparent header (W3C Trace Context) carries the trace ID and parent span ID across network boundaries. For HTTP calls, OTel instrumentation handles this automatically. For message queues (Kafka, SQS, RabbitMQ), the trace context must be manually injected into message headers and extracted on the consumer side.

This is the most commonly missed propagation point. I've seen teams run for months with broken traces at every async boundary and not notice because the individual spans "looked fine."

Sampling Strategies

Strategy	How It Works	Trade-off
Head-based	Decide at entry point (1% random)	Simple, but misses rare errors
Tail-based	Decide after trace completes	Captures errors/slow, but requires buffering all spans
Adaptive	Adjust rate based on traffic volume	Balances cost and coverage
Always-on for errors	Sample 100% of errored traces	Guarantees error visibility, moderate cost

Production Considerations

Tail-based sampling: Deploy the OTel Collector in a stateful mode with the tail_sampling processor. Buffer spans for 30-60s, then decide what to keep: all traces with errors, latency above p99, or specific attributes. This captures the traces that actually matter while cutting storage by 90%+.
Span limits: Set max_attributes_per_span and max_events_per_span. Without these, runaway instrumentation creates oversized spans that choke the collector pipeline. Skipping this step guarantees a hard lesson later.
Service maps: Trace backends like Jaeger and Tempo can auto-generate service dependency graphs from trace data. These maps are genuinely useful for understanding runtime architecture and spotting unexpected dependencies nobody documented.
Trace-to-logs correlation: Emit trace_id in structured logs and configure Grafana to link from a trace span directly to the corresponding log lines. This eliminates manual correlation during incidents. A small investment that pays back constantly.
Queue instrumentation: For Kafka, inject traceparent into record headers on produce and extract on consume, creating a new child span linked to the producer span. Without this, traces break at every async boundary. No exceptions.

Failure Scenarios

Scenario 1: OTel Collector Pipeline Backpressure Crashes Services. The OTel Collector's export queue to Jaeger fills up when the Jaeger backend hits elevated write latency (say, a Cassandra compaction storm). With sending_queue at the default 5,000 spans, the Collector starts returning RESOURCE_EXHAUSTED to application SDKs. Applications using synchronous span export block on the gRPC call, adding 200-500ms latency to every request. The payment service p99 spikes from 150ms to 800ms. Detection: Monitor otelcol_exporter_queue_size and otelcol_exporter_send_failed_spans. Alert when queue utilization exceeds 70%. Recovery: Increase sending_queue size to 50,000 and set retry_on_failure with exponential backoff. Switch application SDK export to asynchronous BatchSpanProcessor with a 30s timeout. This decouples application latency from Collector health, which is where it should have been from day one.

Scenario 2: Trace Context Propagation Break at Async Boundary. A team introduces a new Kafka-based event pipeline but doesn't inject traceparent into Kafka headers. All traces terminate at the producer, and the 8 downstream consumer services appear as disconnected root spans. Service maps show phantom services with no upstream dependencies, and latency analysis for end-to-end order fulfillment becomes impossible. The worst part: this goes undetected for 3 months because traces "look normal." They're just incomplete. Detection: Track trace_depth distribution and alert when a service expected to be downstream consistently appears as root spans. Monitor traces_without_parent_ratio per service. Recovery: Add the OTel Kafka instrumentation library for producer/consumer. Validate trace connectivity in integration tests by asserting parentSpanId != null for downstream services.

Scenario 3: Tail-Based Sampling Data Loss During Collector Restart. The OTel Collector running tail-based sampling buffers spans in memory for 60 seconds before making sampling decisions. During a rolling restart (Kubernetes Deployment update), buffered spans for in-flight traces get dropped. Traces that started before the restart and complete after appear fragmented, missing 10-40% of their spans. During an incident, engineers see partial traces and can't determine root cause. Detection: Monitor otelcol_processor_tail_sampling_count_traces_dropped and compare pre/post restart trace completeness rates. Recovery: Deploy the Collector as a StatefulSet with persistent volume-backed span buffer. Use the groupbytrace processor before tail_sampling to make sure complete traces get evaluated together. Set up graceful shutdown with a drain period that matches the sampling window.

Capacity Planning

Storage estimation formula: daily_trace_storage = total_rps * sampling_rate * avg_spans_per_trace * avg_span_size * 86,400. For a platform handling 50K RPS with 1% head-based sampling, 8 spans/trace, and 1 KB/span: 50,000 * 0.01 * 8 * 1,024 * 86,400 = ~350 GB/day.

Scale Tier	Total RPS	Sampling Rate	Spans/Day	Storage/Day	Reference
Startup	1K	10%	7M	7 GB	Early-stage API
Mid-scale	20K	1%	140M	140 GB	E-commerce platform
Large-scale	200K	0.1% + tail	1.4B	1.4 TB	Uber (~14B spans/day)
Hyper-scale	1M+	0.01% + tail	7B+	7 TB+	Google-scale (Dapper)

Key thresholds: Jaeger with Elasticsearch backend gets expensive past 500 GB/day. Migrate to Grafana Tempo with S3/GCS for 5-10x cost reduction. Tail-based sampling Collectors need roughly 4 GB RAM per 100K spans/second in the decision buffer. Keep OTel Collector CPU utilization under 70%. Above that, span drops increase nonlinearly and span drops increase, losing data that matters. Keep max trace duration under 5 minutes for tail-based sampling; longer traces eat proportionally more buffer memory. Budget 1 Collector replica per 50K spans/second throughput.

Architecture Decision Record

Decision: Choosing a Distributed Tracing Backend

Criteria (Weight)	Jaeger	Grafana Tempo	Datadog APM	AWS X-Ray
Storage cost (25%)	2, Elasticsearch/Cassandra	5, Object storage (S3/GCS)	2, $0.30/M analyzed spans	3, Managed, pay-per-scan
Query capability (20%)	3, Tag-based search	4, TraceQL, exemplars	5, Service maps, error tracking, flame graphs	2, Limited query language
Operational complexity (15%)	3, Collector + query + storage	4, Stateless query, simple ops	5, Fully managed	5, Fully managed
OTel compatibility (15%)	5, Native OTLP support	5, Native OTLP support	4, OTLP support, proprietary agent preferred	3, AWS-specific SDK preferred
Trace-to-logs/metrics (15%)	3, Manual configuration	5, Native Grafana links to Loki/Mimir	5, Unified platform	3, CloudWatch integration
Multi-cloud support (10%)	5, Self-hosted anywhere	5, Self-hosted anywhere	4, SaaS, agent on any cloud	1, AWS only

When to choose what:

Team < 20, already using Grafana: Tempo. Zero-cost storage on S3, TraceQL for trace search, native links to Loki logs and Mimir metrics.
Team 20-100, need deep analysis: Datadog APM. Service maps, error tracking, latency breakdowns out of the box. Worth the cost when faster incident resolution is a priority (it should be).
AWS-only shop, minimal tracing needs: X-Ray. Zero ops, native Lambda/API Gateway integration. Accept the limited query capabilities and move on.
Large platform, multi-cloud: Jaeger or Tempo with an OTel Collector fleet. Vendor-neutral, self-hosted, full data control. Tempo wins on cost; Jaeger has a more mature query UI.
Compliance-sensitive (data residency): Self-hosted Tempo or Jaeger. Trace data can contain PII in span attributes, and keeping it in-house is the only way to stay compliant.

Tool	Type	Best For	Scale
Jaeger	Open Source	Distributed tracing, CNCF graduated, mature UI	Medium-Enterprise
Grafana Tempo	Open Source	Object storage backend, cost-efficient, TraceQL	Medium-Enterprise
OpenTelemetry	Open Source	Vendor-neutral instrumentation SDK and collector	Small-Enterprise
Datadog APM	Commercial	Unified observability, service maps, error tracking	Small-Enterprise

Why It Exists

How It Works

Trace Anatomy

The data model matters. If spans don't carry the right attributes, traces become noise.

OpenTelemetry Architecture

Context Propagation

This is the most commonly missed propagation point. I've seen teams run for months with broken traces at every async boundary and not notice because the individual spans "looked fine."

Sampling Strategies

Strategy	How It Works	Trade-off
Head-based	Decide at entry point (1% random)	Simple, but misses rare errors
Tail-based	Decide after trace completes	Captures errors/slow, but requires buffering all spans
Adaptive	Adjust rate based on traffic volume	Balances cost and coverage
Always-on for errors	Sample 100% of errored traces	Guarantees error visibility, moderate cost

Production Considerations

Tail-based sampling: Deploy the OTel Collector in a stateful mode with the tail_sampling processor. Buffer spans for 30-60s, then decide what to keep: all traces with errors, latency above p99, or specific attributes. This captures the traces that actually matter while cutting storage by 90%+.
Span limits: Set max_attributes_per_span and max_events_per_span. Without these, runaway instrumentation creates oversized spans that choke the collector pipeline. Skipping this step guarantees a hard lesson later.
Service maps: Trace backends like Jaeger and Tempo can auto-generate service dependency graphs from trace data. These maps are genuinely useful for understanding runtime architecture and spotting unexpected dependencies nobody documented.
Trace-to-logs correlation: Emit trace_id in structured logs and configure Grafana to link from a trace span directly to the corresponding log lines. This eliminates manual correlation during incidents. A small investment that pays back constantly.
Queue instrumentation: For Kafka, inject traceparent into record headers on produce and extract on consume, creating a new child span linked to the producer span. Without this, traces break at every async boundary. No exceptions.

Failure Scenarios

Capacity Planning

Scale Tier	Total RPS	Sampling Rate	Spans/Day	Storage/Day	Reference
Startup	1K	10%	7M	7 GB	Early-stage API
Mid-scale	20K	1%	140M	140 GB	E-commerce platform
Large-scale	200K	0.1% + tail	1.4B	1.4 TB	Uber (~14B spans/day)
Hyper-scale	1M+	0.01% + tail	7B+	7 TB+	Google-scale (Dapper)

Architecture Decision Record

Decision: Choosing a Distributed Tracing Backend

Criteria (Weight)	Jaeger	Grafana Tempo	Datadog APM	AWS X-Ray
Storage cost (25%)	2, Elasticsearch/Cassandra	5, Object storage (S3/GCS)	2, $0.30/M analyzed spans	3, Managed, pay-per-scan
Query capability (20%)	3, Tag-based search	4, TraceQL, exemplars	5, Service maps, error tracking, flame graphs	2, Limited query language
Operational complexity (15%)	3, Collector + query + storage	4, Stateless query, simple ops	5, Fully managed	5, Fully managed
OTel compatibility (15%)	5, Native OTLP support	5, Native OTLP support	4, OTLP support, proprietary agent preferred	3, AWS-specific SDK preferred
Trace-to-logs/metrics (15%)	3, Manual configuration	5, Native Grafana links to Loki/Mimir	5, Unified platform	3, CloudWatch integration
Multi-cloud support (10%)	5, Self-hosted anywhere	5, Self-hosted anywhere	4, SaaS, agent on any cloud	1, AWS only

When to choose what:

Team < 20, already using Grafana: Tempo. Zero-cost storage on S3, TraceQL for trace search, native links to Loki logs and Mimir metrics.
Team 20-100, need deep analysis: Datadog APM. Service maps, error tracking, latency breakdowns out of the box. Worth the cost when faster incident resolution is a priority (it should be).
AWS-only shop, minimal tracing needs: X-Ray. Zero ops, native Lambda/API Gateway integration. Accept the limited query capabilities and move on.
Large platform, multi-cloud: Jaeger or Tempo with an OTel Collector fleet. Vendor-neutral, self-hosted, full data control. Tempo wins on cost; Jaeger has a more mature query UI.
Compliance-sensitive (data residency): Self-hosted Tempo or Jaeger. Trace data can contain PII in span attributes, and keeping it in-house is the only way to stay compliant.

Architecture Diagram

Why It Exists

How It Works

Trace Anatomy

OpenTelemetry Architecture

Context Propagation

Sampling Strategies

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Distributed Tracing

Architecture Diagram

Why It Exists

How It Works

Trace Anatomy

OpenTelemetry Architecture

Context Propagation

Sampling Strategies

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics