Distributed Tracing
Architecture Diagram
Why It Exists
Metrics say something broke. Logs say what happened inside a single service. Neither one answers the question that actually needs answering during an incident: where did the request spend its time, and which service caused the failure?
When a request touches an API gateway, three backends, two databases, and a message queue, something needs to reconstruct the full path. That's distributed tracing. It provides the entire request lifecycle as a directed acyclic graph of operations, exposing latency bottlenecks, fan-out patterns, and failure propagation. It's the only observability signal that captures causality across service boundaries. Without it, everyone's guessing.
How It Works
Trace Anatomy
A trace represents the full journey of a single request. It's made up of spans. Each span captures one unit of work: an HTTP call, a database query, a cache lookup, a queue publish. Every span records start time, duration, status, and key-value attributes. Spans link together through parent-child relationships. The root span is the entry point, and child spans represent downstream operations.
The data model matters. If spans don't carry the right attributes, traces become noise.
OpenTelemetry Architecture
The OTel SDK lives in each service and creates spans through auto-instrumentation (HTTP/gRPC libraries, database drivers) or manual instrumentation for custom business logic. Spans get batched and shipped to the OTel Collector, a standalone process that receives, processes (filtering, sampling, enrichment), and exports telemetry to the backend of choice. The Collector decouples instrumentation from storage. Switching from Jaeger to Tempo requires zero application code changes. That flexibility matters more than most teams expect when the tracing bill triples overnight.
Context Propagation
The traceparent header (W3C Trace Context) carries the trace ID and parent span ID across network boundaries. For HTTP calls, OTel instrumentation handles this automatically. For message queues (Kafka, SQS, RabbitMQ), the trace context must be manually injected into message headers and extracted on the consumer side.
This is the most commonly missed propagation point. I've seen teams run for months with broken traces at every async boundary and not notice because the individual spans "looked fine."
Sampling Strategies
| Strategy | How It Works | Trade-off |
|---|---|---|
| Head-based | Decide at entry point (1% random) | Simple, but misses rare errors |
| Tail-based | Decide after trace completes | Captures errors/slow, but requires buffering all spans |
| Adaptive | Adjust rate based on traffic volume | Balances cost and coverage |
| Always-on for errors | Sample 100% of errored traces | Guarantees error visibility, moderate cost |
Production Considerations
- Tail-based sampling: Deploy the OTel Collector in a stateful mode with the
tail_samplingprocessor. Buffer spans for 30-60s, then decide what to keep: all traces with errors, latency above p99, or specific attributes. This captures the traces that actually matter while cutting storage by 90%+. - Span limits: Set
max_attributes_per_spanandmax_events_per_span. Without these, runaway instrumentation creates oversized spans that choke the collector pipeline. Skipping this step guarantees a hard lesson later. - Service maps: Trace backends like Jaeger and Tempo can auto-generate service dependency graphs from trace data. These maps are genuinely useful for understanding runtime architecture and spotting unexpected dependencies nobody documented.
- Trace-to-logs correlation: Emit
trace_idin structured logs and configure Grafana to link from a trace span directly to the corresponding log lines. This eliminates manual correlation during incidents. A small investment that pays back constantly. - Queue instrumentation: For Kafka, inject
traceparentinto record headers on produce and extract on consume, creating a new child span linked to the producer span. Without this, traces break at every async boundary. No exceptions.
Failure Scenarios
Scenario 1: OTel Collector Pipeline Backpressure Crashes Services. The OTel Collector's export queue to Jaeger fills up when the Jaeger backend hits elevated write latency (say, a Cassandra compaction storm). With sending_queue at the default 5,000 spans, the Collector starts returning RESOURCE_EXHAUSTED to application SDKs. Applications using synchronous span export block on the gRPC call, adding 200-500ms latency to every request. The payment service p99 spikes from 150ms to 800ms. Detection: Monitor otelcol_exporter_queue_size and otelcol_exporter_send_failed_spans. Alert when queue utilization exceeds 70%. Recovery: Increase sending_queue size to 50,000 and set retry_on_failure with exponential backoff. Switch application SDK export to asynchronous BatchSpanProcessor with a 30s timeout. This decouples application latency from Collector health, which is where it should have been from day one.
Scenario 2: Trace Context Propagation Break at Async Boundary. A team introduces a new Kafka-based event pipeline but doesn't inject traceparent into Kafka headers. All traces terminate at the producer, and the 8 downstream consumer services appear as disconnected root spans. Service maps show phantom services with no upstream dependencies, and latency analysis for end-to-end order fulfillment becomes impossible. The worst part: this goes undetected for 3 months because traces "look normal." They're just incomplete. Detection: Track trace_depth distribution and alert when a service expected to be downstream consistently appears as root spans. Monitor traces_without_parent_ratio per service. Recovery: Add the OTel Kafka instrumentation library for producer/consumer. Validate trace connectivity in integration tests by asserting parentSpanId != null for downstream services.
Scenario 3: Tail-Based Sampling Data Loss During Collector Restart. The OTel Collector running tail-based sampling buffers spans in memory for 60 seconds before making sampling decisions. During a rolling restart (Kubernetes Deployment update), buffered spans for in-flight traces get dropped. Traces that started before the restart and complete after appear fragmented, missing 10-40% of their spans. During an incident, engineers see partial traces and can't determine root cause. Detection: Monitor otelcol_processor_tail_sampling_count_traces_dropped and compare pre/post restart trace completeness rates. Recovery: Deploy the Collector as a StatefulSet with persistent volume-backed span buffer. Use the groupbytrace processor before tail_sampling to make sure complete traces get evaluated together. Set up graceful shutdown with a drain period that matches the sampling window.
Capacity Planning
Storage estimation formula: daily_trace_storage = total_rps * sampling_rate * avg_spans_per_trace * avg_span_size * 86,400. For a platform handling 50K RPS with 1% head-based sampling, 8 spans/trace, and 1 KB/span: 50,000 * 0.01 * 8 * 1,024 * 86,400 = ~350 GB/day.
| Scale Tier | Total RPS | Sampling Rate | Spans/Day | Storage/Day | Reference |
|---|---|---|---|---|---|
| Startup | 1K | 10% | 7M | 7 GB | Early-stage API |
| Mid-scale | 20K | 1% | 140M | 140 GB | E-commerce platform |
| Large-scale | 200K | 0.1% + tail | 1.4B | 1.4 TB | Uber (~14B spans/day) |
| Hyper-scale | 1M+ | 0.01% + tail | 7B+ | 7 TB+ | Google-scale (Dapper) |
Key thresholds: Jaeger with Elasticsearch backend gets expensive past 500 GB/day. Migrate to Grafana Tempo with S3/GCS for 5-10x cost reduction. Tail-based sampling Collectors need roughly 4 GB RAM per 100K spans/second in the decision buffer. Keep OTel Collector CPU utilization under 70%. Above that, span drops increase nonlinearly and span drops increase, losing data that matters. Keep max trace duration under 5 minutes for tail-based sampling; longer traces eat proportionally more buffer memory. Budget 1 Collector replica per 50K spans/second throughput.
Architecture Decision Record
Decision: Choosing a Distributed Tracing Backend
| Criteria (Weight) | Jaeger | Grafana Tempo | Datadog APM | AWS X-Ray |
|---|---|---|---|---|
| Storage cost (25%) | 2, Elasticsearch/Cassandra | 5, Object storage (S3/GCS) | 2, $0.30/M analyzed spans | 3, Managed, pay-per-scan |
| Query capability (20%) | 3, Tag-based search | 4, TraceQL, exemplars | 5, Service maps, error tracking, flame graphs | 2, Limited query language |
| Operational complexity (15%) | 3, Collector + query + storage | 4, Stateless query, simple ops | 5, Fully managed | 5, Fully managed |
| OTel compatibility (15%) | 5, Native OTLP support | 5, Native OTLP support | 4, OTLP support, proprietary agent preferred | 3, AWS-specific SDK preferred |
| Trace-to-logs/metrics (15%) | 3, Manual configuration | 5, Native Grafana links to Loki/Mimir | 5, Unified platform | 3, CloudWatch integration |
| Multi-cloud support (10%) | 5, Self-hosted anywhere | 5, Self-hosted anywhere | 4, SaaS, agent on any cloud | 1, AWS only |
When to choose what:
- Team < 20, already using Grafana: Tempo. Zero-cost storage on S3, TraceQL for trace search, native links to Loki logs and Mimir metrics.
- Team 20-100, need deep analysis: Datadog APM. Service maps, error tracking, latency breakdowns out of the box. Worth the cost when faster incident resolution is a priority (it should be).
- AWS-only shop, minimal tracing needs: X-Ray. Zero ops, native Lambda/API Gateway integration. Accept the limited query capabilities and move on.
- Large platform, multi-cloud: Jaeger or Tempo with an OTel Collector fleet. Vendor-neutral, self-hosted, full data control. Tempo wins on cost; Jaeger has a more mature query UI.
- Compliance-sensitive (data residency): Self-hosted Tempo or Jaeger. Trace data can contain PII in span attributes, and keeping it in-house is the only way to stay compliant.
Key Points
- •Tracks a single request across multiple services, showing the complete call chain
- •Traces are made of spans. Each span is one unit of work with timing and metadata
- •OpenTelemetry is the instrumentation standard. Vendor-neutral, covers metrics, logs, and traces
- •Head-based vs tail-based sampling. Tail-based is better at catching errors and slow requests
- •Trace context propagation via W3C Trace Context headers (traceparent, tracestate) is the standard
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Jaeger | Open Source | Distributed tracing, CNCF graduated, mature UI | Medium-Enterprise |
| Grafana Tempo | Open Source | Object storage backend, cost-efficient, TraceQL | Medium-Enterprise |
| OpenTelemetry | Open Source | Vendor-neutral instrumentation SDK and collector | Small-Enterprise |
| Datadog APM | Commercial | Unified observability, service maps, error tracking | Small-Enterprise |
Common Mistakes
- Tracing 100% of requests in production. Storage and processing costs become crushing at scale
- Not propagating trace context through message queues. Async calls silently break the trace chain
- Only instrumenting HTTP calls. Database queries, cache lookups, and queue operations need spans too
- Ignoring sampling configuration. Default head-based sampling misses rare but important errors
- Not correlating traces with logs and metrics. All three pillars should be linked by trace ID