Coordination & InfrastructureTech 8 of 15

Observability

OpenTelemetry

The vendor-neutral standard for instrumentation, collection, and export of telemetry data

Use Cases

Unified instrumentation across metrics, traces, logs, and profiles with a single SDKVendor-neutral telemetry export to any backend (Prometheus, Jaeger, Tempo, Datadog, etc.)Distributed tracing with automatic context propagation across service boundariesCustom business metrics (orders, revenue, SLA counters) alongside infrastructure metricsFleet-wide telemetry pipeline processing via the OTel Collector (batching, routing, enrichment, sampling)

Architecture

Why It Exists

Before OpenTelemetry, instrumenting a service meant choosing between vendor-specific SDKs. Datadog's dd-trace, New Relic's agent, Jaeger's client, Prometheus's client library -- each with its own API, wire format, and export target. Switching vendors meant rewriting instrumentation across every service. Running multiple backends (Prometheus for metrics, Jaeger for traces) meant maintaining multiple SDKs in the same codebase.

OpenTelemetry unifies all of this into a single standard. One SDK produces metrics, traces, logs, and profiles. One wire protocol (OTLP) exports to any compatible backend. One collector processes and routes all signals. The application code never changes when switching from Jaeger to Tempo, or from Prometheus to VictoriaMetrics.

The project formed in 2019 from the merger of OpenTracing and OpenCensus. It graduated as a CNCF project and is now the second-most active CNCF project after Kubernetes. Every major observability vendor supports OTLP ingestion.

Who Provides What

OpenTelemetry is a vendor-neutral open standard governed by the CNCF, like Kubernetes. No single company owns it. Google, Microsoft, Splunk, Datadog, Grafana Labs, and 1,000+ contributors maintain it together. There are three components, and different teams own each:

OTel SDK (application team's responsibility). Open-source libraries that application developers import into their code, the same way one would import express, gin, or spring-boot-starter. The SDK provides the API to create spans, record metrics, and propagate context. Application teams decide which endpoints to instrument and what custom metrics to create. The SDK is available for Go, Java, Python, Node.js, .NET, Ruby, Rust, and C++.

OTel Collector (platform team's responsibility). An open-source binary (written in Go, but shipped as a container image -- no Go knowledge needed) that the platform/infrastructure team deploys and operates. Application teams never touch it. The platform team runs it as a Kubernetes DaemonSet (one per node) or as a centralized gateway fleet. Applications export telemetry to localhost:4317 and the Collector handles everything after that: batching, filtering, enrichment, routing, retry, and export to storage backends. Think of it as the "telemetry proxy" that sits between applications and backends. All customization happens through YAML configuration, not code. The Collector ships with 100+ pre-built components (receivers, processors, exporters) as plugins compiled into the binary. Pick which ones to enable and how to wire them together in the config file. Need to drop health check metrics? Add a filter processor with a regex pattern. Need to enrich spans with Kubernetes metadata? Add the k8sattributes processor. Need to route metrics to VictoriaMetrics and traces to Tempo? Add two exporters and assign them to different pipelines. No Go code, no recompilation. For the rare case where a truly custom component is needed that doesn't exist (e.g., a proprietary exporter for an internal system), the OpenTelemetry Collector Builder (ocb) tool lets teams compile a custom binary with a Go plugin included -- but most teams never need this.

OTLP (the wire protocol). The OpenTelemetry Protocol defines the protobuf format for metrics, traces, logs, and profiles over gRPC or HTTP. This is what makes the system vendor-neutral: any SDK speaking OTLP can send to any Collector, and any Collector can export to any OTLP-compatible backend.

Vendor distributions. Major vendors ship their own builds of the Collector with proprietary exporters pre-bundled. Grafana Labs ships grafana-agent (now Alloy), Datadog ships the datadog-agent with OTel support, and AWS ships the aws-otel-collector. These are convenience wrappers around the same open-source core. The upstream Collector is always an option instead.

Installation and Setup

Go

Go has no runtime agent. Instrumentation is explicit through library wrappers.

# Core SDK
go get go.opentelemetry.io/otel
go get go.opentelemetry.io/otel/sdk
go get go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc
go get go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc

// main.go -- initialize the SDK
import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
)

func initTracer() (*trace.TracerProvider, error) {
    exporter, err := otlptracegrpc.New(ctx,
        otlptracegrpc.WithEndpoint("localhost:4317"),  // OTel Collector gRPC
        otlptracegrpc.WithInsecure(),                   // TLS in production
    )
    if err != nil {
        return nil, err
    }
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),          // BatchSpanProcessor (async export)
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("checkout-service"),
        )),
    )
    otel.SetTracerProvider(tp)
    return tp, nil
}

For HTTP servers, wrap the router with the otelhttp middleware:

import "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"

mux := http.NewServeMux()
mux.HandleFunc("/checkout", handleCheckout)
handler := otelhttp.NewHandler(mux, "server")  // auto-creates spans for each request
http.ListenAndServe(":8080", handler)

Java

Java uses a -javaagent flag for zero-code auto-instrumentation. This is the easiest setup across all languages.

# Download the agent JAR
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/latest/download/opentelemetry-javaagent.jar

# Run with the agent attached
java -javaagent:opentelemetry-javaagent.jar \
     -Dotel.service.name=payment-service \
     -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
     -jar payment-service.jar

That's it. The agent automatically instruments Spring Boot, JAX-RS, Servlet, JDBC, Hibernate, gRPC, Kafka clients, Redis clients, and 100+ other libraries. No code changes needed. Spans are created for every incoming HTTP request, every outgoing HTTP call, every database query, and every message produce/consume.

For Kubernetes deployments, add the agent as an init container or use the OpenTelemetry Operator's Instrumentation CRD for auto-injection:

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: java-instrumentation
spec:
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java:latest
  exporter:
    endpoint: http://otel-collector:4317
  env:
    - name: OTEL_SERVICE_NAME
      valueFrom:
        fieldRef:
          fieldPath: metadata.labels['app']

Annotate pods with instrumentation.opentelemetry.io/inject-java: "true" and the operator injects the agent automatically.

Python

pip install opentelemetry-distro opentelemetry-exporter-otlp
opentelemetry-bootstrap -a install   # auto-detects and installs instrumentation packages

# Run with auto-instrumentation
opentelemetry-instrument \
    --service_name inventory-service \
    --exporter_otlp_endpoint http://localhost:4317 \
    python app.py

Auto-instrumentation covers Flask, Django, FastAPI, SQLAlchemy, psycopg2, redis-py, requests, httpx, grpcio, celery, and more. The opentelemetry-bootstrap command scans installed packages and installs matching instrumentation libraries.

For programmatic setup (when more control is needed):

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider(resource=Resource.create({"service.name": "inventory-service"}))
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317")))
trace.set_tracer_provider(provider)

Node.js / TypeScript

npm install @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node \
            @opentelemetry/exporter-trace-otlp-grpc @opentelemetry/exporter-metrics-otlp-grpc

// tracing.ts -- must be loaded BEFORE application code
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'

const sdk = new NodeSDK({
  serviceName: 'cart-service',
  traceExporter: new OTLPTraceExporter({ url: 'http://localhost:4317' }),
  instrumentations: [getNodeAutoInstrumentations()],
})
sdk.start()

# Run with --require to ensure tracing loads first
node --require ./tracing.js app.js

Auto-instrumentation covers Express, Fastify, Koa, HTTP, gRPC, pg, mysql2, mongodb, redis, ioredis, and AWS SDK.

Language and Framework Coverage

Language	Auto-Instrumentation	Stability	Setup Method	Notable Frameworks
Java	Comprehensive (100+ libraries)	Stable	`-javaagent` JVM flag	Spring Boot, JAX-RS, Servlet, JDBC, Hibernate, Kafka, gRPC, Jedis, Lettuce
Go	Library wrappers (explicit)	Stable	Import wrapper packages	net/http, gin, gorilla/mux, gRPC, database/sql, redis/go-redis
Python	Good (30+ libraries)	Stable	`opentelemetry-instrument` CLI	Flask, Django, FastAPI, SQLAlchemy, psycopg2, requests, celery, grpcio
Node.js	Good (25+ libraries)	Stable	`--require` hook	Express, Fastify, Koa, pg, mysql2, mongodb, redis, ioredis, AWS SDK
.NET	Good (20+ libraries)	Stable	`AddOpenTelemetry()` in startup	ASP.NET Core, HttpClient, SqlClient, Entity Framework, gRPC
Ruby	Moderate (15+ libraries)	Beta	`Bundler.require` + config	Rails, Sinatra, Rack, pg, mysql2, redis, Faraday, Net::HTTP
Rust	Minimal (manual)	Alpha	`tracing-opentelemetry` crate	tokio, tonic (gRPC), reqwest, sqlx -- mostly manual span creation
C++	Minimal (manual)	Alpha	CMake/vcpkg + manual init	gRPC, HTTP -- manual instrumentation for most libraries

Go's explicit model vs Java's agent model: Go lacks runtime bytecode manipulation, so there is no -goagent equivalent. Every library needs an explicit instrumentation wrapper (otelhttp for net/http, otelgrpc for gRPC, otelsql for database/sql). This means more setup code but also more control over what gets instrumented. Java's agent approach instruments everything by default, which is convenient but can add unexpected overhead to hot paths.

Custom Metrics: Counters, Histograms, and Gauges

Auto-instrumentation covers HTTP and RPC metrics. For business-specific metrics, use the OTel Metrics API directly.

Creating Custom Metrics (Go Example)

import (
    "go.opentelemetry.io/otel/metric"
)

// Create a meter (one per package, reuse across functions)
var meter = otel.Meter("checkout-service")

// Counter: monotonically increasing value
var ordersPlaced, _ = meter.Int64Counter("orders_placed_total",
    metric.WithDescription("Total orders successfully placed"),
    metric.WithUnit("{order}"),
)

// Histogram: distribution of values
var orderValue, _ = meter.Float64Histogram("order_value_dollars",
    metric.WithDescription("Order value in USD"),
    metric.WithUnit("USD"),
    metric.WithExplicitBucketBoundaries(10, 25, 50, 100, 250, 500, 1000),
)

// UpDownCounter: value that goes up and down
var activeCartSessions, _ = meter.Int64UpDownCounter("active_cart_sessions",
    metric.WithDescription("Number of active shopping cart sessions"),
)

func handleCheckout(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    // Record custom metrics with attributes (low-cardinality only!)
    ordersPlaced.Add(ctx, 1,
        metric.WithAttributes(
            attribute.String("payment_method", "credit_card"),
            attribute.String("region", "us-east-1"),
        ),
    )
    orderValue.Record(ctx, 149.99,
        metric.WithAttributes(
            attribute.String("currency", "USD"),
        ),
    )
    activeCartSessions.Add(ctx, -1)  // session ended
}

Creating Custom Metrics (Python Example)

from opentelemetry import metrics

meter = metrics.get_meter("inventory-service")

# Counter
items_sold = meter.create_counter("items_sold_total", description="Total items sold")

# Histogram
fulfillment_time = meter.create_histogram(
    "fulfillment_time_seconds",
    description="Time from order to shipment",
    unit="s",
)

def process_order(order):
    items_sold.add(order.quantity, {"category": order.category, "warehouse": order.warehouse_id})
    fulfillment_time.record(order.fulfillment_duration.total_seconds(), {"priority": order.priority})

Creating Custom Metrics (Java Example)

import io.opentelemetry.api.metrics.Meter;
import io.opentelemetry.api.metrics.LongCounter;

Meter meter = GlobalOpenTelemetry.getMeter("payment-service");

LongCounter paymentsProcessed = meter.counterBuilder("payments_processed_total")
    .setDescription("Total payments processed")
    .setUnit("{payment}")
    .build();

DoubleHistogram paymentLatency = meter.histogramBuilder("payment_processing_seconds")
    .setDescription("Payment processing latency")
    .setUnit("s")
    .setExplicitBucketBoundaries(List.of(0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0))
    .build();

public void processPayment(PaymentRequest req) {
    long start = System.nanoTime();
    // ... payment logic ...
    paymentsProcessed.add(1, Attributes.of(
        AttributeKey.stringKey("method"), req.getMethod(),
        AttributeKey.stringKey("provider"), req.getProvider()
    ));
    paymentLatency.record((System.nanoTime() - start) / 1e9, Attributes.of(
        AttributeKey.stringKey("method"), req.getMethod()
    ));
}

Critical rule: Metric attributes must be low-cardinality. payment_method (4 values) is fine. user_id (millions of values) will create millions of time series and crash the TSDB. Use span attributes for high-cardinality data -- spans are sampled, metrics are not.

The OTel Collector: Pipeline Architecture

The Collector is where raw telemetry becomes production-grade. It decouples applications from backends and handles everything between "SDK exported data" and "backend received data."

Deployment Models

DaemonSet (recommended for most setups): One collector per Kubernetes node. Applications export to localhost:4317 (zero network hop). The collector batches, processes, and forwards to Kafka or directly to backends. This is what the observability platform in the blog post uses -- 3,000 DaemonSet collectors across the fleet.

Gateway: A centralized collector pool that applications send to over the network. Useful when centralized processing is needed (tail-based sampling requires all spans from a trace on one instance) or when running outside Kubernetes.

Sidecar: One collector per pod. Maximum isolation but high resource overhead. Rarely needed unless tenants require strict data isolation at the collection layer.

Collector Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317     # SDK sends here
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'kubernetes-pods'
          kubernetes_sd_configs:
            - role: pod

processors:
  memory_limiter:                    # ALWAYS first -- prevents OOM
    check_interval: 1s
    limit_mib: 1500
    spike_limit_mib: 500
  filter:                            # Drop noise before batching
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - ".*health.*"
          - ".*readiness.*"
  attributes:                        # Enrich with metadata
    actions:
      - key: k8s.namespace.name
        action: upsert
        from_context: k8s.namespace.name
  batch:                             # Batch before export
    send_batch_size: 8192
    timeout: 200ms

exporters:
  otlp/tempo:
    endpoint: tempo-distributor:4317
  prometheusremotewrite:
    endpoint: http://vminsert:8480/insert/0/prometheus/api/v1/write
  otlp/logs:
    endpoint: vlinsert:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, filter, attributes, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, filter, attributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, filter, batch]
      exporters: [otlp/logs]

Processor chain order matters. The correct sequence is: memory_limiter (safety net) → filter (drop unwanted data) → attributes/transform (enrich) → batch (group for efficient export) → export. Reversing filter and batch means CPU is wasted batching data destined to be dropped. Putting batch before memory_limiter means the limiter can't protect against batch buffer growth.

Manual Span Creation

Auto-instrumentation covers framework-level spans (HTTP handler, DB query, gRPC call). For business logic visibility, create manual spans.

tracer := otel.Tracer("checkout-service")

func handleCheckout(ctx context.Context, cart Cart) error {
    ctx, span := tracer.Start(ctx, "validate-cart")
    defer span.End()

    span.SetAttributes(
        attribute.Int("cart.item_count", len(cart.Items)),
        attribute.String("cart.currency", cart.Currency),
    )

    if err := validateItems(ctx, cart); err != nil {
        span.RecordError(err)
        span.SetStatus(codes.Error, "cart validation failed")
        return err
    }

    // Child span for payment
    ctx, paySpan := tracer.Start(ctx, "process-payment")
    defer paySpan.End()
    // ... payment logic ...

    return nil
}

The key: pass ctx through every function call. The context carries the active span, so child spans automatically link to their parent. Losing the context (goroutine without context, async callback), the trace breaks.

Baggage: Cross-Service Context Without Spans

Baggage propagates key-value pairs across service boundaries via HTTP headers, without creating spans. Use it for context that downstream services need but that doesn't belong in span attributes.

import "go.opentelemetry.io/otel/baggage"

// Service A: set baggage
member, _ := baggage.NewMember("tenant_id", "acme-corp")
bag, _ := baggage.New(member)
ctx = baggage.ContextWithBaggage(ctx, bag)
// Make HTTP call -- baggage propagates automatically via W3C baggage header

// Service B: read baggage
bag := baggage.FromContext(ctx)
tenantID := bag.Member("tenant_id").Value()  // "acme-corp"

Baggage is transmitted as HTTP headers on every request in the trace, so keep it small. Do not put large payloads or sensitive data (PII, tokens) in baggage.

OTel vs Beyla: The Two-Tier Model

OpenTelemetry SDK and Grafana Beyla are not alternatives -- they're complementary layers:

Capability	OTel SDK (Tier 2)	Beyla eBPF (Tier 1)
Setup required	Code changes + dependency	DaemonSet deploy only
Custom business metrics	Yes (counters, histograms, gauges)	No
Custom span attributes	Yes (user_id, cart_size, feature_flag)	No (fixed: method, path, status)
Internal function tracing	Yes (manual spans)	No (HTTP boundary only)
Async/background work	Yes (with context propagation)	No
Kafka/SQS context propagation	Yes (with instrumentation libraries)	No
Profiling integration	Yes (span_id correlation)	No
Framework coverage	Language-specific (varies)	Any language on Linux 5.8+
Effort to deploy	Hours per service	Minutes for entire cluster

Deploy Beyla on day one for baseline RED metrics across all services. Add OTel SDK incrementally to critical services that need business metrics, custom attributes, and profiling correlation. The OTel Collector deduplicates overlapping signals so both can run simultaneously without double-counting.

Production Considerations

Resource overhead: The OTel SDK adds ~1-5 MB heap per service (span and metric buffers). The BatchSpanProcessor defaults to 2,048 spans in-flight and 512 spans per batch. For high-throughput services (>10K requests/sec), tune MaxQueueSize and MaxExportBatchSize to avoid dropped spans.

Sampling at the SDK level: For extremely high-throughput services, use ParentBasedSampler with TraceIDRatioBased to sample a fraction of new traces at creation time (head-based sampling). This reduces SDK and Collector load but misses rare errors. Combine with tail-based sampling in the Collector for the best of both worlds.

Graceful shutdown: Call tracerProvider.Shutdown(ctx) and meterProvider.Shutdown(ctx) on application exit. Without this, the last batch of spans and metrics in the buffer is lost. In Kubernetes, hook this into the SIGTERM handler with a 5-second grace period.

Environment variable configuration: In production, configure the SDK via environment variables rather than code. This makes it possible to change export endpoints, sampling rates, and service names without redeploying:

# Kubernetes pod spec
env:
  - name: OTEL_SERVICE_NAME
    value: "checkout-service"
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://localhost:4317"
  - name: OTEL_TRACES_SAMPLER
    value: "parentbased_traceidratio"
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.1"    # Sample 10% of new traces
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "deployment.environment=production,k8s.namespace.name=checkout"

Pros

• Single SDK for all four signals (metrics, traces, logs, profiles). One dependency instead of four separate libraries
• Vendor-neutral wire protocol (OTLP). Switch backends without changing application code. Export to VictoriaMetrics, Tempo, Datadog, or any OTLP-compatible receiver
• Automatic instrumentation libraries for most frameworks. Spring Boot, Express, Flask, net/http, gRPC -- get spans and metrics with a few lines of setup code
• Context propagation is built in. W3C Trace Context (traceparent/tracestate) propagates trace_id and span_id across HTTP, gRPC, and messaging boundaries automatically
• The OTel Collector decouples applications from backends. Applications export to the local collector, the collector handles batching, retry, routing, and format conversion. Backend changes never touch application code
• CNCF graduated project with contributions from Google, Microsoft, Splunk, Datadog, Grafana Labs, and 1,000+ contributors. The industry standard, not a single-vendor bet

Cons

• SDK maturity varies by language. Go and Java are production-stable. Python and JavaScript are stable but have rough edges in async context propagation. Rust and C++ are experimental
• Auto-instrumentation adds latency overhead. Java agent adds 1-5% latency to instrumented calls. For latency-critical hot paths, manual instrumentation with selective span creation is better
• Configuration complexity. The Collector alone has 100+ receivers, processors, and exporters. Getting the processor chain right (order matters) takes operational experience
• Log signal is newer than metrics and traces. The logs SDK stabilized later and some language implementations are still catching up. Bridge APIs exist for existing log frameworks but add a translation layer
• Breaking changes between SDK versions still happen in newer signals (logs, profiles). Pin versions carefully and test upgrades in staging

When to use

• Any new service that needs observability. OTel should be the default instrumentation choice for greenfield development
• Migrating off vendor-specific SDKs (Datadog APM, New Relic agents) to avoid vendor lock-in
• Building a multi-signal observability platform where metrics, traces, logs, and profiles share context (trace_id, span_id)
• Custom business metrics that need labels, histograms, or counters beyond what auto-instrumentation provides

When NOT to use

• Kernel-level network telemetry without code changes. Use eBPF (Grafana Beyla) instead -- OTel SDK requires application code changes
• Simple Prometheus metric scraping from existing /metrics endpoints. The Prometheus client library is lighter if you only need counters and gauges with no tracing
• Environments where adding a dependency is not possible (embedded systems, bare-metal firmware, legacy COBOL)

Key Points

•OTLP (OpenTelemetry Protocol) is the wire format for all four signals. It uses protobuf over gRPC (default) or HTTP/JSON. Every OTel SDK and Collector speaks OTLP natively, which is why backends can be swapped without touching application code
•The OTel Collector is the backbone of the telemetry pipeline. It runs as a DaemonSet (one per node) or as a gateway (centralized fleet). Receivers accept data in any format (OTLP, Prometheus, Jaeger, Zipkin). Processors transform it (batch, filter, enrich, sample). Exporters send it to backends. The processor chain order matters -- batch before export, filter before batch
•Context propagation is what makes distributed tracing work. The SDK automatically injects W3C traceparent headers into outgoing HTTP and gRPC calls. Downstream services extract the trace_id and parent_span_id, creating child spans that form a complete trace tree across service boundaries
•Auto-instrumentation wraps common frameworks and libraries to produce spans and metrics without manual code. Java uses a -javaagent JVM flag. Python uses sitecustomize.py or programmatic setup. Go requires explicit library wrappers (no runtime agent). Node.js uses --require hooks. The coverage and quality vary by language
•Custom metrics use the OTel Metrics API: Counter (monotonic, e.g., orders_placed), UpDownCounter (non-monotonic, e.g., active_connections), Histogram (distributions, e.g., request_duration), and Gauge (point-in-time, e.g., queue_depth). These map directly to Prometheus metric types at export

Common Mistakes

✗Creating a span for every function call. Spans have overhead (~1-5 microseconds each). Instrument service boundaries (HTTP handlers, RPC calls, database queries), not internal function calls. Use profiling (Pyroscope) for function-level visibility
✗Forgetting to call span.End(). Unfinished spans leak memory in the span processor buffer and never get exported. In Go, always use defer span.End() immediately after span creation
✗Using synchronous export in production. The default OTLP exporter is synchronous -- it blocks the application thread until the collector acknowledges. Use the BatchSpanProcessor (default in most SDKs) which buffers spans and exports in background batches
✗Putting high-cardinality values in metric labels. Adding user_id or request_id as a metric attribute creates millions of time series and crashes the storage backend. Use span attributes for high-cardinality data, metric attributes only for low-cardinality dimensions (service, method, status, region)
✗Running the OTel Collector without resource limits. A misconfigured processor chain (e.g., unbounded tail sampling buffer) can consume all node memory. Always set memory_limiter processor as the first processor in the chain and set container resource limits
✗Ignoring the processor chain order in the Collector. Processors execute in the order listed in the config. Putting the batch processor before the filter processor means batching everything including data destined to be dropped -- wasting memory and CPU. Correct order: memory_limiter → filter → enrich → batch → export

Related Technologies