Distributed Logging

Why It Exists

A request fails somewhere in a chain of 12 microservices. Opening the logs on the pod that returned the error tells almost nothing. The actual cause was three hops upstream, in a service the team doesn't even own. This is the reality of debugging in a distributed system without centralized logging.

Distributed logging pulls logs from every service, container, and infrastructure component into one place where they are actually searchable. The result is a single timeline of what happened, with filtering by correlation ID across service boundaries, and the ability to spot patterns that are invisible when logs live on individual nodes. At scale, a platform can generate terabytes of logs per day. Without centralized aggregation, everyone is just guessing.

How It Works

Structured Logging

Every log entry should be a JSON object with consistent fields: timestamp, level, service, trace_id, span_id, message, and whatever request-scoped metadata matters. With this structure, writing queries like "show all ERROR logs from payment-service where trace_id = abc123 in the last hour" becomes trivial. Without it, the fallback is regex and prayers.

Libraries like zap (Go), structlog (Python), and logback with a JSON encoder (Java) emit structured logs out of the box. Pick one and enforce it. The hardest part isn't choosing the library, it's getting every team to actually use it consistently.

ELK Pipeline vs. Loki

The ELK stack (Elasticsearch, Logstash, Kibana) full-text indexes every log line. This provides powerful ad-hoc queries, but it's expensive. Elasticsearch indexes consume 1.5-2x the raw log size in storage. Paying for that while only ever filtering by service name and log level is wasting money.

Grafana Loki takes a fundamentally different approach. It indexes only metadata labels (service, namespace, level) and stores the actual log lines as compressed chunks in object storage. Queries filter by labels first, then grep through the chunks. Loki is 10-50x cheaper at scale, but noticeably slower for arbitrary full-text search.

The tradeoff in a table:

Aspect	Elasticsearch	Grafana Loki
Indexing	Full-text inverted index	Label-based index only
Storage cost	High (index + data)	Low (object storage)
Query speed	Fast for any query	Fast for label queries, slower for grep
Operations	Complex (shards, replicas, JVM tuning)	Simpler (stateless queriers)

My honest take: most teams should start with Loki unless they already know they need full-text search. Elasticsearch can always be added later for a specific use case. Going the other direction (migrating away from Elasticsearch to save money) is painful.

Log Sampling

At high volume, logging 100% of requests is neither affordable nor necessary. Deterministic sampling based on trace ID hash means the same request is either fully logged or fully dropped across all services. No one ends up with half a trace. Dynamic sampling cranks up verbosity for error paths and dials it down for healthy traffic. This is the right default for most production systems above 1 TB/day.

Production Considerations

Log buffering. Put Kafka or a persistent queue between the collectors and storage. This absorbs burst traffic and prevents backpressure from crashing apps when the logging backend gets slow. I've seen teams skip this and regret it within months.
Retention tiers. Hot storage (7 days) for recent searchable logs, warm storage (30 days) for less frequent access, cold storage (S3/GCS, 1+ year) for compliance. Automate the lifecycle policies on day one or someone will forget and get a surprise storage bill.
Dynamic log levels. Expose a runtime endpoint or use feature flags to switch a service from INFO to DEBUG without redeploying. This is critical for production debugging. Redeploying to change a log level under load is reckless.
PII scrubbing. Build pipeline-level processors (Fluentd filters, Logstash grok) to redact sensitive fields before they hit storage. Retroactive deletion from Elasticsearch is expensive, slow, and unreliable. Treat this as a hard requirement, not a nice-to-have.
Correlation. Inject trace_id and span_id into every log line using middleware or framework integrations. This bridges logs to distributed traces, enabling a jump from a log entry to the full trace waterfall in one click. Running microservices without correlation IDs is a recipe for on-call misery.

Failure Scenarios

Scenario 1: Elasticsearch Cluster Red Status During Index Rollover. A nightly ILM rollover tries to create a new index, but the cluster doesn't have enough disk space for primary shard allocation. The cluster goes red and rejects all new writes. Log agents start buffering locally, eating pod memory. Within 20 minutes, pods are OOMing. Services that log synchronously start throwing backpressure errors, and user-facing latency jumps 40%. Detection: Monitor elasticsearch_cluster_health_status (0=green, 1=yellow, 2=red) and elasticsearch_fs_total_free_bytes. Alert when free disk drops below 20%. Recovery: Delete the oldest indices or force merge cold indices to reclaim space. Long-term, set up watermark-based ILM that rolls to warm/cold tiers before disk reaches 85%.

Scenario 2: Kafka Log Buffer Partition Leader Election Storm. The Kafka cluster used as a log buffer loses a broker. Partition leader election for 200+ partitions triggers a thundering herd of producer reconnections from 500 Fluent Bit agents. Kafka producer buffers fill up, agents fall back to filesystem buffering, and log delivery latency jumps from 5 seconds to 15 minutes. During this gap, an on-call engineer investigating a separate incident has no recent logs in Kibana and misdiagnoses the problem. Detection: Monitor kafka_server_ReplicaManager_UnderReplicatedPartitions and Fluent Bit fluentbit_output_retries_total. Recovery: Run Kafka with min.insync.replicas=2 and 3+ brokers. Configure Fluent Bit with filesystem buffering and a 2 GB buffer limit so it can survive 30-minute outages.

Scenario 3: PII Leak Into Log Storage. A new gRPC service logs full request bodies at INFO level, including customer emails and phone numbers. Nobody catches it for 72 hours until a security scan flags PII in Elasticsearch. Now there are 14 TB of logs containing personal data subject to GDPR. Retroactive deletion from Elasticsearch means reindexing, which costs around $8K in compute and takes 48 hours. Detection: Run automated PII scanners (Amazon Macie, or a custom regex pipeline) on a sampled stream of log data. Alert on matches for email, SSN, and credit card patterns. Recovery: Deploy a Fluentd filter with regex-based redaction immediately. Start a GDPR data processing impact assessment. Reindex the affected indices with PII fields scrubbed.

Capacity Planning

Volume estimation formula: daily_log_volume = num_services * avg_rps * log_lines_per_request * avg_log_line_size * 86400. For a 200-service platform at 10K aggregate RPS generating 3 log lines per request at 500 bytes each, that works out to: 10,000 * 3 * 500 * 86,400 = ~1.3 TB/day raw.

Scale Tier	Services	Daily Volume (Raw)	Elasticsearch Storage (1.7x)	Loki Storage (0.3x)	Reference
Startup	10	10 GB	17 GB	3 GB	Early-stage SaaS
Mid-scale	100	500 GB	850 GB	150 GB	Series C platform
Large-scale	500	5 TB	8.5 TB	1.5 TB	Uber-scale (100TB+/day)
Hyper-scale	2000+	50 TB+	85 TB+	15 TB+	Netflix, Cloudflare

Key thresholds to keep in mind. Elasticsearch shard size should be 10-50 GB for good performance. More than 1,000 shards per node causes cluster instability. Loki chunk_target_size should be 1.5 MB for optimal object storage performance. Budget 1 GB heap per 20 active shards for Elasticsearch JVM. When daily volume exceeds 5 TB, Loki with an S3 backend becomes 10-20x cheaper than Elasticsearch. Log sampling at 10% for non-error traffic is standard above 1 TB/day.

Architecture Decision Record

Decision: Choosing a Centralized Logging Stack

Criteria (Weight)	Elasticsearch + Kibana	Grafana Loki	Datadog Logs	Splunk
Storage cost (25%)	2 - Full-text index doubles storage	5 - Object storage, label-only index	2 - $0.10/GB ingested, costly at scale	1 - Most expensive per GB
Query flexibility (20%)	5 - Full-text search, aggregations	3 - Label filters + grep, limited analytics	4 - Pattern analysis, faceting	5 - SPL is the most powerful query language
Operational burden (20%)	2 - JVM tuning, shard management, cluster ops	4 - Stateless read path, simple scaling	5 - Fully managed	3 - Heavy infrastructure for on-prem
Ecosystem integration (15%)	3 - Separate from metrics/traces	5 - Native Grafana, links to Tempo/Mimir	4 - Unified with APM/metrics	3 - Standalone ecosystem
Compliance / retention (10%)	4 - ILM policies, snapshot restore	3 - S3 lifecycle policies	3 - Limited retention controls	5 - Best-in-class compliance features
Real-time tail (10%)	3 - Kibana Discover, some lag	4 - Live tail via LogCLI	5 - Real-time live tail	4 - Real-time search

When to choose what:

Team < 20, already using Grafana: Loki. It's 10x cheaper, plugs right into existing Grafana dashboards, and is simple to operate.
Team 20-100, need ad-hoc search: Elasticsearch. Full-text search is indispensable when debugging problems nobody anticipated. Budget for a dedicated person (or at least 20% of someone's time) to run the cluster.
Regulated enterprise (SOC 2, HIPAA): Splunk. The compliance reporting, certified integrations, and long-term retention are built in. The cost is high, but it's justified by audit readiness.
Already on Datadog for metrics: Datadog Logs. A unified platform cuts context-switching. Accept the cost premium for the convenience.
Cost-constrained, >5 TB/day: Loki with aggressive log sampling (10% for non-errors, 100% for errors). Archive raw logs to S3 for compliance, query only through Loki.

Tool	Type	Best For	Scale
Grafana Loki	Open Source	Log aggregation without full-text indexing, cost-efficient	Medium-Enterprise
Elasticsearch + Kibana	Open Source	Full-text search, complex queries, mature ecosystem	Medium-Enterprise
Datadog Logs	Commercial	Unified with metrics/traces, live tail, patterns	Small-Enterprise
Fluentd/Fluent Bit	Open Source	Log collection and routing, CNCF graduated	Medium-Enterprise

Why It Exists

How It Works

Structured Logging

ELK Pipeline vs. Loki

The tradeoff in a table:

Aspect	Elasticsearch	Grafana Loki
Indexing	Full-text inverted index	Label-based index only
Storage cost	High (index + data)	Low (object storage)
Query speed	Fast for any query	Fast for label queries, slower for grep
Operations	Complex (shards, replicas, JVM tuning)	Simpler (stateless queriers)

Log Sampling

Production Considerations

Log buffering. Put Kafka or a persistent queue between the collectors and storage. This absorbs burst traffic and prevents backpressure from crashing apps when the logging backend gets slow. I've seen teams skip this and regret it within months.
Retention tiers. Hot storage (7 days) for recent searchable logs, warm storage (30 days) for less frequent access, cold storage (S3/GCS, 1+ year) for compliance. Automate the lifecycle policies on day one or someone will forget and get a surprise storage bill.
Dynamic log levels. Expose a runtime endpoint or use feature flags to switch a service from INFO to DEBUG without redeploying. This is critical for production debugging. Redeploying to change a log level under load is reckless.
PII scrubbing. Build pipeline-level processors (Fluentd filters, Logstash grok) to redact sensitive fields before they hit storage. Retroactive deletion from Elasticsearch is expensive, slow, and unreliable. Treat this as a hard requirement, not a nice-to-have.
Correlation. Inject trace_id and span_id into every log line using middleware or framework integrations. This bridges logs to distributed traces, enabling a jump from a log entry to the full trace waterfall in one click. Running microservices without correlation IDs is a recipe for on-call misery.

Failure Scenarios

Capacity Planning

Scale Tier	Services	Daily Volume (Raw)	Elasticsearch Storage (1.7x)	Loki Storage (0.3x)	Reference
Startup	10	10 GB	17 GB	3 GB	Early-stage SaaS
Mid-scale	100	500 GB	850 GB	150 GB	Series C platform
Large-scale	500	5 TB	8.5 TB	1.5 TB	Uber-scale (100TB+/day)
Hyper-scale	2000+	50 TB+	85 TB+	15 TB+	Netflix, Cloudflare

Architecture Decision Record

Decision: Choosing a Centralized Logging Stack

Criteria (Weight)	Elasticsearch + Kibana	Grafana Loki	Datadog Logs	Splunk
Storage cost (25%)	2 - Full-text index doubles storage	5 - Object storage, label-only index	2 - $0.10/GB ingested, costly at scale	1 - Most expensive per GB
Query flexibility (20%)	5 - Full-text search, aggregations	3 - Label filters + grep, limited analytics	4 - Pattern analysis, faceting	5 - SPL is the most powerful query language
Operational burden (20%)	2 - JVM tuning, shard management, cluster ops	4 - Stateless read path, simple scaling	5 - Fully managed	3 - Heavy infrastructure for on-prem
Ecosystem integration (15%)	3 - Separate from metrics/traces	5 - Native Grafana, links to Tempo/Mimir	4 - Unified with APM/metrics	3 - Standalone ecosystem
Compliance / retention (10%)	4 - ILM policies, snapshot restore	3 - S3 lifecycle policies	3 - Limited retention controls	5 - Best-in-class compliance features
Real-time tail (10%)	3 - Kibana Discover, some lag	4 - Live tail via LogCLI	5 - Real-time live tail	4 - Real-time search

When to choose what:

Team < 20, already using Grafana: Loki. It's 10x cheaper, plugs right into existing Grafana dashboards, and is simple to operate.
Team 20-100, need ad-hoc search: Elasticsearch. Full-text search is indispensable when debugging problems nobody anticipated. Budget for a dedicated person (or at least 20% of someone's time) to run the cluster.
Regulated enterprise (SOC 2, HIPAA): Splunk. The compliance reporting, certified integrations, and long-term retention are built in. The cost is high, but it's justified by audit readiness.
Already on Datadog for metrics: Datadog Logs. A unified platform cuts context-switching. Accept the cost premium for the convenience.
Cost-constrained, >5 TB/day: Loki with aggressive log sampling (10% for non-errors, 100% for errors). Archive raw logs to S3 for compliance, query only through Loki.

Architecture Diagram

Why It Exists

How It Works

Structured Logging

ELK Pipeline vs. Loki

Log Sampling

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Distributed Logging

Architecture Diagram

Why It Exists

How It Works

Structured Logging

ELK Pipeline vs. Loki

Log Sampling

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics