Distributed Logging
Architecture Diagram
Why It Exists
A request fails somewhere in a chain of 12 microservices. Opening the logs on the pod that returned the error tells almost nothing. The actual cause was three hops upstream, in a service the team doesn't even own. This is the reality of debugging in a distributed system without centralized logging.
Distributed logging pulls logs from every service, container, and infrastructure component into one place where they are actually searchable. The result is a single timeline of what happened, with filtering by correlation ID across service boundaries, and the ability to spot patterns that are invisible when logs live on individual nodes. At scale, a platform can generate terabytes of logs per day. Without centralized aggregation, everyone is just guessing.
How It Works
Structured Logging
Every log entry should be a JSON object with consistent fields: timestamp, level, service, trace_id, span_id, message, and whatever request-scoped metadata matters. With this structure, writing queries like "show all ERROR logs from payment-service where trace_id = abc123 in the last hour" becomes trivial. Without it, the fallback is regex and prayers.
Libraries like zap (Go), structlog (Python), and logback with a JSON encoder (Java) emit structured logs out of the box. Pick one and enforce it. The hardest part isn't choosing the library, it's getting every team to actually use it consistently.
ELK Pipeline vs. Loki
The ELK stack (Elasticsearch, Logstash, Kibana) full-text indexes every log line. This provides powerful ad-hoc queries, but it's expensive. Elasticsearch indexes consume 1.5-2x the raw log size in storage. Paying for that while only ever filtering by service name and log level is wasting money.
Grafana Loki takes a fundamentally different approach. It indexes only metadata labels (service, namespace, level) and stores the actual log lines as compressed chunks in object storage. Queries filter by labels first, then grep through the chunks. Loki is 10-50x cheaper at scale, but noticeably slower for arbitrary full-text search.
The tradeoff in a table:
| Aspect | Elasticsearch | Grafana Loki |
|---|---|---|
| Indexing | Full-text inverted index | Label-based index only |
| Storage cost | High (index + data) | Low (object storage) |
| Query speed | Fast for any query | Fast for label queries, slower for grep |
| Operations | Complex (shards, replicas, JVM tuning) | Simpler (stateless queriers) |
My honest take: most teams should start with Loki unless they already know they need full-text search. Elasticsearch can always be added later for a specific use case. Going the other direction (migrating away from Elasticsearch to save money) is painful.
Log Sampling
At high volume, logging 100% of requests is neither affordable nor necessary. Deterministic sampling based on trace ID hash means the same request is either fully logged or fully dropped across all services. No one ends up with half a trace. Dynamic sampling cranks up verbosity for error paths and dials it down for healthy traffic. This is the right default for most production systems above 1 TB/day.
Production Considerations
- Log buffering. Put Kafka or a persistent queue between the collectors and storage. This absorbs burst traffic and prevents backpressure from crashing apps when the logging backend gets slow. I've seen teams skip this and regret it within months.
- Retention tiers. Hot storage (7 days) for recent searchable logs, warm storage (30 days) for less frequent access, cold storage (S3/GCS, 1+ year) for compliance. Automate the lifecycle policies on day one or someone will forget and get a surprise storage bill.
- Dynamic log levels. Expose a runtime endpoint or use feature flags to switch a service from INFO to DEBUG without redeploying. This is critical for production debugging. Redeploying to change a log level under load is reckless.
- PII scrubbing. Build pipeline-level processors (Fluentd filters, Logstash grok) to redact sensitive fields before they hit storage. Retroactive deletion from Elasticsearch is expensive, slow, and unreliable. Treat this as a hard requirement, not a nice-to-have.
- Correlation. Inject
trace_idandspan_idinto every log line using middleware or framework integrations. This bridges logs to distributed traces, enabling a jump from a log entry to the full trace waterfall in one click. Running microservices without correlation IDs is a recipe for on-call misery.
Failure Scenarios
Scenario 1: Elasticsearch Cluster Red Status During Index Rollover. A nightly ILM rollover tries to create a new index, but the cluster doesn't have enough disk space for primary shard allocation. The cluster goes red and rejects all new writes. Log agents start buffering locally, eating pod memory. Within 20 minutes, pods are OOMing. Services that log synchronously start throwing backpressure errors, and user-facing latency jumps 40%. Detection: Monitor elasticsearch_cluster_health_status (0=green, 1=yellow, 2=red) and elasticsearch_fs_total_free_bytes. Alert when free disk drops below 20%. Recovery: Delete the oldest indices or force merge cold indices to reclaim space. Long-term, set up watermark-based ILM that rolls to warm/cold tiers before disk reaches 85%.
Scenario 2: Kafka Log Buffer Partition Leader Election Storm. The Kafka cluster used as a log buffer loses a broker. Partition leader election for 200+ partitions triggers a thundering herd of producer reconnections from 500 Fluent Bit agents. Kafka producer buffers fill up, agents fall back to filesystem buffering, and log delivery latency jumps from 5 seconds to 15 minutes. During this gap, an on-call engineer investigating a separate incident has no recent logs in Kibana and misdiagnoses the problem. Detection: Monitor kafka_server_ReplicaManager_UnderReplicatedPartitions and Fluent Bit fluentbit_output_retries_total. Recovery: Run Kafka with min.insync.replicas=2 and 3+ brokers. Configure Fluent Bit with filesystem buffering and a 2 GB buffer limit so it can survive 30-minute outages.
Scenario 3: PII Leak Into Log Storage. A new gRPC service logs full request bodies at INFO level, including customer emails and phone numbers. Nobody catches it for 72 hours until a security scan flags PII in Elasticsearch. Now there are 14 TB of logs containing personal data subject to GDPR. Retroactive deletion from Elasticsearch means reindexing, which costs around $8K in compute and takes 48 hours. Detection: Run automated PII scanners (Amazon Macie, or a custom regex pipeline) on a sampled stream of log data. Alert on matches for email, SSN, and credit card patterns. Recovery: Deploy a Fluentd filter with regex-based redaction immediately. Start a GDPR data processing impact assessment. Reindex the affected indices with PII fields scrubbed.
Capacity Planning
Volume estimation formula: daily_log_volume = num_services * avg_rps * log_lines_per_request * avg_log_line_size * 86400. For a 200-service platform at 10K aggregate RPS generating 3 log lines per request at 500 bytes each, that works out to: 10,000 * 3 * 500 * 86,400 = ~1.3 TB/day raw.
| Scale Tier | Services | Daily Volume (Raw) | Elasticsearch Storage (1.7x) | Loki Storage (0.3x) | Reference |
|---|---|---|---|---|---|
| Startup | 10 | 10 GB | 17 GB | 3 GB | Early-stage SaaS |
| Mid-scale | 100 | 500 GB | 850 GB | 150 GB | Series C platform |
| Large-scale | 500 | 5 TB | 8.5 TB | 1.5 TB | Uber-scale (100TB+/day) |
| Hyper-scale | 2000+ | 50 TB+ | 85 TB+ | 15 TB+ | Netflix, Cloudflare |
Key thresholds to keep in mind. Elasticsearch shard size should be 10-50 GB for good performance. More than 1,000 shards per node causes cluster instability. Loki chunk_target_size should be 1.5 MB for optimal object storage performance. Budget 1 GB heap per 20 active shards for Elasticsearch JVM. When daily volume exceeds 5 TB, Loki with an S3 backend becomes 10-20x cheaper than Elasticsearch. Log sampling at 10% for non-error traffic is standard above 1 TB/day.
Architecture Decision Record
Decision: Choosing a Centralized Logging Stack
| Criteria (Weight) | Elasticsearch + Kibana | Grafana Loki | Datadog Logs | Splunk |
|---|---|---|---|---|
| Storage cost (25%) | 2 - Full-text index doubles storage | 5 - Object storage, label-only index | 2 - $0.10/GB ingested, costly at scale | 1 - Most expensive per GB |
| Query flexibility (20%) | 5 - Full-text search, aggregations | 3 - Label filters + grep, limited analytics | 4 - Pattern analysis, faceting | 5 - SPL is the most powerful query language |
| Operational burden (20%) | 2 - JVM tuning, shard management, cluster ops | 4 - Stateless read path, simple scaling | 5 - Fully managed | 3 - Heavy infrastructure for on-prem |
| Ecosystem integration (15%) | 3 - Separate from metrics/traces | 5 - Native Grafana, links to Tempo/Mimir | 4 - Unified with APM/metrics | 3 - Standalone ecosystem |
| Compliance / retention (10%) | 4 - ILM policies, snapshot restore | 3 - S3 lifecycle policies | 3 - Limited retention controls | 5 - Best-in-class compliance features |
| Real-time tail (10%) | 3 - Kibana Discover, some lag | 4 - Live tail via LogCLI | 5 - Real-time live tail | 4 - Real-time search |
When to choose what:
- Team < 20, already using Grafana: Loki. It's 10x cheaper, plugs right into existing Grafana dashboards, and is simple to operate.
- Team 20-100, need ad-hoc search: Elasticsearch. Full-text search is indispensable when debugging problems nobody anticipated. Budget for a dedicated person (or at least 20% of someone's time) to run the cluster.
- Regulated enterprise (SOC 2, HIPAA): Splunk. The compliance reporting, certified integrations, and long-term retention are built in. The cost is high, but it's justified by audit readiness.
- Already on Datadog for metrics: Datadog Logs. A unified platform cuts context-switching. Accept the cost premium for the convenience.
- Cost-constrained, >5 TB/day: Loki with aggressive log sampling (10% for non-errors, 100% for errors). Archive raw logs to S3 for compliance, query only through Loki.
Key Points
- •Centralized logging pulls logs from all services into one searchable system
- •Structured logging (JSON) makes querying and filtering possible. Unstructured text logs become useless past a handful of services
- •ELK/EFK stack (Elasticsearch, Fluentd/Logstash, Kibana) is the classic open-source approach
- •Log levels (DEBUG, INFO, WARN, ERROR) should be adjustable at runtime without a redeploy
- •Correlation IDs (trace IDs) connect logs across services for a single request, and they are non-negotiable for debugging
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Grafana Loki | Open Source | Log aggregation without full-text indexing, cost-efficient | Medium-Enterprise |
| Elasticsearch + Kibana | Open Source | Full-text search, complex queries, mature ecosystem | Medium-Enterprise |
| Datadog Logs | Commercial | Unified with metrics/traces, live tail, patterns | Small-Enterprise |
| Fluentd/Fluent Bit | Open Source | Log collection and routing, CNCF graduated | Medium-Enterprise |
Common Mistakes
- Logging sensitive data (passwords, tokens, PII), which violates compliance and creates security risks
- Unstructured log messages. Grep-based debugging falls apart past 10 services
- Not setting log retention policies, so storage costs grow forever
- Logging too much at INFO level, overwhelming the logging pipeline with noise
- Not buffering logs. Direct writes to Elasticsearch from every pod cause write amplification