Monocle (Datadog)
Datadog's shard-per-core Rust TSDB that replaced their legacy storage
Use Cases
Architecture
How It Works Internally
Monocle is built around one core idea: eliminate all cross-core contention by giving each CPU core its own complete storage engine instance.
Each core runs an independent LSM-tree (Log-Structured Merge-tree). Write operations land in an in-memory memtable owned by a single core. When the memtable fills, it flushes to an immutable SSTable on disk. Compaction (merging SSTables to reduce read amplification) happens per-core, so one core compacting never blocks another core from serving reads.
Memory allocation is also per-core. Each core has its own allocator arena, eliminating contention on the global heap. In traditional multi-threaded databases, memory allocation is a hidden bottleneck because malloc() internally uses locks. Monocle sidesteps this entirely.
I/O is similarly partitioned. Each core owns a dedicated I/O queue and a set of file descriptors. This means disk operations on core 0 cannot create head-of-line blocking for core 3. On modern NVMe drives with multiple hardware queues, this maps cleanly to the storage hardware.
The indexing layer uses RocksDB for the inverted index that maps label combinations to time-series IDs. This is the same pattern VictoriaMetrics uses (mergeset, a custom B-tree variant), but Monocle chose RocksDB for its maturity and tuning ecosystem. The index is shared across cores but accessed through carefully designed lock-free read paths for the hot query case.
Data itself is stored in a custom columnar format optimized for time-series access patterns: sequential timestamp reads, value compression (Gorilla-style), and block-level predicate pushdown. Datadog has not published the full wire format, but conference talks describe it as a hybrid of Gorilla encoding for recent data and dictionary-compressed columnar blocks for historical queries.
The Write Path
The write path is designed for zero cross-core coordination.
-
Intake API receives metric payloads over HTTP/gRPC from Datadog agents running on customer infrastructure. Payloads contain batched metric samples:
{metric_name, tags[], value, timestamp}. -
Shard Router computes a fingerprint from the metric name and tag set (a hash of the canonical label string). This fingerprint deterministically maps to a target CPU core. The same time series always lands on the same core, which means the memtable for that series is always local — no cross-core lookups on write.
-
Core-local memtable accepts the write. Because only one core ever writes to this memtable, no locks are needed. The memtable is a sorted in-memory structure (typically a skip list or B-tree variant) that supports both point lookups and range scans.
-
Flush to SSTable triggers when the memtable reaches a size threshold (typically tens of MB). The flush is an atomic swap: a new empty memtable takes over writes while the full one serializes to an immutable SSTable on disk. Gorilla encoding compresses timestamps and values in the SSTable blocks.
-
Per-core compaction runs in the background. Each core merges its own SSTables independently. This is where shard-per-core pays off most clearly: one core running heavy compaction cannot stall writes or reads on any other core. The tradeoff is higher total write amplification — N cores means N independent compaction streams, each doing their own merge work.
The critical invariant: at no point in the write path does one core wait on another. This is what makes write throughput scale linearly with core count.
The Query Path
Queries are inherently cross-core because data for a given time range may span multiple series assigned to different cores. Monocle uses a scatter-gather pattern.
-
Query frontend receives the query (e.g., "average CPU usage across all hosts in us-east-1 for the last hour"). It parses the expression and builds a query plan.
-
Index lookup consults RocksDB to resolve the label matchers (
region=us-east-1) to a set of time-series IDs. The index tells the query planner which cores own which matching series. -
Scatter fans the query out to the relevant cores. Each core receives a list of series IDs and a time range. The core scans its local LSM-tree — checking the active memtable first, then SSTables from newest to oldest.
-
Per-core scan decompresses Gorilla-encoded blocks, applies any predicate pushdown (e.g., filtering by value threshold), and produces a partial result set. Because each core's data is independent, this step runs fully in parallel across cores.
-
Gather and merge collects the N partial results and combines them. For aggregation queries (sum, average, percentile), the merge step performs the final reduction. For raw queries, it merges sorted streams by timestamp.
The scatter-gather pattern means query latency is bounded by the slowest core. If one core is doing heavy compaction or has more data to scan, every query touching that core's data pays the penalty. Datadog mitigates this with pre-aggregated rollups: common query patterns (5-minute, 1-hour, 1-day rollups) are materialized during ingestion, reducing the scatter fan-out for dashboard queries that do not need raw-resolution data.
Why Datadog Built It
Before Monocle, Datadog used a Go-based storage engine. At their scale, Go's garbage collector became a problem. GC pauses introduced latency spikes at p99 and p999. With billions of live objects in the heap (each time series has metadata, index entries, and active memtable pointers), even Go's concurrent GC could not avoid stop-the-world pauses in the tens-of-milliseconds range.
The second problem was lock contention. Their Go engine used a shared-state architecture with fine-grained locking. As core counts grew (modern servers have 64-128+ cores), even reader-writer locks became bottlenecks. Cache line bouncing between cores for atomic reference counting added invisible overhead.
Datadog evaluated two options: rewrite in Go with a shared-nothing architecture, or rewrite in Rust with shard-per-core. They chose Rust because it solved both problems at once. No GC means no pause-time regressions as the heap grows. Rust's ownership model makes shard-per-core natural: each core owns its data structures, and the borrow checker enforces at compile time that no other core can mutate them.
The rewrite took multiple years. The migration strategy ran both engines in parallel, comparing output for correctness, before cutting over.
The Migration: Go to Rust
The Go-to-Rust rewrite is one of the better-documented large-scale storage engine migrations. The strategy had four phases.
Phase 1: Dual-write shadow mode. Both the old Go engine and the new Rust engine received identical write traffic. Neither engine was aware of the other. The Go engine remained the source of truth for all queries.
Phase 2: Output comparison. A validation framework ran identical queries against both engines and compared results. Differences were logged, triaged, and fixed. This phase ran for months because edge cases (timezone handling, NaN propagation, counter reset detection) surfaced slowly under real traffic patterns. The framework needed to tolerate acceptable floating-point differences while catching genuine logic bugs.
Phase 3: Gradual traffic shift. Read traffic migrated incrementally — first internal dashboards, then low-priority customer queries, then progressively higher tiers. At each stage, latency, correctness, and resource usage were compared. The team maintained the ability to instantly revert any customer's queries back to the Go engine.
Phase 4: Decommission. Once all traffic ran on Monocle with stable metrics for an extended period, the Go engine was decommissioned. The dual-write pipeline was removed to eliminate the ingestion cost.
This pattern — dual-write, compare, shift, decommission — is the standard approach for storage engine migrations at scale. Google used a similar strategy migrating Bigtable internals, and Meta used it for their MySQL-to-RocksDB transition (MyRocks).
Multi-Tenancy at Datadog Scale
Datadog serves tens of thousands of customers from a shared infrastructure. Monocle must isolate tenants so that one customer's burst does not degrade another's experience.
Tenant-aware shard routing. The shard router incorporates tenant ID into the hash function. This means a tenant's series are spread across cores, but the system can track per-tenant resource consumption at the core level.
Per-tenant resource quotas. Each tenant has ingestion rate limits (data points per second), cardinality limits (number of unique time series), and query concurrency limits. These are enforced at the intake API layer before data reaches Monocle's cores, so a runaway tenant is rejected early.
Noisy neighbor detection. Monocle tracks per-tenant CPU time, memory usage, and I/O bandwidth at the core level. If a tenant's queries consume disproportionate resources on a core, the system can throttle that tenant's query rate or redirect their queries to dedicated capacity. This is similar to how cloud databases implement workload management (Snowflake's virtual warehouses, BigQuery's slot allocation).
Query fairness. The scatter-gather query path includes per-tenant fair queuing. When multiple tenants' queries compete for the same core's scan capacity, a weighted fair queue ensures no single tenant monopolizes scan time. This prevents a tenant running expensive SELECT *-equivalent queries from starving other tenants' dashboard refreshes.
The multi-tenancy design is deeply integrated — it is not a layer on top of a single-tenant engine. This is one reason Monocle cannot be easily extracted from Datadog's infrastructure: the tenant isolation logic is woven through every component.
Failure Scenarios
Core crash. If a CPU core fails or its thread panics, only that core's shard is affected. The shard router detects the failure and reassigns the affected series to remaining cores. Data recovery uses the per-core WAL (Write-Ahead Log): on restart, the core replays its WAL to reconstruct the memtable state. Queries for series owned by the failed core return partial results or errors until reassignment completes.
NVMe drive failure. Because I/O is partitioned per-core, a drive failure only impacts cores whose I/O queues are mapped to that drive. Monocle runs on servers with multiple NVMe drives, and the core-to-drive mapping is configured at deployment. Affected cores stop serving reads for on-disk data while replication (cross-node, not cross-core) provides the fallback. This is a key advantage of per-core I/O isolation: a single drive failure does not cascade to all cores.
Compaction stall. A core falling behind on compaction accumulates excess SSTables, increasing read amplification (more files to check per query) and disk usage. Because compaction is per-core, this does not affect other cores. The system monitors per-core SSTable counts and can temporarily reduce that core's write intake to let compaction catch up. In the worst case, the core's series are redistributed.
Query timeout on scatter-gather. If one core is slow (compaction, I/O saturation, or heavy scan), the gather step waits. Monocle handles this with per-core timeouts: if a core does not respond within the deadline, the query returns partial results with a warning rather than failing entirely. For aggregation queries, the missing core's contribution is estimated from rollup data when available.
Architecture Lessons for System Design
Monocle is proprietary and cannot be deployed outside Datadog. But the architecture patterns are broadly applicable.
Shard-per-core as a scaling pattern. When lock contention is the bottleneck, partitioning state by CPU core eliminates it completely. ScyllaDB uses the same pattern (via Seastar) for NoSQL workloads. Redpanda uses it for Kafka-compatible streaming. TigerBeetle uses it for financial transaction processing. The tradeoff: complexity increases because cross-core coordination requires explicit message passing, and load imbalance between cores can waste capacity.
When shard-per-core is overkill. Most systems do not need this level of optimization. If the database runs on 4-8 cores and handles thousands (not billions) of writes per second, the overhead of shard-per-core design (core pinning, NUMA-aware memory allocation, explicit message passing) is not justified. Fine-grained locking or lock-free concurrent data structures are simpler and sufficient. Shard-per-core starts paying off at 32+ cores with millions of operations per second where cache line bouncing becomes measurable.
Rust for latency-sensitive storage engines. When GC pauses are unacceptable at scale, a systems language without GC is the pragmatic choice. InfluxDB IOx, Quickwit (search engine), and SurrealDB all made the same decision. The tradeoff: longer development time, smaller hiring pool, and the learning curve of Rust's ownership model.
Separating index from data. Using RocksDB for the inverted index and a custom format for the actual time-series data provides best-of-both-worlds: a battle-tested key-value engine for lookups and a purpose-built format for sequential time-series reads. VictoriaMetrics, Prometheus, and InfluxDB all use this separation pattern.
Per-core memory allocation. General-purpose allocators like glibc's malloc use internal locks. At high thread counts, allocation becomes a bottleneck. Per-core arenas (or thread-local allocators like jemalloc's tcache) eliminate this. This matters for any high-throughput data system, not just TSDBs.
Production Numbers
What is publicly known from Datadog's blog posts and conference talks:
- Ingestion rate: Trillions of data points per day across the metrics product. This translates to tens of millions of data points per second sustained, with burst capacity significantly higher.
- Series cardinality: Billions of unique time series across all tenants. The RocksDB index layer handles this cardinality with sub-millisecond label lookups.
- Query latency: Sub-second p99 for dashboard queries. Pre-aggregated rollups handle the majority of dashboard load; raw-resolution queries against recent data (last few hours) also target sub-second response.
- Hardware profile: Datadog runs on high-core-count servers (64-128+ cores per node) with multiple NVMe drives per node. Custom kernel configurations include NUMA-aware scheduling, interrupt affinity pinning, and I/O scheduler tuning for NVMe hardware queues.
- Compression ratio: Gorilla-style encoding achieves 1.5-2 bytes per data point for recent data. Historical data uses dictionary-compressed columnar blocks that achieve higher compression ratios at the cost of random access speed.
- Node count: Not publicly disclosed, but the combination of shard-per-core (vertical scaling within a node) and horizontal sharding (across nodes) means Monocle achieves its throughput with fewer nodes than a design that relies solely on horizontal scaling.
These numbers are achievable because Datadog controls the full stack: hardware selection, kernel tuning, network configuration, and deployment tooling. Reproducing similar throughput on commodity cloud instances with default kernel settings would require significantly more nodes.
How Open-Source Alternatives Compare
| Dimension | Monocle | VictoriaMetrics | Grafana Mimir | InfluxDB IOx |
|---|---|---|---|---|
| Language | Rust | Go | Go | Rust |
| Threading | Shard-per-core | Goroutines + shared state | Goroutines + shared state | Async Rust (Tokio) |
| Compression | Gorilla + dictionary columnar | Gorilla (1.5-2 bytes/point) | Prometheus chunks | Apache Arrow columnar |
| Index | RocksDB | Custom mergeset | Hash ring + TSDB index | Apache Parquet |
| Max tested scale | Trillions/day | 500M/sec (cluster) | 100M/sec | Analytical workloads |
| Deployable | No (proprietary) | Yes (open-source) | Yes (open-source) | Yes (open-source) |
VictoriaMetrics (Go, shared-nothing cluster). Achieves horizontal scaling through separate vminsert/vmselect/vmstorage components with consistent hashing. No per-core partitioning. Uses Go's GC but keeps heap pressure low through aggressive Gorilla compression and careful object pooling. Handles 2-5M data points/sec per vmstorage node. Proven at 500M/sec in cluster mode.
Grafana Mimir (Go, horizontally scaled Prometheus). Uses a ring-based hash ring for distribution. Each ingester handles a shard of the metric space. No shard-per-core design. Relies on Go's concurrent GC and keeps memory bounded through chunk lifecycle management. Targets the 10M-100M metrics/sec range for most deployments.
InfluxDB IOx (Rust, Apache Arrow + DataFusion). The closest open-source analog to Monocle's language choice. Uses Rust for the storage engine and Apache Arrow's columnar format for query execution. DataFusion provides the query engine. Does not use shard-per-core (uses async Rust with Tokio instead). Targets analytical time-series workloads with SQL compatibility.
Key differences from Monocle: All three open-source options scale horizontally by adding nodes, not by partitioning within a single node's cores. Monocle does both: shard-per-core within a node and horizontal sharding across nodes. This two-level partitioning is what enables trillion-scale ingestion on fewer machines, but it requires controlling the full deployment stack.
Pros
- • Shard-per-core: each CPU core owns its own LSM-tree, memory allocator, and I/O queue. Zero cross-core locks.
- • Written in Rust for predictable latency and memory safety without GC pauses
- • RocksDB-based indexing layer for label lookups at billion-series cardinality
- • Designed for multi-tenant isolation from day one (Datadog's SaaS model)
- • Handles trillions of data points per day in production
Cons
- • Proprietary: not available outside Datadog. Cannot be self-hosted or evaluated.
- • No public API documentation or query language specification
- • Architecture details come only from blog posts and conference talks, not source code
- • Tightly coupled to Datadog's infrastructure (custom networking, deployment tooling)
- • Not a viable option for build-vs-buy decisions; study-only value
When to use
- • You are evaluating Datadog as a managed observability vendor
- • You want to study shard-per-core TSDB architecture for your own system design
- • You need a reference point for what 'beyond open-source scale' looks like
When NOT to use
- • You need a self-hosted or open-source TSDB (use VictoriaMetrics, Mimir, or InfluxDB instead)
- • You want to run your own metrics infrastructure
- • You need a system you can inspect, fork, or contribute to
Key Points
- •Shard-per-core eliminates contention. Each core runs an independent LSM-tree with its own memory allocator and I/O queue. No mutexes, no lock contention, no false sharing.
- •Rust gives predictable tail latency. No garbage collector means no GC pauses at p99. Memory safety comes from the borrow checker, not runtime overhead.
- •RocksDB handles the inverted index for label-to-series lookups. The time-series data itself lives in Monocle's custom columnar format.
- •Datadog processes trillions of data points per day across metrics, traces, and logs. Monocle is the storage engine behind their metrics product.
- •The architecture was purpose-built after existing open-source TSDBs (including their earlier Go-based engine) could not scale further.
- •Cross-core queries use scatter-gather: the query frontend fans out to all cores, each scans its local LSM-tree, and a merge step combines partial results. This adds latency proportional to the slowest core.
- •Write amplification is a deliberate tradeoff. Each core compacts independently, so total compaction I/O scales with core count. Monocle accepts higher write amplification in exchange for zero write-path contention.
- •The Go-to-Rust migration used dual-write shadow mode for months: both engines ingested identical data, and a comparison framework validated output equivalence before cutover.
Common Mistakes
- ✗Treating Monocle as a technology that can be adopted. It is proprietary and cannot be deployed outside Datadog.
- ✗Assuming shard-per-core is always better. It trades simplicity for performance; most workloads do not need this level of optimization.
- ✗Comparing Monocle benchmarks directly to open-source TSDBs. Datadog controls the full stack (network, kernel tuning, hardware), making apples-to-apples comparison impossible.
- ✗Assuming Datadog's published throughput numbers are achievable on commodity hardware. Datadog runs custom kernel configs, NVMe tuning, and NUMA-aware scheduling that most teams cannot replicate.
- ✗Overlooking the operational complexity of shard-per-core. Core pinning, NUMA topology, interrupt affinity, and I/O scheduler tuning are all prerequisites, not optional optimizations.