Storage & DatabasesTech 21 of 21

Time Series

VictoriaMetrics

The Prometheus long-term storage that does more with less hardware

Use Cases

Long-term Prometheus metrics storageMulti-tenant monitoring for platform teamsHigh-cardinality metrics that would OOM PrometheusCost-effective replacement for Thanos or CortexCentralized metrics aggregation across Kubernetes clustersIoT and sensor data monitoring at scale500M+ metrics/sec ingestion at Datadog-tier scale (cluster mode)Gorilla-compressed hot storage tier in tiered TSDB architectures

Architecture

Why Teams Switch To This

A team runs Prometheus. It works great until it doesn't. Maybe they hit 5 million active series and Prometheus is eating 32GB of RAM. Maybe they need 6 months of retention and the local disk is full. Maybe they try Thanos and find that the sidecar, store-gateway, compactor, and query components are more infrastructure than they bargained for.

VictoriaMetrics solves these problems with a simpler architecture and better resource efficiency. It started as a long-term storage backend for Prometheus (accepting data via remote_write), but at this point many teams use it as a complete Prometheus replacement with vmagent handling the scraping. Cloudflare, Grammarly, Fly.io, and Adidas all run it in production.

How the Storage Engine Works

VictoriaMetrics uses a custom storage engine built on the merge tree concept. It is similar in spirit to ClickHouse's MergeTree (not surprising, since the author previously contributed to ClickHouse), but specialized for time series.

When data arrives, it lands in an in-memory buffer called the rawRows buffer. When this fills up (or at regular intervals), the buffer gets sorted by metric name and timestamp, then flushed to disk as an immutable part. Each part contains two main structures: the data block with compressed timestamp-value pairs, and the metaindex block with offsets for each time series within the part.

Compression is the biggest win. It implements the Gorilla algorithm from Facebook's 2015 VLDB paper, with additional optimizations on top.

Timestamps use delta-of-delta encoding. For metrics scraped at regular 15-second intervals, the delta between consecutive timestamps is constant (15,000ms), so the delta-of-delta is zero. A long run of zeros compresses to roughly 1.5 bits per timestamp. Irregular intervals cost more bits, but most infrastructure metrics are scraped on fixed intervals.

Values use XOR encoding against the previous value. The key insight: consecutive metric samples are often identical or very similar. When two IEEE 754 floats are XORed, identical values produce all zeros, which encode as a single bit. In practice, 51% of consecutive metric values are identical (CPU usage didn't change in the last 15 seconds, request count stayed the same). Another 30% share the same leading/trailing zero structure and compress to about 27 bits. The remaining 19% need about 37 bits.

The math: a raw sample is 16 bytes (8-byte int64 timestamp + 8-byte float64 value). With Gorilla compression, the average drops to 1.37 bytes per sample. That is roughly 12x compression. A workload producing 54 TB/day uncompressed fits in under 5 TB/day compressed. This is not theoretical. These are the numbers from the original Facebook paper and from VictoriaMetrics's own benchmarks.

Background merges combine small parts into larger ones, similar to LSM tree compaction. Merges reduce the total number of parts (which improves query performance) and improve compression (larger blocks have more context for the compressor). Merges are the most I/O-intensive operation. If query latency increases, the first thing to check is whether merges are keeping up with ingestion rate.

The inverted index maps metric names and label key-value pairs to internal time series IDs (TSIDs). This index is also stored in a merge tree format with its own background merge process. When querying http_requests_total{job="api"}, the index resolves "job=api" to a set of TSIDs, and then only reads data blocks for those specific series.

Single-Node vs Cluster

Single-node is a single binary (victoria-metrics) that handles everything. It ingests, stores, and queries data. For most teams, this is the right choice. A single node on a modern server (16 cores, 64GB RAM, NVMe SSD) handles 1M+ samples/sec ingestion, 10-20 million active time series, and 6-12 months of retention. That covers a lot of organizations.

Cluster mode splits into three components. vminsert accepts incoming data (Prometheus remote_write, InfluxDB line protocol, Graphite, OpenTSDB, and others). vmstorage persists data to disk and serves read requests from vmselect. vmselect handles queries and aggregates results from multiple vmstorage nodes. Each component scales independently:

Need more write throughput? Add vminsert instances behind a load balancer.
Need more query capacity? Add vmselect instances.
Need more storage? Add vmstorage nodes.

The data distribution uses consistent hashing. vminsert computes a fingerprint for each time series (FNV-1a hash of the sorted metric name + label set) and routes it to a specific vmstorage node based on that hash. This ensures all data for a given series lives on one node, which keeps queries efficient. When a vmstorage node goes down, vminsert detects the failure and reroutes writes to remaining nodes.

Replication is optional but recommended for production: set -replicationFactor=2 on vminsert and each data point gets written to 2 vmstorage nodes. RF=2 is sufficient here, unlike databases that need RF=3 for quorum. VictoriaMetrics data is append-only and immutable, so there is no read-before-write conflict. If a node goes down, the replica still serves queries for that shard's data.

On the query side, vmselect fans out every query to ALL vmstorage nodes in parallel. Each node returns its local results, vmselect merges them, deduplicates overlapping data (from replication or partition recovery), and returns the final result. This fan-out pattern means query latency is bounded by the slowest vmstorage node, not the sum. It also means adding more vmstorage nodes increases total storage and write throughput without hurting query latency.

Production Operations

Monitoring VictoriaMetrics itself (yes, monitoring the monitoring). Key metrics to track:

vm_rows_inserted_total: ingestion rate. If this drops, something upstream is broken.
vm_active_merges: number of concurrent merge operations. Should be non-zero but not maxed out.
vm_parts: total number of parts across all partitions. If this grows steadily, merges are not keeping up.
vm_slow_queries_total: queries exceeding the slow query threshold. Dashboard or alerting misconfiguration usually.
process_resident_memory_bytes: actual RSS. Compare against available memory.

For backups, use vmbackup which creates consistent snapshots using hard links (nearly instant) and uploads to S3, GCS, or local filesystem. Restore with vmrestore. This is far simpler than backing up Prometheus, which involves dealing with WAL segments and block boundaries.

Capacity Planning

Single-node sizing guidelines based on real-world deployments:

Active Series	Ingestion Rate	RAM	CPU	Storage (1 year)
1M	100K samples/sec	8GB	4 cores	200GB
5M	500K samples/sec	16GB	8 cores	1TB
20M	1M samples/sec	64GB	16 cores	4TB

These assume 15-second scrape intervals and typical metric cardinality. Actual numbers depend on the number of labels per series and query complexity. The storage estimates use VictoriaMetrics's compression, which is roughly 3-5x better than Prometheus.

For the cluster version, each component has different bottlenecks:

Component	Instance	Throughput	Bottleneck
vminsert	c6g.xlarge (4 vCPU, 8GB)	~12.5M samples/sec routing	CPU (hashing, protobuf parsing)
vmstorage	i3.4xlarge (16 vCPU, 122GB, 3.8TB NVMe)	~5M samples/sec ingestion	Disk I/O (merges), RAM (inverted index)
vmselect	r6g.xlarge (4 vCPU, 32GB)	~100 concurrent queries	Memory (result merging, caching)

The vmstorage number deserves context. The official VictoriaMetrics benchmark achieved 2.2M samples/sec on a single node with 6GB RAM. On an i3.4xlarge (16 vCPU, 122GB, NVMe), the shard-per-core architecture scales roughly linearly with CPU count, making 5M samples/sec realistic. At 500M samples/sec, that means 100 vmstorage nodes. With RF=2 for high availability, 200 nodes total.

Inverted Index (Mergeset)

The inverted index is what makes label-based queries fast. When a query asks for http_requests_total{service="checkout", env="prod"}, the system needs to find which time series match those labels out of potentially billions.

VictoriaMetrics uses a data structure called mergeset, an LSM-tree variant optimized for sorted key lookups. It maps label key-value pairs to Time Series IDs (TSIDs). Each unique combination of metric name + labels gets a 64-bit TSID.

New label-to-TSID mappings go into an in-memory table. When the table fills up, it flushes to disk as an immutable sorted "part." Background merges periodically combine small parts into larger ones, same as the data merge tree.

The read path handles multi-label queries through posting list intersection. For {service="checkout", env="prod"}, the index looks up two posting lists: all TSIDs where service=checkout, and all TSIDs where env=prod. It intersects these lists (starting with the smallest one for efficiency) to find the matching series.

Each index part has a bloom filter that allows fast "definitely not here" checks. If the bloom filter says a label value does not exist in a part, the system skips that part entirely. This is effective for low-cardinality labels like service or env, but less helpful for high-cardinality labels where bloom filter false positive rates increase.

The critical difference from Prometheus: mergeset is disk-backed, not in-memory. Prometheus holds its entire label index in RAM, which is why high-cardinality workloads cause OOM. VictoriaMetrics keeps hot index data in memory and pages in cold data from disk as needed. This is how it handles 100M+ series without falling over. The tradeoff is that cold index lookups hit disk, adding latency for queries touching rarely-accessed label combinations. In practice, active monitoring queries hit cached index data and stay fast.

High-cardinality labels (like user_id or request_id) bloat the index linearly. More unique values mean more entries in the mergeset, more disk I/O during compaction, more memory for bloom filters, and longer startup times when rebuilding the index after a restart (10-30 minutes with billions of series).

MetricsQL

MetricsQL is a superset of PromQL. Everything valid in PromQL works in MetricsQL without changes. Existing alerts, recording rules, and Grafana dashboards carry over with zero modifications.

The extensions solve real annoyances with PromQL. The most impactful: MetricsQL preserves metric names after aggregation functions. In PromQL, rate(http_requests_total[5m]) drops the metric name from the result, which makes chaining aggregations awkward. MetricsQL keeps it.

WITH templates allow reusable query fragments. Instead of copy-pasting complex label matchers across 20 dashboard panels, define them once and reference them. rollup_rate() automatically selects the best aggregation function for a given time range, handling the common mistake of using rate() with a range that is too short for the scrape interval. label_set() transforms labels in queries without needing recording rules.

MetricsQL also handles missing data more gracefully. PromQL returns NaN gaps that break dashboard visualizations. MetricsQL interpolates across gaps using the last known value within a configurable lookback window. For operational dashboards where brief scrape failures are common, this produces cleaner charts without hiding real issues.

vmalert

VictoriaMetrics does not embed alerting into the storage engine. Instead, vmalert is a separate binary that handles both alerting rules and recording rules. It reads from vmselect, writes back to vminsert, and fires alerts to Alertmanager. Three flags define the entire connection model:

-datasource.url points to vmselect (where vmalert runs MetricsQL queries)
-remoteWrite.url points to vminsert (where recording rule output gets written back)
-notifier.url points to Alertmanager (where firing alerts get sent)

The evaluation loop is simple. Every interval (default 15 seconds, configurable per rule group), vmalert executes each rule's MetricsQL expression against vmselect. For alerting rules, it compares the result against the threshold. If the condition holds longer than the for duration, the alert transitions from "pending" to "firing" and gets sent to Alertmanager. For recording rules, the result gets written back to VictoriaMetrics as a new time series via remote_write. Dashboards then query those pre-computed series instead of scanning millions of raw ones.

This separation from the storage engine is the key difference from Prometheus. In Prometheus, the ruler is embedded in the TSDB process. If the ruler gets overloaded with expensive rule evaluations, it competes for CPU and memory with ingestion and queries. vmalert runs as its own process. It can be scaled, restarted, or upgraded without touching storage or query availability. If a bad recording rule causes high CPU, kill that vmalert instance. vmselect and vmstorage keep running.

For large deployments with thousands of rules, run multiple vmalert instances with sharded rule groups. Each instance loads a subset of rule files and evaluates independently. There is no built-in coordination between instances, so the sharding is managed through configuration (separate rule file directories per instance). This is straightforward but manual. At extreme scale, a consistent-hashing layer in front of vmalert (like the observability platform design in Section 9.7 of the Observability Platform article) automates the shard assignment.

vmalert also provides a web UI at its HTTP port showing rule evaluation status, last evaluation time, and firing alerts. Useful for debugging why a rule is or isn't firing without digging through Alertmanager.

Multi-Tenancy

Running VictoriaMetrics for multiple teams or organizations requires tenant isolation at three levels.

Data plane. Each tenant gets a separate directory on vmstorage: /data/{tenant_id}/. Data from different tenants never mixes on disk. This isolation also means a compaction storm from one tenant's data does not affect another tenant's read performance.

Query plane. vmselect injects the authenticated tenant_id as a label filter into every query. If a user tries to include an explicit tenant_id label matcher in their query (to access another tenant's data), vmselect strips it and replaces it with the authenticated tenant ID from the request header. Audit logs flag these attempts.

Control plane. Per-tenant cardinality limits at vminsert prevent one tenant from blowing up shared infrastructure. When a tenant exceeds its active series budget, vminsert rejects new series with HTTP 429 and an X-Series-Limit header. Existing series continue ingesting normally.

Shuffle-sharding adds another layer of isolation. Instead of spreading each tenant across all vmstorage nodes, each tenant gets assigned to a subset (for example, 4 out of 200 nodes). If one tenant's cardinality explosion causes OOM on their assigned nodes, only those nodes go down. Other tenants on different nodes see no impact. This is the same isolation pattern that Grafana Mimir uses, and it converts a cluster-wide outage into a single-tenant degradation.

Consistency Model

VictoriaMetrics is AP in CAP theorem terms. It prioritizes availability over strict consistency.

There is no quorum on writes. Each vmstorage node owns its shard independently. vminsert writes to the target shard (and optionally its replica) without waiting for acknowledgment from other nodes. This shared-nothing design eliminates coordination overhead and is a big reason VictoriaMetrics benchmarks 6.6x faster than InfluxDB on ingestion.

During a network partition, both sides of the split continue accepting writes for the shards they can reach. Writes to unreachable shards fail, and vminsert surfaces errors to the upstream producer (vmagent, Prometheus, or OTel Collector). When the partition heals, vmselect deduplicates overlapping data from both sides during query execution. The dedup is transparent to the user.

This model works because metric data is append-only and immutable. There is no update-in-place, no read-before-write, and no transaction isolation to worry about. Two nodes accepting the same data point independently is a non-issue since dedup handles it. Losing a few seconds of data during a partition is tolerable because the scrape cycle will produce the next data point 15 seconds later. For historical data, S3 exports provide a durable backup that survives even total cluster loss.

Failure Scenarios

Scenario 1: Merge queue grows unbounded. High ingestion rate combined with slow disks means background merges cannot keep up. The number of parts grows, which makes each query touch more files, which slows down queries, which creates a feedback loop. Detection: monitor vm_parts and vm_active_merges. If parts count grows consistently over hours, merges are falling behind. Recovery: reduce ingestion rate temporarily (or buffer in Kafka/vmagent), upgrade to faster storage, or increase -mergeConcurrency if CPU headroom exists. Long-term fix: size disk I/O for peak ingestion rate, not average.

Scenario 2: Cardinality spike exhausts memory. A deployment change causes label churn (pod restarts with new pod IDs as labels). Thousands of new series per minute hit VictoriaMetrics. The inverted index grows, eating memory. Unlike Prometheus which OOMs relatively quickly, VictoriaMetrics handles high cardinality better but still has limits. At 50M+ series, even VictoriaMetrics needs 64GB+ RAM for the index. Detection: track vm_new_timeseries_created_total for unusual spikes. Prevention: use relabeling in vmagent or Prometheus to drop high-cardinality labels before they reach storage. The -search.maxUniqueTimeseries flag prevents runaway queries from scanning the entire index.

Scenario 3: Network partition splits vminsert from vmstorage subset. vminsert can reach only half the vmstorage nodes. Writes for unreachable shards fail. vmselect returns partial query results from the reachable nodes only. Detection: monitor vm_rpc_connection_errors_total on vminsert and query result completeness on vmselect. Recovery: when the partition heals, deduplication on vmselect merges any overlapping writes. If replication factor is 2, the replica may be on the reachable side, in which case writes continue with no data loss. The key risk is prolonged partitions where the WAL on isolated vmstorage nodes fills up. Monitor disk usage on all vmstorage nodes during partition events.

Scenario 4: Cascading query overload from a single tenant. A tenant opens a dashboard with 20 panels, each running a broad query that touches millions of series. The dashboard auto-refreshes every 10 seconds. Each query fans out to all vmstorage nodes. This saturates vmstorage CPU across the cluster, slowing down all tenants' queries. Alert evaluation latency increases. Detection: monitor vm_concurrent_queries per tenant. Prevention: set per-tenant query concurrency limits (default 20) and per-query series limits (default 500K with -search.maxUniqueTimeseries). vmselect returns HTTP 422 when limits are exceeded. With shuffle-sharding, the impact is limited to the tenant's assigned vmstorage nodes rather than the entire cluster.

Pros

• Dramatically lower resource usage than Prometheus for the same workload, often 5-10x less RAM
• Handles high cardinality far better than Prometheus without falling over
• Drop-in Prometheus replacement with full PromQL compatibility plus MetricsQL extensions
• Compression is exceptional, typically 0.4-0.8 bytes per data point
• Single binary deployment. Download, run, done. Operationally simple.
• Shared-nothing cluster architecture where vminsert, vmselect, and vmstorage scale independently with zero coordination overhead

Cons

• Smaller community than Prometheus and Thanos, though growing fast
• Cluster version has a different architecture than single-node (not just 'add more nodes')
• MetricsQL extensions are useful but create vendor lock-in if you rely on them heavily
• Alerting is a separate binary (vmalert), not embedded in the storage engine. Extra component to deploy and configure.
• Documentation is functional but not as polished as Prometheus ecosystem docs

When to use

• You need long-term Prometheus storage without the complexity of Thanos
• Prometheus is running out of memory or disk and you need a more efficient backend
• Multi-cluster or multi-tenant monitoring where each team pushes metrics centrally
• High cardinality workloads that crash Prometheus

When NOT to use

• You want alerting embedded in the same process as storage (Prometheus bundles this; VictoriaMetrics requires the separate vmalert binary)
• Your metrics volume fits comfortably in a single Prometheus instance with local storage
• You need strong ecosystem support and battle-tested integrations right now
• Log aggregation or distributed tracing (this is a metrics-only database)

Key Points

•VictoriaMetrics achieves 0.4-0.8 bytes per data point through a custom compression scheme. It uses delta-of-delta for timestamps and XOR-based Gorilla encoding for values, similar to what Prometheus does, but adds several optimizations on top. The result is that a workload using 100GB in Prometheus often fits in 10-20GB in VictoriaMetrics.
•The merge tree storage engine is append-only and designed to minimize random I/O. Data arrives, gets buffered in memory, and flushes to disk as immutable parts sorted by metric name and timestamp. Background merges combine small parts into larger ones. This is conceptually similar to LSM trees but specialized for time series access patterns.
•MetricsQL is a superset of PromQL. Everything that works in PromQL works in MetricsQL. But it adds useful functions like rollup_rate() which automatically selects the best aggregation for a given time range, and label_set() for transforming labels in queries. These are genuinely useful extensions, not gimmicks.
•vmagent is a lightweight Prometheus-compatible scraper that handles service discovery, scraping, and remote_write. It is smaller and faster than a full Prometheus instance and can aggregate, deduplicate, and relabel metrics before sending them to VictoriaMetrics. Many teams replace Prometheus entirely with vmagent + VictoriaMetrics.
•The cluster version splits into three components: vminsert (accepts writes), vmselect (serves queries), and vmstorage (stores data). These scale independently. Write-heavy? Add more vminsert nodes. Query-heavy? Add more vmselect nodes. Need more storage? Add vmstorage nodes with more disk. This separation makes capacity planning straightforward.
•Deduplication is built in. If multiple Prometheus instances scrape the same targets (for HA), VictoriaMetrics automatically deduplicates the data on query. No need for the complex sidecar and store-gateway setup that Thanos requires. Configure the -dedup.minScrapeInterval flag and duplicates merge transparently.
•Gorilla compression works at the bit level. Timestamps use delta-of-delta encoding: with regular 15-second scrape intervals, the delta is constant so the delta-of-delta is zero, compressing to nearly nothing (~1.5 bits per timestamp). Values use XOR encoding against the previous value: 51% of consecutive metric values are identical, so their XOR is zero and they compress to a single bit. 30% share the same bit pattern structure and compress to ~27 bits. The remaining 19% need ~37 bits. The combined average is 1.37 bytes per sample, down from 16 bytes raw (8-byte timestamp + 8-byte float). That is a 12x compression ratio.
•The inverted index uses a data structure called mergeset, an LSM-tree variant optimized for sorted key lookups. It maps label key-value pairs to Time Series IDs (TSIDs). When a query like {service='checkout', env='prod'} arrives, the index looks up posting lists for each label, then intersects them to find matching TSIDs. Bloom filters per index part allow fast 'definitely not here' checks that skip irrelevant parts entirely. Unlike Prometheus which holds the entire index in memory, mergeset is disk-backed and handles 100M+ series without OOM.
•Multi-tenancy in cluster mode works through tenant ID headers. The X-Scope-OrgID header carries the tenant identifier through the entire pipeline. On the data plane, each tenant gets a separate directory on vmstorage. On the query plane, vmselect injects the tenant_id filter into every query automatically, preventing cross-tenant data access. vminsert enforces per-tenant cardinality limits and rejects new series with HTTP 429 when a tenant exceeds its budget. Shuffle-sharding assigns each tenant to a subset of vmstorage nodes, so one tenant's cardinality explosion only affects their assigned nodes, not the entire cluster.
•VictoriaMetrics prioritizes availability over consistency (AP in CAP). There is no quorum on writes. Each vmstorage node owns its shard independently. During a network partition, both sides continue accepting writes. When the partition heals, vmselect deduplicates overlapping data during query. This works because metric data is append-only and immutable. Replication factor 2 is sufficient (not 3 like databases that need quorum) because the dedup-on-read model handles split-brain safely.

Common Mistakes

✗Running VictoriaMetrics with insufficient disk I/O. The merge process is I/O-intensive, and if merges fall behind ingestion, the result is too many open files and query latency spikes. Use SSDs, and monitor the number of active merges vs merge queue size.
✗Not configuring retention properly. The -retentionPeriod flag defaults to 1 month. For 1 year of data, set it at startup. Changing retention later does not recover disk space from already-deleted data because the deletion happens at the partition level during the next merge cycle.
✗Ignoring the -search.maxUniqueTimeseries flag. By default, a single query can touch up to 300K unique time series. If a dashboard panel accidentally queries all series (missing a label filter), it can saturate vmselect. Set this limit based on what the dashboards actually need.
✗Using the cluster version when single-node would suffice. The single-node binary handles 1M+ samples/sec ingestion and 10+ million active series on a single modern server. The cluster version adds operational complexity with three separate components. Only go multi-node when horizontal scaling is genuinely needed.
✗Skipping vmagent and using Prometheus remote_write directly. This works, but vmagent handles retries, buffering, and backpressure more gracefully. If VictoriaMetrics goes down briefly, vmagent buffers data to disk and replays it. Prometheus remote_write has a limited buffer and will drop data faster.
✗Not setting per-tenant cardinality limits in multi-tenant deployments. One tenant adding an unbounded label like user_id to a histogram creates millions of new series overnight. Without limits at vminsert, this causes OOM on the vmstorage nodes that tenant is assigned to, taking down all other tenants on those same nodes. Always configure per-tenant active series limits and monitor vm_new_timeseries_created_total per tenant.

Related Technologies