InfluxDB
The purpose-built time series database that owns the IoT and metrics space
Use Cases
Architecture
How It Works Internally
InfluxDB was built from the ground up for time series data. The internals back it up. The storage engine, query planner, and data model all assume the data has timestamps and arrives mostly in order. Data that does not fit this pattern fights the system.
The TSM (Time-Structured Merge Tree) engine is the core of InfluxDB's storage layer. Think of it as an LSM tree that has been redesigned for time series. When data arrives, it goes into a write-ahead log (WAL) for durability and simultaneously into an in-memory cache for fast reads of recent data. The cache is a sorted map keyed by series key + timestamp. When the cache hits a size threshold (default 25MB) or a time limit (10 minutes), it gets snapshot and written to a new TSM file on disk.
TSM files are columnar and compressed. Each file contains a set of blocks, where each block holds timestamps and values for a single series over a time range. Timestamps get delta-encoded and then compressed with simple8b encoding or run-length encoding. Float values use Facebook's Gorilla compression (XOR-based), which achieves about 1.37 bytes per point for typical metrics data. Integer and string values use different codecs, but the compression is consistently good.
Compaction runs in the background and merges smaller TSM files into larger ones. There are four compaction levels. Level 1 compactions merge snapshot files quickly. Level 2 and 3 optimizations combine blocks more aggressively. Full compaction creates the final, highly compressed TSM files. The compaction process is similar to LevelDB's approach, but tuned for the append-mostly pattern of time series data.
Sharding and Data Organization
Data in InfluxDB is organized into shard groups based on time ranges. When a retention policy is configured with a 7-day duration, InfluxDB creates shard groups that each cover a time window (default is 1 day for RPs under 2 days, 1 week for longer RPs). Each shard group contains one or more shards, and each shard is essentially an independent TSM engine instance with its own WAL, cache, and TSM files.
This time-based sharding is why InfluxDB handles retention so efficiently. When a shard group expires, InfluxDB drops the entire directory. No scanning, no marking individual rows for deletion, no garbage collection. Just rm -rf and update the metadata. This is also why deleting individual points within a shard group is expensive. TSM files have to be rewritten without the deleted points, which is a compaction-level operation.
The series index is an in-memory inverted index that maps tag key-value pairs to series IDs. A query like WHERE region = 'us-east' causes the index to return all matching series IDs in constant time, and InfluxDB only reads TSM blocks for those series. This is fast, but the entire index lives in memory. That is the root cause of cardinality problems. At 10 million unique series with the TSI (Time Series Index) engine, expect 2-4GB just for the index. The older in-memory index hits limits much sooner.
Production Setup
For a production single-node deployment, start with an SSD (NVMe preferred) because TSM compaction is I/O-intensive. A 4-core, 16GB RAM machine handles around 250K writes/sec and 5 million active series comfortably. For more headroom, 8 cores and 32GB RAM handles 500K+ writes/sec and 10-15 million active series.
The TICK stack is the full production deployment. Telegraf collects metrics from the infrastructure (CPU, memory, disk, network, Docker, Kubernetes, databases, applications) through its plugin system. InfluxDB stores the data. Chronograf provides dashboards and admin UI, though most teams use Grafana instead. Kapacitor handles alerting and streaming processing on the data. In practice, many teams run Telegraf + InfluxDB + Grafana and skip Chronograf and Kapacitor entirely.
Key configuration to tune: cache-max-memory-size controls when the in-memory cache flushes to disk (default 1GB). Too low and the system gets too many small TSM files. Too high and there is a risk of OOM on restart when the WAL replays. max-series-per-database prevents cardinality explosions (default 1 million). max-values-per-tag caps individual tag cardinality (default 100K). These defaults are conservative on purpose. Raise them only after understanding the memory implications.
Capacity Planning
Write throughput scales roughly linearly with CPU cores up to about 8 cores, then gains taper off due to compaction contention. A well-tuned single node handles 500K-1M points/sec sustained, with burst capacity higher. Each point is a single field value at a single timestamp for a single series.
Storage math: with Gorilla compression on floats, expect about 2 bytes per point. At 100K points/sec sustained, that is about 16GB/day of compressed data. With a 30-day retention policy, plan on roughly 500GB of SSD storage, plus headroom for compaction temporary files (20-30% extra). Do not let disks go above 80% utilization or compactions will stall.
Memory: the TSI (Time Series Index) uses about 300 bytes per series on average. At 5 million active series, that is 1.5GB just for the index. Add the in-memory cache (up to 1GB default), query processing buffers, and OS page cache for TSM files. A node with 5 million series and 100K writes/sec needs 8-16GB RAM minimum. At 20 million series, that means 32-64GB.
Failure Scenarios
Scenario 1: Cardinality explosion kills the node. A developer adds a request_id tag to their metrics. Each request generates a unique series. Within hours, the series count goes from 500K to 50 million. The series index consumes all available memory. Queries time out, then the process OOMs. Detection: monitor the numSeries metric and alert when growth rate exceeds the baseline by 10x. Recovery: identify the offending tag with SHOW TAG VALUES CARDINALITY, drop the problematic measurement if needed, and restart. Prevention: set max-series-per-database and max-values-per-tag to sane limits before this happens.
Scenario 2: WAL replay after crash takes too long. The node crashes with a large in-memory cache (maybe cache-max-memory-size was set to 4GB for throughput). On restart, InfluxDB replays the entire WAL to rebuild the cache. With 4GB of WAL data, this can take 10-30 minutes depending on disk speed. During replay, the node is unresponsive. Detection: monitor restart times. Prevention: keep cache-max-memory-size reasonable (1-2GB), which means more frequent flushes but faster recovery. For both high throughput and fast recovery, put a message queue like Kafka in front of InfluxDB so replay can happen from there instead of relying solely on the WAL.
Pros
- • Purpose-built for time series from day one, not bolted on as an afterthought
- • InfluxQL is SQL-like enough that most engineers pick it up in an afternoon
- • Telegraf agent ecosystem covers 300+ integrations out of the box
- • Built-in retention policies and continuous queries handle data lifecycle automatically
- • Impressive write throughput for a single node, easily 500K+ points/sec
Cons
- • The open-source version (OSS) is single-node only. Clustering requires InfluxDB Cloud or Enterprise
- • Flux query language is powerful but has a steep learning curve and not everyone loves it
- • High cardinality series can cause memory issues and slow queries significantly
- • Schema-on-write means you cannot change tag vs field decisions after the fact without rewriting data
- • Delete operations are expensive and discouraged in practice
When to use
- • You need a dedicated TSDB for metrics, IoT, or sensor data
- • Your team wants something up and running fast with minimal configuration
- • Write-heavy workloads where ingestion speed matters more than complex queries
- • You already use the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor)
When NOT to use
- • You need distributed clustering on open-source (look at VictoriaMetrics or TimescaleDB)
- • Complex relational queries with joins across multiple measurements
- • Your workload is more analytical/OLAP than time series (use ClickHouse)
- • You need strong consistency guarantees for financial transactions
Key Points
- •The TSM (Time-Structured Merge Tree) engine is what makes InfluxDB fast. It is a write-optimized columnar format inspired by LSM trees but designed specifically for time-stamped data. Data lands in a WAL and in-memory cache first, then gets compacted into immutable TSM files sorted by time and series key.
- •Sharding is time-based. Each shard group covers a configurable time range (1 day, 1 week, etc.), and when a shard group's retention policy expires, InfluxDB drops the entire shard. This is why deletes are fast for old data but expensive for individual points. The operation drops files, not scanning and removing rows.
- •Tags are indexed, fields are not. This is the single most important schema design decision. Putting a high-cardinality value like user_id as a tag creates millions of unique series and the inverted index blows up. Fields are for values that get aggregated. Tags are for values that get filtered on, and they need to be low cardinality.
- •Series cardinality is the number of unique combinations of measurement name + tag set. A measurement with 3 tags, each having 100 unique values, creates up to 1 million series. InfluxDB keeps the series index in memory. At 10+ million series, expect OOM issues on a 16GB machine.
- •Continuous queries (CQ) run on a schedule and downsample raw data into summary measurements. For example, per-second CPU readings can be rolled up into 5-minute averages. This is how query performance stays manageable as data ages. Without CQs, queries over months of raw data will be painfully slow.
- •Retention policies (RP) automatically drop data older than a configured duration. Combine RPs with CQs for a tiered storage pattern: keep raw data for 7 days, 5-minute rollups for 90 days, hourly rollups for a year. This is the standard production pattern and it works well.
Common Mistakes
- ✗Putting high-cardinality values in tags. User IDs, request IDs, email addresses as tags will create millions of series and eventually OOM the process. These belong in fields if they must be stored, but honestly a TSDB is the wrong place for that data.
- ✗Not setting up retention policies early. Without RPs, data grows unbounded. Disk fills up, compactions slow down, and eventually the node becomes unresponsive. Set retention policies before ingesting production data, not after the disk is 90% full.
- ✗Writing each point individually instead of batching. Every HTTP write has overhead. Sending one point per request is 100x slower than sending 5000 points in a single batch. Use the Telegraf agent or batch writes in the client library.
- ✗Ignoring shard group duration settings. The default shard group duration is based on the retention policy, but it might not match the workload. Too-short durations create many small shards with overhead. Too-long durations mean slow deletes because InfluxDB cannot drop a shard until every point in it has expired.
- ✗Treating InfluxDB like a relational database and trying to do JOINs. There are no JOINs. To correlate data from different measurements, handle it in the application layer or use a tool like Grafana that can overlay multiple queries.