ClickHouse
The columnar OLAP database that actually delivers on sub-second analytics
Use Cases
Architecture
How It Works Internally
ClickHouse stores data in columnar format. Each column lives in its own file within a data part. When a query reads only specific columns, it skips everything else. For a table with 200 columns where a query touches 5, only 2.5% of the data hits disk. This is the single biggest reason ClickHouse beats row-oriented databases by 10-100x on analytical queries. It is not magic. It is just less I/O.
The MergeTree engine is the heart of ClickHouse. When data is inserted, ClickHouse creates a new immutable part, which is a directory containing one file per column plus a primary index file. The primary index stores one entry per index_granularity rows (default 8192), mapping the first value of the ORDER BY columns to row ranges. This sparse index usually fits in memory and allows binary search to prune row ranges fast. Background merge threads combine small parts into larger ones. Depending on the engine variant, merges can apply deduplication (ReplacingMergeTree), aggregation (AggregatingMergeTree), or TTL expiration for data lifecycle management.
Vectorized query execution processes data in chunks of 8192 rows. Instead of the classic Volcano model that evaluates one row at a time, ClickHouse applies each operation to an entire column chunk. Tight loops exploit CPU cache locality and SIMD instructions. A single core can scan and aggregate 1-3 billion rows per second for simple operations. That is not a typo. Yandex reported scanning 1 billion rows in 0.1 seconds on a single node for Metrica analytics.
Compression is not an afterthought here. Each column file gets compressed independently using codec chains. LZ4 is the default because it is fast, but CODEC(Delta(4), ZSTD(1)) works better for timestamps (delta-encode first, then compress) and CODEC(DoubleDelta, ZSTD) for monotonically increasing values. CloudFlare gets 20:1 compression ratios on their DNS analytics data, storing 10 trillion rows in compressed form.
Production Architecture
For production, run a ReplicatedMergeTree setup with ClickHouse Keeper, which is the ZooKeeper-compatible consensus system built directly into ClickHouse. A typical topology looks like this: 3 ClickHouse Keeper nodes for coordination, and N shards with 2-3 replicas each. Put Distributed tables in front as a query facade that routes queries to the right shards and merges results. Define sharding with a sharding_key expression, something like sipHash64(user_id) % shard_count.
Spread replicas across availability zones. Tune max_insert_threads for parallel part writes and background_pool_size for merge concurrency. Set up tiered storage with storage_policy to automatically move old parts from NVMe to S3-compatible object storage. ClickHouse's S3 integration provides effectively unlimited cold storage at a fraction of local disk cost, and it actually works well in practice.
What to monitor: query_log for slow queries, MergedRows/sec vs InsertedRows/sec (merges must keep up with ingestion), ReplicasMaxAbsoluteDelay (replication lag in seconds), MaxPartCountForPartition (anything above 300 triggers warnings), and MemoryTracking for total server memory usage. Get alerts on all of these before the on-call rotation discovers them the hard way.
Capacity Planning
ClickHouse can ingest 500K-2M rows/sec per node, depending on row width and hardware. Batching matters more than anything else here: each INSERT should contain at least 10,000-100,000 rows. If the application cannot batch, the Buffer table engine can accumulate small writes and flush in batches. For Kafka ingestion, the built-in Kafka table engine handles batching automatically.
Storage math: with 10:1 compression, 1TB of raw CSV data compresses to roughly 100GB on disk. A 3-node cluster with 2TB NVMe each stores about 60TB of raw data. For bigger datasets, use S3-backed storage. Hot data can be queried from local NVMe and cold data directly from S3 without changing queries.
Memory: ClickHouse uses memory for merges, query processing, and caches. The max_memory_usage setting (default 10GB per query) prevents runaway queries from eating the whole box. For a 64GB node, allocate 80% to ClickHouse and reserve 20% for OS page cache. Size the uncompressed cache and mark cache at 5-10GB each. As a rough benchmark, a 3-node cluster with 64GB RAM and NVMe SSDs handles 100 concurrent dashboard queries with p99 under 500ms over 50 billion rows. Results will vary based on query complexity, but that serves as a starting point.
Failure Scenarios
Scenario 1: Too-many-parts exception halts inserts. An application inserts small batches every second, maybe 1000 rows each. Each insert creates a new part. Background merges cannot keep up. parts_to_merge grows, and when MaxPartCountForPartition exceeds 300 (the default threshold), ClickHouse rejects new inserts with a "Too many parts" error. Detection: monitor MaxPartCountForPartition and alert at 200, not 300. Recovery: stop inserts temporarily, let merges complete, then fix the application to batch inserts at 10K-100K rows per batch. If the application genuinely cannot batch, put a Buffer table engine in front of it. Another option is a Kafka-based buffering layer that batches events into 50K-row inserts every 10 seconds.
Scenario 2: Replication lag diverges after schema migration. A schema migration (ALTER TABLE ADD COLUMN) is applied to one replica, but it fails on another because the disk is full. The replicas diverge. ClickHouse Keeper marks the lagging replica as stale. Inserts keep succeeding on the healthy replica, but the stale one falls further behind. If ignored long enough, recovery requires a full re-sync from the healthy replica. Detection: monitor ReplicasMaxAbsoluteDelay (seconds of lag) and ReplicasMaxQueueSize (pending operations). Alert when delay exceeds 60 seconds. Recovery: fix the disk space issue, then ClickHouse automatically replays the replication queue. If the queue is gone, drop the local table and re-create it. ClickHouse will clone data from a healthy replica. Lesson learned: always validate disk space on all replicas before running schema changes. An orchestration tool that checks preconditions will prevent a 3am page.
Pros
- • Analytical queries are absurdly fast thanks to columnar storage
- • Excellent compression ratios, often 10-20x
- • Handles billions of rows with sub-second response times
- • SQL-compatible query interface, so the learning curve is gentle
- • Materialized views and real-time aggregation work out of the box
Cons
- • Not built for OLTP or point lookups. Do not try.
- • No full ACID transactions
- • Updates and deletes are expensive (async mutations under the hood)
- • You have to think carefully about schema design or performance tanks
- • Smaller ecosystem than PostgreSQL or Elasticsearch
When to use
- • You need sub-second queries over billions of rows
- • Analytical/OLAP workloads heavy on aggregations
- • Real-time dashboards and reporting systems
- • You want a cost-effective alternative to managed analytics services
When NOT to use
- • OLTP workloads with frequent updates/deletes
- • You need full transaction support
- • Small datasets that fit comfortably in PostgreSQL
- • Full-text search (use Elasticsearch instead)
Key Points
- •Columnar storage reads only the columns a query actually touches. Querying 5 of 200 columns means reading 2.5% of the data. That is where the I/O savings come from.
- •MergeTree is the foundation. Data lands as immutable parts, background merges combine them, and the ORDER BY clause defines the primary key sort for efficient range scans.
- •Vectorized execution processes data in batches of 8192 rows using SIMD instructions. Expect 1-3 billion rows/sec per core for simple aggregations.
- •Materialized views fire on INSERT and pre-aggregate data into summary tables. They are essential for sub-second dashboard queries over raw event streams.
- •Compression ratios of 10-20x are normal. ClickHouse defaults to LZ4, but codec chains like Delta + ZSTD can hit 90-95% compression on time-series data.
- •Approximate query functions (uniq, quantile, topK) use HyperLogLog, t-digest, and Space-Saving algorithms. They deliver O(1) memory aggregation over billions of rows.
Common Mistakes
- ✗Using ClickHouse for OLTP workloads. Point lookups, single-row updates, and transactions are fundamentally incompatible with a columnar, batch-oriented architecture.
- ✗Too many small inserts. Each INSERT creates a new part on disk. Inserting row-by-row creates millions of tiny parts that overwhelm the merge process. Batch at 10K+ rows per insert.
- ✗Picking the wrong ORDER BY for MergeTree. This clause defines the primary index sort. Get it wrong and efficient range scans become full column scans.
- ✗Ignoring merge pressure. Background merges have a concurrency limit. If inserts outpace merges, part count grows without bound and queries slow dramatically (too many parts exception).
- ✗Skipping materialized views for common aggregations. Querying raw event tables with billions of rows on every dashboard refresh wastes compute. Pre-aggregate at insert time.