ClickHouse

How It Works Internally

ClickHouse stores data in columnar format. Each column lives in its own file within a data part. When a query reads only specific columns, it skips everything else. For a table with 200 columns where a query touches 5, only 2.5% of the data hits disk. This is the single biggest reason ClickHouse beats row-oriented databases by 10-100x on analytical queries. It is not magic. It is just less I/O.

The MergeTree engine is the heart of ClickHouse. When data is inserted, ClickHouse creates a new immutable part, which is a directory containing one file per column plus a primary index file. The primary index stores one entry per index_granularity rows (default 8192), mapping the first value of the ORDER BY columns to row ranges. This sparse index usually fits in memory and allows binary search to prune row ranges fast. Background merge threads combine small parts into larger ones. Depending on the engine variant, merges can apply deduplication (ReplacingMergeTree), aggregation (AggregatingMergeTree), or TTL expiration for data lifecycle management.

Vectorized query execution processes data in chunks of 8192 rows. Instead of the classic Volcano model that evaluates one row at a time, ClickHouse applies each operation to an entire column chunk. Tight loops exploit CPU cache locality and SIMD instructions. A single core can scan and aggregate 1-3 billion rows per second for simple operations. That is not a typo. Yandex reported scanning 1 billion rows in 0.1 seconds on a single node for Metrica analytics.

Compression is not an afterthought here. Each column file gets compressed independently using codec chains. LZ4 is the default because it is fast, but CODEC(Delta(4), ZSTD(1)) works better for timestamps (delta-encode first, then compress) and CODEC(DoubleDelta, ZSTD) for monotonically increasing values. CloudFlare gets 20:1 compression ratios on their DNS analytics data, storing 10 trillion rows in compressed form.

Production Architecture

For production, run a ReplicatedMergeTree setup with ClickHouse Keeper, which is the ZooKeeper-compatible consensus system built directly into ClickHouse. A typical topology looks like this: 3 ClickHouse Keeper nodes for coordination, and N shards with 2-3 replicas each. Put Distributed tables in front as a query facade that routes queries to the right shards and merges results. Define sharding with a sharding_key expression, something like sipHash64(user_id) % shard_count.

Spread replicas across availability zones. Tune max_insert_threads for parallel part writes and background_pool_size for merge concurrency. Set up tiered storage with storage_policy to automatically move old parts from NVMe to S3-compatible object storage. ClickHouse's S3 integration provides effectively unlimited cold storage at a fraction of local disk cost, and it actually works well in practice.

What to monitor: query_log for slow queries, MergedRows/sec vs InsertedRows/sec (merges must keep up with ingestion), ReplicasMaxAbsoluteDelay (replication lag in seconds), MaxPartCountForPartition (anything above 300 triggers warnings), and MemoryTracking for total server memory usage. Get alerts on all of these before the on-call rotation discovers them the hard way.

Capacity Planning

ClickHouse can ingest 500K-2M rows/sec per node, depending on row width and hardware. Batching matters more than anything else here: each INSERT should contain at least 10,000-100,000 rows. If the application cannot batch, the Buffer table engine can accumulate small writes and flush in batches. For Kafka ingestion, the built-in Kafka table engine handles batching automatically.

Storage math: with 10:1 compression, 1TB of raw CSV data compresses to roughly 100GB on disk. A 3-node cluster with 2TB NVMe each stores about 60TB of raw data. For bigger datasets, use S3-backed storage. Hot data can be queried from local NVMe and cold data directly from S3 without changing queries.

Memory: ClickHouse uses memory for merges, query processing, and caches. The max_memory_usage setting (default 10GB per query) prevents runaway queries from eating the whole box. For a 64GB node, allocate 80% to ClickHouse and reserve 20% for OS page cache. Size the uncompressed cache and mark cache at 5-10GB each. As a rough benchmark, a 3-node cluster with 64GB RAM and NVMe SSDs handles 100 concurrent dashboard queries with p99 under 500ms over 50 billion rows. Results will vary based on query complexity, but that serves as a starting point.

Failure Scenarios

Scenario 1: Too-many-parts exception halts inserts. An application inserts small batches every second, maybe 1000 rows each. Each insert creates a new part. Background merges cannot keep up. parts_to_merge grows, and when MaxPartCountForPartition exceeds 300 (the default threshold), ClickHouse rejects new inserts with a "Too many parts" error. Detection: monitor MaxPartCountForPartition and alert at 200, not 300. Recovery: stop inserts temporarily, let merges complete, then fix the application to batch inserts at 10K-100K rows per batch. If the application genuinely cannot batch, put a Buffer table engine in front of it. Another option is a Kafka-based buffering layer that batches events into 50K-row inserts every 10 seconds.

Scenario 2: Replication lag diverges after schema migration. A schema migration (ALTER TABLE ADD COLUMN) is applied to one replica, but it fails on another because the disk is full. The replicas diverge. ClickHouse Keeper marks the lagging replica as stale. Inserts keep succeeding on the healthy replica, but the stale one falls further behind. If ignored long enough, recovery requires a full re-sync from the healthy replica. Detection: monitor ReplicasMaxAbsoluteDelay (seconds of lag) and ReplicasMaxQueueSize (pending operations). Alert when delay exceeds 60 seconds. Recovery: fix the disk space issue, then ClickHouse automatically replays the replication queue. If the queue is gone, drop the local table and re-create it. ClickHouse will clone data from a healthy replica. Lesson learned: always validate disk space on all replicas before running schema changes. An orchestration tool that checks preconditions will prevent a 3am page.

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

ClickHouse

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies