Elasticsearch

How It Works Internally

Running Elasticsearch in production requires understanding Lucene. Not surface-level, but actually understanding it. Every Elasticsearch shard is a standalone Lucene index. A Lucene index is made up of immutable segments, and each segment contains an inverted index, stored fields, doc values (columnar data for sorting and aggregations), and term vectors. The inverted index maps each analyzed term to a posting list: a sorted array of document IDs that contain that term. When running boolean queries, Elasticsearch intersects or unions those posting lists using skip-list algorithms.

When a document is indexed, it first lands in an in-memory buffer and the translog (write-ahead log). Every refresh_interval (default 1 second), the buffer gets flushed to a new Lucene segment, and only then does the document become searchable. This is why Elasticsearch calls itself "near real-time" and not real-time. The distinction matters. The translog gets fsync'd every 5 seconds by default, or on every request with index.translog.durability: request. In the background, segment merges run continuously: small segments get combined into bigger ones, and deleted documents are physically removed.

Query execution follows a query-then-fetch pattern. The coordinating node broadcasts the query to every relevant shard. Each shard runs the query against its local Lucene index and sends back the top-N document IDs with their scores. The coordinating node merges all of those results (it's really a distributed top-K problem), then fetches the full documents from the shards that actually hold them. So for a query across 50 shards requesting 10 results, each shard returns 10 candidates (500 total), the coordinator picks the global top 10, then grabs those 10 documents. The math adds up fast with hundreds of shards.

Aggregations rely on doc values, which are column-oriented data structures stored alongside the inverted index. Terms aggregations build an in-memory hash map of term-to-count. Composite aggregations paginate through results so there is no need to load everything into heap at once. A critical detail:: the 31GB JVM heap limit (the compressed oops boundary) is a hard wall. Going above 32GB actually reduces usable memory because pointer compression is lost. Elastic's own guidance says never exceed 31GB, and they mean it.

Production Architecture

Run dedicated master-eligible nodes (3 or 5, always odd for quorum), separate coordinating nodes for query routing, and data nodes for storage. Do not co-locate master and data roles on high-throughput clusters. A GC pause on a data node that doubles as a master will destabilize the entire cluster. I've seen this happen, and it's not a fun 2 AM page. Use cluster.routing.allocation.awareness.attributes: zone for cross-AZ replica placement.

For time-series data (logs, metrics), use data tiers. Hot nodes on NVMe SSDs hold the recent indices. Warm nodes on regular SSDs store aging data. Cold nodes on HDDs handle archive. Set up ILM policies to automatically roll over indices at 50GB or 30 days, force-merge to 1 segment on the warm tier (this cuts heap usage significantly), and delete after the retention window closes.

Set JVM heap to exactly 50% of RAM, capped at 31GB. The other half goes to OS filesystem cache, and Lucene depends heavily on mmap'd segment files. A 64GB node should have 31GB heap and 33GB for filesystem cache. That split is not a suggestion. Monitor these: cluster health, pending_tasks, JVM heap usage %, field data cache size, segment count, search latency p99, and indexing latency p99.

Capacity Planning

Aim for shard sizes between 10-50GB. Each shard eats roughly 1MB of heap just for metadata. A cluster with 100,000 shards burns 100GB on metadata alone, which will exhaust the available heap. The rule of thumb: fewer than 20 shards per GB of heap. On a 31GB heap node, that means stay under 620 shards.

For search-heavy workloads, plan on 1 shard per 10-30 million documents. For log analytics, size by daily ingestion volume. For an ingestion rate of 500GB/day with 30-day retention. That's 15TB of primary data. With one replica, 30TB total. At 50GB per shard, that is 600 shards. That fits comfortably on a 10-node cluster.

Index throughput: a single data node typically handles 10,000-40,000 docs/sec, depending on document size and mapping complexity. Bulk indexing with batches of 5-15MB per request delivers the best throughput. For heavy write workloads, bump the refresh interval to 30 seconds instead of 1. That alone cuts segment creation overhead by roughly 30x.

Failure Scenarios

Scenario 1: Mapping explosion kills the cluster. A service starts indexing JSON documents with dynamic keys (think user-generated field names). Dynamic mapping creates a new field in the mapping for every unique key. Once the field count hits 10,000+, cluster state updates become expensive because each mapping change gets broadcast to all nodes. The master node ends up spending all its time processing mapping updates. Cluster state publication times out. Nodes start dropping. Detection: monitor cluster.state.update.count and mapping field count per index. Recovery: delete the problematic index, recreate it with dynamic: strict or dynamic: false, and reindex from the source of truth.

Scenario 2: Coordinating node OOM from a heavy aggregation. Someone runs a terms aggregation on a high-cardinality field (100M+ unique values) across 200 shards. Each shard sends back its local top-N buckets, and the coordinating node has to merge all of them in heap. The heap fills up, GC pauses blow past 30 seconds, and the coordinating node goes unresponsive. Every in-flight query times out. Detection: watch coordinating node heap usage and GC pause duration; alert on jvm.mem.heap_used_percent > 85%. Recovery: switch to composite aggregation for paginated results, add execution_hint: map for low-cardinality fields, or use the shard_size parameter to cap per-shard bucket count. This is one of the most common ways teams accidentally take down their search cluster.

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies

Elasticsearch

Use Cases

Architecture

How It Works Internally

Production Architecture

Capacity Planning

Failure Scenarios

Pros

Cons

When to use

When NOT to use

Key Points

Common Mistakes

Related Technologies