Elasticsearch
The search engine most teams reach for first, built on Lucene
Use Cases
Architecture
How It Works Internally
Running Elasticsearch in production requires understanding Lucene. Not surface-level, but actually understanding it. Every Elasticsearch shard is a standalone Lucene index. A Lucene index is made up of immutable segments, and each segment contains an inverted index, stored fields, doc values (columnar data for sorting and aggregations), and term vectors. The inverted index maps each analyzed term to a posting list: a sorted array of document IDs that contain that term. When running boolean queries, Elasticsearch intersects or unions those posting lists using skip-list algorithms.
When a document is indexed, it first lands in an in-memory buffer and the translog (write-ahead log). Every refresh_interval (default 1 second), the buffer gets flushed to a new Lucene segment, and only then does the document become searchable. This is why Elasticsearch calls itself "near real-time" and not real-time. The distinction matters. The translog gets fsync'd every 5 seconds by default, or on every request with index.translog.durability: request. In the background, segment merges run continuously: small segments get combined into bigger ones, and deleted documents are physically removed.
Query execution follows a query-then-fetch pattern. The coordinating node broadcasts the query to every relevant shard. Each shard runs the query against its local Lucene index and sends back the top-N document IDs with their scores. The coordinating node merges all of those results (it's really a distributed top-K problem), then fetches the full documents from the shards that actually hold them. So for a query across 50 shards requesting 10 results, each shard returns 10 candidates (500 total), the coordinator picks the global top 10, then grabs those 10 documents. The math adds up fast with hundreds of shards.
Aggregations rely on doc values, which are column-oriented data structures stored alongside the inverted index. Terms aggregations build an in-memory hash map of term-to-count. Composite aggregations paginate through results so there is no need to load everything into heap at once. A critical detail:: the 31GB JVM heap limit (the compressed oops boundary) is a hard wall. Going above 32GB actually reduces usable memory because pointer compression is lost. Elastic's own guidance says never exceed 31GB, and they mean it.
Production Architecture
Run dedicated master-eligible nodes (3 or 5, always odd for quorum), separate coordinating nodes for query routing, and data nodes for storage. Do not co-locate master and data roles on high-throughput clusters. A GC pause on a data node that doubles as a master will destabilize the entire cluster. I've seen this happen, and it's not a fun 2 AM page. Use cluster.routing.allocation.awareness.attributes: zone for cross-AZ replica placement.
For time-series data (logs, metrics), use data tiers. Hot nodes on NVMe SSDs hold the recent indices. Warm nodes on regular SSDs store aging data. Cold nodes on HDDs handle archive. Set up ILM policies to automatically roll over indices at 50GB or 30 days, force-merge to 1 segment on the warm tier (this cuts heap usage significantly), and delete after the retention window closes.
Set JVM heap to exactly 50% of RAM, capped at 31GB. The other half goes to OS filesystem cache, and Lucene depends heavily on mmap'd segment files. A 64GB node should have 31GB heap and 33GB for filesystem cache. That split is not a suggestion. Monitor these: cluster health, pending_tasks, JVM heap usage %, field data cache size, segment count, search latency p99, and indexing latency p99.
Capacity Planning
Aim for shard sizes between 10-50GB. Each shard eats roughly 1MB of heap just for metadata. A cluster with 100,000 shards burns 100GB on metadata alone, which will exhaust the available heap. The rule of thumb: fewer than 20 shards per GB of heap. On a 31GB heap node, that means stay under 620 shards.
For search-heavy workloads, plan on 1 shard per 10-30 million documents. For log analytics, size by daily ingestion volume. For an ingestion rate of 500GB/day with 30-day retention. That's 15TB of primary data. With one replica, 30TB total. At 50GB per shard, that is 600 shards. That fits comfortably on a 10-node cluster.
Index throughput: a single data node typically handles 10,000-40,000 docs/sec, depending on document size and mapping complexity. Bulk indexing with batches of 5-15MB per request delivers the best throughput. For heavy write workloads, bump the refresh interval to 30 seconds instead of 1. That alone cuts segment creation overhead by roughly 30x.
Failure Scenarios
Scenario 1: Mapping explosion kills the cluster. A service starts indexing JSON documents with dynamic keys (think user-generated field names). Dynamic mapping creates a new field in the mapping for every unique key. Once the field count hits 10,000+, cluster state updates become expensive because each mapping change gets broadcast to all nodes. The master node ends up spending all its time processing mapping updates. Cluster state publication times out. Nodes start dropping. Detection: monitor cluster.state.update.count and mapping field count per index. Recovery: delete the problematic index, recreate it with dynamic: strict or dynamic: false, and reindex from the source of truth.
Scenario 2: Coordinating node OOM from a heavy aggregation. Someone runs a terms aggregation on a high-cardinality field (100M+ unique values) across 200 shards. Each shard sends back its local top-N buckets, and the coordinating node has to merge all of them in heap. The heap fills up, GC pauses blow past 30 seconds, and the coordinating node goes unresponsive. Every in-flight query times out. Detection: watch coordinating node heap usage and GC pause duration; alert on jvm.mem.heap_used_percent > 85%. Recovery: switch to composite aggregation for paginated results, add execution_hint: map for low-cardinality fields, or use the shard_size parameter to cap per-shard bucket count. This is one of the most common ways teams accidentally take down their search cluster.
Pros
- • Near real-time full-text search
- • Horizontally scalable with automatic sharding
- • Rich query DSL with aggregations
- • Schema-free JSON documents
- • Powerful text analysis and tokenization
Cons
- • Not a primary data store (no ACID transactions)
- • High memory consumption for indexing
- • Split-brain risk without careful cluster config
- • Complex capacity planning and tuning
- • License changes (SSPL) may affect deployment
When to use
- • Need full-text search with relevance scoring
- • Log aggregation and analytics (ELK/EFK stack)
- • Real-time search across millions of documents
- • Complex aggregations and faceted search
When NOT to use
- • Primary data store for transactional workloads
- • Simple key-value lookups
- • Strong consistency requirements
- • Limited infrastructure budget (resource-hungry)
Key Points
- •The inverted index maps every unique term to the list of documents containing it, enabling O(1) term lookups and fast boolean query intersection
- •Near real-time means documents become searchable after a refresh interval (default 1 second), when the in-memory buffer gets written to a Lucene segment
- •Shard count is locked after index creation. Adding or removing primary shards requires a full reindex, so get the capacity math right upfront
- •Aggregations (terms, histogram, date_histogram, composite) run analytics but live entirely in JVM heap. High-cardinality aggregations will cause OOM
- •Index Lifecycle Management (ILM) automates rollover, shrink, force-merge, and delete phases. It is non-negotiable for time-series data like logs
- •The query-then-fetch model means a query across 100 shards fires 100 parallel requests, then fetches actual documents from the relevant shards
Common Mistakes
- ✗Too many small shards (<10GB each). Each shard is a Lucene index that consumes heap for metadata; clusters with 100K+ shards become unstable
- ✗Not setting explicit mappings upfront. Dynamic mapping auto-creates fields with wrong types, causing mapping explosion (10K+ fields) and heap exhaustion
- ✗Searching across too many indices at once. Each shard adds a thread and heap allocation; hitting 500 shards per query saturates the coordinating node
- ✗Not using ILM for retention. Indices grow unbounded, disk fills, and the cluster goes red. Always configure rollover with size/age triggers and delete policies
- ✗Ignoring heap pressure from aggregations. A terms aggregation on a high-cardinality field (e.g., user_id with 100M unique values) can eat the entire 31GB heap