Hot, Warm & Cold Data Tiering
Architecture Diagram
Why It Exists
Not all data deserves the same storage. The last 24 hours of orders get queried 500 times per second. Last month's orders see maybe 10 queries per minute. Last year's orders sit there for compliance and the occasional customer support lookup.
Storing everything in Redis because "it is fast" costs $3,000 per month per terabyte. Storing everything in S3 because "it is cheap" means the checkout page takes 8 seconds to load order history. Tiering delivers fast where it matters and cheap where it does not.
The economics tell the whole story. 1 TB in Redis costs roughly $3,000/month. The same data on SSD-backed PostgreSQL costs about $100/month. On S3 it is $23/month. On S3 Glacier it is $4/month. That is a 750x cost difference between the fastest and cheapest tier. For a company storing 50 TB, the difference between "everything in one tier" and "intelligent tiering" can be $100K/month.
Defining the Tiers
Hot tier. Data accessed multiple times per second. Latency budget: sub-millisecond to single-digit milliseconds. Lives in memory (Redis, Memcached) or on NVMe SSDs with in-memory caching (PostgreSQL with shared_buffers tuned, Elasticsearch hot nodes).
Examples: active user sessions, real-time dashboards, live order status, current pricing, feature flags, rate limiter counters.
Warm tier. Data accessed a few times per minute to a few times per hour. Latency budget: 10-100ms. Lives on SSD-backed databases or Elasticsearch warm nodes. This is where most queryable data sits.
Examples: last 30 days of orders, recent log data, user profiles accessed occasionally, product catalog, recent search history.
Cold tier. Data accessed a few times per day or less. Latency budget: seconds to minutes. Acceptable because the queries are infrequent and the user expects to wait. Lives on object storage (S3), HDFS, Elasticsearch frozen nodes, or searchable snapshots.
Examples: compliance archives, historical analytics beyond 90 days, old audit trails, decommissioned product data, old user-generated content.
Frozen/archive tier. Data that might never be accessed again but must be retained for legal or regulatory reasons. Latency budget: hours (restore from archive before querying). Lives on S3 Glacier, S3 Glacier Deep Archive, or tape.
Examples: legal holds, 7-year financial records, HIPAA-mandated medical data retention.
Query Routing Patterns (Concrete Example)
Here is how tiering works in practice. Take an e-commerce order history API.
Request: GET /api/orders?user_id=123&days=365
The query router splits this into three sub-queries based on tier boundaries:
Hot path (last 24 hours). Query Redis: ZRANGEBYSCORE orders:123 <24h_ago> +inf. Returns in 0.3ms. Finds 2 recent orders.
Warm path (1 to 90 days). Query PostgreSQL: SELECT * FROM orders WHERE user_id = 123 AND created_at > now() - interval '90 days'. Returns in 8ms. Finds 15 orders.
Cold path (90 to 365 days). Query S3 via Athena: SELECT * FROM orders_archive WHERE user_id = 123 AND year_month >= '2025-06'. Returns in 1.2 seconds. Finds 8 orders.
The API merges all three results and returns 25 orders. Total response time is dominated by the cold-tier query at 1.2 seconds. But here is what matters: for the common case where a user checks their recent orders (last 7 days), only the hot and warm paths fire. That response comes back in under 10ms.
The alternative without tiering? Store everything in PostgreSQL. Every query scans 365 days of data regardless of whether the user asked for yesterday or last year. At scale, the table grows to billions of rows and the database spends most of its time scanning old data that nobody asked for.
A smarter version of the cold path: pre-aggregate old data into monthly summaries and store those summaries in the warm tier. Most cold-tier queries do not need individual records. "Total spending: $2,340 in March 2025" is often enough. This avoids the Athena round-trip entirely for summary requests.
Promotion and Demotion Policies
Time-based. The simplest approach. Move data from hot to warm after 24 hours, warm to cold after 90 days. This works when access patterns correlate with age, which they do for most transactional data. Most teams start here and it is good enough for a long time.
Access-count-based. Track how often each record (or partition, or index) is accessed. Demote when the access count falls below a threshold over a time window. More accurate than time-based, but requires access tracking infrastructure. Elasticsearch ILM supports this natively.
Hybrid. Time-based demotion with access-count-based promotion. If a cold record suddenly gets accessed frequently (a viral old blog post, a reopened support ticket, a legal discovery request), promote it back to warm. This is the best of both worlds but the most complex to implement.
Manual override. Some data is always hot regardless of access patterns. Configuration, feature flags, pricing data. Pin it to the hot tier explicitly and never demote it.
Concrete policy example (Elasticsearch ILM):
| Phase | Age | Actions | Node Type |
|---|---|---|---|
| Hot | 0-7 days | 3 replicas, force merge to 1 segment | NVMe SSD, 64GB RAM |
| Warm | 7-30 days | Shrink to 1 shard, 1 replica, read-only | SSD, 32GB RAM |
| Cold | 30-90 days | Searchable snapshot on S3, frozen | Minimal (S3-backed) |
| Delete | 90+ days | Delete index | N/A |
This policy handles a logging pipeline doing 100 GB/day. Only the last week sits on expensive hot hardware. Everything older than a month is on S3. Total storage cost drops by roughly 80% compared to keeping everything on hot nodes.
Built-in Tiered Storage in Practice
Elasticsearch ILM is the most battle-tested tiering implementation in the ecosystem. It provides lifecycle policies that automatically roll over indices when they hit a size or age threshold, shrink them, freeze them, and eventually delete them. Hot nodes run NVMe SSDs with high CPU for indexing and search. Warm nodes run cheaper SSDs with less CPU. Cold and frozen nodes back their indices with S3 searchable snapshots, paying almost nothing for storage while keeping the data queryable (at higher latency). The query API is identical across tiers. The application does not know or care which tier serves a particular index.
Kafka Tiered Storage (KIP-405) solves a different problem. Kafka brokers traditionally keep all log segments on local disk. With 90 days of retention on a topic doing 1 TB/day, that requires 90 TB of broker disk. Tiered storage offloads older segments to S3 while keeping recent segments on local SSD. Consumers reading the latest data hit local disk with normal latency. Consumers replaying from 3 months ago transparently fetch from S3. This makes it practical to set Kafka retention to "forever" without breaking the bank on broker hardware.
ClickHouse with S3-backed MergeTree works at the partition level. Recent partitions (this week's data) live on local NVMe for fast queries. Old partitions automatically move to S3 based on a storage_policy configuration. Queries that span both local and S3 partitions run transparently. The catch: S3-backed partitions are slower to scan, so cold queries take longer. Pre-aggregate old data into rollup tables for sub-second analytics on historical ranges.
Custom tiering (Redis + PostgreSQL + S3). When the database does not have built-in tiering, the application handles routing. Check Redis first. On miss, check PostgreSQL. If the data is older than the warm tier boundary, query S3 via Athena or a similar query engine. This is what most teams build, and it works fine. The downside is that every new feature needs to be aware of the tiering logic.
Cost Analysis
| Tier | Storage | Cost per TB/month | Read Latency | Example Tech |
|---|---|---|---|---|
| Hot | In-memory | ~$3,000 | < 1ms | Redis, Memcached |
| Warm | SSD-backed DB | ~$100 | 5-50ms | PostgreSQL, ES hot nodes |
| Cold | Object storage | ~$23 | 100ms-5s | S3, ES frozen tier |
| Frozen | Archive | ~$4 | 1-12 hours | S3 Glacier Deep Archive |
Real example. A system storing 10 TB total with typical access distribution: 100 GB hot + 1 TB warm + 9 TB cold.
- With tiering: $300 (hot) + $100 (warm) + $207 (cold) = $607/month
- All in PostgreSQL: ~$1,000/month, and cold-tier queries are slow because the database is scanning 10 TB
- All in Redis: ~$30,000/month. Do not laugh, I have seen teams do this
The tiering payoff grows with data volume. At 100 TB, the savings are over $10K/month.
When Not to Tier
Tiering adds complexity. Every query needs to know which tier to hit. Every new feature needs to respect tier boundaries. Debugging becomes harder because data lives in three places.
Skip tiering if the total dataset fits on a single SSD (under 500 GB) and query latency is acceptable. One tier, one technology, one place to look when something breaks.
Also skip it if the access patterns are uniform. If every record gets queried with roughly equal frequency (a configuration store, a small product catalog), there is no "cold" data to move. The whole point of tiering is exploiting the fact that most data is rarely accessed. If that is not true for a given workload, tiering is overhead with no payoff.
Failure Scenarios
Scenario 1: Hot tier goes down, warm tier gets crushed. Redis crashes. Every request that used to hit Redis now falls through to PostgreSQL. The database was sized for warm-tier load (50 QPS), not the full hot-tier load (5,000 QPS). Connection pool exhaustion hits within seconds. Queries start timing out. The monitoring dashboard, which also queries PostgreSQL, goes dark.
Detection: Alert on Redis availability and on PostgreSQL connection pool utilization crossing 80%.
Prevention: Put a circuit breaker on the hot-tier fallback path. When Redis is down, return a degraded response (cached from the last successful read, or a "temporarily unavailable" status) instead of blindly forwarding all traffic to the warm tier. Pre-compute a capacity buffer for how much extra load the warm tier can absorb and set the circuit breaker threshold accordingly. In practice, a warm tier can usually handle 2-3x its normal load for short bursts, not 100x.
Scenario 2: Cold-tier query blocks the API. A customer support agent searches 3 years of order history. The Athena query scans 500 GB of Parquet files and takes 8 seconds. The API gateway has a 5-second timeout. The agent sees a 504 error. They retry. Now two Athena queries are running.
Detection: Track cold-tier query latency as a separate SLI from hot/warm. Alert when p95 exceeds the API gateway timeout.
Fix: Never block a synchronous API on a cold-tier scan. Use an async query pattern: return a job ID immediately, let the client poll for results or subscribe to a notification. The UI shows "Loading historical data..." instead of a timeout error. For common cold-tier queries, pre-aggregate results into the warm tier on a nightly batch job so the live query never needs to touch S3.
Key Points
- •Not all data deserves the same storage. Hot data sits in memory or NVMe SSDs, warm data on cheaper SSDs, cold data on object storage. The cost difference between tiers is 100x
- •The boundary between tiers is defined by access frequency and latency requirements, not by data age alone. A 3-year-old record queried daily is hot, not cold
- •Promotion and demotion policies determine when data moves between tiers. Time-based is simplest, access-count-based is most accurate, most teams use a hybrid
- •Elasticsearch, ClickHouse, and Kafka all have built-in tiered storage. A custom query routing layer is not always necessary
- •The query pattern matters as much as the storage tier. A cold-tier query that scans terabytes needs a different approach (async, pre-aggregated) than a hot-tier point lookup
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Elasticsearch ILM | Open Source | Log and event data lifecycle with automatic rollover, shrink, freeze, and delete | Medium-Enterprise |
| ClickHouse Tiered Storage | Open Source | Analytics data with volume-based policies, S3-backed MergeTree for cold partitions | Medium-Enterprise |
| Kafka Tiered Storage (KIP-405) | Open Source | Event log retention beyond broker disk, transparent S3 offload for old segments | Large-Enterprise |
| AWS S3 Intelligent-Tiering | Managed | Object storage with automatic access-pattern-based tiering, no retrieval fees | Small-Enterprise |
| Snowflake | Commercial | Transparent hot/warm/cold with auto-scaling compute per tier, zero admin | Medium-Enterprise |
Common Mistakes
- Tiering by age alone. A 3-year-old record that gets queried daily is hot, not cold. Measure access frequency before drawing tier boundaries
- No warm tier. Going straight from in-memory cache to S3 creates a latency cliff where responses jump from 1ms to 3 seconds with nothing in between
- Forgetting that cold-tier queries still need to be usable. Users do not care where the data lives. If the query takes 8 seconds, that is the experience
- Not measuring access patterns before choosing tier boundaries. Most teams guess wrong about what is hot. Instrument first, tier second
- Over-engineering tiering for small datasets. If everything fits on a single SSD, skip the complexity and just use one SSD