Caching Strategies

Why It Exists

Most applications follow a Pareto distribution: somewhere around 10-20% of the data serves 80%+ of traffic. Anyone who has watched a slow dashboard magically speed up after someone dropped Redis in front of the database has seen this in action.

Caching exploits two things: temporal locality (recently accessed data gets accessed again) and the fact that memory is orders of magnitude faster than disk. Serving a hot key from Redis takes sub-millisecond time. The same query hitting Postgres might take 5-10ms on a good day. Multiply that difference across millions of requests and the database savings are enormous.

The catch? The system is now maintaining two copies of the truth. That is where every caching headache begins.

How It Works

The Four Caching Patterns

Cache-aside (lazy loading): The application checks cache first. On a miss, it queries the database, then stuffs the result into cache. The application owns both the read and write paths. This is the most common pattern and the one I reach for by default because it gives full control. The tradeoff is that the application is responsible for invalidation logic.
Read-through: The cache layer fetches from the database on a miss. The application only talks to the cache. Cleaner code, but now the caching layer has a hard dependency on the database client.
Write-through: Every write goes to both cache and database synchronously. The cache is always consistent, but write latency takes a hit. Use this when reads dwarf writes by a wide margin.
Write-behind (write-back): Writes land in cache immediately. The cache flushes to the database asynchronously. The fastest writes possible, but if the cache node dies before flushing, that data is gone. Only use this for workloads that can tolerate some data loss.

Eviction Policies

Policy	Behavior	Best For
LRU (Least Recently Used)	Evicts the item not accessed for the longest time	General purpose, most common
LFU (Least Frequently Used)	Evicts the item with the fewest accesses	Workloads with stable hot sets
FIFO	Evicts the oldest inserted item	Time-series or session data
TTL-based	Expires after a fixed duration	Data with known freshness requirements

Cache Stampede Prevention

When a popular cache key expires, hundreds of concurrent requests all miss at once and slam the database. This is the thundering herd problem, and it will absolutely take down the database without proper handling.

Three ways to handle it:

Mutex/distributed lock: The first request grabs a lock, fetches from the DB, and populates cache. Everyone else waits or gets stale data. Simple but adds lock contention.
Stale-while-revalidate: Keep serving the expired value while a single background thread refreshes it. No one blocks. This is my preferred approach for most read-heavy workloads.
Probabilistic early expiration: Each request has a small random chance of refreshing cache before the TTL actually expires, spreading the load over time. Clever, but harder to reason about.

Multi-Tier Architecture

L1 (in-process): Caffeine, Guava, or even a plain Map inside the application process. Sub-microsecond reads, zero network hops. The downside: limited by heap size and each instance maintains its own copy, so they drift apart.
L2 (distributed): Redis or Memcached. Sub-millisecond reads over the network. Shared across all instances. Scales independently of the app servers.

Reads check L1 first, then L2, then the database. Writes invalidate L1 across all nodes (via pub/sub or an event bus) and update L2. Getting the invalidation broadcast right is the hard part here.

Production Considerations

Key design: Use structured, namespaced keys like user:{id}:profile:v2. That version segment matters. When the schema changes, nobody wants to deserialize corrupted data from stale keys. I have seen this bug take down a checkout flow.
Cache warming: After a deploy or restart, pre-populate cache with the hottest keys before traffic arrives. Pull from an access frequency log with a background job. Skipping this step means the database eats a cold-start spike right when the deploy is supposed to be "zero downtime."
Monitoring: Track hit ratio (aim for 95%+), eviction rate, and memory usage. A dropping hit ratio usually means the working set outgrew the cache. This is the most important cache metric, period.
Graceful degradation: If Redis goes down, the app should bypass the cache and hit the database directly, not throw a 500 at the user. Put a circuit breaker around cache calls. Test this path regularly because it will break if never exercised.
Serialization overhead: JSON is human-readable but slow. For serious throughput, switch to Protocol Buffers or MessagePack. Profile the serialization cost. I have seen cases where encoding/decoding takes longer than the actual cache lookup.

Failure Scenarios

Scenario 1: Cache Stampede on Cold Start. A Redis cluster comes back after maintenance with an empty cache. Thousands of app instances fire off their hot-key queries simultaneously. The database connection pool runs dry in seconds, P99 latency jumps from 5ms to 12 seconds, and the DB starts rejecting connections. Now there is a full outage caused by the caching layer coming back online, which is ironic. Detection: Watch the cache hit ratio. A drop below 50% combined with database connection pool above 90% utilization signals a stampede. Recovery: Build a cache warming script that pre-loads the top 10,000 keys from an access frequency log before routing production traffic to the cluster. Instagram runs a dedicated warming service that replays the last hour of access patterns before their memcached clusters accept real traffic.

Scenario 2: Hot Key Imbalance. A product page goes viral and suddenly one Redis key is getting 15% of all cache reads. That key's shard handles 200K QPS while neighboring shards sit at 5K. The shard's network bandwidth saturates, and every other key on that shard gets slow as collateral damage. Detection: Track per-key access frequency with Redis OBJECT FREQ or client-side sampling. Alert when any single key crosses 1% of total cluster QPS. Recovery: Replicate the hot key across multiple shards with a suffix (product:123:r1, product:123:r2) and randomize reads across the replicas.

Scenario 3: Silent Data Corruption from Stale Cache. A write-through setup has a subtle bug: the database write succeeds but the cache update fails silently on a network timeout. No TTL was set on the key, so the cache serves stale data forever. Users see outdated pricing for 6 hours before someone files a support ticket. This one is nasty because monitoring probably looks green the whole time. Detection: Run periodic reconciliation checks between cache and source of truth for critical keys. Track cache write failure rates separately from read metrics. Recovery: Always set a maximum TTL even on write-through keys as a safety net (1 hour is a reasonable default). Facebook ran into this class of bug in their TAO cache and built a lease mechanism. A cache miss returns a "lease token" that must be presented when populating the cache, which prevents stale writes from overwriting fresh data.

Capacity Planning

Metric	Threshold	Action
Hit ratio	< 95%	Increase cache size or review eviction policy
Memory utilization	> 75%	Scale cluster or evict aggressively
Eviction rate	> 1% of total keys/min	Working set exceeds capacity, add memory
P99 latency	> 2ms (Redis), > 5ms (Memcached)	Network saturation or hot key, investigate

Real-world scale references: Facebook runs the largest known memcached deployment, with thousands of servers handling billions of requests per second at a 99%+ hit ratio for their TAO social graph cache. Discord operates 180+ Redis nodes processing 300M+ messages per day using consistent hashing. Netflix's EVCache (built on memcached) pushes 30M requests per second across 10,000+ instances with a median latency of 1ms.

Capacity formula: Required cache memory = working_set_size * (1 + overhead_factor), where overhead_factor is 1.3 to 1.5 for Redis (accounting for data structure overhead and fragmentation). Working set = number of unique keys accessed in one TTL window multiplied by average value size. Connection formula: max_connections = app_instances * connections_per_instance * 1.2. A single Redis instance can technically handle around 50K concurrent connections, but performance degrades past 10K. Use connection pooling. As a starting heuristic, budget about 1GB of cache per 5,000 QPS, then refine once real hit ratio data is available.

Architecture Decision Record

Caching Technology Decision Matrix

Criteria (Weight)	Redis	Memcached	In-Process (Caffeine)	CDN Cache
Data structure needs	Rich (hashes, sorted sets, streams)	Simple key-value only	Any Java object	HTTP responses only
Latency requirement	< 1ms network	< 1ms network	< 10μs (no network)	Varies by edge location
Consistency across instances	Shared state	Shared state	Each instance has own copy	Eventually consistent
Persistence needed	Yes (RDB/AOF)	No	No	No
Max dataset size	~25GB per node (practical)	~64GB per node	Limited by JVM heap (< 8GB)	Unlimited (distributed)

Decision rules:

Simple strings or blobs with maximum throughput needed: Go with Memcached. Its multi-threaded architecture handles more QPS per node than single-threaded Redis. At extreme scale, this difference matters.
Data structures needed (sorted leaderboards, rate limiting counters, pub/sub): Redis. Nothing else comes close for this use case.
Sub-millisecond latency is critical and data fits in memory (under 2GB working set per instance): Use an L1 in-process cache like Caffeine for JVM or lru-cache for Node.js. Eliminate the network hop entirely. This is the single biggest latency win available.
Dataset exceeds 100GB: Shard Redis with Redis Cluster (scales to 1000 nodes) or use client-side consistent hashing over standalone instances. I generally prefer standalone instances with client hashing because the operational model is simpler.
Zero tolerance for cache downtime: Run Redis Sentinel or Redis Cluster with 3+ replicas per shard. For truly critical paths, consider dual-writing to both Redis and Memcached and reading from whichever is available.

Tool	Type	Best For	Scale
Redis	Open Source	Rich data structures, pub/sub, persistence options	Small-Enterprise
Memcached	Open Source	Simple key-value, multi-threaded, maximum throughput	Medium-Enterprise
Caffeine	Open Source	JVM in-process cache, near-optimal hit ratio	Small-Large
Hazelcast	Open Source	Distributed cache, embedded or client-server	Medium-Enterprise

Why It Exists

The catch? The system is now maintaining two copies of the truth. That is where every caching headache begins.

How It Works

The Four Caching Patterns

Cache-aside (lazy loading): The application checks cache first. On a miss, it queries the database, then stuffs the result into cache. The application owns both the read and write paths. This is the most common pattern and the one I reach for by default because it gives full control. The tradeoff is that the application is responsible for invalidation logic.
Read-through: The cache layer fetches from the database on a miss. The application only talks to the cache. Cleaner code, but now the caching layer has a hard dependency on the database client.
Write-through: Every write goes to both cache and database synchronously. The cache is always consistent, but write latency takes a hit. Use this when reads dwarf writes by a wide margin.
Write-behind (write-back): Writes land in cache immediately. The cache flushes to the database asynchronously. The fastest writes possible, but if the cache node dies before flushing, that data is gone. Only use this for workloads that can tolerate some data loss.

Eviction Policies

Policy	Behavior	Best For
LRU (Least Recently Used)	Evicts the item not accessed for the longest time	General purpose, most common
LFU (Least Frequently Used)	Evicts the item with the fewest accesses	Workloads with stable hot sets
FIFO	Evicts the oldest inserted item	Time-series or session data
TTL-based	Expires after a fixed duration	Data with known freshness requirements

Cache Stampede Prevention

Three ways to handle it:

Mutex/distributed lock: The first request grabs a lock, fetches from the DB, and populates cache. Everyone else waits or gets stale data. Simple but adds lock contention.
Stale-while-revalidate: Keep serving the expired value while a single background thread refreshes it. No one blocks. This is my preferred approach for most read-heavy workloads.
Probabilistic early expiration: Each request has a small random chance of refreshing cache before the TTL actually expires, spreading the load over time. Clever, but harder to reason about.

Multi-Tier Architecture

L1 (in-process): Caffeine, Guava, or even a plain Map inside the application process. Sub-microsecond reads, zero network hops. The downside: limited by heap size and each instance maintains its own copy, so they drift apart.
L2 (distributed): Redis or Memcached. Sub-millisecond reads over the network. Shared across all instances. Scales independently of the app servers.

Reads check L1 first, then L2, then the database. Writes invalidate L1 across all nodes (via pub/sub or an event bus) and update L2. Getting the invalidation broadcast right is the hard part here.

Production Considerations

Key design: Use structured, namespaced keys like user:{id}:profile:v2. That version segment matters. When the schema changes, nobody wants to deserialize corrupted data from stale keys. I have seen this bug take down a checkout flow.
Cache warming: After a deploy or restart, pre-populate cache with the hottest keys before traffic arrives. Pull from an access frequency log with a background job. Skipping this step means the database eats a cold-start spike right when the deploy is supposed to be "zero downtime."
Monitoring: Track hit ratio (aim for 95%+), eviction rate, and memory usage. A dropping hit ratio usually means the working set outgrew the cache. This is the most important cache metric, period.
Graceful degradation: If Redis goes down, the app should bypass the cache and hit the database directly, not throw a 500 at the user. Put a circuit breaker around cache calls. Test this path regularly because it will break if never exercised.
Serialization overhead: JSON is human-readable but slow. For serious throughput, switch to Protocol Buffers or MessagePack. Profile the serialization cost. I have seen cases where encoding/decoding takes longer than the actual cache lookup.

Failure Scenarios

Capacity Planning

Metric	Threshold	Action
Hit ratio	< 95%	Increase cache size or review eviction policy
Memory utilization	> 75%	Scale cluster or evict aggressively
Eviction rate	> 1% of total keys/min	Working set exceeds capacity, add memory
P99 latency	> 2ms (Redis), > 5ms (Memcached)	Network saturation or hot key, investigate

Architecture Decision Record

Caching Technology Decision Matrix

Criteria (Weight)	Redis	Memcached	In-Process (Caffeine)	CDN Cache
Data structure needs	Rich (hashes, sorted sets, streams)	Simple key-value only	Any Java object	HTTP responses only
Latency requirement	< 1ms network	< 1ms network	< 10μs (no network)	Varies by edge location
Consistency across instances	Shared state	Shared state	Each instance has own copy	Eventually consistent
Persistence needed	Yes (RDB/AOF)	No	No	No
Max dataset size	~25GB per node (practical)	~64GB per node	Limited by JVM heap (< 8GB)	Unlimited (distributed)

Decision rules:

Simple strings or blobs with maximum throughput needed: Go with Memcached. Its multi-threaded architecture handles more QPS per node than single-threaded Redis. At extreme scale, this difference matters.
Data structures needed (sorted leaderboards, rate limiting counters, pub/sub): Redis. Nothing else comes close for this use case.
Sub-millisecond latency is critical and data fits in memory (under 2GB working set per instance): Use an L1 in-process cache like Caffeine for JVM or lru-cache for Node.js. Eliminate the network hop entirely. This is the single biggest latency win available.
Dataset exceeds 100GB: Shard Redis with Redis Cluster (scales to 1000 nodes) or use client-side consistent hashing over standalone instances. I generally prefer standalone instances with client hashing because the operational model is simpler.
Zero tolerance for cache downtime: Run Redis Sentinel or Redis Cluster with 3+ replicas per shard. For truly critical paths, consider dual-writing to both Redis and Memcached and reading from whichever is available.

Architecture Diagram

Why It Exists

How It Works

The Four Caching Patterns

Eviction Policies

Cache Stampede Prevention

Multi-Tier Architecture

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Caching Technology Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics

Caching Strategies

Architecture Diagram

Why It Exists

How It Works

The Four Caching Patterns

Eviction Policies

Cache Stampede Prevention

Multi-Tier Architecture

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Caching Technology Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics