Caching Strategies
Architecture Diagram
Why It Exists
Most applications follow a Pareto distribution: somewhere around 10-20% of the data serves 80%+ of traffic. Anyone who has watched a slow dashboard magically speed up after someone dropped Redis in front of the database has seen this in action.
Caching exploits two things: temporal locality (recently accessed data gets accessed again) and the fact that memory is orders of magnitude faster than disk. Serving a hot key from Redis takes sub-millisecond time. The same query hitting Postgres might take 5-10ms on a good day. Multiply that difference across millions of requests and the database savings are enormous.
The catch? The system is now maintaining two copies of the truth. That is where every caching headache begins.
How It Works
The Four Caching Patterns
- Cache-aside (lazy loading): The application checks cache first. On a miss, it queries the database, then stuffs the result into cache. The application owns both the read and write paths. This is the most common pattern and the one I reach for by default because it gives full control. The tradeoff is that the application is responsible for invalidation logic.
- Read-through: The cache layer fetches from the database on a miss. The application only talks to the cache. Cleaner code, but now the caching layer has a hard dependency on the database client.
- Write-through: Every write goes to both cache and database synchronously. The cache is always consistent, but write latency takes a hit. Use this when reads dwarf writes by a wide margin.
- Write-behind (write-back): Writes land in cache immediately. The cache flushes to the database asynchronously. The fastest writes possible, but if the cache node dies before flushing, that data is gone. Only use this for workloads that can tolerate some data loss.
Eviction Policies
| Policy | Behavior | Best For |
|---|---|---|
| LRU (Least Recently Used) | Evicts the item not accessed for the longest time | General purpose, most common |
| LFU (Least Frequently Used) | Evicts the item with the fewest accesses | Workloads with stable hot sets |
| FIFO | Evicts the oldest inserted item | Time-series or session data |
| TTL-based | Expires after a fixed duration | Data with known freshness requirements |
Cache Stampede Prevention
When a popular cache key expires, hundreds of concurrent requests all miss at once and slam the database. This is the thundering herd problem, and it will absolutely take down the database without proper handling.
Three ways to handle it:
- Mutex/distributed lock: The first request grabs a lock, fetches from the DB, and populates cache. Everyone else waits or gets stale data. Simple but adds lock contention.
- Stale-while-revalidate: Keep serving the expired value while a single background thread refreshes it. No one blocks. This is my preferred approach for most read-heavy workloads.
- Probabilistic early expiration: Each request has a small random chance of refreshing cache before the TTL actually expires, spreading the load over time. Clever, but harder to reason about.
Multi-Tier Architecture
- L1 (in-process): Caffeine, Guava, or even a plain
Mapinside the application process. Sub-microsecond reads, zero network hops. The downside: limited by heap size and each instance maintains its own copy, so they drift apart. - L2 (distributed): Redis or Memcached. Sub-millisecond reads over the network. Shared across all instances. Scales independently of the app servers.
Reads check L1 first, then L2, then the database. Writes invalidate L1 across all nodes (via pub/sub or an event bus) and update L2. Getting the invalidation broadcast right is the hard part here.
Production Considerations
- Key design: Use structured, namespaced keys like
user:{id}:profile:v2. That version segment matters. When the schema changes, nobody wants to deserialize corrupted data from stale keys. I have seen this bug take down a checkout flow. - Cache warming: After a deploy or restart, pre-populate cache with the hottest keys before traffic arrives. Pull from an access frequency log with a background job. Skipping this step means the database eats a cold-start spike right when the deploy is supposed to be "zero downtime."
- Monitoring: Track hit ratio (aim for 95%+), eviction rate, and memory usage. A dropping hit ratio usually means the working set outgrew the cache. This is the most important cache metric, period.
- Graceful degradation: If Redis goes down, the app should bypass the cache and hit the database directly, not throw a 500 at the user. Put a circuit breaker around cache calls. Test this path regularly because it will break if never exercised.
- Serialization overhead: JSON is human-readable but slow. For serious throughput, switch to Protocol Buffers or MessagePack. Profile the serialization cost. I have seen cases where encoding/decoding takes longer than the actual cache lookup.
Failure Scenarios
Scenario 1: Cache Stampede on Cold Start. A Redis cluster comes back after maintenance with an empty cache. Thousands of app instances fire off their hot-key queries simultaneously. The database connection pool runs dry in seconds, P99 latency jumps from 5ms to 12 seconds, and the DB starts rejecting connections. Now there is a full outage caused by the caching layer coming back online, which is ironic. Detection: Watch the cache hit ratio. A drop below 50% combined with database connection pool above 90% utilization signals a stampede. Recovery: Build a cache warming script that pre-loads the top 10,000 keys from an access frequency log before routing production traffic to the cluster. Instagram runs a dedicated warming service that replays the last hour of access patterns before their memcached clusters accept real traffic.
Scenario 2: Hot Key Imbalance. A product page goes viral and suddenly one Redis key is getting 15% of all cache reads. That key's shard handles 200K QPS while neighboring shards sit at 5K. The shard's network bandwidth saturates, and every other key on that shard gets slow as collateral damage. Detection: Track per-key access frequency with Redis OBJECT FREQ or client-side sampling. Alert when any single key crosses 1% of total cluster QPS. Recovery: Replicate the hot key across multiple shards with a suffix (product:123:r1, product:123:r2) and randomize reads across the replicas.
Scenario 3: Silent Data Corruption from Stale Cache. A write-through setup has a subtle bug: the database write succeeds but the cache update fails silently on a network timeout. No TTL was set on the key, so the cache serves stale data forever. Users see outdated pricing for 6 hours before someone files a support ticket. This one is nasty because monitoring probably looks green the whole time. Detection: Run periodic reconciliation checks between cache and source of truth for critical keys. Track cache write failure rates separately from read metrics. Recovery: Always set a maximum TTL even on write-through keys as a safety net (1 hour is a reasonable default). Facebook ran into this class of bug in their TAO cache and built a lease mechanism. A cache miss returns a "lease token" that must be presented when populating the cache, which prevents stale writes from overwriting fresh data.
Capacity Planning
| Metric | Threshold | Action |
|---|---|---|
| Hit ratio | < 95% | Increase cache size or review eviction policy |
| Memory utilization | > 75% | Scale cluster or evict aggressively |
| Eviction rate | > 1% of total keys/min | Working set exceeds capacity, add memory |
| P99 latency | > 2ms (Redis), > 5ms (Memcached) | Network saturation or hot key, investigate |
Real-world scale references: Facebook runs the largest known memcached deployment, with thousands of servers handling billions of requests per second at a 99%+ hit ratio for their TAO social graph cache. Discord operates 180+ Redis nodes processing 300M+ messages per day using consistent hashing. Netflix's EVCache (built on memcached) pushes 30M requests per second across 10,000+ instances with a median latency of 1ms.
Capacity formula: Required cache memory = working_set_size * (1 + overhead_factor), where overhead_factor is 1.3 to 1.5 for Redis (accounting for data structure overhead and fragmentation). Working set = number of unique keys accessed in one TTL window multiplied by average value size. Connection formula: max_connections = app_instances * connections_per_instance * 1.2. A single Redis instance can technically handle around 50K concurrent connections, but performance degrades past 10K. Use connection pooling. As a starting heuristic, budget about 1GB of cache per 5,000 QPS, then refine once real hit ratio data is available.
Architecture Decision Record
Caching Technology Decision Matrix
| Criteria (Weight) | Redis | Memcached | In-Process (Caffeine) | CDN Cache |
|---|---|---|---|---|
| Data structure needs | Rich (hashes, sorted sets, streams) | Simple key-value only | Any Java object | HTTP responses only |
| Latency requirement | < 1ms network | < 1ms network | < 10μs (no network) | Varies by edge location |
| Consistency across instances | Shared state | Shared state | Each instance has own copy | Eventually consistent |
| Persistence needed | Yes (RDB/AOF) | No | No | No |
| Max dataset size | ~25GB per node (practical) | ~64GB per node | Limited by JVM heap (< 8GB) | Unlimited (distributed) |
Decision rules:
- Simple strings or blobs with maximum throughput needed: Go with Memcached. Its multi-threaded architecture handles more QPS per node than single-threaded Redis. At extreme scale, this difference matters.
- Data structures needed (sorted leaderboards, rate limiting counters, pub/sub): Redis. Nothing else comes close for this use case.
- Sub-millisecond latency is critical and data fits in memory (under 2GB working set per instance): Use an L1 in-process cache like Caffeine for JVM or
lru-cachefor Node.js. Eliminate the network hop entirely. This is the single biggest latency win available. - Dataset exceeds 100GB: Shard Redis with Redis Cluster (scales to 1000 nodes) or use client-side consistent hashing over standalone instances. I generally prefer standalone instances with client hashing because the operational model is simpler.
- Zero tolerance for cache downtime: Run Redis Sentinel or Redis Cluster with 3+ replicas per shard. For truly critical paths, consider dual-writing to both Redis and Memcached and reading from whichever is available.
Key Points
- •Cuts database load and latency by serving hot data from in-memory stores instead of disk
- •Cache-aside, read-through, write-through, write-behind. Each pattern fits different read/write ratios
- •Cache invalidation is genuinely hard. TTL, event-driven invalidation, and versioned keys are the main tools
- •Cache stampede (thundering herd) hits when many requests miss at the same time. Use locking or stale-while-revalidate
- •Multi-tier caching (L1 in-process, L2 distributed) trades off latency against consistency
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Redis | Open Source | Rich data structures, pub/sub, persistence options | Small-Enterprise |
| Memcached | Open Source | Simple key-value, multi-threaded, maximum throughput | Medium-Enterprise |
| Caffeine | Open Source | JVM in-process cache, near-optimal hit ratio | Small-Large |
| Hazelcast | Open Source | Distributed cache, embedded or client-server | Medium-Enterprise |
Common Mistakes
- Caching without setting TTL. Stale data sticks around forever
- Treating cache as a primary data store. Eviction means data loss
- Not handling cache failures. If Redis goes down, the app should fall back to the database, not crash
- Sloppy key design. Poorly namespaced keys cause silent data overwrites
- Skipping cache warming after deploys. A cold cache on restart hammers the database