CDN & Edge Computing
Architecture Diagram
Why It Exists
Light in fiber travels from Tokyo to Virginia in about 150ms round-trip. There is no optimizing around physics. CDNs put copies of the content at 200+ locations around the world so the nearest edge node serves the request in 5-20ms instead. That is the entire value proposition, and it is a big one.
How It Works
- DNS Resolution - The user's DNS query resolves to the nearest edge PoP via Anycast or geo-DNS.
- Cache Lookup - The edge checks its local cache using a cache key built from the URL plus any Vary headers.
- Cache Hit - Resource comes straight from the edge. No origin contact at all. This is the fast path, around 5ms.
- Cache Miss - The edge fetches from origin (or origin shield), caches the response, then serves it. Every request after that hits cache.
- Cache Invalidation - When content changes, either wait for TTL expiry or actively purge via API. Fastly can purge globally in about 150ms. CloudFront takes up to 60 seconds, which can feel like an eternity during an incident.
Edge Computing
Modern CDNs are not just cache boxes anymore. Edge Workers (Cloudflare Workers, Lambda@Edge, Deno Deploy) run JavaScript and WASM right at the edge:
- A/B testing - Route users to variants without a round-trip to origin. This alone saves hundreds of milliseconds per page load.
- Auth validation - Check JWTs at the edge and reject unauthorized requests before they ever touch the origin servers.
- Personalization - Inject user-specific content into cached page shells. The result is the speed of caching with the flexibility of dynamic content.
- API routing - Edge-side GraphQL resolution, request fan-out, and response aggregation.
The catch? Debugging edge workers is painful. The code runs in 200+ locations with limited observability. Start simple and add complexity only when the monitoring is there to support it.
Production Considerations
- Cache hit ratio - Target above 90% for static assets. Monitor this in the CDN analytics dashboard. A low ratio probably means a cache key problem or too many unique URL parameters.
- Stale-while-revalidate - Serve stale content immediately while fetching fresh content in the background. This is the best tradeoff for content that can tolerate brief staleness, which is most content.
- Origin shield - Pick one PoP as the "shield" between all edge nodes and the origin. This collapses N cache misses down to 1. Skip this, and the lesson comes the hard way during a traffic spike.
- Compression - Turn on Brotli at the edge (20-30% smaller than gzip). Cache compressed and uncompressed variants separately. This is free performance left on the table without it.
- Security - The CDN doubles as a WAF: rate limiting, bot detection, DDoS absorption. Cloudflare absorbs multi-Tbps attacks at the edge. That is worth the monthly bill by itself.
Failure Scenarios
Scenario 1: Cache Poisoning via Host Header Injection. An attacker sends a request with a spoofed Host header. The CDN caches the response keyed on the URL but with malicious redirect content baked in. Every user hitting that edge PoP gets the poisoned response. This can affect millions of users for the duration of the TTL. Detection: look for anomalies on cache_store events where response headers contain unexpected Location values, and monitor for Host header mismatches in origin access logs. Recovery: immediately purge the affected cache keys (Fastly: ~150ms, CloudFront: up to 60s). The long-term fix is to normalize and validate Host headers at the origin, include Host in the cache key via Vary, and set Cache-Control: private for any personalized responses. I have seen this happen in production and it is genuinely scary how fast it spreads.
Scenario 2: Thundering Herd on Cache Expiration. Picture this: a viral content page has 500K concurrent readers, and the TTL expires simultaneously across 200+ PoPs. Each PoP independently requests the origin, generating a 200x amplification spike. The origin servers, provisioned for 2K RPS, suddenly get hit with 400K RPS in a 2-second window. The origin crashes. Cache misses start returning 502s. Those 502s get cached (negative caching), creating a secondary outage that persists even after origin recovers. Detection: watch for origin request rate spikes that correlate with TTL boundaries, and monitor origin_5xx_rate alongside cache_miss_rate. Recovery: enable origin shield (collapses 200 PoP requests into 1), use stale-while-revalidate (serve stale content while one request refreshes), and turn on request coalescing at the edge. Any high-traffic site without origin shield enabled should stop and go enable it right now.
Scenario 3: Edge Worker Memory Leak. A Cloudflare Worker deployed globally has a subtle memory leak, say 2KB per request from a map that never gets cleared. At 50K RPS per PoP, the 128MB limit is hit within minutes. Workers start throwing MemoryError exceptions and all traffic falls through to origin. The origin, sized for the 5% of traffic that normally misses cache, collapses under the full load. Detection: Worker CPU time metrics approaching limits, sudden spike in origin_request_rate without any corresponding cache purge events. Recovery: rollback the Worker version via API, which propagates in under 30 seconds. Prevention: mandatory memory profiling in CI, canary Worker deployments to a single PoP before global rollout, and hard circuit breakers that bypass Workers when error rate exceeds 1%.
Capacity Planning
Modern CDN PoPs serve 10-100 Gbps per location with hundreds of edge locations globally. Cloudflare has 310+ PoPs with ~280 Tbps total capacity. CloudFront operates 600+ PoPs. Akamai has 4,100+ PoPs with ~350 Tbps capacity.
| Metric | Target | Warning | Action |
|---|---|---|---|
| Cache hit ratio (static) | > 95% | < 90% | Review cache keys, extend TTLs |
| Cache hit ratio (dynamic) | > 60% | < 40% | Add edge-side personalization |
| Origin bandwidth | < 10% of edge | > 20% of edge | Enable origin shield, extend TTL |
| Edge latency (P50) | < 10ms | > 30ms | Check PoP routing, enable Anycast |
| Purge propagation | < 5s | > 30s | Evaluate CDN provider SLA |
Real-world numbers worth knowing: Netflix serves ~400 Gbps per Open Connect appliance (their custom CDN embedded in ISPs), totaling over 100 Tbps during peak. Stripe runs a multi-CDN strategy (Cloudflare + Fastly) for redundancy, failing over via DNS in under 60s. Capacity formula: origin_capacity_required = total_traffic * (1 - cache_hit_ratio) * 3x_headroom. For a site serving 10 Gbps at the edge with 95% hit ratio, that means: 10 * 0.05 * 3 = 1.5 Gbps origin capacity. Always plan for the 3x headroom. It will be needed.
Architecture Decision Record
ADR: CDN Strategy Selection
Context: Choosing between single-CDN, multi-CDN, and custom CDN (build-your-own) strategies. The right answer depends on traffic volume, latency requirements, and honestly, how many people are available to operate the thing.
| Criteria (Weight) | Single CDN (Cloudflare/CloudFront) | Multi-CDN (Cloudflare + Fastly) | Custom CDN (Netflix Open Connect) |
|---|---|---|---|
| Ops complexity (25%) | Low | Medium (traffic director needed) | Very high (hardware + software) |
| Latency (20%) | Good (PoP-dependent) | Best (route to fastest) | Best (ISP-embedded) |
| Cost at 10 TB/mo (20%) | ~$0-500 | ~$500-1000 | N/A at this scale |
| Cost at 10 PB/mo (20%) | ~$50K-200K | ~$80K-250K | ~$10K-30K (amortized) |
| Reliability (15%) | Single provider risk | Failover capable | Full control |
Decision framework:
- Traffic < 1 TB/month AND team < 20 engineers. Use a single CDN (Cloudflare free/pro or CloudFront). Operational simplicity wins here. Focus on getting cache-hit ratios above 90% before even thinking about anything else.
- Traffic 1-100 TB/month AND latency-sensitive global audience. Use a primary CDN with DNS-based failover to a secondary. Set up Real User Monitoring (RUM) to measure actual user latency per-PoP. Fastly + CloudFront is a solid pairing (instant purge + AWS integration).
- Traffic > 100 TB/month OR media streaming. Multi-CDN with an active traffic director (Citrix/NS1/Conviva). Route each request to the fastest or cheapest CDN based on real-time performance data. Hulu, Disney+, and Twitch all do this.
- Traffic > 1 PB/month AND predictable content catalog. Build or embed custom CDN appliances in major ISPs (the Netflix Open Connect model). The $10M+ upfront investment pays back at this scale, because CDN egress fees at petabyte volumes would exceed $1M/month. Realistically, fewer than 10 companies in the world need this.
Key Points
- •Caches content at geographically distributed edge nodes close to users
- •Cuts origin server load and drops latency by 50-90% for static assets
- •Edge computing goes beyond caching. Run actual logic at the edge with Workers or Lambda@Edge
- •Cache invalidation is the genuinely hard part. TTL, purge APIs, stale-while-revalidate all have tradeoffs
- •Shield/origin-shield pattern prevents thundering herd on cache misses
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| CloudFront | Managed | AWS ecosystem, Lambda@Edge | Small-Enterprise |
| Cloudflare | Managed | Edge Workers, DDoS protection, massive PoP network | Small-Enterprise |
| Fastly | Managed | Instant purge, VCL customization, real-time logging | Medium-Enterprise |
| Akamai | Commercial | Largest network, media delivery, enterprise SLAs | Enterprise |
Common Mistakes
- Setting overly long TTLs without a purge strategy. Stale content ends up stuck globally
- Caching responses with Set-Cookie headers, which serves one user's session to another
- Not varying cache keys on relevant headers (Accept-Encoding, Accept-Language)
- Ignoring cache hit ratio metrics. A low hit ratio means the CDN is just adding latency for no benefit
- Skipping origin shield. Without it, N edge PoPs each independently hammer the origin on a miss