CDN & Edge Computing

Why It Exists

Light in fiber travels from Tokyo to Virginia in about 150ms round-trip. There is no optimizing around physics. CDNs put copies of the content at 200+ locations around the world so the nearest edge node serves the request in 5-20ms instead. That is the entire value proposition, and it is a big one.

How It Works

DNS Resolution - The user's DNS query resolves to the nearest edge PoP via Anycast or geo-DNS.
Cache Lookup - The edge checks its local cache using a cache key built from the URL plus any Vary headers.
Cache Hit - Resource comes straight from the edge. No origin contact at all. This is the fast path, around 5ms.
Cache Miss - The edge fetches from origin (or origin shield), caches the response, then serves it. Every request after that hits cache.
Cache Invalidation - When content changes, either wait for TTL expiry or actively purge via API. Fastly can purge globally in about 150ms. CloudFront takes up to 60 seconds, which can feel like an eternity during an incident.

Edge Computing

Modern CDNs are not just cache boxes anymore. Edge Workers (Cloudflare Workers, Lambda@Edge, Deno Deploy) run JavaScript and WASM right at the edge:

A/B testing - Route users to variants without a round-trip to origin. This alone saves hundreds of milliseconds per page load.
Auth validation - Check JWTs at the edge and reject unauthorized requests before they ever touch the origin servers.
Personalization - Inject user-specific content into cached page shells. The result is the speed of caching with the flexibility of dynamic content.
API routing - Edge-side GraphQL resolution, request fan-out, and response aggregation.

The catch? Debugging edge workers is painful. The code runs in 200+ locations with limited observability. Start simple and add complexity only when the monitoring is there to support it.

Production Considerations

Cache hit ratio - Target above 90% for static assets. Monitor this in the CDN analytics dashboard. A low ratio probably means a cache key problem or too many unique URL parameters.
Stale-while-revalidate - Serve stale content immediately while fetching fresh content in the background. This is the best tradeoff for content that can tolerate brief staleness, which is most content.
Origin shield - Pick one PoP as the "shield" between all edge nodes and the origin. This collapses N cache misses down to 1. Skip this, and the lesson comes the hard way during a traffic spike.
Compression - Turn on Brotli at the edge (20-30% smaller than gzip). Cache compressed and uncompressed variants separately. This is free performance left on the table without it.
Security - The CDN doubles as a WAF: rate limiting, bot detection, DDoS absorption. Cloudflare absorbs multi-Tbps attacks at the edge. That is worth the monthly bill by itself.

Failure Scenarios

Scenario 1: Cache Poisoning via Host Header Injection. An attacker sends a request with a spoofed Host header. The CDN caches the response keyed on the URL but with malicious redirect content baked in. Every user hitting that edge PoP gets the poisoned response. This can affect millions of users for the duration of the TTL. Detection: look for anomalies on cache_store events where response headers contain unexpected Location values, and monitor for Host header mismatches in origin access logs. Recovery: immediately purge the affected cache keys (Fastly: ~150ms, CloudFront: up to 60s). The long-term fix is to normalize and validate Host headers at the origin, include Host in the cache key via Vary, and set Cache-Control: private for any personalized responses. I have seen this happen in production and it is genuinely scary how fast it spreads.

Scenario 2: Thundering Herd on Cache Expiration. Picture this: a viral content page has 500K concurrent readers, and the TTL expires simultaneously across 200+ PoPs. Each PoP independently requests the origin, generating a 200x amplification spike. The origin servers, provisioned for 2K RPS, suddenly get hit with 400K RPS in a 2-second window. The origin crashes. Cache misses start returning 502s. Those 502s get cached (negative caching), creating a secondary outage that persists even after origin recovers. Detection: watch for origin request rate spikes that correlate with TTL boundaries, and monitor origin_5xx_rate alongside cache_miss_rate. Recovery: enable origin shield (collapses 200 PoP requests into 1), use stale-while-revalidate (serve stale content while one request refreshes), and turn on request coalescing at the edge. Any high-traffic site without origin shield enabled should stop and go enable it right now.

Scenario 3: Edge Worker Memory Leak. A Cloudflare Worker deployed globally has a subtle memory leak, say 2KB per request from a map that never gets cleared. At 50K RPS per PoP, the 128MB limit is hit within minutes. Workers start throwing MemoryError exceptions and all traffic falls through to origin. The origin, sized for the 5% of traffic that normally misses cache, collapses under the full load. Detection: Worker CPU time metrics approaching limits, sudden spike in origin_request_rate without any corresponding cache purge events. Recovery: rollback the Worker version via API, which propagates in under 30 seconds. Prevention: mandatory memory profiling in CI, canary Worker deployments to a single PoP before global rollout, and hard circuit breakers that bypass Workers when error rate exceeds 1%.

Capacity Planning

Modern CDN PoPs serve 10-100 Gbps per location with hundreds of edge locations globally. Cloudflare has 310+ PoPs with ~280 Tbps total capacity. CloudFront operates 600+ PoPs. Akamai has 4,100+ PoPs with ~350 Tbps capacity.

Metric	Target	Warning	Action
Cache hit ratio (static)	> 95%	< 90%	Review cache keys, extend TTLs
Cache hit ratio (dynamic)	> 60%	< 40%	Add edge-side personalization
Origin bandwidth	< 10% of edge	> 20% of edge	Enable origin shield, extend TTL
Edge latency (P50)	< 10ms	> 30ms	Check PoP routing, enable Anycast
Purge propagation	< 5s	> 30s	Evaluate CDN provider SLA

Real-world numbers worth knowing: Netflix serves ~400 Gbps per Open Connect appliance (their custom CDN embedded in ISPs), totaling over 100 Tbps during peak. Stripe runs a multi-CDN strategy (Cloudflare + Fastly) for redundancy, failing over via DNS in under 60s. Capacity formula: origin_capacity_required = total_traffic * (1 - cache_hit_ratio) * 3x_headroom. For a site serving 10 Gbps at the edge with 95% hit ratio, that means: 10 * 0.05 * 3 = 1.5 Gbps origin capacity. Always plan for the 3x headroom. It will be needed.

Architecture Decision Record

ADR: CDN Strategy Selection

Context: Choosing between single-CDN, multi-CDN, and custom CDN (build-your-own) strategies. The right answer depends on traffic volume, latency requirements, and honestly, how many people are available to operate the thing.

Criteria (Weight)	Single CDN (Cloudflare/CloudFront)	Multi-CDN (Cloudflare + Fastly)	Custom CDN (Netflix Open Connect)
Ops complexity (25%)	Low	Medium (traffic director needed)	Very high (hardware + software)
Latency (20%)	Good (PoP-dependent)	Best (route to fastest)	Best (ISP-embedded)
Cost at 10 TB/mo (20%)	~$0-500	~$500-1000	N/A at this scale
Cost at 10 PB/mo (20%)	~$50K-200K	~$80K-250K	~$10K-30K (amortized)
Reliability (15%)	Single provider risk	Failover capable	Full control

Decision framework:

Traffic < 1 TB/month AND team < 20 engineers. Use a single CDN (Cloudflare free/pro or CloudFront). Operational simplicity wins here. Focus on getting cache-hit ratios above 90% before even thinking about anything else.
Traffic 1-100 TB/month AND latency-sensitive global audience. Use a primary CDN with DNS-based failover to a secondary. Set up Real User Monitoring (RUM) to measure actual user latency per-PoP. Fastly + CloudFront is a solid pairing (instant purge + AWS integration).
Traffic > 100 TB/month OR media streaming. Multi-CDN with an active traffic director (Citrix/NS1/Conviva). Route each request to the fastest or cheapest CDN based on real-time performance data. Hulu, Disney+, and Twitch all do this.
Traffic > 1 PB/month AND predictable content catalog. Build or embed custom CDN appliances in major ISPs (the Netflix Open Connect model). The $10M+ upfront investment pays back at this scale, because CDN egress fees at petabyte volumes would exceed $1M/month. Realistically, fewer than 10 companies in the world need this.

Tool	Type	Best For	Scale
CloudFront	Managed	AWS ecosystem, Lambda@Edge	Small-Enterprise
Cloudflare	Managed	Edge Workers, DDoS protection, massive PoP network	Small-Enterprise
Fastly	Managed	Instant purge, VCL customization, real-time logging	Medium-Enterprise
Akamai	Commercial	Largest network, media delivery, enterprise SLAs	Enterprise

Why It Exists

How It Works

DNS Resolution - The user's DNS query resolves to the nearest edge PoP via Anycast or geo-DNS.
Cache Lookup - The edge checks its local cache using a cache key built from the URL plus any Vary headers.
Cache Hit - Resource comes straight from the edge. No origin contact at all. This is the fast path, around 5ms.
Cache Miss - The edge fetches from origin (or origin shield), caches the response, then serves it. Every request after that hits cache.
Cache Invalidation - When content changes, either wait for TTL expiry or actively purge via API. Fastly can purge globally in about 150ms. CloudFront takes up to 60 seconds, which can feel like an eternity during an incident.

Edge Computing

Modern CDNs are not just cache boxes anymore. Edge Workers (Cloudflare Workers, Lambda@Edge, Deno Deploy) run JavaScript and WASM right at the edge:

A/B testing - Route users to variants without a round-trip to origin. This alone saves hundreds of milliseconds per page load.
Auth validation - Check JWTs at the edge and reject unauthorized requests before they ever touch the origin servers.
Personalization - Inject user-specific content into cached page shells. The result is the speed of caching with the flexibility of dynamic content.
API routing - Edge-side GraphQL resolution, request fan-out, and response aggregation.

The catch? Debugging edge workers is painful. The code runs in 200+ locations with limited observability. Start simple and add complexity only when the monitoring is there to support it.

Production Considerations

Cache hit ratio - Target above 90% for static assets. Monitor this in the CDN analytics dashboard. A low ratio probably means a cache key problem or too many unique URL parameters.
Stale-while-revalidate - Serve stale content immediately while fetching fresh content in the background. This is the best tradeoff for content that can tolerate brief staleness, which is most content.
Origin shield - Pick one PoP as the "shield" between all edge nodes and the origin. This collapses N cache misses down to 1. Skip this, and the lesson comes the hard way during a traffic spike.
Compression - Turn on Brotli at the edge (20-30% smaller than gzip). Cache compressed and uncompressed variants separately. This is free performance left on the table without it.
Security - The CDN doubles as a WAF: rate limiting, bot detection, DDoS absorption. Cloudflare absorbs multi-Tbps attacks at the edge. That is worth the monthly bill by itself.

Failure Scenarios

Capacity Planning

Metric	Target	Warning	Action
Cache hit ratio (static)	> 95%	< 90%	Review cache keys, extend TTLs
Cache hit ratio (dynamic)	> 60%	< 40%	Add edge-side personalization
Origin bandwidth	< 10% of edge	> 20% of edge	Enable origin shield, extend TTL
Edge latency (P50)	< 10ms	> 30ms	Check PoP routing, enable Anycast
Purge propagation	< 5s	> 30s	Evaluate CDN provider SLA

Architecture Decision Record

ADR: CDN Strategy Selection

Criteria (Weight)	Single CDN (Cloudflare/CloudFront)	Multi-CDN (Cloudflare + Fastly)	Custom CDN (Netflix Open Connect)
Ops complexity (25%)	Low	Medium (traffic director needed)	Very high (hardware + software)
Latency (20%)	Good (PoP-dependent)	Best (route to fastest)	Best (ISP-embedded)
Cost at 10 TB/mo (20%)	~$0-500	~$500-1000	N/A at this scale
Cost at 10 PB/mo (20%)	~$50K-200K	~$80K-250K	~$10K-30K (amortized)
Reliability (15%)	Single provider risk	Failover capable	Full control

Decision framework:

Traffic < 1 TB/month AND team < 20 engineers. Use a single CDN (Cloudflare free/pro or CloudFront). Operational simplicity wins here. Focus on getting cache-hit ratios above 90% before even thinking about anything else.
Traffic 1-100 TB/month AND latency-sensitive global audience. Use a primary CDN with DNS-based failover to a secondary. Set up Real User Monitoring (RUM) to measure actual user latency per-PoP. Fastly + CloudFront is a solid pairing (instant purge + AWS integration).
Traffic > 100 TB/month OR media streaming. Multi-CDN with an active traffic director (Citrix/NS1/Conviva). Route each request to the fastest or cheapest CDN based on real-time performance data. Hulu, Disney+, and Twitch all do this.
Traffic > 1 PB/month AND predictable content catalog. Build or embed custom CDN appliances in major ISPs (the Netflix Open Connect model). The $10M+ upfront investment pays back at this scale, because CDN egress fees at petabyte volumes would exceed $1M/month. Realistically, fewer than 10 companies in the world need this.

Architecture Diagram

Why It Exists

How It Works

Edge Computing

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: CDN Strategy Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics

CDN & Edge Computing

Architecture Diagram

Why It Exists

How It Works

Edge Computing

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: CDN Strategy Selection

Key Points

Tool Comparison

Common Mistakes

Related Topics