Rate Limiting & Throttling
Architecture Diagram
Why It Exists
Here is the uncomfortable truth: without rate limiting, one bad client can take down an entire system. It does not even have to be an attacker. A buggy retry loop in a partner's integration will do it just fine.
Rate limiting is the first line of defense. It caps how many requests a client can make in a given window so that no single actor starves everyone else. It also protects downstream dependencies that cannot scale as fast as the edge.
How It Works
Token Bucket Algorithm
Start with a bucket of N tokens. Every request costs one token. Tokens refill at a fixed rate R per second. If the bucket is empty, the request gets rejected. Simple as that.
The nice property here is that it naturally allows bursts up to the bucket capacity while still enforcing a steady-state rate.
Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Burst: 100 requests instantly, then 10/second sustained
Sliding Window Log
Store the timestamp of every request in the current window. Count entries to decide allow or deny. It delivers precise results, but it is memory-hungry because a timestamp is stored per request. For high-throughput APIs, this gets expensive fast.
Sliding Window Counter
This is the hybrid approach most production systems actually use. It keeps counters for the current and previous fixed windows, then weight the previous window's counter by the overlap percentage. Good tradeoff between precision and memory.
Fixed Window Counter
The simplest approach: count requests in fixed time windows (say, per minute). The catch is that a burst at the window boundary lets a client send 2x the intended limit. If that imprecision is acceptable, great. If not, use sliding window.
Production Considerations
- Redis atomicity. Use Lua scripts for atomic check-and-increment.
MULTI/EXECtransactions will not work here because the check and the increment must happen in a single atomic step. - Clock synchronization. Distributed rate limiting needs roughly synchronized clocks. A few seconds of NTP drift is fine for most algorithms, but if clocks are minutes apart, weird behavior follows.
- Graceful degradation. If Redis goes down, fail open (allow traffic) rather than fail closed (reject everything). Yes, this means rate limiting is temporarily lost. That is better than a total outage. Monitor and alert on rate limiter availability to catch when this happens.
- Multi-dimensional limits. Rate limit by user ID, API key, IP, endpoint, and HTTP method independently. A user might get 1000 req/min for reads but only 100 req/min for writes. Different operations have different costs.
- Cost-based limiting. Not all requests are equal. A search query scanning millions of rows should cost more "tokens" than a simple key lookup. Weight expensive operations accordingly.
Failure Scenarios
Scenario 1: Redis Failure, Fail-Open Stampede. The Redis cluster backing the distributed rate limiter hits a leader election during a network partition. This typically lasts 5 to 30 seconds. During that window, the rate limiter fails open (as configured) and all clients bypass limits. A single aggressive client or bot network pushes 50K RPS instead of its 100 RPM limit, overwhelming the backend database. Connection pool exhaustion cascades to every service sharing that database. Detection: monitor rate_limiter_redis_errors_total and rate_limiter_bypass_total, and alert when bypass rate exceeds 1% of total decisions. Recovery: implement a local fallback rate limiter (an in-process token bucket per instance) that kicks in when Redis is unavailable. Even imprecise per-instance limits (global_limit / instance_count) will prevent total collapse. Stripe and Shopify both use this layered pattern: local limiter as a safety net, Redis for precision.
Scenario 2: Rate Limit Bypass via API Key Rotation. A malicious actor creates free-tier API keys programmatically, each with its own 1000 req/hr limit. With 100 keys, they get 100K req/hr, effectively unlimited. Each key individually looks fine, so no alarms fire. Detection: monitor unique API keys per IP address (alert on more than 5 keys/IP/hour), track total requests per IP regardless of API key, and add velocity checks on key creation (more than 3 keys per account per day triggers review). Recovery: add IP-based rate limiting as a secondary layer, require email verification for API keys, and implement progressive rate reduction for new keys. Start them at 10% of the full limit and ramp over 24 hours.
Scenario 3: Retry Storm After Rate Limit Response. The API returns 429 responses without Retry-After headers. A popular third-party client library implements immediate retry with no backoff. At 10K users, a momentary limit breach triggers 10K retries within one second, which causes more 429s, which causes more retries. Exponential amplification. The system oscillates between overload and recovery for 5 to 10 minutes. Detection: monitor 429_response_rate and retry_request_ratio (requests with retry-indicative headers / total requests); alert when the retry ratio exceeds 30%. Recovery: always return a Retry-After header with jittered delay (base + random 0 to 5s), implement server-side backpressure by increasing retry delay under load, and add client identification to detect and block retry-storming clients.
Capacity Planning
Redis handles roughly 200K operations/second for simple INCR commands on a single instance (r6g.large). A Lua-script-based sliding window check costs about 2 to 3 operations per rate limit decision, so the result is around 70K rate limit decisions/second per Redis instance.
| Metric | Target | Warning | Action |
|---|---|---|---|
| Redis CPU (rate limit instance) | < 40% | > 60% | Scale Redis or shard by key prefix |
| Rate limit decision latency (P99) | < 2ms | > 10ms | Check Redis connection pool, network |
| False positive rate (legitimate 429s) | < 0.01% | > 0.1% | Widen limits, review limit granularity |
| Memory per rate limit key | ~100 bytes (counter) | N/A | Plan: 100B * unique_keys * 2 windows |
| Redis failover duration | < 5s | > 30s | Test Sentinel/cluster failover regularly |
Real-world numbers worth knowing. Stripe rate-limits at roughly 100 read / 100 write requests per second per key in production, using a multi-tier system (per-IP, per-API-key, per-merchant, and global). GitHub allows 5,000 requests/hour for authenticated API users with a token bucket algorithm and returns precise X-RateLimit-* headers. Capacity formula: redis_instances = (peak_rps * avg_rate_checks_per_request) / 70000. A service at 100K RPS with 2 rate-limit checks per request needs: (100K * 2) / 70K = roughly 3 Redis instances (deploy 6 for HA with replicas).
Architecture Decision Record
ADR: Rate Limiting Architecture
Context: Where and how rate limits get enforced affects accuracy, latency, failure modes, and operational complexity. There is no single right answer, but there is a clear decision framework.
| Criteria (Weight) | Local In-Process | Centralized (Redis) | Edge (CDN/WAF) | Hybrid (Local + Redis) |
|---|---|---|---|---|
| Accuracy (30%) | Low (per-instance) | High (global count) | Medium (edge count) | High |
| Latency added (25%) | ~0ms | ~1-3ms (Redis RTT) | ~0ms (inline) | ~0-3ms |
| Failure mode (20%) | Fail-safe (no deps) | Fail-open risk | Fail-safe | Graceful degradation |
| Ops complexity (15%) | Lowest | Medium (Redis HA) | Low (managed) | Highest |
| Multi-tenant support (10%) | Poor | Excellent | Limited | Excellent |
Decision framework:
- Single service, under 10K RPM, no multi-tenant requirements. Use in-process rate limiting (Go
rate.Limiter, Java GuavaRateLimiter). Zero external dependencies, sub-microsecond overhead. The trade-off is that limits are per-instance, not global. This is fine for most internal services. - Multi-service, 10K to 500K RPM, API product with paying customers. Use Redis-backed centralized rate limiting. Deploy Redis Sentinel or Cluster for HA. Implement the local fallback pattern to degrade to per-instance if Redis goes down. This is what Stripe, GitHub, and Twilio do.
- Public-facing, over 500K RPM, DDoS/abuse is a primary concern. Layer edge rate limiting (Cloudflare Rate Limiting, AWS WAF) in front of application-level limits. Let the edge handle volumetric attacks (IP-based) while the application handles business-logic limits (per-user, per-resource). Edge plus Redis is the gold standard here.
- Multi-region, eventual consistency is acceptable. Use local rate limiting with periodic cross-region synchronization (for example, sync counters via Kafka every 5 seconds). Accept that limits may be exceeded by up to
sync_interval * region_countin the worst case. This avoids cross-region Redis latency (50 to 100ms) on the hot path, which matters more than most teams expect at scale.
Key Points
- •Caps request rates to protect services from abuse, DDoS, and resource exhaustion
- •Token bucket, sliding window, and leaky bucket are the three core algorithms worth knowing
- •Distributed rate limiting needs shared state (Redis). Local-only limits apply per instance, which is often not the intended behavior
- •Always communicate limits through standard headers: X-RateLimit-Limit, Remaining, Reset
- •Set different limits per tier. Free users, paid users, and internal services should not share the same budget
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| Redis + Lua | Open Source | Distributed counters, atomic operations | Medium-Enterprise |
| Envoy Rate Limit | Open Source | Service mesh integration, per-route limits | Large-Enterprise |
| Kong Rate Limiting | Open Source | API gateway plugin, Redis-backed | Medium-Enterprise |
| AWS WAF | Managed | Edge rate limiting, IP-based rules | Small-Enterprise |
Common Mistakes
- Using per-instance rate limiting instead of global. Clients can bypass it by hitting different instances
- Not separating authenticated and unauthenticated rate limits
- Setting limits too tight at launch, then throttling legitimate traffic before real usage data exists
- Skipping the Retry-After header. Clients retry immediately and the result is a thundering herd
- Rate limiting by IP only. Shared IPs behind NAT or corporate proxies punish every user behind them