Rate Limiting & Throttling

Why It Exists

Here is the uncomfortable truth: without rate limiting, one bad client can take down an entire system. It does not even have to be an attacker. A buggy retry loop in a partner's integration will do it just fine.

Rate limiting is the first line of defense. It caps how many requests a client can make in a given window so that no single actor starves everyone else. It also protects downstream dependencies that cannot scale as fast as the edge.

How It Works

Token Bucket Algorithm

Start with a bucket of N tokens. Every request costs one token. Tokens refill at a fixed rate R per second. If the bucket is empty, the request gets rejected. Simple as that.

The nice property here is that it naturally allows bursts up to the bucket capacity while still enforcing a steady-state rate.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Burst: 100 requests instantly, then 10/second sustained

Sliding Window Log

Store the timestamp of every request in the current window. Count entries to decide allow or deny. It delivers precise results, but it is memory-hungry because a timestamp is stored per request. For high-throughput APIs, this gets expensive fast.

Sliding Window Counter

This is the hybrid approach most production systems actually use. It keeps counters for the current and previous fixed windows, then weight the previous window's counter by the overlap percentage. Good tradeoff between precision and memory.

Fixed Window Counter

The simplest approach: count requests in fixed time windows (say, per minute). The catch is that a burst at the window boundary lets a client send 2x the intended limit. If that imprecision is acceptable, great. If not, use sliding window.

Production Considerations

Redis atomicity. Use Lua scripts for atomic check-and-increment. MULTI/EXEC transactions will not work here because the check and the increment must happen in a single atomic step.
Clock synchronization. Distributed rate limiting needs roughly synchronized clocks. A few seconds of NTP drift is fine for most algorithms, but if clocks are minutes apart, weird behavior follows.
Graceful degradation. If Redis goes down, fail open (allow traffic) rather than fail closed (reject everything). Yes, this means rate limiting is temporarily lost. That is better than a total outage. Monitor and alert on rate limiter availability to catch when this happens.
Multi-dimensional limits. Rate limit by user ID, API key, IP, endpoint, and HTTP method independently. A user might get 1000 req/min for reads but only 100 req/min for writes. Different operations have different costs.
Cost-based limiting. Not all requests are equal. A search query scanning millions of rows should cost more "tokens" than a simple key lookup. Weight expensive operations accordingly.

Failure Scenarios

Scenario 1: Redis Failure, Fail-Open Stampede. The Redis cluster backing the distributed rate limiter hits a leader election during a network partition. This typically lasts 5 to 30 seconds. During that window, the rate limiter fails open (as configured) and all clients bypass limits. A single aggressive client or bot network pushes 50K RPS instead of its 100 RPM limit, overwhelming the backend database. Connection pool exhaustion cascades to every service sharing that database. Detection: monitor rate_limiter_redis_errors_total and rate_limiter_bypass_total, and alert when bypass rate exceeds 1% of total decisions. Recovery: implement a local fallback rate limiter (an in-process token bucket per instance) that kicks in when Redis is unavailable. Even imprecise per-instance limits (global_limit / instance_count) will prevent total collapse. Stripe and Shopify both use this layered pattern: local limiter as a safety net, Redis for precision.

Scenario 2: Rate Limit Bypass via API Key Rotation. A malicious actor creates free-tier API keys programmatically, each with its own 1000 req/hr limit. With 100 keys, they get 100K req/hr, effectively unlimited. Each key individually looks fine, so no alarms fire. Detection: monitor unique API keys per IP address (alert on more than 5 keys/IP/hour), track total requests per IP regardless of API key, and add velocity checks on key creation (more than 3 keys per account per day triggers review). Recovery: add IP-based rate limiting as a secondary layer, require email verification for API keys, and implement progressive rate reduction for new keys. Start them at 10% of the full limit and ramp over 24 hours.

Scenario 3: Retry Storm After Rate Limit Response. The API returns 429 responses without Retry-After headers. A popular third-party client library implements immediate retry with no backoff. At 10K users, a momentary limit breach triggers 10K retries within one second, which causes more 429s, which causes more retries. Exponential amplification. The system oscillates between overload and recovery for 5 to 10 minutes. Detection: monitor 429_response_rate and retry_request_ratio (requests with retry-indicative headers / total requests); alert when the retry ratio exceeds 30%. Recovery: always return a Retry-After header with jittered delay (base + random 0 to 5s), implement server-side backpressure by increasing retry delay under load, and add client identification to detect and block retry-storming clients.

Capacity Planning

Redis handles roughly 200K operations/second for simple INCR commands on a single instance (r6g.large). A Lua-script-based sliding window check costs about 2 to 3 operations per rate limit decision, so the result is around 70K rate limit decisions/second per Redis instance.

Metric	Target	Warning	Action
Redis CPU (rate limit instance)	< 40%	> 60%	Scale Redis or shard by key prefix
Rate limit decision latency (P99)	< 2ms	> 10ms	Check Redis connection pool, network
False positive rate (legitimate 429s)	< 0.01%	> 0.1%	Widen limits, review limit granularity
Memory per rate limit key	~100 bytes (counter)	N/A	Plan: 100B * unique_keys * 2 windows
Redis failover duration	< 5s	> 30s	Test Sentinel/cluster failover regularly

Real-world numbers worth knowing. Stripe rate-limits at roughly 100 read / 100 write requests per second per key in production, using a multi-tier system (per-IP, per-API-key, per-merchant, and global). GitHub allows 5,000 requests/hour for authenticated API users with a token bucket algorithm and returns precise X-RateLimit-* headers. Capacity formula: redis_instances = (peak_rps * avg_rate_checks_per_request) / 70000. A service at 100K RPS with 2 rate-limit checks per request needs: (100K * 2) / 70K = roughly 3 Redis instances (deploy 6 for HA with replicas).

Architecture Decision Record

ADR: Rate Limiting Architecture

Context: Where and how rate limits get enforced affects accuracy, latency, failure modes, and operational complexity. There is no single right answer, but there is a clear decision framework.

Criteria (Weight)	Local In-Process	Centralized (Redis)	Edge (CDN/WAF)	Hybrid (Local + Redis)
Accuracy (30%)	Low (per-instance)	High (global count)	Medium (edge count)	High
Latency added (25%)	~0ms	~1-3ms (Redis RTT)	~0ms (inline)	~0-3ms
Failure mode (20%)	Fail-safe (no deps)	Fail-open risk	Fail-safe	Graceful degradation
Ops complexity (15%)	Lowest	Medium (Redis HA)	Low (managed)	Highest
Multi-tenant support (10%)	Poor	Excellent	Limited	Excellent

Decision framework:

Single service, under 10K RPM, no multi-tenant requirements. Use in-process rate limiting (Go rate.Limiter, Java Guava RateLimiter). Zero external dependencies, sub-microsecond overhead. The trade-off is that limits are per-instance, not global. This is fine for most internal services.
Multi-service, 10K to 500K RPM, API product with paying customers. Use Redis-backed centralized rate limiting. Deploy Redis Sentinel or Cluster for HA. Implement the local fallback pattern to degrade to per-instance if Redis goes down. This is what Stripe, GitHub, and Twilio do.
Public-facing, over 500K RPM, DDoS/abuse is a primary concern. Layer edge rate limiting (Cloudflare Rate Limiting, AWS WAF) in front of application-level limits. Let the edge handle volumetric attacks (IP-based) while the application handles business-logic limits (per-user, per-resource). Edge plus Redis is the gold standard here.
Multi-region, eventual consistency is acceptable. Use local rate limiting with periodic cross-region synchronization (for example, sync counters via Kafka every 5 seconds). Accept that limits may be exceeded by up to sync_interval * region_count in the worst case. This avoids cross-region Redis latency (50 to 100ms) on the hot path, which matters more than most teams expect at scale.

Tool	Type	Best For	Scale
Redis + Lua	Open Source	Distributed counters, atomic operations	Medium-Enterprise
Envoy Rate Limit	Open Source	Service mesh integration, per-route limits	Large-Enterprise
Kong Rate Limiting	Open Source	API gateway plugin, Redis-backed	Medium-Enterprise
AWS WAF	Managed	Edge rate limiting, IP-based rules	Small-Enterprise

Why It Exists

How It Works

Token Bucket Algorithm

Start with a bucket of N tokens. Every request costs one token. Tokens refill at a fixed rate R per second. If the bucket is empty, the request gets rejected. Simple as that.

The nice property here is that it naturally allows bursts up to the bucket capacity while still enforcing a steady-state rate.

Bucket capacity: 100 tokens
Refill rate: 10 tokens/second
Burst: 100 requests instantly, then 10/second sustained

Sliding Window Log

Sliding Window Counter

Fixed Window Counter

Production Considerations

Redis atomicity. Use Lua scripts for atomic check-and-increment. MULTI/EXEC transactions will not work here because the check and the increment must happen in a single atomic step.
Clock synchronization. Distributed rate limiting needs roughly synchronized clocks. A few seconds of NTP drift is fine for most algorithms, but if clocks are minutes apart, weird behavior follows.
Graceful degradation. If Redis goes down, fail open (allow traffic) rather than fail closed (reject everything). Yes, this means rate limiting is temporarily lost. That is better than a total outage. Monitor and alert on rate limiter availability to catch when this happens.
Multi-dimensional limits. Rate limit by user ID, API key, IP, endpoint, and HTTP method independently. A user might get 1000 req/min for reads but only 100 req/min for writes. Different operations have different costs.
Cost-based limiting. Not all requests are equal. A search query scanning millions of rows should cost more "tokens" than a simple key lookup. Weight expensive operations accordingly.

Failure Scenarios

Capacity Planning

Metric	Target	Warning	Action
Redis CPU (rate limit instance)	< 40%	> 60%	Scale Redis or shard by key prefix
Rate limit decision latency (P99)	< 2ms	> 10ms	Check Redis connection pool, network
False positive rate (legitimate 429s)	< 0.01%	> 0.1%	Widen limits, review limit granularity
Memory per rate limit key	~100 bytes (counter)	N/A	Plan: 100B * unique_keys * 2 windows
Redis failover duration	< 5s	> 30s	Test Sentinel/cluster failover regularly

Architecture Decision Record

ADR: Rate Limiting Architecture

Context: Where and how rate limits get enforced affects accuracy, latency, failure modes, and operational complexity. There is no single right answer, but there is a clear decision framework.

Criteria (Weight)	Local In-Process	Centralized (Redis)	Edge (CDN/WAF)	Hybrid (Local + Redis)
Accuracy (30%)	Low (per-instance)	High (global count)	Medium (edge count)	High
Latency added (25%)	~0ms	~1-3ms (Redis RTT)	~0ms (inline)	~0-3ms
Failure mode (20%)	Fail-safe (no deps)	Fail-open risk	Fail-safe	Graceful degradation
Ops complexity (15%)	Lowest	Medium (Redis HA)	Low (managed)	Highest
Multi-tenant support (10%)	Poor	Excellent	Limited	Excellent

Decision framework:

Single service, under 10K RPM, no multi-tenant requirements. Use in-process rate limiting (Go rate.Limiter, Java Guava RateLimiter). Zero external dependencies, sub-microsecond overhead. The trade-off is that limits are per-instance, not global. This is fine for most internal services.
Multi-service, 10K to 500K RPM, API product with paying customers. Use Redis-backed centralized rate limiting. Deploy Redis Sentinel or Cluster for HA. Implement the local fallback pattern to degrade to per-instance if Redis goes down. This is what Stripe, GitHub, and Twilio do.
Public-facing, over 500K RPM, DDoS/abuse is a primary concern. Layer edge rate limiting (Cloudflare Rate Limiting, AWS WAF) in front of application-level limits. Let the edge handle volumetric attacks (IP-based) while the application handles business-logic limits (per-user, per-resource). Edge plus Redis is the gold standard here.
Multi-region, eventual consistency is acceptable. Use local rate limiting with periodic cross-region synchronization (for example, sync counters via Kafka every 5 seconds). Accept that limits may be exceeded by up to sync_interval * region_count in the worst case. This avoids cross-region Redis latency (50 to 100ms) on the hot path, which matters more than most teams expect at scale.

Architecture Diagram

Why It Exists

How It Works

Token Bucket Algorithm

Sliding Window Log

Sliding Window Counter

Fixed Window Counter

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Rate Limiting Architecture

Key Points

Tool Comparison

Common Mistakes

Related Topics

Rate Limiting & Throttling

Architecture Diagram

Why It Exists

How It Works

Token Bucket Algorithm

Sliding Window Log

Sliding Window Counter

Fixed Window Counter

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

ADR: Rate Limiting Architecture

Key Points

Tool Comparison

Common Mistakes

Related Topics