Design a Rate Limiter
Design a rate limiting service that protects APIs from abuse. It should support different rate limit rules (per user, per IP, per endpoint), work across multiple servers, and handle high throughput with minimal latency overhead.
Key Topics
Interview Cheat Sheet
60s skim · 3min careful readEmbedded Go library inside the API gateway, not a sidecar service. One atomic Lua script in Valkey evaluates all four scope counters (global, tenant, user, IP) per request in sub-millisecond. Sliding window counter as default, token bucket for burst endpoints. Fail-open via in-process circuit breaker. The non-obvious detail is the hash-tag on Valkey keys so all of a tenant's scopes land on the same shard for atomic multi-key Lua.
- Check · 1M req/sec · sub-millisecond p99
Every request hits the rate limiter inside the gateway. A single Lua script in Valkey checks all four counters (global, tenant, user, IP) at once and either lets the request through or returns 429 with proper headers. No separate service hop.
Requestgatewayembedded rate-limit libraryone Lua script on Valkey reads + increments 4 counters atomically (global, tenant, user, IP) using redis.call('TIME') as server-side clockif any exceeds: return 429 with X-RateLimit-Limit/Remaining/Reset headerselse: forward to upstreamhash tag on key forces a tenant's 4 scopes onto one shard so multi-key Lua is atomic - Hot tenant · per-pod pre-aggregation · 100ms flush
A hot tenant doing 300K/sec on one Valkey shard would blow the shard's 20K scripts/sec ceiling. Instead, each gateway pod tracks the count in memory and flushes a batch to Valkey every 100ms. That cuts shard ops by 100x at the cost of about 1% overshoot.
Hot tenant requestin-process pre-aggregator counts locallyevery 100ms: batched INCRBY {batch} to Valkey (single network call instead of per-request)shard ops drop from 300K/sec to ~3K/seccost: up to 100ms staleness, ~1% overshoot at sustained burst (acceptable trade) - Fail-open · circuit breaker on Valkey loss
If Valkey goes away, the gateway can't enforce limits. The choice is fail-open (let everything through) or fail-closed (block everything). Fail-open is safer because the API backends are expected to self-protect; the rate limiter is defense-in-depth, never the only defense.
Valkey timeouts ≥5 consecutive (~500ms)circuit breaker opens on each gateway podall requests pass through unlimitedbackends rely on their own connection limits, queue depth, and load sheddingbreaker probes every 5s, closes when a probe succeedsemergency multiplier at Valkey rl:emergency_multiplier read by Lua every check (5.0 scales every limit 5x in one second across all pods)
- •1M rate-limit checks/sec peak, sub-millisecond p99 added latency
- •Four scopes: global 500K/min, tenant 10K/min, user 100/min, IP 1K/min
- •Valkey ~10 GB across 10 shards, two counters per rule at 16 bytes each
- •Valkey single-thread: ~20K Lua scripts/sec per shard, ~0.05ms per script
- •Sliding window counter: two counters per key, ~0.3ms, within 5% of true sliding
- •Multi-region split: US 660 + EU 330 + Asia 110 = 1100, 10% over-provision on 1000 global
- •Pre-aggregation flush every 100ms, ~1% overshoot at 50K/sec on a 500K/min cap
- •Embedded library in the gateway, not a separate sidecar service (saves 0.5-2ms per request)
- •Sliding window counter as default, token bucket as exception for burst endpoints
- •One atomic Lua script for all four scopes, not four separate INCR commands
- •Hash-tag on Valkey keys forces tenant scopes onto one shard, required for scripted multi-key
- •Fail-open via circuit breaker, not fail-closed (limiter must never cause an API outage)
- •Approach C multi-region (proportional split with 10% over-provision), not gossip by default
- •Rate limiter is defense-in-depth, API backends must still load-shed independently
- •redis.call('TIME') inside Lua, not client clocks, so attackers can't game windows
- •30s rule-cache propagation versus 1s emergency-multiplier path, different SLAs
- •Valkey single-threaded ceiling is what justifies pre-aggregation, not raw shard count
- Sliding window counter over log or fixed window
Sliding log stores every request timestamp, so a 1000/min limit at 1M req/sec is 60M timestamps per key with ZRANGEBYSCORE on each check, not sub-ms. Fixed window has boundary bursts (2x the limit at window edges). Sliding window counter holds two counters per key and weights the previous window by elapsed ratio, gets within 5% of true sliding accuracy in <0.3ms. Same approach Stripe and Cloudflare use. Token bucket is the exception, kept for burst endpoints where 'allow up to N at once, then refill' is the actual product requirement.
- Embedded library over standalone service (the 0.5-2ms math)
A standalone rate-limit service adds 0.5-2ms per request for the network hop on top of ~0.2ms Valkey roundtrip. At 1M checks/sec that tax is unaffordable. The library is a Go module compiled into every gateway pod, talks Valkey directly. Library updates ship with gateway releases (which we accept because rule changes go through the 30s rule cache, not library deploys). Same operational trade as the news-aggregator read path being independent of the write path: keep the hot path lean.
- Single atomic Lua over four separate INCR calls
If we did the four scope checks (global, tenant, user, IP) as separate INCR commands, a burst could pass the global check, then the tenant check, before any counter increments are visible to concurrent requests. The combined Lua script reads all four counters, evaluates them, and increments together. Either all four pass or none, in a single round-trip. Also reduces network hops from four to one. Hash tags on keys force a tenant's four scope keys onto one shard so multi-key Lua actually works in cluster mode.
- Multi-region: proportional split + 10% over-provision
True global accuracy needs Paxos on every check, which doesn't pencil at 1M/sec. Approach C: split the global limit by region proportional to traffic plus 10% over-provisioning. A 1000/min global becomes US 660 + EU 330 + Asia 110 = 1100 aggregate. We accept 10% global overshoot for zero cross-region dependency. For billing-sensitive limits needing tighter accuracy, upgrade to Approach D: gossip counters between regions every 5s, accepting some staleness for closer-to-global enforcement. The trade is the same shape as the URL shortener's region-prefixed counter: bound the worst case with structure, don't synchronize.
40-Minute Interview Playbook
Each phase is what the interviewer expects you to do and say. Concrete steps, not topic hints. The diagrams are what you sketch on the board.
- 15 min
Clarify Requirements and Scope
GoalLock in the four scopes (global, tenant, user, IP), the throughput target, and the fail-open posture. The big mistake is starting to code Redis without nailing down what 'a rate limit' actually means in this product.
Do & Say- ASK·1Open by asking what are we protecting and at what scope? Walk them to four hierarchical scopes: global (500K/min), tenant (10K/min), user (100/min), per-IP (1K/min). Write those four numbers stacked in the corner of the board.
- SAY·2Pin the throughput: 1M rate-limit checks/sec peak, sub-millisecond p99 added latency. Confirm out loud that the rate limiter must be faster than whatever it protects.
- SAY·3State the fail-open contract: if Valkey is unreachable, we allow the request, we don't deny, the rate limiter must never cause an outage of the API it protects. Get a nod.
- SAY·4Park out of scope: DDoS scrubbing at the edge, WAF rules, abuse detection. Say L7 rate limiting only, the network edge handles volumetric attacks.
- WRITE·5Justify the per-key math: 10M tenants × ~5 rules each, 2 counters per rule at 16 bytes, plus Valkey overhead, ~10 GB across the cluster, single-digit GB per shard. Write 10GB total, 10 shards on the board.
Interviewer is grading: You name the scope hierarchy before drawing anything, you write the fail-open rule down so it can't get forgotten in the deep dive, and you don't conflate rate limiting with WAF/DDoS.
- 25 min
API and Data Model
GoalDefine the embedded Check() call, the response headers, and the two-counter sliding-window key format. No external rate-limit service, no per-request network hop.
Do & Say- WRITE·1Write the embedded library signature: Check(ctx, req) returns Decision{Allowed, Limit, Remaining, RetryAfterSec, TriggeredRule}. Say this is a Go library inside the API Gateway, not a sidecar.
- WRITE·2Write the four headers a 429 must emit: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After. Clients build backoff from these, so they're non-negotiable.
- DRAW·3Sketch the Valkey key shape with a Cluster hash tag: rl:{acme_corp}:tenant:1711360800 and the previous-window key one window earlier. Underline the curly braces. The {acme_corp} hash tag forces all keys for one tenant onto the same shard, required because the Lua script touches all four scope keys atomically.
- SAY·4Token bucket key format for burst endpoints: tb:user:acme:42 stores tokens|last_refill_epoch in a single string, one GET, one SET per check.
- SAY·5Rule store: PostgreSQL table with columns (tenant_id, endpoint_pattern, scope, algorithm, max_requests, window_seconds, priority). 30-second in-process cache per gateway pod, so no DB hop on the hot path.
Interviewer is grading: You volunteer the hash-tag detail before being prompted (Valkey Cluster gotcha), you keep the rate limiter embedded (no separate service), and you can defend why rules live in PostgreSQL and counters live in Valkey.
- 310 min
High-Level Design
GoalOne picture that shows the embedded library in the gateway, the Valkey Cluster for counters, the rule store and async decision log on the side. Label the hot-path arrow as 'one Lua roundtrip, sub-millisecond.'
Draw on the boardDo & Say- DRAW·1Draw the gateway pod with the rate limit library inside it. Label the box embedded, not a service. A standalone rate limiter adds 0.5 to 2 ms per request. We can't afford that.
- DRAW·2Draw one arrow from the gateway to Valkey labeled EVALSHA, one Lua script, all four scope checks atomic. global, tenant, user, and endpoint are evaluated together. If we did them as four separate commands, bursts could pass the global check before any counter increments.
- SAY·3Add the rule cache loop: PostgreSQL feeds an in-process Go map, 30-second refresh. 30 seconds is the propagation latency for normal rule changes; we expose a flush endpoint for emergency changes during incidents.
- SAY·4Add the dotted async path: 1% sampling on allows, 100% on denies, → Kafka → Flink → ClickHouse for the top 10 rate-limited tenants in the last hour dashboard.
- SAY·5Mark the failure path: circuit-breaker open after 5 consecutive Valkey failures, dotted arrow gateway → Allowed. This is the fail-open contract. The library never blocks on Valkey.
Interviewer is grading: You draw the rate limiter inside the gateway, you label the single Lua-script roundtrip explicitly, and your diagram already shows the fail-open path before they ask about Valkey failures.
- 415 min
Deep Dive: Algorithm Choice, Hot Keys, and Multi-Region
GoalThree sub-dives the interviewer will push on: which algorithm and why, what happens to a hot tenant key that's pinned to one shard, and how the limit behaves across regions.
Draw on the boardDo & Say- SAY·1Put four algorithms on the board: fixed window, sliding log, sliding window counter, token bucket. Kill fixed window with one line: 2x burst at the window boundary, not usable. Kill sliding log: 60M timestamps per key at 1M req/sec, ZRANGEBYSCORE isn't sub-millisecond.
- SAY·2Defend sliding window counter as the default: two counters per key, weighted estimate, sub-0.3ms, within 5% of true sliding window. This is what Stripe and Cloudflare use. Formula: estimated = prev * (1 - elapsed_ratio) + curr.
- SAY·3Defend token bucket as the exception for burst endpoints: bulk APIs and pagination, where a user legitimately makes 100 calls in a second then nothing for a minute. Token bucket allows that, sliding window punishes it.
- SAY·4Pivot to the Valkey single-thread bottleneck: Valkey is single-threaded per shard, 0.05ms per Lua script execution, 20K scripts/sec per shard ceiling. For 1M/sec we need 50 shards, or we cut Valkey ops with local pre-aggregation.
- DRAW·5Draw the pre-aggregator: each gateway pod batches counts in-process, flushes via INCRBY every 100ms. Trade-off: up to 100ms of staleness, limit can be exceeded by 5K requests at 50K/sec, 1% overshoot on a 500K/min cap. Acceptable.
- SAY·6Hot tenant pin: all four scope keys for a tenant share the same hash tag → live on one shard by design. If acme_corp generates 300K/sec, that one shard is the bottleneck. Pre-aggregation knocks it down to ~3K batch updates/sec, well within the 20K Lua-script ceiling.
- SAY·7Clock drift: use redis.call("TIME") inside the Lua script instead of client-side timestamps. Valkey's clock is the only one that matters, so a user can't game window boundaries by routing to a server with a later wall clock.
- SAY·8Multi-region Approach C: split the global limit by region proportional to traffic + 10% over-provisioning. A 1000/min global → US=660, EU=330, Asia=110, total 1100. We accept a 10% global overshoot for zero cross-region dependency. For billing-sensitive limits, upgrade to Approach D: gossip every 5 seconds.
- WATCH·9Be ready for the override question. Three mechanisms stacked: permanent bypass list for internal services, per-tenant override row in PostgreSQL with expires_at, emergency multiplier at Valkey key rl:emergency_multiplier read by Lua every check. Setting it 5.0 scales every limit 5x within one second across all pods.
Interviewer is grading: You don't ship a single algorithm, you ship a default plus an exception. You volunteer the Valkey single-threaded recalc before they catch you. You name the 10% over-provisioning cost of multi-region without being prompted.
- 55 min
Failure Modes, Trade-offs, and Wrap-up
GoalDefend the fail-open posture, name two deliberate trade-offs, close in one sentence.
Do & Say- SAY·1Valkey shard failure: one shard down = 10% of keys unreachable, circuit breakers open after 5 timeouts (~500ms), requests fail open, Valkey Cluster promotes replica in ~5s. Worst case: a tenant on the failed shard overshoots by 10s of traffic, ~1,667 extra at 10K/min. Beats blocking 10% of traffic.
- SAY·2Complete Valkey outage: every circuit opens, every request is allowed, API backends self-protect via connection limits, queue depth, and load shedding. The rate limiter is defense-in-depth, not the only defense. This is the deliberate trade-off the fail-open rule encodes.
- SAY·3PostgreSQL outage: zero impact on the hot path, 30-second in-process cache serves stale rules indefinitely, only rule management is blocked. Two-tier architecture pays off here, the hot path and the config path fail independently.
- SAY·4Close in one sentence: embedded Go library inside the gateway, one atomic Lua script across four scope counters in Valkey, sliding window counter by default + token bucket for burst endpoints, fail-open via circuit breaker, multi-region by proportional split with 10% over-provisioning, async decision logs into ClickHouse for visibility.
- SAY·5If there's time, offer the next layers: Lua script hash-tag trick, emergency multiplier rollout playbook, or Approach C → Approach D upgrade path.
Interviewer is grading: You volunteer the fail-open consequences (some over-limit traffic, by design) instead of pretending the system is perfect. You name defense-in-depth as the reason rate limiting alone isn't enough. You can summarize the whole design in one breath.
Interview Grading by Level
What an interviewer at each level expects to see in your answer. Use this to calibrate, not to perform.
Mid-Level Engineer (L4 / SDE-II)
Picks a reasonable algorithm and a key-value store, but stops short of explaining what happens when the limiter itself fails or when a single tenant overwhelms one shard.
- Names token bucket or sliding window and can describe how each one counts requests.
- Puts Redis (or Valkey) in front of the rate limit check and knows it needs an atomic operation.
- Returns 429 with Retry-After and remembers the X-RateLimit-* headers when prompted.
- Knows that a per-process counter on each app server won't work because requests for one user hit different pods.
- Stores rules in a relational database and caches them in process to avoid a DB hop per request.
- Doesn't mention atomicity of multi-step checks; uses INCR + EXPIRE as separate commands and misses the race.
- Picks 'a single Redis instance' without acknowledging it's a SPOF or sizing how many shards 1M req/sec needs.
- Has no answer for what happens when Redis goes down (often picks fail-closed, which causes outages).
- Doesn't realize the hash-tag/key-distribution problem when a single tenant generates 50K req/sec.
- Treats multi-region as 'just put a load balancer in front' with no plan for cross-region counter accuracy.
Senior Engineer (L5 / SDE-III)
Drives the four-scope hierarchy, picks sliding window counter with a token bucket exception, defends fail-open, and handles hot keys with pre-aggregation when pushed.
- Writes the four hierarchical scopes (global, tenant, user, IP) and their limits on the board within the first 5 minutes.
- Picks sliding window counter as default and token bucket for burst endpoints, with one-line rejections of fixed window and sliding log.
- Implements all four scope checks as a single atomic Lua script and explains why separate commands would race under burst.
- Forces all keys for a tenant onto the same shard with hash tags, and knows this is a Valkey Cluster requirement for scripted multi-key access.
- Defends fail-open via in-process circuit breaker as the contract, not a bug. Names the consequence (some over-limit traffic during outages).
- Layers three override mechanisms: permanent bypass list, per-tenant override with expiry, emergency multiplier stored in Valkey.
- Routes hot tenants with per-pod local pre-aggregation flushed every 100ms, accepting ~1% overshoot.
- Names Valkey's single-threaded execution only after the interviewer pushes; revises shard count mid-conversation rather than starting from the per-script latency budget.
- Discusses multi-region in handwavy terms ('we'll replicate counters') and only lands on the proportional-split approach with prompting.
- Forgets to use redis.call('TIME') inside the Lua script, leaving clock drift as an attack surface that the interviewer has to surface.
Staff+ Engineer (L6+)
Owns the room, names defense-in-depth explicitly, treats the limiter as part of a layered protection story, and brings operational rehearsal artifacts.
- Frames the rate limiter as defense-in-depth: API backends must independently load-shed, the rate limiter is never the only protection.
- Volunteers the Valkey single-thread bottleneck (~20K Lua-scripts/sec per shard) and uses it to justify pre-aggregation, not 50 shards of raw capacity.
- Brings up clock-source discipline (redis.call('TIME')) without being asked, because client-side timestamps let attackers route to the server with the later wall clock.
- Quantifies the failure consequence: '10 seconds of unprotected traffic during a shard failover is roughly 1,667 extra requests for a 10K/min tenant, and we accept that explicitly.'
- Distinguishes the 30s rule-cache propagation from the 1s emergency-multiplier path and names which one is used in an incident playbook.
- Pushes back on requirements: 'do you actually want 100 req/min global, or 100 per region? Because Approach C means a 10% overshoot at the global level.'
- Closes with one-sentence summary plus an explicit list of what they'd cover with more time (Lua hash-tag trick, gossip-based Approach D, header negotiation for partial-content downloads under throttling).
Common Follow-up Questions
click to expandQuestions an interviewer is likely to ask after your walkthrough. Rehearse the short answer.
Foundations Referenced
Detailed Solution Coming Soon
Full walkthrough coming soon. Stay tuned!