Real-World ScenariosTopic 4 of 18

ProblemMediumAsked Often

Token Bucket & Sliding Window Rate Limiters

In one line

A rate limiter caps request rate per key. Token bucket: refill N tokens/sec, requests consume one. Sliding window: count requests in the last second. Both need atomic counter updates and either lazy refill or scheduled refill, choice depends on traffic patterns and burst tolerance.

The problem in plain English

Cap how many requests a user, IP, or API key can make in a given time window. Two algorithms cover almost every real system.

Token bucket

Imagine a bucket that holds at most C tokens. Tokens drip in at a steady rate (R per second). Each request takes one token and proceeds; if the bucket is empty, the request is rejected.

Bucket capacity C = 10 tokens. Refill rate R = 2 tokens/second.

t=0s    bucket: [**********]   (full)
        request > take 1     bucket: [*********o]
        request > take 1     bucket: [********oo]
        ...10 fast requests... bucket: [oooooooooo]   (empty)
        request > REJECT (no tokens)

t=1s    bucket refilled by 2  bucket: [**oooooooo]
        2 requests OK         bucket: [oooooooooo]
        request > REJECT

t=5s    bucket: [********oo]   (refilled to 8 over 4 quiet seconds)
        long quiet stretch lets the bucket refill, ready to absorb a burst

Token bucket allows bursts (up to C requests instantly) but caps the average at R per second. The state is just two numbers per key: tokens remaining, and last refill timestamp. Update is one CAS or one Redis Lua script.

Sliding window

Keep a list of the timestamps of recent requests. Reject if more than N happened in the last window seconds.

Limit: 5 requests per 10 seconds. Window slides with current time.

t=0s   record:  [t=0]                                    count in [-10s, 0]: 1   OK
t=1s   record:  [t=0, t=1]                               count in [-9s, 1]:  2   OK
t=2s   record:  [t=0, t=1, t=2]                          count in [-8s, 2]:  3   OK
t=3s   record:  [t=0, t=1, t=2, t=3]                     count in [-7s, 3]:  4   OK
t=4s   record:  [t=0, t=1, t=2, t=3, t=4]                count in [-6s, 4]:  5   OK
t=4.5s record:  [t=0, t=1, t=2, t=3, t=4, t=4.5]?        count in [-5.5s,4.5]: 6  REJECT

t=11s  record:  [t=1, t=2, t=3, t=4]    (t=0 fell out)   count in [1s, 11s]: 4   OK
       (t=0 is older than 10s ago, drop it)

Sliding window enforces a strict cap; no bursts allowed. The state is an ordered list of timestamps per key (or a counter per sub-bucket if the window is divided into fixed slots). Update needs the list under a lock.

Picking one

	Token Bucket	Sliding Window
Allow bursts?	Yes (up to capacity)	No (strict cap)
Memory per key	Constant (2 numbers)	One entry per recent request
Update cost	One CAS / Lua script	Lock around a list
Best for	API quotas with grace, friendly throttling	Billing, strict per-window contracts

Most real systems use token bucket because it's cheap and friendly. Stripe's billing endpoints use sliding window because they can't allow a burst that exceeds the window's hard cap.

Production-grade considerations

Tip

What real services run

AWS: token bucket per service per region. Global limits via SQS quota daemon.
Stripe: sliding window per customer per endpoint. Lua script in Redis.
Cloudflare: leaky bucket at the edge; aggregate at origin.
GitHub API: 5000/hour authenticated, sliding hourly window.

Algorithm tradeoffs

	Token Bucket	Sliding Window
Allow bursts	Yes (up to capacity)	No (strict cap)
Memory per key	Constant (2 numbers)	O(requests in window)
Atomic update cost	One CAS	Lock (deque manipulation)
Best for	API quota with grace	Billing / strict limits

When a rate limiter is unnecessary

Internal services with predictable load → connection pool already bounds concurrency.
Per-second limits on a service serving millions of req/sec → sharding by key is cheaper than tracking each.
Simple "max N concurrent" → use a Semaphore, not a rate limiter.

Warning

The senior signal Articulating what to do when the limiter itself fails. Fail-open (let everything through) vs fail-closed (reject everything), billing services should fail-closed; CDNs fail-open. The decision reveals a lot about the candidate's production-thinking.

Implementations

Token bucket, atomic refill on request

Lazy refill: on each tryAcquire(), compute how many tokens would have been added since the last access. Stores tokens and last-refill nanos in two AtomicLongs (don't pack into one long: System.nanoTime() is 64-bit and truncating to 32 bits wraps every ~4 seconds, blowing up the (now - lastRefill) math). The CAS-retry loop keeps the pair consistent.

 1  import java.util.concurrent.atomic.AtomicLong;
 2  import java.util.concurrent.TimeUnit;
 3  
 4  class TokenBucket {
 5      private final long capacity;
 6      private final long refillNanos;          // nanos per token
 7      private final AtomicLong tokens;
 8      private final AtomicLong lastRefill;     // System.nanoTime() snapshot
 9  
10      public TokenBucket(long capacity, long ratePerSec) {
11          this.capacity = capacity;
12          this.refillNanos = TimeUnit.SECONDS.toNanos(1) / ratePerSec;
13          this.tokens = new AtomicLong(capacity);
14          this.lastRefill = new AtomicLong(System.nanoTime());
15      }
16  
17      public boolean tryAcquire() {
18          long now = System.nanoTime();
19          while (true) {
20              long prev = lastRefill.get();
21              long earned = (now - prev) / refillNanos;
22              long current = tokens.get();
23              long updated = Math.min(capacity, current + earned);
24              if (updated < 1) return false;
25              // CAS the token counter; on success, also advance lastRefill
26              if (tokens.compareAndSet(current, updated - 1)) {
27                  // Advance lastRefill by the time-equivalent of earned tokens.
28                  // Best-effort: a parallel acquire may also advance it.
29                  lastRefill.compareAndSet(prev, prev + earned * refillNanos);
30                  return true;
31              }
32          }
33      }
34  }

Per-key rate limiter with ConcurrentHashMap

Limit per user/IP/API key. ConcurrentHashMap stores one TokenBucket per key. computeIfAbsent ensures each key has exactly one bucket, regardless of concurrent first-use.

 1  class PerKeyRateLimiter {
 2      private final ConcurrentHashMap<String, TokenBucket> buckets = new ConcurrentHashMap<>();
 3      private final long capacity;
 4      private final long ratePerSec;
 5  
 6      public PerKeyRateLimiter(long capacity, long ratePerSec) {
 7          this.capacity = capacity;
 8          this.ratePerSec = ratePerSec;
 9      }
10  
11      public boolean tryAcquire(String key) {
12          TokenBucket bucket = buckets.computeIfAbsent(
13              key, k -> new TokenBucket(capacity, ratePerSec));
14          return bucket.tryAcquire();
15      }
16  }

Key points

•Token bucket: refill rate, max bucket capacity (allows bursts)
•Sliding window: log of timestamps, count those in [now-1s, now]
•Per-key state requires concurrent map; per-bucket update requires atomic
•Distributed rate limiting: Redis SETNX + INCR + EXPIRE, or atomic Lua script
•Real production: Stripe's algorithm uses sliding window + leaky bucket per dimension

Follow-up questions

▸Token bucket vs sliding window, pick one?

Token bucket allows bursts (use up tokens fast, refill slowly). Sliding window strictly caps requests-per-window. Stripe's API uses sliding-window because billing endpoints can't tolerate bursts. Most CDNs use token bucket for friendliness. Pick based on the contract being enforced.

▸How is this made distributed across N servers?

Per-server limiters drift; a shared store is required. Redis SETNX + INCR + EXPIRE is the standard (atomic Lua script for the check-and-update). Sketches like CountMinSketch reduce memory for huge key spaces. Trade exact accuracy for performance.

▸What about rate limit by IP vs by user vs by API key?

All three usually. Multi-dimensional limits, most strict wins. Stripe documents this: per-account, per-endpoint, per-action all have separate limits. Each is a TokenBucket (or sliding window) keyed by the identifier.

▸How are limit responses handled gracefully?

Return 429 with `Retry-After` header. Add `X-RateLimit-Remaining` and `X-RateLimit-Reset` headers so clients can self-throttle. For internal services, exponential backoff with jitter on receiving 429.

Related reading

Token bucket

Imagine a bucket that holds at most C tokens. Tokens drip in at a steady rate (R per second). Each request takes one token and proceeds; if the bucket is empty, the request is rejected.

Bucket capacity C = 10 tokens. Refill rate R = 2 tokens/second. t=0s bucket: [**********] (full) request > take 1 bucket: [*********o] request > take 1 bucket: [********oo] ...10 fast requests... bucket: [oooooooooo] (empty) request > REJECT (no tokens) t=1s bucket refilled by 2 bucket: [**oooooooo] 2 requests OK bucket: [oooooooooo] request > REJECT t=5s bucket: [********oo] (refilled to 8 over 4 quiet seconds) long quiet stretch lets the bucket refill, ready to absorb a burst

Sliding window

Keep a list of the timestamps of recent requests. Reject if more than N happened in the last window seconds.

Limit: 5 requests per 10 seconds. Window slides with current time. t=0s record: [t=0] count in [-10s, 0]: 1 OK t=1s record: [t=0, t=1] count in [-9s, 1]: 2 OK t=2s record: [t=0, t=1, t=2] count in [-8s, 2]: 3 OK t=3s record: [t=0, t=1, t=2, t=3] count in [-7s, 3]: 4 OK t=4s record: [t=0, t=1, t=2, t=3, t=4] count in [-6s, 4]: 5 OK t=4.5s record: [t=0, t=1, t=2, t=3, t=4, t=4.5]? count in [-5.5s,4.5]: 6 REJECT t=11s record: [t=1, t=2, t=3, t=4] (t=0 fell out) count in [1s, 11s]: 4 OK (t=0 is older than 10s ago, drop it)

Picking one

Token Bucket

Sliding Window

Allow bursts?

Yes (up to capacity)

No (strict cap)

Memory per key

Constant (2 numbers)

One entry per recent request

Update cost

One CAS / Lua script

Lock around a list

Best for

API quotas with grace, friendly throttling

Billing, strict per-window contracts

Most real systems use token bucket because it's cheap and friendly. Stripe's billing endpoints use sliding window because they can't allow a burst that exceeds the window's hard cap.

Token Bucket

Sliding Window

Allow bursts

Yes (up to capacity)

No (strict cap)

Memory per key

Constant (2 numbers)

O(requests in window)

Atomic update cost

One CAS

Lock (deque manipulation)

Best for

API quota with grace

Billing / strict limits

When a rate limiter is unnecessary

Internal services with predictable load → connection pool already bounds concurrency.

Per-second limits on a service serving millions of req/sec → sharding by key is cheaper than tracking each.

Simple "max N concurrent" → use a Semaphore, not a rate limiter.

Warning

1 import java.util.concurrent.atomic.AtomicLong; 2 import java.util.concurrent.TimeUnit; 3 4 class TokenBucket { 5 private final long capacity; 6 private final long refillNanos; // nanos per token 7 private final AtomicLong tokens; 8 private final AtomicLong lastRefill; // System.nanoTime() snapshot 9 10 public TokenBucket(long capacity, long ratePerSec) { 11 this.capacity = capacity; 12 this.refillNanos = TimeUnit.SECONDS.toNanos(1) / ratePerSec; 13 this.tokens = new AtomicLong(capacity); 14 this.lastRefill = new AtomicLong(System.nanoTime()); 15 } 16 17 public boolean tryAcquire() { 18 long now = System.nanoTime(); 19 while (true) { 20 long prev = lastRefill.get(); 21 long earned = (now - prev) / refillNanos; 22 long current = tokens.get(); 23 long updated = Math.min(capacity, current + earned); 24 if (updated < 1) return false; 25 // CAS the token counter; on success, also advance lastRefill 26 if (tokens.compareAndSet(current, updated - 1)) { 27 // Advance lastRefill by the time-equivalent of earned tokens. 28 // Best-effort: a parallel acquire may also advance it. 29 lastRefill.compareAndSet(prev, prev + earned * refillNanos); 30 return true; 31 } 32 } 33 } 34 }

1 class PerKeyRateLimiter { 2 private final ConcurrentHashMap<String, TokenBucket> buckets = new ConcurrentHashMap<>(); 3 private final long capacity; 4 private final long ratePerSec; 5 6 public PerKeyRateLimiter(long capacity, long ratePerSec) { 7 this.capacity = capacity; 8 this.ratePerSec = ratePerSec; 9 } 10 11 public boolean tryAcquire(String key) { 12 TokenBucket bucket = buckets.computeIfAbsent( 13 key, k -> new TokenBucket(capacity, ratePerSec)); 14 return bucket.tryAcquire(); 15 } 16 }