Token Bucket & Sliding Window Rate Limiters
A rate limiter caps request rate per key. Token bucket: refill N tokens/sec, requests consume one. Sliding window: count requests in the last second. Both need atomic counter updates and either lazy refill or scheduled refill, choice depends on traffic patterns and burst tolerance.
The problem in plain English
Cap how many requests a user, IP, or API key can make in a given time window. Two algorithms cover almost every real system.
Token bucket
Imagine a bucket that holds at most C tokens. Tokens drip in at a steady rate (R per second). Each request takes one token and proceeds; if the bucket is empty, the request is rejected.
Bucket capacity C = 10 tokens. Refill rate R = 2 tokens/second.
t=0s bucket: [**********] (full)
request > take 1 bucket: [*********o]
request > take 1 bucket: [********oo]
...10 fast requests... bucket: [oooooooooo] (empty)
request > REJECT (no tokens)
t=1s bucket refilled by 2 bucket: [**oooooooo]
2 requests OK bucket: [oooooooooo]
request > REJECT
t=5s bucket: [********oo] (refilled to 8 over 4 quiet seconds)
long quiet stretch lets the bucket refill, ready to absorb a burst
Token bucket allows bursts (up to C requests instantly) but caps the average at R per second. The state is just two numbers per key: tokens remaining, and last refill timestamp. Update is one CAS or one Redis Lua script.
Sliding window
Keep a list of the timestamps of recent requests. Reject if more than N happened in the last window seconds.
Limit: 5 requests per 10 seconds. Window slides with current time.
t=0s record: [t=0] count in [-10s, 0]: 1 OK
t=1s record: [t=0, t=1] count in [-9s, 1]: 2 OK
t=2s record: [t=0, t=1, t=2] count in [-8s, 2]: 3 OK
t=3s record: [t=0, t=1, t=2, t=3] count in [-7s, 3]: 4 OK
t=4s record: [t=0, t=1, t=2, t=3, t=4] count in [-6s, 4]: 5 OK
t=4.5s record: [t=0, t=1, t=2, t=3, t=4, t=4.5]? count in [-5.5s,4.5]: 6 REJECT
t=11s record: [t=1, t=2, t=3, t=4] (t=0 fell out) count in [1s, 11s]: 4 OK
(t=0 is older than 10s ago, drop it)
Sliding window enforces a strict cap; no bursts allowed. The state is an ordered list of timestamps per key (or a counter per sub-bucket if the window is divided into fixed slots). Update needs the list under a lock.
Picking one
| Token Bucket | Sliding Window | |
|---|---|---|
| Allow bursts? | Yes (up to capacity) | No (strict cap) |
| Memory per key | Constant (2 numbers) | One entry per recent request |
| Update cost | One CAS / Lua script | Lock around a list |
| Best for | API quotas with grace, friendly throttling | Billing, strict per-window contracts |
Most real systems use token bucket because it's cheap and friendly. Stripe's billing endpoints use sliding window because they can't allow a burst that exceeds the window's hard cap.
Production-grade considerations
What real services run
- AWS: token bucket per service per region. Global limits via SQS quota daemon.
- Stripe: sliding window per customer per endpoint. Lua script in Redis.
- Cloudflare: leaky bucket at the edge; aggregate at origin.
- GitHub API: 5000/hour authenticated, sliding hourly window.
Algorithm tradeoffs
| Token Bucket | Sliding Window | |
|---|---|---|
| Allow bursts | Yes (up to capacity) | No (strict cap) |
| Memory per key | Constant (2 numbers) | O(requests in window) |
| Atomic update cost | One CAS | Lock (deque manipulation) |
| Best for | API quota with grace | Billing / strict limits |
When a rate limiter is unnecessary
- Internal services with predictable load → connection pool already bounds concurrency.
- Per-second limits on a service serving millions of req/sec → sharding by key is cheaper than tracking each.
- Simple "max N concurrent" → use a Semaphore, not a rate limiter.
The senior signal Articulating what to do when the limiter itself fails. Fail-open (let everything through) vs fail-closed (reject everything), billing services should fail-closed; CDNs fail-open. The decision reveals a lot about the candidate's production-thinking.
Implementations
Lazy refill: on each tryAcquire(), compute how many tokens would have been added since the last access. Stores tokens and last-refill nanos in two AtomicLongs (don't pack into one long: System.nanoTime() is 64-bit and truncating to 32 bits wraps every ~4 seconds, blowing up the (now - lastRefill) math). The CAS-retry loop keeps the pair consistent.
1 import java.util.concurrent.atomic.AtomicLong;
2 import java.util.concurrent.TimeUnit;
3
4 class TokenBucket {
5 private final long capacity;
6 private final long refillNanos; // nanos per token
7 private final AtomicLong tokens;
8 private final AtomicLong lastRefill; // System.nanoTime() snapshot
9
10 public TokenBucket(long capacity, long ratePerSec) {
11 this.capacity = capacity;
12 this.refillNanos = TimeUnit.SECONDS.toNanos(1) / ratePerSec;
13 this.tokens = new AtomicLong(capacity);
14 this.lastRefill = new AtomicLong(System.nanoTime());
15 }
16
17 public boolean tryAcquire() {
18 long now = System.nanoTime();
19 while (true) {
20 long prev = lastRefill.get();
21 long earned = (now - prev) / refillNanos;
22 long current = tokens.get();
23 long updated = Math.min(capacity, current + earned);
24 if (updated < 1) return false;
25 // CAS the token counter; on success, also advance lastRefill
26 if (tokens.compareAndSet(current, updated - 1)) {
27 // Advance lastRefill by the time-equivalent of earned tokens.
28 // Best-effort: a parallel acquire may also advance it.
29 lastRefill.compareAndSet(prev, prev + earned * refillNanos);
30 return true;
31 }
32 }
33 }
34 }Limit per user/IP/API key. ConcurrentHashMap stores one TokenBucket per key. computeIfAbsent ensures each key has exactly one bucket, regardless of concurrent first-use.
1 class PerKeyRateLimiter {
2 private final ConcurrentHashMap<String, TokenBucket> buckets = new ConcurrentHashMap<>();
3 private final long capacity;
4 private final long ratePerSec;
5
6 public PerKeyRateLimiter(long capacity, long ratePerSec) {
7 this.capacity = capacity;
8 this.ratePerSec = ratePerSec;
9 }
10
11 public boolean tryAcquire(String key) {
12 TokenBucket bucket = buckets.computeIfAbsent(
13 key, k -> new TokenBucket(capacity, ratePerSec));
14 return bucket.tryAcquire();
15 }
16 }Key points
- •Token bucket: refill rate, max bucket capacity (allows bursts)
- •Sliding window: log of timestamps, count those in [now-1s, now]
- •Per-key state requires concurrent map; per-bucket update requires atomic
- •Distributed rate limiting: Redis SETNX + INCR + EXPIRE, or atomic Lua script
- •Real production: Stripe's algorithm uses sliding window + leaky bucket per dimension