Throughput vs Latency Tuning
Throughput = ops/sec the system can sustain. Latency = time per op. They trade off via batching, queuing, and concurrency. Higher concurrency → higher throughput but worse tail latency. Little's Law: average concurrency = throughput × average latency.
The two metrics that matter
Throughput (req/sec) is what monitoring shows in the steady state. Latency (ms per op) is what users feel. They're related but not interchangeable.
Higher concurrency → more in-flight ops → higher throughput. But also: longer queues → higher per-op latency. Service degrades from "fast at low load" to "slower at high load", even if no individual operation has changed.
Little's Law, the only law that matters here
L = λW
L = average number of items in the system. λ = arrival rate (req/sec). W = average time per item (sec).
Worked example: API receives 10K req/sec, average latency 50ms. Concurrent in-flight requests = 10K × 0.05 = 500. So sizing needs: 500-connection DB pool, 500-thread (or 500 virtual threads, or asyncio with concurrency=500) capacity. Below that, requests queue; latency grows.
The Universal Scalability Law
Throughput doesn't scale linearly forever Adding threads gives diminishing returns due to (a) contention on shared resources, (b) coherence overhead (cross-CPU coordination). At some N, more threads make throughput worse.
The optimal N depends on the workload. Find it empirically: load-test at N=2,4,8,16,32,64. Throughput plateaus or drops past optimal. Don't assume "more = better."
Latency percentiles
| Percentile | What it measures | Use for |
|---|---|---|
| p50 (median) | Typical experience | Trend over time |
| p95 | "Most users" tail | SLO targets |
| p99 | Worst-1%, what frustrated users see | SLO violations |
| p99.9 | Catastrophe | Capacity planning |
| max | Single worst case | Debugging |
Average is dangerous, it hides bimodal distributions (90% fast + 10% terrible looks the same as 100% mediocre on average).
Backpressure, bounded latency under load
When arrival rate exceeds service rate, queue grows. Without intervention, queue depth → infinity, latency → infinity. Three responses:
- Drop, return 503, free resources.
- Block, admission control upstream; no new requests until queue drains.
- Bulk-process, serve in batches, prioritize.
The interview answer "Throughput and latency trade off. Little's Law connects them via concurrency. Tail latency (p99/p99.9) matters more than average. Beyond optimal concurrency, throughput drops, find empirically. Backpressure is mandatory in production to bound latency under burst load."
Key points
- •Throughput: system-level capacity (req/sec)
- •Latency: per-op time, especially p50/p99/p99.9 percentiles
- •Little's Law: L = λW (concurrent in-flight = throughput × avg latency)
- •Batching trades latency for throughput (waits for batch, then processes many)
- •Queue length grows when arrival rate exceeds service rate, exponential queueing
Tradeoffs
| Option | Pros | Cons | When to use |
|---|---|---|---|
| Optimize for throughput |
|
| Batch jobs, async pipelines, ETL, analytics queries |
| Optimize for latency |
|
| Interactive services, trading, real-time |
| Adaptive (backpressure-driven) |
|
| Production services with mixed workload |
Follow-up questions
▸What's Little's Law and why does it matter?
▸Why does p99 matter more than average?
▸How does batching affect latency?
▸What's USL (Universal Scalability Law)?
Gotchas
- !Average latency hides tail, always look at p99/p99.9
- !Adding threads beyond optimal N HURTS throughput (USL)
- !Unbounded queues turn brief overload into permanent latency degradation
- !Cache hit rate drops when concurrency increases (cache-line contention, eviction)