Throughput vs Latency Tuning

The two metrics that matter

Throughput (req/sec) is what monitoring shows in the steady state. Latency (ms per op) is what users feel. They're related but not interchangeable.

Higher concurrency → more in-flight ops → higher throughput. But also: longer queues → higher per-op latency. Service degrades from "fast at low load" to "slower at high load", even if no individual operation has changed.

Little's Law, the only law that matters here

L = λW

L = average number of items in the system. λ = arrival rate (req/sec). W = average time per item (sec).

Worked example: API receives 10K req/sec, average latency 50ms. Concurrent in-flight requests = 10K × 0.05 = 500. So sizing needs: 500-connection DB pool, 500-thread (or 500 virtual threads, or asyncio with concurrency=500) capacity. Below that, requests queue; latency grows.

The Universal Scalability Law

Important

Throughput doesn't scale linearly forever Adding threads gives diminishing returns due to (a) contention on shared resources, (b) coherence overhead (cross-CPU coordination). At some N, more threads make throughput worse.

The optimal N depends on the workload. Find it empirically: load-test at N=2,4,8,16,32,64. Throughput plateaus or drops past optimal. Don't assume "more = better."

Latency percentiles

Percentile	What it measures	Use for
p50 (median)	Typical experience	Trend over time
p95	"Most users" tail	SLO targets
p99	Worst-1%, what frustrated users see	SLO violations
p99.9	Catastrophe	Capacity planning
max	Single worst case	Debugging

Average is dangerous, it hides bimodal distributions (90% fast + 10% terrible looks the same as 100% mediocre on average).

Backpressure, bounded latency under load

When arrival rate exceeds service rate, queue grows. Without intervention, queue depth → infinity, latency → infinity. Three responses:

Drop, return 503, free resources.
Block, admission control upstream; no new requests until queue drains.
Bulk-process, serve in batches, prioritize.

Tip

The interview answer "Throughput and latency trade off. Little's Law connects them via concurrency. Tail latency (p99/p99.9) matters more than average. Beyond optimal concurrency, throughput drops, find empirically. Backpressure is mandatory in production to bound latency under burst load."

Tradeoffs

Option	Pros	Cons	When to use
Optimize for throughput	Higher ops/sec on same hardware Better resource utilization	Latency suffers (queueing) Tail can spike under burst	Batch jobs, async pipelines, ETL, analytics queries
Optimize for latency	Low tail latency Predictable response times	Lower throughput Resources idle under low load	Interactive services, trading, real-time
Adaptive (backpressure-driven)	Bounded latency under load Degrades gracefully	More complex to implement Requires queue depth monitoring	Production services with mixed workload

Follow-up questions

▸What's Little's Law and why does it matter?

L = λW. Average number of items in system = arrival rate × average time in system. Useful for prediction: at 1000 req/sec and average latency 100ms, there are 100 in-flight requests at any moment. Size connection pool, thread pool, and queue depth around this.

▸Why does p99 matter more than average?

Page-load times are bottlenecked by the slowest of N parallel requests. If 99% of requests are fast and 1% are slow, the 1% dominates user experience. Average masks tail. Always track p50 (median), p99 (tail), p99.9 (worst-case).

▸How does batching affect latency?

Batching adds wait-for-batch time to each operation. If batch size = 100 and ops arrive at 1ms intervals, last-in-batch waits 99ms before processing. But total processing for 100 ops drops dramatically. Net: higher throughput, individual latency up by batch wait.

▸What's USL (Universal Scalability Law)?

Neil Gunther's law: as concurrency N increases, throughput grows linearly (in idealized world), then sublinearly (contention), then DROPS (coordination cost). Real systems have an optimal N, beyond that, more threads = lower throughput. Find N empirically.

The two metrics that matter

Throughput (req/sec) is what monitoring shows in the steady state. Latency (ms per op) is what users feel. They're related but not interchangeable.

Little's Law, the only law that matters here

L = λW

L = average number of items in the system. λ = arrival rate (req/sec). W = average time per item (sec).

The Universal Scalability Law

Important

The optimal N depends on the workload. Find it empirically: load-test at N=2,4,8,16,32,64. Throughput plateaus or drops past optimal. Don't assume "more = better."

Latency percentiles

Percentile	What it measures	Use for
p50 (median)	Typical experience	Trend over time
p95	"Most users" tail	SLO targets
p99	Worst-1%, what frustrated users see	SLO violations
p99.9	Catastrophe	Capacity planning
max	Single worst case	Debugging

Average is dangerous, it hides bimodal distributions (90% fast + 10% terrible looks the same as 100% mediocre on average).

Backpressure, bounded latency under load

When arrival rate exceeds service rate, queue grows. Without intervention, queue depth → infinity, latency → infinity. Three responses:

Drop, return 503, free resources.
Block, admission control upstream; no new requests until queue drains.
Bulk-process, serve in batches, prioritize.

Tip

Tradeoffs

Option	Pros	Cons	When to use
Optimize for throughput	Higher ops/sec on same hardware Better resource utilization	Latency suffers (queueing) Tail can spike under burst	Batch jobs, async pipelines, ETL, analytics queries
Optimize for latency	Low tail latency Predictable response times	Lower throughput Resources idle under low load	Interactive services, trading, real-time
Adaptive (backpressure-driven)	Bounded latency under load Degrades gracefully	More complex to implement Requires queue depth monitoring	Production services with mixed workload

Follow-up questions

▸What's Little's Law and why does it matter?

▸Why does p99 matter more than average?

▸How does batching affect latency?

▸What's USL (Universal Scalability Law)?

The two metrics that matter

Little's Law, the only law that matters here

The Universal Scalability Law

Latency percentiles

Backpressure, bounded latency under load

Key points

Tradeoffs

Follow-up questions

Gotchas

Related reading

Throughput vs Latency Tuning

The two metrics that matter

Little's Law, the only law that matters here

The Universal Scalability Law

Latency percentiles

Backpressure, bounded latency under load

Key points

Tradeoffs

Follow-up questions

Gotchas

Related reading