Performance Engineering at Scale
Why P99 Matters More Than P50
Your median latency is the experience most users have. Your p99 latency is the experience your most valuable users have. High-volume customers, power users running complex queries, and API integrators pushing large payloads all cluster in the tail. When your p99 doubles, the people who notice are the ones generating the most revenue.
A stable p50 with a doubled p99 is a diagnostic fingerprint. It tells you the problem affects a subset of requests, not all of them. Your first move is segmentation: which requests slowed down? Filter by endpoint, payload size, customer tier, geographic region, and cache hit rate. At Uber, a p99 spike that only affected rides with more than 3 stops led engineers to a quadratic routing calculation that was invisible at p50 because most rides have one stop.
GC pauses are another classic p99 offender. If your service runs on the JVM and a deploy increased object allocation rates, the garbage collector will occasionally pause for tens or hundreds of milliseconds. The median request sails through between pauses. The unlucky 1% lands during a GC stop-the-world event. Check your GC logs alongside your latency percentiles. If p99 spikes correlate with GC pause timestamps, you have your answer.
Cold caches after a deploy are a third common cause. If your deploy restarts instances, the in-memory cache is empty. The first request for each cache key hits the database. Once the cache warms, latency drops back to normal. But if you have millions of cache keys, the warming period can last 30 minutes, and during that window, your p99 is dominated by cache misses.
Capacity Modeling That Actually Works
When product says "we are launching in Brazil next quarter," you need a number: can the current architecture handle it? Most engineers answer with a napkin calculation. Staff engineers answer with a model.
Start with traffic analysis from an existing region of similar size. Decompose it into request types, not just total QPS. A region doing 10,000 QPS might be 6,000 reads, 3,000 writes, and 1,000 search queries. Each hits different infrastructure bottlenecks. Reads are cache-friendly and cheap. Writes go to the database primary. Search hits Elasticsearch. Scaling to handle 10,000 more QPS of reads is a different problem than scaling for 3,000 more writes.
Next, identify your actual bottleneck. Run load tests that replay production traffic patterns against a staging environment. Ramp linearly until something breaks. Maybe the database connection pool saturates at 4,000 writes per second. Maybe Elasticsearch falls behind on indexing at 1,500 search queries per second. The first thing that breaks is your capacity constraint, and your capacity plan is the plan to push that constraint past your projected load plus a 30% safety margin.
The mistake nearly everyone makes: testing with uniform synthetic traffic. Real traffic has a thundering herd at 9 AM when people start work, a quiet period at 3 AM, and periodic spikes from batch jobs and marketing campaigns. Your load test should replay a 24-hour production traffic recording, not a flat line of constant QPS. Netflix's approach of replaying shadowed production traffic to staging environments is the gold standard.
Building Regression Detection Into CI
You should catch performance regressions before they reach production. The challenge is doing it without drowning in false positives.
Step one: establish a stable baseline. Run your benchmark suite against the main branch nightly. Store the results over a rolling 14-day window. Use the median of those runs as your baseline, not a single run, because individual benchmark results have variance from system load, JIT warmup, and garbage collection timing.
Step two: run the same suite on every pull request. Compare the PR results against the baseline. But do not alert on any delta. A benchmark that runs in 47ms on main and 49ms on the PR is within noise. You need a statistical threshold. A common approach: flag regressions only when the change exceeds 2 standard deviations from the baseline median and the absolute change is above a minimum threshold (say, 5ms or 10%). Both conditions must be true.
Step three: make it actionable. When a regression is detected, the CI output should include which specific benchmark regressed, by how much, and a link to the flame graph diff between baseline and PR. Engineers will ignore a red flag that just says "performance regression detected." They will investigate a message that says "parseUserPayload p99 increased from 12ms to 31ms; flame graph shows new JSON validation step added 18ms."
The teams that get this right, like Android's benchmarking infrastructure and Chrome's perf bots, treat regression detection as a product. It has an SLO for false positive rate (under 5%), a process for triaging alerts, and an owner who maintains it.
Performance Budgets as an Engineering Practice
A performance budget assigns each component of a request path a maximum latency contribution. The API gateway gets 10ms. Authentication gets 5ms. Business logic gets 80ms. Database queries get 30ms. The total budget is your SLO: 200ms at p99.
When a team wants to add a new middleware layer or an extra database call, they have to fit it within their budget. If auth wants to add a rate-limit check that costs 8ms, they need to find 3ms of savings elsewhere in their 5ms budget, or negotiate for a larger budget from the total pool. This turns performance from a vague goal into a concrete constraint with clear ownership.
Google's web team popularized this approach for page load performance, but it works equally well for backend services. The key is measurement infrastructure that breaks down latency by component on every request. If you cannot attribute latency to specific components, you cannot enforce budgets. Distributed tracing with span-level timing is the minimum requirement.
Budget violations should be visible, not punitive. A weekly report showing which components are over budget and trending in which direction gives teams the information to prioritize without turning performance into a blame game.
Sample Questions
Your service's p99 latency doubled after a recent deploy but p50 is unchanged. How do you investigate?
Tail latency questions test whether you understand percentile distributions. The interviewer wants to see you reason about which subset of requests slowed down, whether it is GC pressure, cold caches, a specific code path, or a dependency timeout affecting a fraction of traffic.
Product wants to launch in a new region. How do you model whether the current architecture can handle the load?
Capacity modeling questions separate engineers who guess from engineers who measure. Strong answers discuss traffic pattern analysis, load testing methodology that mirrors real behavior, and systematic identification of bottlenecks before they hit production.
How do you build a performance regression detection system into your CI/CD pipeline?
The interviewer is looking for statistical rigor, not just 'run benchmarks.' Key areas: establishing stable baselines, determining statistical significance of regressions, and avoiding false positives that train the team to ignore alerts.
Evaluation Criteria
- Follows a systematic investigation methodology rather than guessing at causes
- Demonstrates capacity modeling skills grounded in real traffic patterns, not back-of-napkin estimates
- Understands what tail latency reveals that median latency hides
- Builds proactive detection systems rather than only debugging reactively
- Knows when to optimize and when optimization is premature or targeting the wrong layer
Key Points
- •P99 and p50 tell completely different stories. A stable p50 with a spiking p99 often points to GC pauses, cold cache misses on a subset of keys, or a specific code path triggered by a fraction of requests. You need to segment before you theorize.
- •Capacity models built on synthetic uniform traffic are fiction. Real traffic is bursty, follows time-of-day curves, and has hot keys. Your load test needs to replay production traffic patterns or your capacity estimate will be wrong by 2x or more.
- •Performance budgets per component force teams to own their latency contribution. If the API gateway gets 10ms, auth gets 5ms, and the business logic gets 50ms, every team knows their ceiling and can optimize independently.
- •Profiling in development and profiling in production reveal different truths. Your local JMH benchmark runs with a warm JIT and no GC pressure. Production has cold starts, noisy neighbors, and network jitter. Always validate with production profiling.
- •Regression detection needs statistical rigor. A 3% latency increase might be noise or might be real. Without proper baseline windows, confidence intervals, and minimum sample sizes, your detection system either misses real regressions or cries wolf daily.
Common Mistakes
- ✗Optimizing based on averages instead of percentiles. An average latency of 50ms can hide the fact that 1% of your users are waiting 3 seconds.
- ✗Load testing with uniform synthetic traffic when real traffic is bursty and follows power-law distributions on key access patterns
- ✗Treating all latency as equal when user-facing request latency and background job latency have completely different impact profiles and optimization priorities
- ✗Not accounting for graceful degradation under overload, then discovering during a traffic spike that your service falls off a cliff instead of shedding load