Performance Engineering at Scale

Why P99 Matters More Than P50

Your median latency is the experience most users have. Your p99 latency is the experience your most valuable users have. High-volume customers, power users running complex queries, and API integrators pushing large payloads all cluster in the tail. When your p99 doubles, the people who notice are the ones generating the most revenue.

A stable p50 with a doubled p99 is a diagnostic fingerprint. It tells you the problem affects a subset of requests, not all of them. Your first move is segmentation: which requests slowed down? Filter by endpoint, payload size, customer tier, geographic region, and cache hit rate. At Uber, a p99 spike that only affected rides with more than 3 stops led engineers to a quadratic routing calculation that was invisible at p50 because most rides have one stop.

GC pauses are another classic p99 offender. If your service runs on the JVM and a deploy increased object allocation rates, the garbage collector will occasionally pause for tens or hundreds of milliseconds. The median request sails through between pauses. The unlucky 1% lands during a GC stop-the-world event. Check your GC logs alongside your latency percentiles. If p99 spikes correlate with GC pause timestamps, you have your answer.

Cold caches after a deploy are a third common cause. If your deploy restarts instances, the in-memory cache is empty. The first request for each cache key hits the database. Once the cache warms, latency drops back to normal. But if you have millions of cache keys, the warming period can last 30 minutes, and during that window, your p99 is dominated by cache misses.

Capacity Modeling That Actually Works

When product says "we are launching in Brazil next quarter," you need a number: can the current architecture handle it? Most engineers answer with a napkin calculation. Staff engineers answer with a model.

Start with traffic analysis from an existing region of similar size. Decompose it into request types, not just total QPS. A region doing 10,000 QPS might be 6,000 reads, 3,000 writes, and 1,000 search queries. Each hits different infrastructure bottlenecks. Reads are cache-friendly and cheap. Writes go to the database primary. Search hits Elasticsearch. Scaling to handle 10,000 more QPS of reads is a different problem than scaling for 3,000 more writes.

Next, identify your actual bottleneck. Run load tests that replay production traffic patterns against a staging environment. Ramp linearly until something breaks. Maybe the database connection pool saturates at 4,000 writes per second. Maybe Elasticsearch falls behind on indexing at 1,500 search queries per second. The first thing that breaks is your capacity constraint, and your capacity plan is the plan to push that constraint past your projected load plus a 30% safety margin.

The mistake nearly everyone makes: testing with uniform synthetic traffic. Real traffic has a thundering herd at 9 AM when people start work, a quiet period at 3 AM, and periodic spikes from batch jobs and marketing campaigns. Your load test should replay a 24-hour production traffic recording, not a flat line of constant QPS. Netflix's approach of replaying shadowed production traffic to staging environments is the gold standard.

Building Regression Detection Into CI

You should catch performance regressions before they reach production. The challenge is doing it without drowning in false positives.

Step one: establish a stable baseline. Run your benchmark suite against the main branch nightly. Store the results over a rolling 14-day window. Use the median of those runs as your baseline, not a single run, because individual benchmark results have variance from system load, JIT warmup, and garbage collection timing.

Step two: run the same suite on every pull request. Compare the PR results against the baseline. But do not alert on any delta. A benchmark that runs in 47ms on main and 49ms on the PR is within noise. You need a statistical threshold. A common approach: flag regressions only when the change exceeds 2 standard deviations from the baseline median and the absolute change is above a minimum threshold (say, 5ms or 10%). Both conditions must be true.

Step three: make it actionable. When a regression is detected, the CI output should include which specific benchmark regressed, by how much, and a link to the flame graph diff between baseline and PR. Engineers will ignore a red flag that just says "performance regression detected." They will investigate a message that says "parseUserPayload p99 increased from 12ms to 31ms; flame graph shows new JSON validation step added 18ms."

The teams that get this right, like Android's benchmarking infrastructure and Chrome's perf bots, treat regression detection as a product. It has an SLO for false positive rate (under 5%), a process for triaging alerts, and an owner who maintains it.

Performance Budgets as an Engineering Practice

A performance budget assigns each component of a request path a maximum latency contribution. The API gateway gets 10ms. Authentication gets 5ms. Business logic gets 80ms. Database queries get 30ms. The total budget is your SLO: 200ms at p99.

When a team wants to add a new middleware layer or an extra database call, they have to fit it within their budget. If auth wants to add a rate-limit check that costs 8ms, they need to find 3ms of savings elsewhere in their 5ms budget, or negotiate for a larger budget from the total pool. This turns performance from a vague goal into a concrete constraint with clear ownership.

Google's web team popularized this approach for page load performance, but it works equally well for backend services. The key is measurement infrastructure that breaks down latency by component on every request. If you cannot attribute latency to specific components, you cannot enforce budgets. Distributed tracing with span-level timing is the minimum requirement.

Budget violations should be visible, not punitive. A weekly report showing which components are over budget and trending in which direction gives teams the information to prioritize without turning performance into a blame game.

Why P99 Matters More Than P50

Capacity Modeling That Actually Works

Building Regression Detection Into CI

You should catch performance regressions before they reach production. The challenge is doing it without drowning in false positives.

Why P99 Matters More Than P50

Capacity Modeling That Actually Works

Building Regression Detection Into CI

Performance Budgets as an Engineering Practice

Sample Questions

Evaluation Criteria

Key Points

Common Mistakes

Related Topics

Performance Engineering at Scale

Why P99 Matters More Than P50

Capacity Modeling That Actually Works

Building Regression Detection Into CI

Performance Budgets as an Engineering Practice

Sample Questions

Evaluation Criteria

Key Points

Common Mistakes

Related Topics