Amdahl's Law
Maximum speedup from parallelisation = 1 / (s + p/N), where s is serial fraction, p is parallel fraction, N is processor count. Even small serial fractions cap maximum speedup. 5% serial = 20x maximum, no matter how many cores. The implication: optimising the serial part is often the highest-leverage work.
What it is
Amdahl's Law (Gene Amdahl, 1967) gives the maximum speedup achievable by parallelising a workload. The formula:
Speedup(N) = 1 / (s + (1 - s) / N)
Where s is the serial fraction (the part that can't be parallelised) and N is the number of processors.
The shocking implication: even a small serial fraction caps speedup. A workload with 5% serial code can never run more than 20x faster, no matter how many cores are thrown at it. As N → ∞, speedup → 1/s.
Why it matters
It's a corrective to the intuition "twice the cores = twice the speed". That's true for the parallel part, but the serial part doesn't get faster.
Practical numbers:
| Serial fraction | Max speedup |
|---|---|
| 50% | 2x |
| 10% | 10x |
| 5% | 20x |
| 1% | 100x |
| 0.1% | 1000x |
Massive speedup requires very small serial fractions. Real workloads rarely have less than 1-5% serial overhead (setup, aggregation, I/O coordination), so 20-100x is a typical practical ceiling for parallelisation.
What is "serial" in real systems
Common serial fractions:
- Setup and teardown: initial config load, final result aggregation, opening/closing connections.
- Lock contention: even brief lock holds show up as serial when many threads compete.
- Coordinated I/O: single-threaded log writer, sequential database flush, shared cache invalidation.
- Result merging: parallel computation followed by sequential combine (the "fan-in" half of fan-out/fan-in).
- Inherently sequential algorithms: dynamic programming with chain dependencies, sequential decoders.
Profile the workload at varying core counts. If speedup is sub-linear, the gap is the serial fraction.
How to fight Amdahl
Two strategies.
Reduce the serial fraction. Often the highest-leverage move. Replace a sequential aggregator with a tree-reduce. Replace a global lock with sharded locks. Move single-threaded I/O to a per-thread design. Each shaves the serial part, raising the asymptote.
Re-architect to remove the serial part. Some serial fractions are removable; others are inherent. If the workload genuinely requires a final sequential step (writing the result to a single sink), it can't be parallelised. But often, multiple sinks, partitioning the output, or eliminating the synchronisation will work.
Amdahl vs USL
Amdahl's Law is the simple upper bound. The Universal Scalability Law (USL, Neil Gunther) is more realistic for real systems:
Speedup(N) = N / (1 + α(N - 1) + β·N(N - 1))
Where α is the contention coefficient and β is the coherency coefficient. The β term lets USL model the case where adding cores actually slows the system down (lock contention overhead, cache coherency traffic). Real systems often peak at some N and degrade beyond it; USL captures that, Amdahl doesn't.
For most performance reasoning, Amdahl is enough. For capacity planning of large systems, USL is more accurate.
The right sequence when "we need more performance" comes up
Adding cores is rarely the right first move. Measure at varying parallelism levels and compute observed speedup. Compare it to Amdahl's prediction at the suspected serial fraction. Profile to find the serial parts (lock contention, I/O serialisation). Reduce the serial fraction by re-architecting those hotspots. Only then add cores, knowing the new ceiling.
Skipping straight to "add cores" wastes money. Cores are the easy lever; the serial fraction is the actual bottleneck.
Implementations
Key points
- •Speedup(N) = 1 / (s + (1-s)/N). As N → ∞, speedup → 1/s.
- •5% serial = 20x maximum speedup. 1% serial = 100x. 10% = 10x.
- •Real systems have multiple serial parts: I/O, lock contention, sequential bottlenecks.
- •Highest-leverage optimisation: reduce the serial fraction, not just add cores.
- •Universal Scalability Law (USL) extends Amdahl with a contention term, often more realistic.
Follow-up questions
▸What's the practical lesson?
▸How does Amdahl differ from USL?
▸Does Amdahl apply to distributed systems?
▸Why don't speedups in benchmarks match Amdahl predictions?
Gotchas
- !Adding cores past the Amdahl ceiling is wasted money
- !'Embarrassingly parallel' workloads still have a serial setup/teardown that limits speedup at scale
- !Lock contention is a hidden serial fraction; profile to find it
- !Per-core wins (better algorithm, less contention) sometimes beat throwing more cores
- !Real systems are often USL-shaped (degradation past peak), not Amdahl-shaped (asymptote)