Test Coverage & Effectiveness
The Coverage Trap
A codebase with 90% test coverage can still have serious quality problems. Coverage only measures which lines of code get executed during a test run. It says nothing about whether the assertions are meaningful. A test that calls a function and checks that it doesn't throw an exception "covers" that code but catches almost nothing.
That said, coverage below 60% almost always correlates with higher defect escape rates. The metric is useful as a floor, not a ceiling. Track coverage trends to make sure new code isn't dragging the number down, but don't set a target above 80% and expect it to guarantee quality.
Mutation Testing for Real Effectiveness
Mutation testing is the answer to "are these tests actually good?" Tools like Stryker (JavaScript/TypeScript), PIT (Java), or mutmut (Python) make small changes to your source code (mutations) and then run your tests. If a mutation survives (tests still pass), that means your tests didn't catch a real code change. The mutation score (percentage of mutations killed) is a far better quality signal than coverage.
A typical finding: a codebase with 85% coverage might have a mutation score of only 55%. That gap represents code that is executed but not meaningfully tested. Focus mutation testing on critical paths first. Running it across the entire codebase is expensive, so start with your payments module, authentication flow, or whatever breaks production most often.
Flaky Tests Are an Emergency
A flaky test is one that passes and fails without any code change. At a 2% flake rate, a suite of 500 tests will produce roughly 10 spurious failures per run. Engineers quickly learn to just re-run the pipeline and ignore the noise. That's the real damage: flaky tests erode trust in the entire suite. When the build is red, people stop investigating because "it's probably just flaky."
Track flake rate as a first-class metric. Quarantine flaky tests automatically (mark them, skip in CI, file a ticket). Fix or delete them within a sprint. Google's internal data shows that teams with flake rates under 0.5% merge code significantly faster because they trust their green builds.
The Test Pyramid in Practice
The pyramid model says: many fast unit tests at the base, fewer integration tests in the middle, and a small number of end-to-end tests at the top. A healthy ratio is roughly 70/20/10. The reason is economics. Unit tests run in milliseconds, are cheap to write, and isolate failures precisely. E2E tests take minutes, are expensive to maintain, and produce vague failure messages.
Teams that invert this pyramid (few unit tests, heavy E2E suite) end up with 30-minute CI runs, constant flakiness from browser automation, and a codebase where nobody wants to add tests because each one is painful to write and maintain.
Test Execution Time
If your test suite takes more than 10 minutes, developers stop running it locally. They push and wait for CI, which means they context-switch to something else and lose flow. Track test execution time as a developer experience metric. When it starts creeping up, invest in parallelization, smarter test selection (only run tests affected by changed files), and pruning redundant tests. Fast feedback loops drive better testing habits.
Key Points
- •Coverage percentage tells you what code is executed during tests, not whether the tests actually catch bugs
- •Mutation testing measures real test effectiveness by injecting faults and checking if tests detect them
- •Flaky test rate above 2-3% visibly degrades developer trust in the test suite and slows down merges
- •Test pyramid ratios (70% unit / 20% integration / 10% e2e) keep execution fast and maintenance manageable
- •Test execution time is a developer experience metric: suites over 10 minutes break the feedback loop
Common Mistakes
- ✗Chasing a coverage number (like 80%) without considering what the tests actually assert
- ✗Writing tests after the fact to hit a coverage gate, which produces low-value tests that test implementation details
- ✗Ignoring flaky tests until the suite is so unreliable that engineers stop trusting green builds
- ✗Building a test-diamond (heavy integration, light unit) which makes the suite slow and brittle