Race Detection in CI

What it is

Race detectors are tools that instrument the program to detect data races at runtime: two threads accessing the same memory without synchronisation, where at least one is a write. They catch a specific bug class (data races) that's responsible for many flaky-test and "works on x86, fails on ARM" stories.

The mainstream detectors:

Go: -race flag. Built in. Catches data races during test runs.
C++: ThreadSanitizer (TSan). Built into Clang and GCC. -fsanitize=thread.
Rust: TSan via RUSTFLAGS="-Zsanitizer=thread". The type system catches most races at compile time anyway.
Java: no general race detector (the JMM allows races as defined behaviour). Use jcstress for memory-model assumption tests; standard JUnit for higher-level concurrency tests.
Python: no general race detector (the GIL hides most). Stress testing is the practical approach.

How they work

The standard algorithm is happens-before tracking via vector clocks. Each memory location has a vector clock; each access updates the location's clock to be at least the accessing thread's clock; if a new access has a clock that doesn't dominate the prior one's, that's a race.

The implementation cost: 2-10x slowdown, 5-10x more memory. Acceptable for CI; not for production.

Where to run them

CI. Every PR. Run the test suite with race detection enabled. Failures block merge.

If the test suite is too slow with -race (long-running integration tests, large parameter sweeps), split into a fast race-detected subset (per PR) and a full suite (nightly).

What they do and don't catch

Catch:

Concurrent unprotected access to a variable.
Memory ordering bugs (writes visible without synchronisation).
Wrong use of atomics with non-atomic neighbours.

Don't catch:

Deadlocks (no progress = no race).
Livelocks (lots of progress = lots of races but all on different memory).
Missing notifications / lost wakeups.
Atomicity violations across multiple atomics.
Bugs that didn't happen during the test (test didn't trigger the race).

The race detector is necessary, not sufficient. Combine with stress tests that exercise concurrent paths at high load; the combination catches more.

Stress testing

For Go: high-N goroutine tests that perform many operations concurrently. Run repeatedly under -race to maximise the chance of triggering races.

go test -race -count=100 -run=TestConcurrent

For Java: concurrent JUnit tests with N threads doing M operations. Use jcstress for the lower-level memory-model fixtures.

For Python: pytest-xdist or pytest --count for repeated runs. Fewer races to find (GIL helps), but multi-bytecode operations can still race.

The race detector + stress test combination is the standard for catching concurrency bugs before production.

Race detection in CI is non-negotiable for any concurrent code: cheap, automatic, catches a major bug class. It is not a complete safety net, so combine it with stress tests, code review, and deadlock analysis. And remember it is a development-time tool; production runs un-instrumented, so the catch has to happen before ship.

The race detector finds the bugs that hide on x86 and surface on ARM. Every team should be running it.

Follow-up questions

▸Should the race detector be on for production?

No. It slows execution 2-10x and uses 5-10x more memory. CI is where it belongs: every PR, every test. Production runs the un-instrumented binary. The race detector's job is to catch bugs before they ship; once shipped, production should never see a race.

▸How does Go's race detector compare to TSan?

Same theory (happens-before vector clocks), different implementations. Go's is built into the toolchain, no extra setup. TSan needs Clang/GCC with -fsanitize=thread. Both have very low false positive rates: if they report a race, there's a race.

▸Does the race detector catch all bugs?

No. It catches data races (concurrent unprotected access). Doesn't catch deadlocks, livelocks, missing notifications, lost wakeups, atomicity violations on multi-step operations. Also doesn't catch bugs that depend on specific scheduling that didn't happen during the test run. Useful, not sufficient.

▸What about Rust?

Rust's type system catches most data races at compile time (Send/Sync traits). TSan still useful for unsafe code or for FFI. The Rust compiler + clippy + TSan in CI is a strong combination.

What it is

The mainstream detectors:

Go: -race flag. Built in. Catches data races during test runs.
C++: ThreadSanitizer (TSan). Built into Clang and GCC. -fsanitize=thread.
Rust: TSan via RUSTFLAGS="-Zsanitizer=thread". The type system catches most races at compile time anyway.
Java: no general race detector (the JMM allows races as defined behaviour). Use jcstress for memory-model assumption tests; standard JUnit for higher-level concurrency tests.
Python: no general race detector (the GIL hides most). Stress testing is the practical approach.

How they work

The implementation cost: 2-10x slowdown, 5-10x more memory. Acceptable for CI; not for production.

Where to run them

CI. Every PR. Run the test suite with race detection enabled. Failures block merge.

If the test suite is too slow with -race (long-running integration tests, large parameter sweeps), split into a fast race-detected subset (per PR) and a full suite (nightly).

What they do and don't catch

Catch:

Concurrent unprotected access to a variable.
Memory ordering bugs (writes visible without synchronisation).
Wrong use of atomics with non-atomic neighbours.

Don't catch:

Deadlocks (no progress = no race).
Livelocks (lots of progress = lots of races but all on different memory).
Missing notifications / lost wakeups.
Atomicity violations across multiple atomics.
Bugs that didn't happen during the test (test didn't trigger the race).

The race detector is necessary, not sufficient. Combine with stress tests that exercise concurrent paths at high load; the combination catches more.

Stress testing

For Go: high-N goroutine tests that perform many operations concurrently. Run repeatedly under -race to maximise the chance of triggering races.

go test -race -count=100 -run=TestConcurrent

For Java: concurrent JUnit tests with N threads doing M operations. Use jcstress for the lower-level memory-model fixtures.

For Python: pytest-xdist or pytest --count for repeated runs. Fewer races to find (GIL helps), but multi-bytecode operations can still race.

The race detector + stress test combination is the standard for catching concurrency bugs before production.

The race detector finds the bugs that hide on x86 and surface on ARM. Every team should be running it.

Follow-up questions

▸Should the race detector be on for production?

▸How does Go's race detector compare to TSan?

▸Does the race detector catch all bugs?

▸What about Rust?

Rust's type system catches most data races at compile time (Send/Sync traits). TSan still useful for unsafe code or for FFI. The Rust compiler + clippy + TSan in CI is a strong combination.

What it is

How they work

Where to run them

What they do and don't catch

Stress testing

Implementations

Key points

Follow-up questions

Gotchas

Related reading

Race Detection in CI

What it is

How they work

Where to run them

What they do and don't catch

Stress testing

Implementations

Key points

Follow-up questions

Gotchas

Related reading