Data Corruption Patterns

Silent vs Visible Corruption

Not all corruption looks the same. Visible corruption crashes your application. A null pointer, a constraint violation, a type mismatch. These are the good outcomes. The application fails loudly, the bad data doesn't propagate, and you get an error log to investigate.

Silent corruption is the dangerous kind. A user's balance is wrong by $0.50. An order has 3 items but the total reflects 2. A timestamp is off by one timezone. The application keeps running. Users might not notice for days. When they do notice, the corrupted data has been replicated, cached, exported to analytics, sent in invoices, and used for business decisions.

Race Conditions and Lost Updates

The most common application-level corruption comes from read-modify-write cycles without proper locking. Two requests read the same row, both modify it, both write it back. The second write overwrites the first. This is called a lost update, and it happens in every system that doesn't use transactions or optimistic locking.

Optimistic locking with a version column catches this: UPDATE users SET balance = 100, version = 4 WHERE id = 1 AND version = 3. If the version changed between read and write, the update affects zero rows and the application retries. This is simple, effective, and almost nobody implements it until they've had a corruption incident.

Event Stream Propagation

Modern architectures use event streams (Kafka, SQS, EventBridge) to propagate changes across services. When corrupted data enters the stream, every consumer creates its own copy of the corruption. The inventory service has wrong counts. The billing service has wrong amounts. The analytics pipeline has wrong metrics.

Fixing this requires replaying events from a known good state. If you use event sourcing, you have a complete history and can reconstruct the correct state. If you use CDC (Change Data Capture), you need to reprocess from the point before corruption started. Either way, every downstream system needs its own recovery process.

Impact Assessment

When corruption is detected, the first question isn't "how do we fix it?" It's "how big is this?" You need to know: when did corruption start, how many records are affected, which downstream systems consumed the corrupted data, and which customers are impacted.

Build queries that can answer these questions quickly. If your order totals don't match the sum of line items, you need a query that finds all mismatched orders. If inventory went negative, you need a query that finds all products with negative stock. Have these queries ready before you need them.

Recovery Strategies

Point-in-time recovery. Restore the database to the moment before corruption started, extract the correct data, and apply it to the current database. This is surgical but requires PITR to be enabled and the corruption timestamp to be known.

Event replay. If you have an event store, replay events from before the corruption through corrected application logic. This is the cleanest approach but requires event sourcing architecture.

Manual reconciliation. For small numbers of affected records, human review might be the safest option. Export the corrupted records, compare against source data (payment gateway records, upstream system data), and fix them one by one. Slow but accurate.

Document every correction in a remediation log. Auditors and customers will ask what was changed and why.

Silent vs Visible Corruption

Race Conditions and Lost Updates

Event Stream Propagation

Impact Assessment

Recovery Strategies

Event replay. If you have an event store, replay events from before the corruption through corrected application logic. This is the cleanest approach but requires event sourcing architecture.

Document every correction in a remediation log. Auditors and customers will ask what was changed and why.

Silent vs Visible Corruption

Race Conditions and Lost Updates

Event Stream Propagation

Impact Assessment

Recovery Strategies

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Data Corruption Patterns

Silent vs Visible Corruption

Race Conditions and Lost Updates

Event Stream Propagation

Impact Assessment

Recovery Strategies

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics