Data Corruption Patterns
Silent vs Visible Corruption
Not all corruption looks the same. Visible corruption crashes your application. A null pointer, a constraint violation, a type mismatch. These are the good outcomes. The application fails loudly, the bad data doesn't propagate, and you get an error log to investigate.
Silent corruption is the dangerous kind. A user's balance is wrong by $0.50. An order has 3 items but the total reflects 2. A timestamp is off by one timezone. The application keeps running. Users might not notice for days. When they do notice, the corrupted data has been replicated, cached, exported to analytics, sent in invoices, and used for business decisions.
Race Conditions and Lost Updates
The most common application-level corruption comes from read-modify-write cycles without proper locking. Two requests read the same row, both modify it, both write it back. The second write overwrites the first. This is called a lost update, and it happens in every system that doesn't use transactions or optimistic locking.
Optimistic locking with a version column catches this: UPDATE users SET balance = 100, version = 4 WHERE id = 1 AND version = 3. If the version changed between read and write, the update affects zero rows and the application retries. This is simple, effective, and almost nobody implements it until they've had a corruption incident.
Event Stream Propagation
Modern architectures use event streams (Kafka, SQS, EventBridge) to propagate changes across services. When corrupted data enters the stream, every consumer creates its own copy of the corruption. The inventory service has wrong counts. The billing service has wrong amounts. The analytics pipeline has wrong metrics.
Fixing this requires replaying events from a known good state. If you use event sourcing, you have a complete history and can reconstruct the correct state. If you use CDC (Change Data Capture), you need to reprocess from the point before corruption started. Either way, every downstream system needs its own recovery process.
Impact Assessment
When corruption is detected, the first question isn't "how do we fix it?" It's "how big is this?" You need to know: when did corruption start, how many records are affected, which downstream systems consumed the corrupted data, and which customers are impacted.
Build queries that can answer these questions quickly. If your order totals don't match the sum of line items, you need a query that finds all mismatched orders. If inventory went negative, you need a query that finds all products with negative stock. Have these queries ready before you need them.
Recovery Strategies
Point-in-time recovery. Restore the database to the moment before corruption started, extract the correct data, and apply it to the current database. This is surgical but requires PITR to be enabled and the corruption timestamp to be known.
Event replay. If you have an event store, replay events from before the corruption through corrected application logic. This is the cleanest approach but requires event sourcing architecture.
Manual reconciliation. For small numbers of affected records, human review might be the safest option. Export the corrupted records, compare against source data (payment gateway records, upstream system data), and fix them one by one. Slow but accurate.
Document every correction in a remediation log. Auditors and customers will ask what was changed and why.
Incident Timeline
- T+0mA race condition in the order service writes a partial update to the database. Two concurrent requests modify the same row, and the last write wins with incomplete data.
- T+2mDownstream services consume the corrupted data via event stream. Inventory counts go negative. Billing calculates wrong amounts. The corruption is spreading.
- T+5mA reconciliation job flags 47 orders with mismatched totals. Alert fires, but the on-call initially dismisses it as a known false positive from last month.
- T+10mCustomer complaints arrive: wrong charges, missing items in orders. On-call escalates to a P0. Team begins impact assessment.
- T+15mSource of corruption identified. The race condition is in code deployed 3 days ago. Corrupted records span 72 hours. Rolling back the code stops new corruption but doesn't fix existing bad data.
- T+30mData repair begins using event sourcing replay from the last known good state. Manual review required for 200+ orders that can't be automatically reconciled.
Detection Signals
- •Reconciliation job failures or mismatches between source-of-truth systems
- •Constraint violation errors in database logs that indicate data integrity issues
- •Customer reports of incorrect balances, missing records, or duplicate transactions
- •Negative values in columns that should only be positive (inventory counts, account balances)
Prevention
- Use database-level constraints (UNIQUE, CHECK, FOREIGN KEY) as the last line of defense. Application-level validation is necessary but not sufficient
- Implement optimistic locking with version columns for all records that can be concurrently modified
- Run reconciliation jobs continuously, comparing data across systems (e.g., order totals vs payment amounts vs inventory changes)
- Use event sourcing or write-ahead logs for critical data paths so you can reconstruct state from events
- Enable checksums on database pages (PostgreSQL data_checksums, MySQL innodb_checksum_algorithm) to detect storage-level corruption
Key Points
- •Silent data corruption is worse than visible corruption. If users see an error, they retry. If they see wrong data, they make decisions based on it
- •Corruption that enters an event stream is exponentially harder to fix because every consumer has its own corrupted copy of the data
- •Point-in-time recovery (PITR) is your most important database feature. Without it, you're choosing between losing corrupted data and keeping corrupted data
- •Most data corruption comes from application bugs, not hardware failures. Race conditions, missing transactions, and incorrect update logic cause 90% of corruption
- •The blast radius of data corruption grows with time. Detecting corruption 5 minutes after it starts affects 50 records. Detecting it 3 days later affects 50,000
Common Mistakes
- ✗Restoring a full database backup to fix corruption, which also reverts all legitimate changes made since the backup
- ✗Fixing corrupted records in production manually with UPDATE statements without first backing up the corrupted data for analysis
- ✗Assuming database replication protects against corruption. Replication faithfully copies corrupted data to every replica