Configuration Drift Incidents

The Hidden Cost of Manual Changes

Every production system has drift. The question is whether you know about it. Configuration drift happens when the running state of your infrastructure diverges from what your code says it should be. Someone edits a security group in the AWS console. Someone changes a ConfigMap with kubectl edit. Someone tunes a database parameter through the admin UI. Each change works. None of them get committed to version control.

The incident doesn't happen when the drift is introduced. It happens weeks later when an automated process reverts the drift, or when a new deployment assumes the old state is still in place.

Categories of Drift

Infrastructure drift. Cloud resources that don't match Terraform or CloudFormation state. A security group rule added manually, an IAM policy attached outside of code, a load balancer setting tuned in the console. This is the most common and most dangerous category because it often involves security-critical configurations.

Application configuration drift. Environment variables, feature flags, and config files that differ between what's deployed and what's in git. This often happens through CI/CD overrides, manual deployments, or config changes that bypass the normal deployment pipeline.

Data drift. Database schemas that don't match migration state. Seeds or reference data that was updated directly in production. This overlaps with database migration failures but is subtler because the application might work fine with the drifted schema until a specific code path is hit.

GitOps as Prevention

GitOps is the strongest prevention against drift. ArgoCD watches a git repository and continuously reconciles the cluster state with the repository state. If someone makes a manual change, ArgoCD reverts it within minutes. This sounds aggressive, but it's the point. You want manual changes to be impossible to persist.

The key shift is that git becomes the source of truth, not the running system. You don't SSH in and change things. You commit a change, and the system converges to match. If you need to make an emergency change, you still commit it, just with an expedited review process.

Detecting Existing Drift

Before you can prevent drift, you need to find what's already drifted. Run terraform plan across all your state files and document every unexpected change. Use driftctl to compare your cloud provider's actual state against Terraform. For Kubernetes, compare running manifests against what's in git using kubectl diff.

This audit usually reveals surprises. Security groups with rules nobody remembers adding. IAM roles with permissions that aren't in any Terraform file. Database parameters tuned manually during a previous incident and never codified.

Emergency Changes Done Right

Sometimes you need to change something in production right now. That's fine. The process should be: make the change, immediately create a PR that codifies it, and mark the PR as incident-related. Don't wait until the incident is over. The 5 minutes it takes to create the PR while the fix is fresh in your mind saves hours of archaeology later.

Keep a #manual-changes channel in Slack. Every manual production change gets posted there with who, what, when, and why. It's not as good as GitOps, but it's infinitely better than nothing.

The Hidden Cost of Manual Changes

The incident doesn't happen when the drift is introduced. It happens weeks later when an automated process reverts the drift, or when a new deployment assumes the old state is still in place.

Categories of Drift

GitOps as Prevention

Detecting Existing Drift

Emergency Changes Done Right

Keep a #manual-changes channel in Slack. Every manual production change gets posted there with who, what, when, and why. It's not as good as GitOps, but it's infinitely better than nothing.

The Hidden Cost of Manual Changes

Categories of Drift

GitOps as Prevention

Detecting Existing Drift

Emergency Changes Done Right

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics

Configuration Drift Incidents

The Hidden Cost of Manual Changes

Categories of Drift

GitOps as Prevention

Detecting Existing Drift

Emergency Changes Done Right

Incident Timeline

Detection Signals

Prevention

Key Points

Common Mistakes

Related Topics