Configuration Drift Incidents
The Hidden Cost of Manual Changes
Every production system has drift. The question is whether you know about it. Configuration drift happens when the running state of your infrastructure diverges from what your code says it should be. Someone edits a security group in the AWS console. Someone changes a ConfigMap with kubectl edit. Someone tunes a database parameter through the admin UI. Each change works. None of them get committed to version control.
The incident doesn't happen when the drift is introduced. It happens weeks later when an automated process reverts the drift, or when a new deployment assumes the old state is still in place.
Categories of Drift
Infrastructure drift. Cloud resources that don't match Terraform or CloudFormation state. A security group rule added manually, an IAM policy attached outside of code, a load balancer setting tuned in the console. This is the most common and most dangerous category because it often involves security-critical configurations.
Application configuration drift. Environment variables, feature flags, and config files that differ between what's deployed and what's in git. This often happens through CI/CD overrides, manual deployments, or config changes that bypass the normal deployment pipeline.
Data drift. Database schemas that don't match migration state. Seeds or reference data that was updated directly in production. This overlaps with database migration failures but is subtler because the application might work fine with the drifted schema until a specific code path is hit.
GitOps as Prevention
GitOps is the strongest prevention against drift. ArgoCD watches a git repository and continuously reconciles the cluster state with the repository state. If someone makes a manual change, ArgoCD reverts it within minutes. This sounds aggressive, but it's the point. You want manual changes to be impossible to persist.
The key shift is that git becomes the source of truth, not the running system. You don't SSH in and change things. You commit a change, and the system converges to match. If you need to make an emergency change, you still commit it, just with an expedited review process.
Detecting Existing Drift
Before you can prevent drift, you need to find what's already drifted. Run terraform plan across all your state files and document every unexpected change. Use driftctl to compare your cloud provider's actual state against Terraform. For Kubernetes, compare running manifests against what's in git using kubectl diff.
This audit usually reveals surprises. Security groups with rules nobody remembers adding. IAM roles with permissions that aren't in any Terraform file. Database parameters tuned manually during a previous incident and never codified.
Emergency Changes Done Right
Sometimes you need to change something in production right now. That's fine. The process should be: make the change, immediately create a PR that codifies it, and mark the PR as incident-related. Don't wait until the incident is over. The 5 minutes it takes to create the PR while the fix is fresh in your mind saves hours of archaeology later.
Keep a #manual-changes channel in Slack. Every manual production change gets posted there with who, what, when, and why. It's not as good as GitOps, but it's infinitely better than nothing.
Incident Timeline
- T+0mAn engineer SSH's into a production server and changes an Nginx config to fix a routing issue. The change works. No PR is created.
- T+2mThree weeks later, Terraform runs as part of a routine deployment. It detects the manual change as drift and reverts the Nginx config to the version in code.
- T+5mThe routing fix disappears. 15% of API traffic starts hitting the wrong backend. Error rates climb for a subset of customers.
- T+10mOn-call investigates the deployment. Diff shows only application code changes, nothing related to Nginx. The infrastructure change isn't in any PR.
- T+15mTeam spends 20 minutes comparing running config against Terraform state before finding the discrepancy. The original engineer is on vacation.
- T+30mFix applied in Terraform, PR merged, deployed. Incident took 45 minutes to resolve because the root cause wasn't documented anywhere.
Detection Signals
- •Terraform plan showing unexpected changes on infrastructure that wasn't recently modified
- •Configuration management tool (Ansible, Puppet) reporting convergence failures or remediation actions
- •Differences between staging and production that aren't explained by feature flags or environment-specific configs
- •Deployment failures caused by state conflicts between the expected and actual infrastructure state
Prevention
- Enforce GitOps with ArgoCD or Flux. All changes go through git. Manual kubectl or SSH changes are overwritten automatically within minutes
- Run terraform plan in CI on a schedule (daily at minimum) and alert on any detected drift, even if no deployment is happening
- Use tools like driftctl to scan for infrastructure drift between cloud provider state and Terraform state files
- Disable direct SSH access to production servers. Use Session Manager or similar audited access that logs every command
- Implement Open Policy Agent (OPA) or Kyverno policies that reject configurations not matching your standards
Key Points
- •Configuration drift is silent. The system works fine until something interacts with the drifted config, which could be days or months after the manual change
- •The most dangerous drift happens in security configs: security groups, IAM policies, and network ACLs changed manually and never reviewed
- •Feature flags are intentional drift. Treat them with the same rigor: track who changed what, when, and why. Use LaunchDarkly or Unleash with audit logs
- •Environment parity between staging and production is a myth in most organizations. Document the known differences explicitly
Common Mistakes
- ✗Making manual hotfixes in production under pressure and then forgetting to backport them into the infrastructure code
- ✗Running Terraform apply without reviewing the plan output, accidentally reverting previous manual changes
- ✗Using different Terraform module versions across environments, so staging and production infrastructure slowly diverge
- ✗Storing configuration in environment variables without a validation layer, allowing typos to propagate silently