GitOps & Infrastructure as Code
Architecture Diagram
Why It Exists
Anyone who has spent an afternoon figuring out why production looks different from staging, and eventually traced it back to a console change someone made at 2 AM during an incident, already understands the problem.
Clickops and imperative scripts create drift. They leave no audit trail. They cannot be reproduced reliably. GitOps and Infrastructure as Code fix this by treating infrastructure definitions as versioned, reviewable, testable code. Every change is a pull request: a peer reviews it, CI validates it, and git records who changed what, when, and why.
For platform engineers, this is not optional. It is the foundation of any compliance-ready, scalable infrastructure. Without it, one bad weekend leads to a state nobody can explain.
How It Works
GitOps Principles
GitOps boils down to four ideas. First, declarative: describe the desired state, not the steps to get there. Second, versioned and immutable: Git holds the canonical record. Third, pulled automatically: agents inside the cluster reconcile desired vs actual state on their own. Fourth, continuously reconciled: drift gets detected and corrected without anyone intervening.
ArgoCD and Flux both run inside the cluster, polling Git and applying any differences they find. This pull-based model has a real security advantage over push-based approaches. The cluster never needs to expose credentials to external CI systems. The CI pipeline does not need cluster-admin access. That matters more than most teams realize until they get bitten by it.
Terraform Workflow
Terraform follows a plan-apply-state lifecycle. terraform plan shows what it intends to create, modify, or destroy. terraform apply runs that plan against cloud provider APIs. The state file maps declared resources to actual cloud resource IDs.
This state must live in a remote backend (S3 + DynamoDB, Terraform Cloud, or GCS) with locking turned on. Without locking, two engineers running apply at the same time will corrupt the state. I have seen this happen on a Friday afternoon. It is not fun. The fix usually involves manual state surgery and a lot of swearing.
ArgoCD Sync Process
ArgoCD watches a Git repository path that contains Kubernetes manifests (raw YAML, Kustomize, or Helm charts). On each sync cycle, it runs a three-way diff between the desired state in Git and the live state in the cluster. Resources that diverge get marked OutOfSync.
Automated sync policies can be configured to self-heal drift, or set up manual sync gates for sensitive namespaces where a human needs to be in the loop. ApplicationSets let teams template hundreds of microservice deployments from a single generator definition, which is incredibly useful at scale but also a footgun if misconfigured (more on that in the failure scenarios below).
IaC Testing Strategies
| Test Level | Tool | What It Validates |
|---|---|---|
| Static analysis | tflint, checkov, tfsec | Syntax errors, security misconfigs, policy violations |
| Unit tests | terratest, pytest (Pulumi) | Module logic, expected plan output |
| Integration tests | Apply to ephemeral environment | Real resource creation, connectivity, IAM |
| Policy as code | OPA/Rego, Sentinel | Organizational guardrails (no public S3 buckets, enforced tagging) |
Most teams skip the middle two rows. Static analysis is easy to add, so people do it. Policy as code gets mandated by compliance. But actual unit and integration testing of infrastructure? Rare. And that gap is where the nastiest production surprises come from.
Multi-Environment Management
Use a layered approach. Shared modules define the infrastructure blueprint. Environment-specific variable files (dev.tfvars, prod.tfvars) or Kustomize overlays customize configs per environment. Promote changes sequentially through dev, staging, and production, with automated gates between each step.
This sounds obvious on paper, but the temptation to "just apply directly to prod, it's a small change" is strong. Resist it. Small changes cause outages too.
Production Considerations
Encrypt state files at rest and lock down access with IAM policies. The state file contains sensitive outputs: database passwords, private IPs, service account keys. Treat it like a secret, because it is one.
Use Terraform workspaces or directory-based isolation to prevent accidental cross-environment changes. For GitOps, configure RBAC in ArgoCD so teams can only sync their own namespaces. Nobody wants a junior engineer accidentally syncing the payments namespace.
Establish a break-glass procedure for emergency manual changes. Sometimes someone needs to kubectl edit something at 3 AM during an incident. That is fine. What is not fine is forgetting to reconcile Git with the actual state afterward. Make the follow-up PR mandatory and track compliance.
Monitor reconciliation loop latency. Alert when drift persists beyond the SLO threshold. If ArgoCD says everything is in sync but it has not actually checked in 30 minutes, that "in sync" status is a lie.
Failure Scenarios
Scenario 1: Terraform State Lock Deadlock
An engineer's terraform apply crashes mid-execution. Maybe their laptop dies, the network drops, or the process gets OOM-killed. The DynamoDB state lock never gets released. Every subsequent terraform plan or apply for that environment hangs, waiting for a lock that nobody is holding.
Cascading impact: All infrastructure changes for that environment are blocked. If this is production, nobody can scale up during an incident, rotate compromised credentials, or push a hotfix. Everything is stuck.
Detection: Monitor lock age via DynamoDB item timestamps. Alert when any lock exceeds 30 minutes. Track terraform_lock_wait_seconds in CI pipeline metrics.
Recovery: Document terraform force-unlock as a break-glass procedure that requires two-person approval. Wrap all CI-based applies with a timeout that auto-releases locks after 20 minutes. Better yet, use Terraform Cloud or Spacelift, which handle lock management with built-in timeouts. This is one of those problems to solve once and then forget about.
Scenario 2: ArgoCD Reconciliation Storm
A misconfigured ApplicationSet generator produces 500+ Applications at once. ArgoCD's repo-server and application-controller try to reconcile all of them simultaneously, overwhelming the Kubernetes API server's request budget.
Cascading impact: The API server starts throttling all clients. kubectl commands time out. Pod scheduling stalls. Health checks fail. Existing workloads begin restarting because they miss liveness probes. One bad config change takes down the control plane for every team.
Detection: Monitor ArgoCD's argocd_app_reconcile_count and argocd_app_sync_total rate. Alert when reconciliation rate exceeds 50/min. Watch apiserver_request_total with code=429 for throttling signals.
Recovery: Set spec.syncPolicy.retry.limit and rate limits on ArgoCD sync operations. Use argocd app list --status OutOfSync to identify the blast radius. Going forward, implement progressive ApplicationSet rollout with wave annotations. I strongly recommend testing ApplicationSet changes in a non-production cluster first. The blast radius of getting this wrong is enormous.
Scenario 3: Terraform Provider Version Drift Across Teams
Team A pins the aws provider to ~> 5.20. Team B uses ~> 5.40. Both teams contribute to shared modules. A module update works for Team B but breaks Team A's plan with incompatible resource schemas. Nobody catches it until a production apply fails.
Cascading impact: Production infrastructure changes are blocked. Debugging version conflicts across 50+ modules takes hours while everyone points fingers.
Detection: Enforce provider version constraints in CI using terraform providers lock. Run a weekly job that validates all module consumers against the latest provider version.
Recovery: Have a platform team maintain a shared provider version pin file across all Terraform roots. Use Renovate or Dependabot to propose provider upgrades as PRs with automated plan validation. Build a provider compatibility matrix and test it in CI. This is tedious to set up but saves the team from painful debugging sessions down the road.
Capacity Planning
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Terraform plan duration | < 3 min | > 5 min | > 15 min |
| ArgoCD sync latency | < 30 sec | > 2 min | > 10 min |
| Drift detection interval | < 5 min | > 10 min | > 30 min |
| State file size | < 50 MB | > 100 MB | > 500 MB |
| Resources per state file | < 500 | > 1,000 | > 5,000 |
Scaling rules of thumb: Plan time grows roughly linearly with the number of resources in a state file (plan_time ~ O(n)). When plan exceeds 5 minutes, split the state. One state file per service per environment, or per domain boundary, works well in practice.
For ArgoCD, each repo-server pod handles about 200 Applications effectively. For 1,000+ apps, provision 5+ repo-server replicas with 4GB RAM each. Use sharding (argocd-application-controller --app-state-cache-expiration) to distribute controller load across replicas.
The numbers look clean on paper, but real-world performance depends heavily on how complex the manifests are and how often they change. Monitor actual latencies and adjust from there rather than trusting theoretical calculations.
Architecture Decision Record
ADR: Pull-Based GitOps vs Push-Based IaC Deployment
Context: The decision is whether infrastructure changes get applied by pull-based agents (ArgoCD, Flux) running inside the target environment, or by push-based pipelines (Terraform in CI) that apply changes from the outside.
| Criteria (Weight) | Pull-Based (ArgoCD/Flux) | Push-Based (Terraform in CI) | Hybrid |
|---|---|---|---|
| Drift detection & self-healing (20%) | 10 | 3 | 8 |
| Security posture (20%) | 9 | 5 | 7 |
| Cloud resource provisioning (15%) | 3 | 10 | 9 |
| Kubernetes-native workflows (15%) | 10 | 4 | 8 |
| Audit & approval gates (15%) | 7 | 9 | 8 |
| Multi-cloud support (10%) | 5 | 9 | 8 |
| Operational complexity (5%) | 6 | 7 | 4 |
| Weighted Score | 7.40 | 6.30 | 7.75 |
Decision scenarios:
- Kubernetes-only infrastructure: Go with pull-based GitOps (ArgoCD/Flux). The result is continuous reconciliation, drift self-healing, and the cluster never exposes credentials externally. This is the clear winner when 80%+ of the infrastructure is K8s-native.
- Multi-cloud with managed services (RDS, BigQuery, CloudFront): Use push-based Terraform. Cloud provider resources need API-level provisioning that pull-based K8s controllers cannot handle natively without Crossplane, and Crossplane adds its own complexity.
- Both K8s and cloud-native resources (this is most teams): Go hybrid. Terraform provisions cloud resources (VPCs, databases, IAM). ArgoCD manages Kubernetes workloads. Terraform outputs like database endpoints and secret ARNs feed into ArgoCD-managed ConfigMaps via ExternalSecrets. More moving parts, but each tool does what it is actually good at.
- Regulated industry with change approval boards: Push-based with Terraform Cloud or Spacelift. Sentinel/OPA policies enforce approval gates, cost estimation, and drift detection in a unified workflow with full audit trails. Regulators like seeing a single pane of glass for change management.
Key Points
- •Git as the single source of truth for both application and infrastructure config
- •Declarative infrastructure: define the desired state, not how to get there
- •Pull-based GitOps (ArgoCD, Flux) vs push-based IaC (Terraform apply in CI)
- •Drift detection through continuous reconciliation, keeping actual state in line with desired state
- •Infrastructure changes go through the same PR review process as application code
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| ArgoCD | Open Source | K8s GitOps, multi-cluster, UI dashboard | Medium-Enterprise |
| Terraform | Open Source | Multi-cloud IaC, stateful resource management | Small-Enterprise |
| Pulumi | Open Source | IaC in real programming languages (TS, Python, Go) | Small-Enterprise |
| Crossplane | Open Source | K8s-native cloud resource provisioning via CRDs | Medium-Enterprise |
Common Mistakes
- Running kubectl apply by hand in production, which throws away the audit trail and review process entirely
- Storing Terraform state locally without a remote backend and locking. Two concurrent applies will corrupt the state.
- Copy-pasting infrastructure instead of writing reusable modules. This falls apart the moment anything needs updating.
- Mixing application deployment with infrastructure provisioning in the same pipeline
- Skipping infrastructure tests. Terraform plan is the unit test. Apply to staging is the integration test. Do both.