GitOps & Infrastructure as Code — Infrastructure | CrackingWalnuts

Why It Exists

Anyone who has spent an afternoon figuring out why production looks different from staging, and eventually traced it back to a console change someone made at 2 AM during an incident, already understands the problem.

Clickops and imperative scripts create drift. They leave no audit trail. They cannot be reproduced reliably. GitOps and Infrastructure as Code fix this by treating infrastructure definitions as versioned, reviewable, testable code. Every change is a pull request: a peer reviews it, CI validates it, and git records who changed what, when, and why.

For platform engineers, this is not optional. It is the foundation of any compliance-ready, scalable infrastructure. Without it, one bad weekend leads to a state nobody can explain.

How It Works

GitOps Principles

GitOps boils down to four ideas. First, declarative: describe the desired state, not the steps to get there. Second, versioned and immutable: Git holds the canonical record. Third, pulled automatically: agents inside the cluster reconcile desired vs actual state on their own. Fourth, continuously reconciled: drift gets detected and corrected without anyone intervening.

ArgoCD and Flux both run inside the cluster, polling Git and applying any differences they find. This pull-based model has a real security advantage over push-based approaches. The cluster never needs to expose credentials to external CI systems. The CI pipeline does not need cluster-admin access. That matters more than most teams realize until they get bitten by it.

Terraform Workflow

Terraform follows a plan-apply-state lifecycle. terraform plan shows what it intends to create, modify, or destroy. terraform apply runs that plan against cloud provider APIs. The state file maps declared resources to actual cloud resource IDs.

This state must live in a remote backend (S3 + DynamoDB, Terraform Cloud, or GCS) with locking turned on. Without locking, two engineers running apply at the same time will corrupt the state. I have seen this happen on a Friday afternoon. It is not fun. The fix usually involves manual state surgery and a lot of swearing.

ArgoCD Sync Process

ArgoCD watches a Git repository path that contains Kubernetes manifests (raw YAML, Kustomize, or Helm charts). On each sync cycle, it runs a three-way diff between the desired state in Git and the live state in the cluster. Resources that diverge get marked OutOfSync.

Automated sync policies can be configured to self-heal drift, or set up manual sync gates for sensitive namespaces where a human needs to be in the loop. ApplicationSets let teams template hundreds of microservice deployments from a single generator definition, which is incredibly useful at scale but also a footgun if misconfigured (more on that in the failure scenarios below).

IaC Testing Strategies

Test Level	Tool	What It Validates
Static analysis	`tflint`, `checkov`, `tfsec`	Syntax errors, security misconfigs, policy violations
Unit tests	`terratest`, `pytest` (Pulumi)	Module logic, expected plan output
Integration tests	Apply to ephemeral environment	Real resource creation, connectivity, IAM
Policy as code	OPA/Rego, Sentinel	Organizational guardrails (no public S3 buckets, enforced tagging)

Most teams skip the middle two rows. Static analysis is easy to add, so people do it. Policy as code gets mandated by compliance. But actual unit and integration testing of infrastructure? Rare. And that gap is where the nastiest production surprises come from.

Multi-Environment Management

Use a layered approach. Shared modules define the infrastructure blueprint. Environment-specific variable files (dev.tfvars, prod.tfvars) or Kustomize overlays customize configs per environment. Promote changes sequentially through dev, staging, and production, with automated gates between each step.

This sounds obvious on paper, but the temptation to "just apply directly to prod, it's a small change" is strong. Resist it. Small changes cause outages too.

Production Considerations

Encrypt state files at rest and lock down access with IAM policies. The state file contains sensitive outputs: database passwords, private IPs, service account keys. Treat it like a secret, because it is one.

Use Terraform workspaces or directory-based isolation to prevent accidental cross-environment changes. For GitOps, configure RBAC in ArgoCD so teams can only sync their own namespaces. Nobody wants a junior engineer accidentally syncing the payments namespace.

Establish a break-glass procedure for emergency manual changes. Sometimes someone needs to kubectl edit something at 3 AM during an incident. That is fine. What is not fine is forgetting to reconcile Git with the actual state afterward. Make the follow-up PR mandatory and track compliance.

Monitor reconciliation loop latency. Alert when drift persists beyond the SLO threshold. If ArgoCD says everything is in sync but it has not actually checked in 30 minutes, that "in sync" status is a lie.

Failure Scenarios

Scenario 1: Terraform State Lock Deadlock

An engineer's terraform apply crashes mid-execution. Maybe their laptop dies, the network drops, or the process gets OOM-killed. The DynamoDB state lock never gets released. Every subsequent terraform plan or apply for that environment hangs, waiting for a lock that nobody is holding.

Cascading impact: All infrastructure changes for that environment are blocked. If this is production, nobody can scale up during an incident, rotate compromised credentials, or push a hotfix. Everything is stuck.

Detection: Monitor lock age via DynamoDB item timestamps. Alert when any lock exceeds 30 minutes. Track terraform_lock_wait_seconds in CI pipeline metrics.

Recovery: Document terraform force-unlock as a break-glass procedure that requires two-person approval. Wrap all CI-based applies with a timeout that auto-releases locks after 20 minutes. Better yet, use Terraform Cloud or Spacelift, which handle lock management with built-in timeouts. This is one of those problems to solve once and then forget about.

Scenario 2: ArgoCD Reconciliation Storm

A misconfigured ApplicationSet generator produces 500+ Applications at once. ArgoCD's repo-server and application-controller try to reconcile all of them simultaneously, overwhelming the Kubernetes API server's request budget.

Cascading impact: The API server starts throttling all clients. kubectl commands time out. Pod scheduling stalls. Health checks fail. Existing workloads begin restarting because they miss liveness probes. One bad config change takes down the control plane for every team.

Detection: Monitor ArgoCD's argocd_app_reconcile_count and argocd_app_sync_total rate. Alert when reconciliation rate exceeds 50/min. Watch apiserver_request_total with code=429 for throttling signals.

Recovery: Set spec.syncPolicy.retry.limit and rate limits on ArgoCD sync operations. Use argocd app list --status OutOfSync to identify the blast radius. Going forward, implement progressive ApplicationSet rollout with wave annotations. I strongly recommend testing ApplicationSet changes in a non-production cluster first. The blast radius of getting this wrong is enormous.

Scenario 3: Terraform Provider Version Drift Across Teams

Team A pins the aws provider to ~> 5.20. Team B uses ~> 5.40. Both teams contribute to shared modules. A module update works for Team B but breaks Team A's plan with incompatible resource schemas. Nobody catches it until a production apply fails.

Cascading impact: Production infrastructure changes are blocked. Debugging version conflicts across 50+ modules takes hours while everyone points fingers.

Detection: Enforce provider version constraints in CI using terraform providers lock. Run a weekly job that validates all module consumers against the latest provider version.

Recovery: Have a platform team maintain a shared provider version pin file across all Terraform roots. Use Renovate or Dependabot to propose provider upgrades as PRs with automated plan validation. Build a provider compatibility matrix and test it in CI. This is tedious to set up but saves the team from painful debugging sessions down the road.

Capacity Planning

Metric	Target	Warning Threshold	Critical Threshold
Terraform plan duration	< 3 min	> 5 min	> 15 min
ArgoCD sync latency	< 30 sec	> 2 min	> 10 min
Drift detection interval	< 5 min	> 10 min	> 30 min
State file size	< 50 MB	> 100 MB	> 500 MB
Resources per state file	< 500	> 1,000	> 5,000

Scaling rules of thumb: Plan time grows roughly linearly with the number of resources in a state file (plan_time ~ O(n)). When plan exceeds 5 minutes, split the state. One state file per service per environment, or per domain boundary, works well in practice.

For ArgoCD, each repo-server pod handles about 200 Applications effectively. For 1,000+ apps, provision 5+ repo-server replicas with 4GB RAM each. Use sharding (argocd-application-controller --app-state-cache-expiration) to distribute controller load across replicas.

The numbers look clean on paper, but real-world performance depends heavily on how complex the manifests are and how often they change. Monitor actual latencies and adjust from there rather than trusting theoretical calculations.

Architecture Decision Record

ADR: Pull-Based GitOps vs Push-Based IaC Deployment

Context: The decision is whether infrastructure changes get applied by pull-based agents (ArgoCD, Flux) running inside the target environment, or by push-based pipelines (Terraform in CI) that apply changes from the outside.

Criteria (Weight)	Pull-Based (ArgoCD/Flux)	Push-Based (Terraform in CI)	Hybrid
Drift detection & self-healing (20%)	10	3	8
Security posture (20%)	9	5	7
Cloud resource provisioning (15%)	3	10	9
Kubernetes-native workflows (15%)	10	4	8
Audit & approval gates (15%)	7	9	8
Multi-cloud support (10%)	5	9	8
Operational complexity (5%)	6	7	4
Weighted Score	7.40	6.30	7.75

Decision scenarios:

Kubernetes-only infrastructure: Go with pull-based GitOps (ArgoCD/Flux). The result is continuous reconciliation, drift self-healing, and the cluster never exposes credentials externally. This is the clear winner when 80%+ of the infrastructure is K8s-native.
Multi-cloud with managed services (RDS, BigQuery, CloudFront): Use push-based Terraform. Cloud provider resources need API-level provisioning that pull-based K8s controllers cannot handle natively without Crossplane, and Crossplane adds its own complexity.
Both K8s and cloud-native resources (this is most teams): Go hybrid. Terraform provisions cloud resources (VPCs, databases, IAM). ArgoCD manages Kubernetes workloads. Terraform outputs like database endpoints and secret ARNs feed into ArgoCD-managed ConfigMaps via ExternalSecrets. More moving parts, but each tool does what it is actually good at.
Regulated industry with change approval boards: Push-based with Terraform Cloud or Spacelift. Sentinel/OPA policies enforce approval gates, cost estimation, and drift detection in a unified workflow with full audit trails. Regulators like seeing a single pane of glass for change management.

Tool	Type	Best For	Scale
ArgoCD	Open Source	K8s GitOps, multi-cluster, UI dashboard	Medium-Enterprise
Terraform	Open Source	Multi-cloud IaC, stateful resource management	Small-Enterprise
Pulumi	Open Source	IaC in real programming languages (TS, Python, Go)	Small-Enterprise
Crossplane	Open Source	K8s-native cloud resource provisioning via CRDs	Medium-Enterprise

Why It Exists

For platform engineers, this is not optional. It is the foundation of any compliance-ready, scalable infrastructure. Without it, one bad weekend leads to a state nobody can explain.

How It Works

GitOps Principles

Terraform Workflow

ArgoCD Sync Process

IaC Testing Strategies

Test Level	Tool	What It Validates
Static analysis	`tflint`, `checkov`, `tfsec`	Syntax errors, security misconfigs, policy violations
Unit tests	`terratest`, `pytest` (Pulumi)	Module logic, expected plan output
Integration tests	Apply to ephemeral environment	Real resource creation, connectivity, IAM
Policy as code	OPA/Rego, Sentinel	Organizational guardrails (no public S3 buckets, enforced tagging)

Multi-Environment Management

This sounds obvious on paper, but the temptation to "just apply directly to prod, it's a small change" is strong. Resist it. Small changes cause outages too.

Production Considerations

Failure Scenarios

Scenario 1: Terraform State Lock Deadlock

Detection: Monitor lock age via DynamoDB item timestamps. Alert when any lock exceeds 30 minutes. Track terraform_lock_wait_seconds in CI pipeline metrics.

Scenario 2: ArgoCD Reconciliation Storm

Scenario 3: Terraform Provider Version Drift Across Teams

Cascading impact: Production infrastructure changes are blocked. Debugging version conflicts across 50+ modules takes hours while everyone points fingers.

Detection: Enforce provider version constraints in CI using terraform providers lock. Run a weekly job that validates all module consumers against the latest provider version.

Capacity Planning

Metric	Target	Warning Threshold	Critical Threshold
Terraform plan duration	< 3 min	> 5 min	> 15 min
ArgoCD sync latency	< 30 sec	> 2 min	> 10 min
Drift detection interval	< 5 min	> 10 min	> 30 min
State file size	< 50 MB	> 100 MB	> 500 MB
Resources per state file	< 500	> 1,000	> 5,000

Architecture Decision Record

ADR: Pull-Based GitOps vs Push-Based IaC Deployment

Criteria (Weight)	Pull-Based (ArgoCD/Flux)	Push-Based (Terraform in CI)	Hybrid
Drift detection & self-healing (20%)	10	3	8
Security posture (20%)	9	5	7
Cloud resource provisioning (15%)	3	10	9
Kubernetes-native workflows (15%)	10	4	8
Audit & approval gates (15%)	7	9	8
Multi-cloud support (10%)	5	9	8
Operational complexity (5%)	6	7	4
Weighted Score	7.40	6.30	7.75

Decision scenarios:

Kubernetes-only infrastructure: Go with pull-based GitOps (ArgoCD/Flux). The result is continuous reconciliation, drift self-healing, and the cluster never exposes credentials externally. This is the clear winner when 80%+ of the infrastructure is K8s-native.
Multi-cloud with managed services (RDS, BigQuery, CloudFront): Use push-based Terraform. Cloud provider resources need API-level provisioning that pull-based K8s controllers cannot handle natively without Crossplane, and Crossplane adds its own complexity.
Both K8s and cloud-native resources (this is most teams): Go hybrid. Terraform provisions cloud resources (VPCs, databases, IAM). ArgoCD manages Kubernetes workloads. Terraform outputs like database endpoints and secret ARNs feed into ArgoCD-managed ConfigMaps via ExternalSecrets. More moving parts, but each tool does what it is actually good at.
Regulated industry with change approval boards: Push-based with Terraform Cloud or Spacelift. Sentinel/OPA policies enforce approval gates, cost estimation, and drift detection in a unified workflow with full audit trails. Regulators like seeing a single pane of glass for change management.

Architecture Diagram

Why It Exists

How It Works

GitOps Principles

Terraform Workflow

ArgoCD Sync Process

IaC Testing Strategies

Multi-Environment Management

Production Considerations

Failure Scenarios

Scenario 1: Terraform State Lock Deadlock

Scenario 2: ArgoCD Reconciliation Storm

Scenario 3: Terraform Provider Version Drift Across Teams

Capacity Planning

Architecture Decision Record

ADR: Pull-Based GitOps vs Push-Based IaC Deployment

Key Points

Tool Comparison

Common Mistakes

Related Topics

Architecture Diagram

Why It Exists

How It Works

GitOps Principles

Terraform Workflow

ArgoCD Sync Process

IaC Testing Strategies

Multi-Environment Management

Production Considerations

Failure Scenarios

Scenario 1: Terraform State Lock Deadlock

Scenario 2: ArgoCD Reconciliation Storm

Scenario 3: Terraform Provider Version Drift Across Teams

Capacity Planning

Architecture Decision Record

ADR: Pull-Based GitOps vs Push-Based IaC Deployment

Key Points

Tool Comparison

Common Mistakes

Related Topics