CI/CD Pipeline Design
Architecture Diagram
Why It Exists
Anyone who has spent a Friday evening untangling a merge gone wrong, or watched a deployment turn into a three-hour Slack thread, already knows why CI/CD pipelines exist.
The core idea is simple. Take the manual, error-prone work of building, testing, and shipping software, and turn it into an automated, repeatable workflow. Every commit gets built and tested the same way, no matter who wrote it. No special incantations. No "it worked on my machine."
Without a pipeline, long-lived branches drift apart, merge conflicts pile up, and deployments become something people dread. With one, shipping code becomes boring. And boring is exactly what a deployment process should be.
At a staff or architect level, the pipeline is not just a developer convenience. It is the primary quality gate for the entire organization and the foundation of delivery metrics (DORA: deployment frequency, lead time, change failure rate, MTTR). If the pipeline is slow or unreliable, everything downstream suffers.
How It Works
The Testing Pyramid in CI
Structure the pipeline around the testing pyramid. This is not a new idea, but it is surprising how many teams get it wrong.
Unit tests (70-80% of the suite) run first. They are fast, isolated, and catch logic errors in seconds. Integration tests (15-20%) validate service boundaries, database queries, and API contracts. End-to-end tests (5-10%) run against a deployed staging environment and check critical user journeys.
The most important principle here: fail fast. If unit tests break, there is zero reason to kick off a 20-minute E2E suite. Kill the pipeline early. The developers will thank the team that set it up.
Build Optimization
Three things will make the biggest difference in build times:
- Dependency caching: Store
node_modules,.m2, or Go module caches between builds. Most CI systems support layer-aware caching keyed by lockfile hash. Without this, the pipeline is burning minutes on every single build for no reason. - Parallelism: Split test suites across multiple runners using sharding (e.g., Jest
--shard, pytest-xdist). A 30-minute test suite across 6 runners finishes in 5 minutes. The math is simple, and the payoff is huge. - Affected detection: In monorepos, tools like Nx, Turborepo, or Bazel compute the dependency graph and only build/test packages touched by the changeset. Nobody should be rebuilding 50 services because someone fixed a typo in one.
Secrets Management in CI
Do not hardcode secrets in pipeline definitions. I know this sounds obvious, but I have seen it in production more times than I would like to admit.
Use the CI platform's encrypted secret store (GitHub Actions secrets, GitLab CI variables) and inject them as environment variables at runtime. For more complex setups, pull short-lived credentials from HashiCorp Vault or AWS Secrets Manager on each pipeline run. Rotate CI service account tokens on a schedule, not "when we remember."
Monorepo vs Polyrepo Pipelines
| Concern | Monorepo | Polyrepo |
|---|---|---|
| Pipeline trigger | Path-filtered, affected detection | Per-repo webhook |
| Build graph | Centralized, shared dependency resolution | Independent per service |
| Caching | Cross-project cache sharing | Isolated per repo |
| Versioning | Unified or independent per package | Independent by default |
Neither approach is universally better. Monorepos provide atomic cross-service changes and shared caching. Polyrepos provide isolation and simpler per-team ownership. Pick based on team structure and how tightly coupled the services actually are.
Production Considerations
An unobservable pipeline is an untrustworthy pipeline. Emit structured logs from every stage, track stage durations, and set up alerts for degradation. If the median build time drifts from 8 minutes to 14, the platform team should know before the engineers start complaining.
A few non-negotiable practices:
- Enforce branch protection rules that require green CI before merge. No exceptions, not even for senior engineers. The moment "just this once" bypasses are allowed, the game is lost.
- Set pipeline-level timeouts. Hung builds that sit in the queue for hours will block everything behind them.
- Use ephemeral, containerized runners. A clean environment per execution eliminates the entire class of "works on the CI machine" bugs. Yes, it costs a bit more in setup time. It is worth it.
- For regulated industries, keep an immutable audit log of every pipeline execution, including who approved the production deployment gate.
Failure Scenarios
Scenario 1: Shared Runner Pool Exhaustion
A large monorepo merge queue triggers 200+ concurrent pipeline runs. The shared runner pool (say, 50 self-hosted runners) saturates. Every subsequent pipeline queues indefinitely, blocking every team in the org.
What happens next: PRs cannot merge, hotfixes stall, deployment freezes spread. Monday mornings are the worst for this.
How to detect it: Monitor ci_pending_jobs queue depth and runner_utilization_ratio. Alert when queue depth exceeds 3x the runner count or p95 wait time crosses 5 minutes.
How to fix it: Set up autoscaling runner groups (GitHub Actions scale sets, GitLab Runner autoscaler on Kubernetes). Impose per-team concurrency limits so one team's bulk merge cannot starve everyone else. Create a priority lane for hotfix branches that jumps ahead of feature work.
Scenario 2: Dependency Mirror/Cache Corruption
The internal npm/Maven mirror returns corrupted packages or stale versions. Builds produce artifacts with wrong dependency versions.
What happens next: This is a nasty one. Builds still succeed, but they produce incorrect binaries. Subtle runtime bugs slip into production because nothing flagged the problem at build time.
How to detect it: Verify dependency checksums against lockfile hashes post-install. Track cache hit rates. A sudden drop from 95% to 0% is a clear sign of cache eviction or corruption.
How to fix it: Add checksum verification as a pipeline step. Maintain a fallback to upstream registries with a circuit breaker. Pin the mirror infrastructure on immutable snapshots for quick rollback.
Scenario 3: Secret Rotation During Active Deployments
A scheduled rotation of CI service account tokens invalidates credentials mid-pipeline. In-flight deployments fail at the deploy step after 15 minutes of successful build and test.
What happens next: Half-deployed state in staging, broken integration tests across teams, and confusion about whether to roll back or push forward.
How to detect it: Monitor auth_failure_rate in the CI platform metrics. Alert on more than 2% authentication failures within a 5-minute window.
How to fix it: Use short-lived OIDC tokens instead of long-lived secrets (GitHub Actions OIDC for AWS is a good example). Implement graceful rotation with overlapping validity windows so old tokens stay valid while new ones propagate. Add a pre-flight credential validation check as the very first pipeline step. If creds are bad, fail immediately instead of wasting 15 minutes.
Capacity Planning
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Median build time | < 10 min | > 15 min | > 25 min |
| p95 queue wait time | < 2 min | > 5 min | > 15 min |
| Runner utilization | 60-75% | > 85% | > 95% |
| Cache hit rate | > 90% | < 80% | < 60% |
| Pipeline success rate | > 95% | < 90% | < 80% |
Here is how to size the runner pool. The formula: Required runners = (daily_pipeline_runs x avg_duration_minutes) / (1440 x target_utilization). For 5,000 daily runs at 12 minutes average with a 70% utilization target: (5000 x 12) / (1440 x 0.7) = 60 runners. Add 30% headroom for burst traffic (Monday mornings, release days), giving roughly 78 runners. Budget $150-300/month per self-hosted runner for compute, storage, and network.
One thing to keep in mind: these numbers assume steady-state traffic. Real pipeline traffic is bursty. Most teams see a spike between 10am-12pm and another in the afternoon. The autoscaler needs to handle the peaks, not the averages.
Architecture Decision Record
ADR: Selecting a CI/CD Platform
Context: The CI/CD platform choice will shape developer workflows and operational overhead for years. Migrations are painful, so it is worth getting this right.
| Criteria (Weight) | GitHub Actions | GitLab CI | Jenkins | CircleCI |
|---|---|---|---|---|
| Setup complexity (15%) | 9 | 8 | 4 | 8 |
| Ecosystem & plugins (15%) | 9 | 7 | 10 | 7 |
| Self-hosted option (10%) | 7 | 9 | 10 | 3 |
| Monorepo support (20%) | 6 | 7 | 8 | 7 |
| Security & compliance (15%) | 8 | 9 | 6 | 7 |
| Cost at scale (15%) | 5 | 7 | 8 | 5 |
| Observability (10%) | 7 | 8 | 5 | 7 |
| Weighted Score | 7.15 | 7.75 | 7.15 | 6.45 |
My recommendations by situation:
- Startup on GitHub (under 50 engineers): GitHub Actions. Low ops overhead, generous free tier, and native integration means the team is not context-switching between tools. Do not overthink this.
- Regulated enterprise with air-gapped environments: GitLab CI (self-managed). A single platform for SCM, CI, registry, and security scanning satisfies auditor requirements for unified audit trails. That consolidation matters when compliance reviews come around.
- Large monorepo with custom build tooling (200+ engineers): Jenkins or Buildkite. The flexibility for custom executors, distributed build graphs, and integration with Bazel/Buck becomes essential. GitHub Actions will start to feel limiting at this scale.
- Multi-cloud organization: GitLab CI or Jenkins. Avoid tying the organization to a single cloud provider's CI tooling (AWS CodePipeline, GCP Cloud Build). The lock-in becomes painful the moment the cloud strategy shifts.
Key Points
- •Continuous Integration merges code frequently; Continuous Delivery automates release; Continuous Deployment auto-deploys
- •Pipeline stages: lint, test, build, security scan, deploy to staging, integration test, deploy to prod
- •Fast feedback loops matter more than anything else. Aim for under 10 min from commit to test results.
- •Trunk-based development with short-lived feature branches keeps merge conflicts small and manageable
- •Pipeline as code (Jenkinsfile, .github/workflows) makes builds reproducible and auditable
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| GitHub Actions | Managed | GitHub-native, marketplace actions, matrix builds | Small-Enterprise |
| GitLab CI | Open Source | Integrated DevOps platform, self-hosted option | Medium-Enterprise |
| Jenkins | Open Source | Maximum flexibility, plugin ecosystem, self-hosted | Medium-Enterprise |
| CircleCI | Managed | Docker-first, fast builds, orbs ecosystem | Small-Large |
Common Mistakes
- Not parallelizing independent test suites. Sequential execution wastes minutes per build.
- Running all tests on every PR. Use affected/changed file detection for large monorepos.
- Manual deployment steps in the pipeline. That defeats the whole point of automation.
- Not caching dependencies between builds. A fresh npm install adds 2-5 minutes every time.
- Sharing mutable state between pipeline stages, which causes flaky tests from leftover state