Artifact Management & Container Registry
Architecture Diagram
Why It Exists
Every build produces something. Container images, npm packages, JAR files, binaries. Those artifacts need to live somewhere that is versioned, secured, and reachable by the deployment targets.
I have seen teams skip this step. They build images directly on production hosts. They scp tarballs around. They pull everything from Docker Hub at deploy time and then act surprised when rate limits kill a release at 2am. All of these are traps.
A container registry is the bridge between the CI pipeline and the runtime. But for senior and staff engineers, it is also a security boundary. This is where image signing policies, vulnerability thresholds, and supply chain integrity get enforced before anything touches production. Any team not treating the registry as a security checkpoint is leaving a massive gap.
How It Works
OCI Image Specification
Container images follow the Open Container Initiative (OCI) specification. An image is made up of three things: a manifest (JSON describing the image), a config (runtime parameters like entrypoint, environment variables, exposed ports), and an ordered set of layers (filesystem diffs stored as compressed tarballs). Each layer is content-addressed by its SHA256 digest.
Layer Mechanics and Deduplication
When an image is pushed, the registry checks which layers already exist by their digest. Only new or modified layers get uploaded. This is why Dockerfile ordering matters so much.
Put the things that change often (like COPY . .) at the bottom of the Dockerfile. Keep system package installs and dependency fetches at the top so those layers stay cached. A well-structured Dockerfile turns a five-minute push into a ten-second one. I have seen teams shave hours off their daily CI time just by reordering their Dockerfile.
Image Signing and Verification
Verify that the image running in production is the exact image the CI pipeline built. No exceptions.
Cosign (part of the Sigstore project) signs images using keyless signing with OIDC identity tokens tied to the CI provider. Then Kubernetes admission controllers like Kyverno or OPA Gatekeeper enforce that only signed images from trusted pipelines get into the cluster. Notary v2 is another option that uses The Update Framework (TUF) for signature storage and delegation.
My strong recommendation: start with Cosign. It is simpler to set up, the keyless workflow removes an entire class of key management headaches, and the ecosystem is moving in its direction.
Vulnerability Scanning Pipeline
Scan at multiple points. Not just one.
- Build time: Scan the Dockerfile and base image during CI (Trivy, Grype, Snyk Container)
- Registry time: Enable automatic scanning on push (ECR, Harbor native scanning)
- Runtime: Continuously rescan deployed images as new CVEs get published
Define a policy gate: block deployment if any critical or high-severity CVE exists without a fix available. Track vulnerability debt with SLOs, for example, all high-severity CVEs patched within 7 days.
The hard part is not setting up scanning. It is keeping the noise manageable so the team does not start ignoring the alerts.
Retention Policies and Cost Control
Configure lifecycle policies to automatically expire untagged images after N days and retain only the last M tagged versions per repository. ECR lifecycle policies and Harbor tag retention rules handle this automatically. Without retention policies, registry storage will quietly balloon to terabytes of orphaned layers. I have seen storage bills triple in six months because nobody set this up.
Multi-Architecture Builds
Modern platforms need images for both amd64 and arm64. Docker Buildx with --platform linux/amd64,linux/arm64 produces a manifest list (fat manifest) that lets the container runtime pull the correct architecture automatically. For mixed fleets (x86 servers alongside Graviton or other ARM instances), this is not optional.
Production Considerations
Mirror critical public base images (node, python, alpine) into the private registry. This protects the pipeline from Docker Hub rate limits and outages. I have watched entire deploy freezes happen because Docker Hub went down during a release window.
For multi-region deployments, set up geo-replicated registries. Harbor supports replication rules between instances, and ECR supports cross-region replication.
Lock down RBAC so only CI service accounts can push images. Developers get read-only access. Monitor pull latency and error rates as part of the platform SLOs, because a slow or broken registry directly blocks deployments and autoscaling events.
Failure Scenarios
Scenario 1: Registry Outage During Autoscaling Event
A traffic spike triggers Kubernetes Horizontal Pod Autoscaler to scale a service from 10 to 50 pods. The container registry is throwing 503s. New pods fail with ImagePullBackOff and never become ready. The existing 10 pods drown under traffic they cannot shed.
Cascading impact: The load balancer's health checks start failing on overloaded pods. Circuit breakers trip in upstream services. User-facing errors spike. HPA keeps requesting more pods (which also fail to pull), creating a feedback loop of failed scheduling.
Detection: Monitor kubelet_image_pull_duration_seconds and container_image_pull_errors_total per node. Alert when image pull failure rate exceeds 5% within a 2-minute window.
Recovery: Set imagePullPolicy: IfNotPresent (not Always) for production workloads so existing nodes use cached layers. Pre-pull critical images to all nodes via DaemonSet. Deploy registry mirrors (Harbor proxy cache, Dragonfly P2P distribution) so nodes pull from local cache first. This is one of those things that feels like over-engineering until the first time it prevents an outage.
Scenario 2: Supply Chain Attack via Compromised Base Image
A widely-used public base image (say, node:18-alpine) gets compromised upstream. The CI pipeline pulls it and builds on top of the poisoned image. Vulnerability scanners do not flag it because the payload is a zero-day with no CVE yet. The image passes all gates and deploys to production.
Cascading impact: A cryptominer or data exfiltration payload runs inside production containers. If that base image is shared across 50+ services, the blast radius is the entire platform.
Detection: Monitor unexpected outbound network connections from containers using Falco or Cilium network policies. Watch for CPU utilization anomalies per pod. Cryptominers cause sustained >80% CPU on pods that should be idle. Generate SBOMs with Syft and compare dependency hashes between builds.
Recovery: Pin base images to specific SHA digests. Never use floating tags. Maintain a curated base image catalog rebuilt weekly from source with provenance attestation (SLSA Level 3). Enforce Cosign signature verification in admission controllers so only internally-built base images can be deployed.
This scenario is not theoretical. It has happened multiple times in the wild. Without pinned digests and verified signatures, the organization is rolling the dice.
Scenario 3: Layer Storage Backend Corruption
The registry's blob storage backend (S3, GCS, Azure Blob) experiences silent data corruption or accidental deletion of a layer blob. Image pulls fail with checksum mismatch errors for a specific layer shared across hundreds of images.
Cascading impact: Any deployment or scaling event for images referencing the corrupted layer fails. Rollback might also fail if the previous version shares the same base layers.
Detection: Monitor registry_storage_errors_total and image pull success rates. Run periodic garbage collection with integrity verification (registry garbage-collect --dry-run).
Recovery: Enable S3 versioning on the storage backend for recovering deleted blobs. Maintain cross-region registry replication as a warm standby. Implement a registry health check that periodically pulls a canary image from each repository.
Capacity Planning
| Metric | Target | Warning Threshold | Critical Threshold |
|---|---|---|---|
| Image pull latency (same region) | < 5 sec | > 15 sec | > 30 sec |
| Registry storage growth | < 10% month-over-month | > 20% MoM | > 50% MoM |
| Image scan time | < 60 sec per image | > 3 min | > 10 min |
| Layer deduplication ratio | > 70% | < 50% | < 30% |
| Concurrent pull capacity | 500+ pulls/min | < 200/min | < 50/min |
Scale references: Docker Hub serves 13B+ image pulls per month. GitHub Container Registry handles 1B+ pulls per month across public and private repositories. Shopify's internal registry serves 50K+ image pulls per day with p99 latency under 8 seconds, using Harbor with S3 backend and CloudFront CDN for layer distribution.
Storage formula: monthly_storage = images_per_day x avg_unique_layers x avg_layer_size x 30 x (1 - dedup_ratio). For 200 images/day, 5 unique layers at 50MB average, 70% deduplication: 200 x 5 x 50MB x 30 x 0.3 = 450 GB/month. At S3 pricing ($0.023/GB), that is roughly $10.35/month for storage alone, plus about $0.09/GB for transfer.
Retention policy math: Keeping 30 tagged versions per repository across 100 repositories with 500MB average image size = 100 x 30 x 500MB = 1.5 TB before deduplication. With 70% dedup, actual storage is around 450 GB.
Architecture Decision Record
ADR: Selecting an Artifact Storage and Container Registry Strategy
Context: The container registry sits in the critical path of every deployment and autoscaling event. Choosing between managed, self-hosted, and hybrid registry architectures has long-term consequences for cost, security, and operational overhead. Get this wrong and the consequences will linger for years.
| Criteria (Weight) | AWS ECR | Harbor (Self-Hosted) | Docker Hub | GitHub Packages |
|---|---|---|---|---|
| Security scanning (20%) | 8 | 9 | 5 | 6 |
| Multi-cloud portability (15%) | 4 | 10 | 8 | 6 |
| Operational overhead (15%) | 9 | 4 | 10 | 9 |
| Cost at scale (15%) | 6 | 8 | 5 | 7 |
| Geo-replication (10%) | 7 | 8 | 3 | 4 |
| Image signing & SBOM (15%) | 7 | 9 | 4 | 6 |
| RBAC granularity (10%) | 7 | 10 | 4 | 6 |
| Weighted Score | 6.80 | 8.05 | 5.50 | 6.40 |
Decision scenarios:
- AWS-native organization with EKS clusters: Go with ECR. Native IAM integration eliminates credential management for image pulls. Lifecycle policies handle retention. That trades some cloud lock-in for minimal ops overhead, and it is usually a good trade.
- Multi-cloud or hybrid cloud with strict compliance needs: Harbor, self-hosted. This provides full control over image storage location, scanning policies, and replication topology. It integrates with Trivy, Notary, and Cosign. Deploy on Kubernetes with Helm for HA. Be honest about the cost: budget 2-4 engineer-days per month for operational upkeep.
- Open source project or public images: Docker Hub or GitHub Packages. The free tier works fine for public distribution. Do not use either as the sole registry for production private images though. Rate limits bite hard (100 pulls per 6 hours for anonymous, 200 for authenticated).
- Enterprise with 100+ microservices and supply chain security requirements: Harbor plus a cloud-managed proxy. Run Harbor as the source of truth with Cosign signing enforcement, then replicate to ECR or GCR per region as pull-through caches. Admission controllers validate signatures against Harbor's trust store.
Key Points
- •Central repository for build artifacts (container images, packages, binaries) with versioning and access control
- •Container registries store OCI images with layer deduplication. Only changed layers get pushed or pulled.
- •Image scanning detects CVEs in base images and dependencies before deployment
- •Immutable tags (use SHA digests, not :latest) ensure reproducible deployments
- •Proximity matters. A registry in the same region as the cluster cuts pull times dramatically.
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| AWS ECR | Managed | EKS integration, lifecycle policies, scanning | Small-Enterprise |
| Harbor | Open Source | Enterprise registry, replication, RBAC, scanning | Medium-Enterprise |
| Docker Hub | Managed | Public images, community ecosystem | Small-Medium |
| GitHub Packages | Managed | GitHub-native, multi-format (npm, Docker, Maven) | Small-Large |
Common Mistakes
- Using :latest in production. There is no way to deterministically reproduce a deployment.
- Not scanning images before deployment. Known CVEs walk straight into production.
- Storing secrets in image layers. They persist in layer history forever.
- No image retention policy. Registry storage grows unbounded and costs creep up.
- Pulling from public registries in CI. Rate limits and outages will break the pipeline.