Artifact Management & Container Registry — Infrastructure | CrackingWalnuts

Why It Exists

Every build produces something. Container images, npm packages, JAR files, binaries. Those artifacts need to live somewhere that is versioned, secured, and reachable by the deployment targets.

I have seen teams skip this step. They build images directly on production hosts. They scp tarballs around. They pull everything from Docker Hub at deploy time and then act surprised when rate limits kill a release at 2am. All of these are traps.

A container registry is the bridge between the CI pipeline and the runtime. But for senior and staff engineers, it is also a security boundary. This is where image signing policies, vulnerability thresholds, and supply chain integrity get enforced before anything touches production. Any team not treating the registry as a security checkpoint is leaving a massive gap.

How It Works

OCI Image Specification

Container images follow the Open Container Initiative (OCI) specification. An image is made up of three things: a manifest (JSON describing the image), a config (runtime parameters like entrypoint, environment variables, exposed ports), and an ordered set of layers (filesystem diffs stored as compressed tarballs). Each layer is content-addressed by its SHA256 digest.

Layer Mechanics and Deduplication

When an image is pushed, the registry checks which layers already exist by their digest. Only new or modified layers get uploaded. This is why Dockerfile ordering matters so much.

Put the things that change often (like COPY . .) at the bottom of the Dockerfile. Keep system package installs and dependency fetches at the top so those layers stay cached. A well-structured Dockerfile turns a five-minute push into a ten-second one. I have seen teams shave hours off their daily CI time just by reordering their Dockerfile.

Image Signing and Verification

Verify that the image running in production is the exact image the CI pipeline built. No exceptions.

Cosign (part of the Sigstore project) signs images using keyless signing with OIDC identity tokens tied to the CI provider. Then Kubernetes admission controllers like Kyverno or OPA Gatekeeper enforce that only signed images from trusted pipelines get into the cluster. Notary v2 is another option that uses The Update Framework (TUF) for signature storage and delegation.

My strong recommendation: start with Cosign. It is simpler to set up, the keyless workflow removes an entire class of key management headaches, and the ecosystem is moving in its direction.

Vulnerability Scanning Pipeline

Scan at multiple points. Not just one.

Build time: Scan the Dockerfile and base image during CI (Trivy, Grype, Snyk Container)
Registry time: Enable automatic scanning on push (ECR, Harbor native scanning)
Runtime: Continuously rescan deployed images as new CVEs get published

Define a policy gate: block deployment if any critical or high-severity CVE exists without a fix available. Track vulnerability debt with SLOs, for example, all high-severity CVEs patched within 7 days.

The hard part is not setting up scanning. It is keeping the noise manageable so the team does not start ignoring the alerts.

Retention Policies and Cost Control

Configure lifecycle policies to automatically expire untagged images after N days and retain only the last M tagged versions per repository. ECR lifecycle policies and Harbor tag retention rules handle this automatically. Without retention policies, registry storage will quietly balloon to terabytes of orphaned layers. I have seen storage bills triple in six months because nobody set this up.

Multi-Architecture Builds

Modern platforms need images for both amd64 and arm64. Docker Buildx with --platform linux/amd64,linux/arm64 produces a manifest list (fat manifest) that lets the container runtime pull the correct architecture automatically. For mixed fleets (x86 servers alongside Graviton or other ARM instances), this is not optional.

Production Considerations

Mirror critical public base images (node, python, alpine) into the private registry. This protects the pipeline from Docker Hub rate limits and outages. I have watched entire deploy freezes happen because Docker Hub went down during a release window.

For multi-region deployments, set up geo-replicated registries. Harbor supports replication rules between instances, and ECR supports cross-region replication.

Lock down RBAC so only CI service accounts can push images. Developers get read-only access. Monitor pull latency and error rates as part of the platform SLOs, because a slow or broken registry directly blocks deployments and autoscaling events.

Failure Scenarios

Scenario 1: Registry Outage During Autoscaling Event

A traffic spike triggers Kubernetes Horizontal Pod Autoscaler to scale a service from 10 to 50 pods. The container registry is throwing 503s. New pods fail with ImagePullBackOff and never become ready. The existing 10 pods drown under traffic they cannot shed.

Cascading impact: The load balancer's health checks start failing on overloaded pods. Circuit breakers trip in upstream services. User-facing errors spike. HPA keeps requesting more pods (which also fail to pull), creating a feedback loop of failed scheduling.

Detection: Monitor kubelet_image_pull_duration_seconds and container_image_pull_errors_total per node. Alert when image pull failure rate exceeds 5% within a 2-minute window.

Recovery: Set imagePullPolicy: IfNotPresent (not Always) for production workloads so existing nodes use cached layers. Pre-pull critical images to all nodes via DaemonSet. Deploy registry mirrors (Harbor proxy cache, Dragonfly P2P distribution) so nodes pull from local cache first. This is one of those things that feels like over-engineering until the first time it prevents an outage.

Scenario 2: Supply Chain Attack via Compromised Base Image

A widely-used public base image (say, node:18-alpine) gets compromised upstream. The CI pipeline pulls it and builds on top of the poisoned image. Vulnerability scanners do not flag it because the payload is a zero-day with no CVE yet. The image passes all gates and deploys to production.

Cascading impact: A cryptominer or data exfiltration payload runs inside production containers. If that base image is shared across 50+ services, the blast radius is the entire platform.

Detection: Monitor unexpected outbound network connections from containers using Falco or Cilium network policies. Watch for CPU utilization anomalies per pod. Cryptominers cause sustained >80% CPU on pods that should be idle. Generate SBOMs with Syft and compare dependency hashes between builds.

Recovery: Pin base images to specific SHA digests. Never use floating tags. Maintain a curated base image catalog rebuilt weekly from source with provenance attestation (SLSA Level 3). Enforce Cosign signature verification in admission controllers so only internally-built base images can be deployed.

This scenario is not theoretical. It has happened multiple times in the wild. Without pinned digests and verified signatures, the organization is rolling the dice.

Scenario 3: Layer Storage Backend Corruption

The registry's blob storage backend (S3, GCS, Azure Blob) experiences silent data corruption or accidental deletion of a layer blob. Image pulls fail with checksum mismatch errors for a specific layer shared across hundreds of images.

Cascading impact: Any deployment or scaling event for images referencing the corrupted layer fails. Rollback might also fail if the previous version shares the same base layers.

Detection: Monitor registry_storage_errors_total and image pull success rates. Run periodic garbage collection with integrity verification (registry garbage-collect --dry-run).

Recovery: Enable S3 versioning on the storage backend for recovering deleted blobs. Maintain cross-region registry replication as a warm standby. Implement a registry health check that periodically pulls a canary image from each repository.

Capacity Planning

Metric	Target	Warning Threshold	Critical Threshold
Image pull latency (same region)	< 5 sec	> 15 sec	> 30 sec
Registry storage growth	< 10% month-over-month	> 20% MoM	> 50% MoM
Image scan time	< 60 sec per image	> 3 min	> 10 min
Layer deduplication ratio	> 70%	< 50%	< 30%
Concurrent pull capacity	500+ pulls/min	< 200/min	< 50/min

Scale references: Docker Hub serves 13B+ image pulls per month. GitHub Container Registry handles 1B+ pulls per month across public and private repositories. Shopify's internal registry serves 50K+ image pulls per day with p99 latency under 8 seconds, using Harbor with S3 backend and CloudFront CDN for layer distribution.

Storage formula: monthly_storage = images_per_day x avg_unique_layers x avg_layer_size x 30 x (1 - dedup_ratio). For 200 images/day, 5 unique layers at 50MB average, 70% deduplication: 200 x 5 x 50MB x 30 x 0.3 = 450 GB/month. At S3 pricing ($0.023/GB), that is roughly $10.35/month for storage alone, plus about $0.09/GB for transfer.

Retention policy math: Keeping 30 tagged versions per repository across 100 repositories with 500MB average image size = 100 x 30 x 500MB = 1.5 TB before deduplication. With 70% dedup, actual storage is around 450 GB.

Architecture Decision Record

ADR: Selecting an Artifact Storage and Container Registry Strategy

Context: The container registry sits in the critical path of every deployment and autoscaling event. Choosing between managed, self-hosted, and hybrid registry architectures has long-term consequences for cost, security, and operational overhead. Get this wrong and the consequences will linger for years.

Criteria (Weight)	AWS ECR	Harbor (Self-Hosted)	Docker Hub	GitHub Packages
Security scanning (20%)	8	9	5	6
Multi-cloud portability (15%)	4	10	8	6
Operational overhead (15%)	9	4	10	9
Cost at scale (15%)	6	8	5	7
Geo-replication (10%)	7	8	3	4
Image signing & SBOM (15%)	7	9	4	6
RBAC granularity (10%)	7	10	4	6
Weighted Score	6.80	8.05	5.50	6.40

Decision scenarios:

AWS-native organization with EKS clusters: Go with ECR. Native IAM integration eliminates credential management for image pulls. Lifecycle policies handle retention. That trades some cloud lock-in for minimal ops overhead, and it is usually a good trade.
Multi-cloud or hybrid cloud with strict compliance needs: Harbor, self-hosted. This provides full control over image storage location, scanning policies, and replication topology. It integrates with Trivy, Notary, and Cosign. Deploy on Kubernetes with Helm for HA. Be honest about the cost: budget 2-4 engineer-days per month for operational upkeep.
Open source project or public images: Docker Hub or GitHub Packages. The free tier works fine for public distribution. Do not use either as the sole registry for production private images though. Rate limits bite hard (100 pulls per 6 hours for anonymous, 200 for authenticated).
Enterprise with 100+ microservices and supply chain security requirements: Harbor plus a cloud-managed proxy. Run Harbor as the source of truth with Cosign signing enforcement, then replicate to ECR or GCR per region as pull-through caches. Admission controllers validate signatures against Harbor's trust store.

Tool	Type	Best For	Scale
AWS ECR	Managed	EKS integration, lifecycle policies, scanning	Small-Enterprise
Harbor	Open Source	Enterprise registry, replication, RBAC, scanning	Medium-Enterprise
Docker Hub	Managed	Public images, community ecosystem	Small-Medium
GitHub Packages	Managed	GitHub-native, multi-format (npm, Docker, Maven)	Small-Large

Why It Exists

Every build produces something. Container images, npm packages, JAR files, binaries. Those artifacts need to live somewhere that is versioned, secured, and reachable by the deployment targets.

How It Works

OCI Image Specification

Layer Mechanics and Deduplication

When an image is pushed, the registry checks which layers already exist by their digest. Only new or modified layers get uploaded. This is why Dockerfile ordering matters so much.

Image Signing and Verification

Verify that the image running in production is the exact image the CI pipeline built. No exceptions.

My strong recommendation: start with Cosign. It is simpler to set up, the keyless workflow removes an entire class of key management headaches, and the ecosystem is moving in its direction.

Vulnerability Scanning Pipeline

Scan at multiple points. Not just one.

Build time: Scan the Dockerfile and base image during CI (Trivy, Grype, Snyk Container)
Registry time: Enable automatic scanning on push (ECR, Harbor native scanning)
Runtime: Continuously rescan deployed images as new CVEs get published

The hard part is not setting up scanning. It is keeping the noise manageable so the team does not start ignoring the alerts.

Retention Policies and Cost Control

Multi-Architecture Builds

Production Considerations

For multi-region deployments, set up geo-replicated registries. Harbor supports replication rules between instances, and ECR supports cross-region replication.

Failure Scenarios

Scenario 1: Registry Outage During Autoscaling Event

Detection: Monitor kubelet_image_pull_duration_seconds and container_image_pull_errors_total per node. Alert when image pull failure rate exceeds 5% within a 2-minute window.

Scenario 2: Supply Chain Attack via Compromised Base Image

Cascading impact: A cryptominer or data exfiltration payload runs inside production containers. If that base image is shared across 50+ services, the blast radius is the entire platform.

This scenario is not theoretical. It has happened multiple times in the wild. Without pinned digests and verified signatures, the organization is rolling the dice.

Scenario 3: Layer Storage Backend Corruption

Cascading impact: Any deployment or scaling event for images referencing the corrupted layer fails. Rollback might also fail if the previous version shares the same base layers.

Detection: Monitor registry_storage_errors_total and image pull success rates. Run periodic garbage collection with integrity verification (registry garbage-collect --dry-run).

Capacity Planning

Metric	Target	Warning Threshold	Critical Threshold
Image pull latency (same region)	< 5 sec	> 15 sec	> 30 sec
Registry storage growth	< 10% month-over-month	> 20% MoM	> 50% MoM
Image scan time	< 60 sec per image	> 3 min	> 10 min
Layer deduplication ratio	> 70%	< 50%	< 30%
Concurrent pull capacity	500+ pulls/min	< 200/min	< 50/min

Architecture Decision Record

ADR: Selecting an Artifact Storage and Container Registry Strategy

Criteria (Weight)	AWS ECR	Harbor (Self-Hosted)	Docker Hub	GitHub Packages
Security scanning (20%)	8	9	5	6
Multi-cloud portability (15%)	4	10	8	6
Operational overhead (15%)	9	4	10	9
Cost at scale (15%)	6	8	5	7
Geo-replication (10%)	7	8	3	4
Image signing & SBOM (15%)	7	9	4	6
RBAC granularity (10%)	7	10	4	6
Weighted Score	6.80	8.05	5.50	6.40

Decision scenarios:

AWS-native organization with EKS clusters: Go with ECR. Native IAM integration eliminates credential management for image pulls. Lifecycle policies handle retention. That trades some cloud lock-in for minimal ops overhead, and it is usually a good trade.
Multi-cloud or hybrid cloud with strict compliance needs: Harbor, self-hosted. This provides full control over image storage location, scanning policies, and replication topology. It integrates with Trivy, Notary, and Cosign. Deploy on Kubernetes with Helm for HA. Be honest about the cost: budget 2-4 engineer-days per month for operational upkeep.
Open source project or public images: Docker Hub or GitHub Packages. The free tier works fine for public distribution. Do not use either as the sole registry for production private images though. Rate limits bite hard (100 pulls per 6 hours for anonymous, 200 for authenticated).
Enterprise with 100+ microservices and supply chain security requirements: Harbor plus a cloud-managed proxy. Run Harbor as the source of truth with Cosign signing enforcement, then replicate to ECR or GCR per region as pull-through caches. Admission controllers validate signatures against Harbor's trust store.

Architecture Diagram

Why It Exists

How It Works

OCI Image Specification

Layer Mechanics and Deduplication

Image Signing and Verification

Vulnerability Scanning Pipeline

Retention Policies and Cost Control

Multi-Architecture Builds

Production Considerations

Failure Scenarios

Scenario 1: Registry Outage During Autoscaling Event

Scenario 2: Supply Chain Attack via Compromised Base Image

Scenario 3: Layer Storage Backend Corruption

Capacity Planning

Architecture Decision Record

ADR: Selecting an Artifact Storage and Container Registry Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics

Architecture Diagram

Why It Exists

How It Works

OCI Image Specification

Layer Mechanics and Deduplication

Image Signing and Verification

Vulnerability Scanning Pipeline

Retention Policies and Cost Control

Multi-Architecture Builds

Production Considerations

Failure Scenarios

Scenario 1: Registry Outage During Autoscaling Event

Scenario 2: Supply Chain Attack via Compromised Base Image

Scenario 3: Layer Storage Backend Corruption

Capacity Planning

Architecture Decision Record

ADR: Selecting an Artifact Storage and Container Registry Strategy

Key Points

Tool Comparison

Common Mistakes

Related Topics