Secrets Management

Why It Exists

Anyone who has been building distributed systems long enough has seen it happen. Someone hardcodes a database password in a config file, it ends up in a Docker image, someone pushes that image to a registry with too-broad read access, and now an intern three teams over can see production credentials. Or worse, a contractor commits a connection string to a public repo and the team finds out from Twitter.

Every service depends on secrets: database credentials, API keys, TLS certificates, encryption keys, service account tokens. The problem isn't that secrets exist. The problem is that they end up everywhere. Sprinkled across environment variables, baked into container images, copy-pasted into Slack DMs. Once they scatter, auditing who has access becomes impossible, rotating them requires a fire drill, and detecting leaks is a guessing game.

Centralized secrets management provides one place to store credentials with encryption, access control, audit logging, and automatic rotation. It turns secrets from a ticking time bomb into something that can actually be managed.

How It Works

Vault Architecture

HashiCorp Vault is the most widely adopted option, so it's worth understanding how it works internally.

Storage backend. Vault encrypts everything before writing it to Consul, Raft (integrated storage), or cloud storage. Plaintext never hits disk.
Seal/Unseal. Vault boots in a sealed state where it physically cannot decrypt its own storage. Unsealing requires a threshold of key shares (Shamir's Secret Sharing) or auto-unseal through cloud KMS. This means stealing the storage volume alone gets an attacker nothing.
Auth methods. Services prove their identity through Kubernetes ServiceAccount tokens, AWS IAM roles, AppRole (for machine-to-machine), or OIDC. Each auth method maps to a Vault policy that defines exactly which secret paths the caller can read.
Secret engines. KV (static key-value), Database (dynamic credentials), PKI (certificate issuance), Transit (encryption-as-a-service), and AWS (dynamic IAM credentials). The Database and PKI engines are the biggest wins compared to simpler alternatives.

Dynamic Secrets Workflow

This is the feature that makes Vault worth the operational overhead. Instead of storing a static database password that lives forever, Vault generates a unique credential per request with a short TTL:

The application authenticates to Vault via Kubernetes auth
It requests database credentials from the database/creds/readonly path
Vault connects to the database, creates a temporary user with the readonly role, and hands back a credential with a 1-hour lease
The application uses that credential for database queries
When the lease expires, Vault revokes the credential and drops the database user

Think about what this means. If a credential leaks, the blast radius is one service for one hour. Not every service, not forever. That's a fundamentally different security posture.

Kubernetes Integration Patterns

Pattern	Mechanism	Pros	Cons
Secrets Store CSI Driver	Mount secrets as files in pods	No app changes, auto-rotation	Requires CSI driver + provider
External Secrets Operator	Sync to K8s Secrets	Works with existing Secret consumers	K8s Secrets are base64, not encrypted
Vault Agent Sidecar	Sidecar injects secrets into shared volume	Dynamic secrets, lease renewal	Extra container per pod
Direct SDK	Application calls Vault API	Full control, dynamic secrets	Application code changes required

My honest take: start with External Secrets Operator for getting secrets from a cloud provider into Kubernetes. It's the least disruptive path. Move to the CSI driver or Vault Agent when dynamic secrets or lease renewal are actually needed. The direct SDK approach gives the most control but couples application code to Vault, a tradeoff to make deliberately.

Production Considerations

High availability. Run Vault in HA mode with Raft consensus (3 or 5 nodes). If Vault goes down, no application can fetch secrets. It becomes the most critical dependency in the infrastructure, so treat it that way.
Break-glass procedures. Document and regularly test the unseal process. If the auto-unseal KMS becomes unavailable, operators with Shamir key shares physically distributed across trusted individuals are needed. I've seen teams that couldn't unseal Vault during an incident because the key holders were on vacation. Don't be that team.
Secret sprawl detection. Run trufflehog or gitleaks in CI pipelines on every commit. This is not optional. It's a matter of when, not if, someone accidentally commits a credential.
Rotation strategy. Build rotation through Lambda functions (AWS) or CronJobs (K8s) that trigger on a schedule. Always test rotation in staging first. Many applications choke on mid-connection credential changes because their connection pools don't refresh cleanly.
Disaster recovery. Take Vault snapshots regularly and store them encrypted in a separate cloud account. Test restoration quarterly. I know "test the backups" sounds obvious, but I've personally watched a team discover their Vault snapshots were corrupted during an actual DR event.

Failure Scenarios

Scenario 1: Vault Cluster Quorum Loss Causes Total Secret Access Failure. A 3-node Vault cluster on Raft storage loses two nodes at once (AZ failure). The surviving node can't form a quorum and drops into standby mode, rejecting all reads and writes. Every application that needs to fetch or renew secrets breaks. Database connections expire because dynamic secret leases can't renew. Within 15 minutes, all services on dynamic database credentials start throwing auth errors. Revenue-impacting services go down within 30 minutes. Detection: Monitor vault_core_active (should be 1), vault_raft_peers (should equal cluster size), and vault_runtime_alloc_bytes. Alert when active node count drops to 0 or peer count falls below quorum. Recovery: Restore the second node from the most recent Raft snapshot. If AZ recovery is delayed, do an emergency unseal of a new node in a healthy AZ using the backup Shamir keys. Long-term: deploy 5-node Vault clusters across 3 AZs to survive a dual-node failure.

Scenario 2: Secret Rotation Breaks Active Database Connections. An automated rotation CronJob rotates the PostgreSQL app_user password in both Vault and the database at the same time. Meanwhile, the application holds a connection pool of 200 persistent connections using the old password. When the pool needs to open new connections during a scale-up, the old password gets rejected. Existing connections keep working until the pool recycles, creating a partial outage where some requests succeed and some fail. This is maddening to debug. Detection: Monitor pg_stat_activity for authentication failure counts. Correlate vault_secret_lease_creation_count with the application error rate. Recovery: Use dual-credential rotation. Create new credentials, update Vault, wait for applications to pick up the new credentials (via lease expiry or CSI volume sync), then revoke the old ones. Never rotate in-place. Vault's database/rotate-role handles this correctly by maintaining both old and new credentials during the transition window.

Scenario 3: Secret Sprawl, Leaked API Key in CI Logs. A developer adds echo $DATABASE_URL to a CI debug step. The CI system logs the output, exposing the database connection string (including the password) in build logs visible to 200 engineers. The secret sits in CI log storage for 90 days. A trufflehog scan catches it 4 days later. Detection: Run trufflehog and gitleaks in CI pipelines on every commit. Scan CI log output with pattern matching for known secret formats (AWS keys start with AKIA, Vault tokens start with hvs.). Recovery: Rotate the exposed credential in Vault immediately. Purge the CI logs. Add pipeline rules that mask environment variables in log output. Better yet, move to Vault-injected secrets via sidecar instead of environment variables to eliminate this entire class of leak.

Capacity Planning

Vault sizing formula: vault_memory_gb = base_2gb + (num_secrets / 100,000) * 1gb + (requests_per_second / 1,000) * 0.5gb. Vault's Raft storage works well up to roughly 1M secrets. Beyond that, look into sharding by namespace.

Scale Tier	Secrets	Requests/sec	Dynamic Leases	Vault Nodes	Reference
Startup	500	10	50	3 (single cluster)	Series A company
Mid-scale	10K	200	2,000	3-5 (single cluster)	100-service platform
Large-scale	100K	2,000	50,000	5 (primary) + 3 (DR)	Stripe-scale
Hyper-scale	1M+	10,000+	500,000+	Multi-cluster, namespace sharding	Netflix, Uber

Key thresholds: Keep Vault Raft write latency under 10ms. Anything above 50ms indicates a storage backend problem. Don't exceed 500,000 active leases per cluster, because lease expiry processing will start eating into performance. Token creation rate should stay under 5,000/sec sustained on a single cluster. Auto-unseal via AWS KMS adds about 50ms to the unseal process but removes the headache of managing Shamir keys. The Secrets Store CSI Driver syncs every 2 minutes by default. For faster rotation response, drop rotationPollInterval to 30s, but that trades more Vault API calls for speed. Budget roughly 1 Vault admin per 500 engineers for day-to-day operations and policy management.

Architecture Decision Record

Decision: Choosing a Secrets Management Solution

Criteria (Weight)	HashiCorp Vault	AWS Secrets Manager	GCP Secret Manager	CyberArk Conjur
Dynamic secrets (25%)	5, Database/AWS/PKI/SSH	2, RDS rotation only	2, Basic rotation	3, Limited dynamic
Multi-cloud support (20%)	5, Any infrastructure	1, AWS only	1, GCP only	4, Multi-cloud agent
Operational complexity (20%)	2, HA cluster/unseal/upgrades	5, Fully managed	5, Fully managed	2, Complex on-prem
Kubernetes integration (15%)	5, CSI/sidecar/ESO	3, ESO only	3, ESO only	3, Follower injection
Compliance / audit (10%)	5, Full audit log/namespaces	4, CloudTrail integration	4, Cloud Audit Logs	5, Enterprise-grade audit
Cost (10%)	3, Free OSS + infra cost (Enterprise is paid)	3, $0.40/secret/month	3, $0.06/secret/month + access fees	1, Expensive enterprise license

When to choose what:

Team under 20, single cloud (AWS): Go with AWS Secrets Manager. Zero ops, native RDS rotation, CloudTrail audit. It covers what most startups actually need.
Team of 20-100, multi-cloud or K8s-heavy: HashiCorp Vault. Dynamic secrets, PKI, and transit encryption justify the operational cost. Just don't underestimate that cost.
Enterprise with PAM requirements: CyberArk Conjur. It plugs into existing PAM infrastructure with privileged session recording and enterprise compliance certifications.
GCP-native shop: GCP Secret Manager + Workload Identity. Least friction, IAM-based access control, automatic replication.
Hybrid (managed + self-hosted): Use the cloud provider's secrets manager for cloud resources (RDS passwords, API keys) and Vault for dynamic secrets, PKI, and cross-cloud use cases. External Secrets Operator bridges the two in Kubernetes. This is what most mid-to-large organizations end up doing in practice, and honestly it's a reasonable outcome even if it feels architecturally messy.

Tool	Type	Best For	Scale
HashiCorp Vault	Open Source	Dynamic secrets, PKI, transit encryption	Medium-Enterprise
AWS Secrets Manager	Managed	AWS-native, automatic RDS rotation	Small-Enterprise
External Secrets Operator	Open Source	Sync cloud secrets into K8s Secrets	Medium-Enterprise
CyberArk Conjur	Commercial	Enterprise PAM, compliance-focused	Enterprise

Why It Exists

How It Works

Vault Architecture

HashiCorp Vault is the most widely adopted option, so it's worth understanding how it works internally.

Storage backend. Vault encrypts everything before writing it to Consul, Raft (integrated storage), or cloud storage. Plaintext never hits disk.
Seal/Unseal. Vault boots in a sealed state where it physically cannot decrypt its own storage. Unsealing requires a threshold of key shares (Shamir's Secret Sharing) or auto-unseal through cloud KMS. This means stealing the storage volume alone gets an attacker nothing.
Auth methods. Services prove their identity through Kubernetes ServiceAccount tokens, AWS IAM roles, AppRole (for machine-to-machine), or OIDC. Each auth method maps to a Vault policy that defines exactly which secret paths the caller can read.
Secret engines. KV (static key-value), Database (dynamic credentials), PKI (certificate issuance), Transit (encryption-as-a-service), and AWS (dynamic IAM credentials). The Database and PKI engines are the biggest wins compared to simpler alternatives.

Dynamic Secrets Workflow

The application authenticates to Vault via Kubernetes auth
It requests database credentials from the database/creds/readonly path
Vault connects to the database, creates a temporary user with the readonly role, and hands back a credential with a 1-hour lease
The application uses that credential for database queries
When the lease expires, Vault revokes the credential and drops the database user

Think about what this means. If a credential leaks, the blast radius is one service for one hour. Not every service, not forever. That's a fundamentally different security posture.

Kubernetes Integration Patterns

Pattern	Mechanism	Pros	Cons
Secrets Store CSI Driver	Mount secrets as files in pods	No app changes, auto-rotation	Requires CSI driver + provider
External Secrets Operator	Sync to K8s Secrets	Works with existing Secret consumers	K8s Secrets are base64, not encrypted
Vault Agent Sidecar	Sidecar injects secrets into shared volume	Dynamic secrets, lease renewal	Extra container per pod
Direct SDK	Application calls Vault API	Full control, dynamic secrets	Application code changes required

Production Considerations

High availability. Run Vault in HA mode with Raft consensus (3 or 5 nodes). If Vault goes down, no application can fetch secrets. It becomes the most critical dependency in the infrastructure, so treat it that way.
Break-glass procedures. Document and regularly test the unseal process. If the auto-unseal KMS becomes unavailable, operators with Shamir key shares physically distributed across trusted individuals are needed. I've seen teams that couldn't unseal Vault during an incident because the key holders were on vacation. Don't be that team.
Secret sprawl detection. Run trufflehog or gitleaks in CI pipelines on every commit. This is not optional. It's a matter of when, not if, someone accidentally commits a credential.
Rotation strategy. Build rotation through Lambda functions (AWS) or CronJobs (K8s) that trigger on a schedule. Always test rotation in staging first. Many applications choke on mid-connection credential changes because their connection pools don't refresh cleanly.
Disaster recovery. Take Vault snapshots regularly and store them encrypted in a separate cloud account. Test restoration quarterly. I know "test the backups" sounds obvious, but I've personally watched a team discover their Vault snapshots were corrupted during an actual DR event.

Failure Scenarios

Capacity Planning

Scale Tier	Secrets	Requests/sec	Dynamic Leases	Vault Nodes	Reference
Startup	500	10	50	3 (single cluster)	Series A company
Mid-scale	10K	200	2,000	3-5 (single cluster)	100-service platform
Large-scale	100K	2,000	50,000	5 (primary) + 3 (DR)	Stripe-scale
Hyper-scale	1M+	10,000+	500,000+	Multi-cluster, namespace sharding	Netflix, Uber

Architecture Decision Record

Decision: Choosing a Secrets Management Solution

Criteria (Weight)	HashiCorp Vault	AWS Secrets Manager	GCP Secret Manager	CyberArk Conjur
Dynamic secrets (25%)	5, Database/AWS/PKI/SSH	2, RDS rotation only	2, Basic rotation	3, Limited dynamic
Multi-cloud support (20%)	5, Any infrastructure	1, AWS only	1, GCP only	4, Multi-cloud agent
Operational complexity (20%)	2, HA cluster/unseal/upgrades	5, Fully managed	5, Fully managed	2, Complex on-prem
Kubernetes integration (15%)	5, CSI/sidecar/ESO	3, ESO only	3, ESO only	3, Follower injection
Compliance / audit (10%)	5, Full audit log/namespaces	4, CloudTrail integration	4, Cloud Audit Logs	5, Enterprise-grade audit
Cost (10%)	3, Free OSS + infra cost (Enterprise is paid)	3, $0.40/secret/month	3, $0.06/secret/month + access fees	1, Expensive enterprise license

When to choose what:

Team under 20, single cloud (AWS): Go with AWS Secrets Manager. Zero ops, native RDS rotation, CloudTrail audit. It covers what most startups actually need.
Team of 20-100, multi-cloud or K8s-heavy: HashiCorp Vault. Dynamic secrets, PKI, and transit encryption justify the operational cost. Just don't underestimate that cost.
Enterprise with PAM requirements: CyberArk Conjur. It plugs into existing PAM infrastructure with privileged session recording and enterprise compliance certifications.
GCP-native shop: GCP Secret Manager + Workload Identity. Least friction, IAM-based access control, automatic replication.
Hybrid (managed + self-hosted): Use the cloud provider's secrets manager for cloud resources (RDS passwords, API keys) and Vault for dynamic secrets, PKI, and cross-cloud use cases. External Secrets Operator bridges the two in Kubernetes. This is what most mid-to-large organizations end up doing in practice, and honestly it's a reasonable outcome even if it feels architecturally messy.

Architecture Diagram

Why It Exists

How It Works

Vault Architecture

Dynamic Secrets Workflow

Kubernetes Integration Patterns

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Secrets Management

Architecture Diagram

Why It Exists

How It Works

Vault Architecture

Dynamic Secrets Workflow

Kubernetes Integration Patterns

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics