Secrets Management
Architecture Diagram
Why It Exists
Anyone who has been building distributed systems long enough has seen it happen. Someone hardcodes a database password in a config file, it ends up in a Docker image, someone pushes that image to a registry with too-broad read access, and now an intern three teams over can see production credentials. Or worse, a contractor commits a connection string to a public repo and the team finds out from Twitter.
Every service depends on secrets: database credentials, API keys, TLS certificates, encryption keys, service account tokens. The problem isn't that secrets exist. The problem is that they end up everywhere. Sprinkled across environment variables, baked into container images, copy-pasted into Slack DMs. Once they scatter, auditing who has access becomes impossible, rotating them requires a fire drill, and detecting leaks is a guessing game.
Centralized secrets management provides one place to store credentials with encryption, access control, audit logging, and automatic rotation. It turns secrets from a ticking time bomb into something that can actually be managed.
How It Works
Vault Architecture
HashiCorp Vault is the most widely adopted option, so it's worth understanding how it works internally.
- Storage backend. Vault encrypts everything before writing it to Consul, Raft (integrated storage), or cloud storage. Plaintext never hits disk.
- Seal/Unseal. Vault boots in a sealed state where it physically cannot decrypt its own storage. Unsealing requires a threshold of key shares (Shamir's Secret Sharing) or auto-unseal through cloud KMS. This means stealing the storage volume alone gets an attacker nothing.
- Auth methods. Services prove their identity through Kubernetes ServiceAccount tokens, AWS IAM roles, AppRole (for machine-to-machine), or OIDC. Each auth method maps to a Vault policy that defines exactly which secret paths the caller can read.
- Secret engines. KV (static key-value), Database (dynamic credentials), PKI (certificate issuance), Transit (encryption-as-a-service), and AWS (dynamic IAM credentials). The Database and PKI engines are the biggest wins compared to simpler alternatives.
Dynamic Secrets Workflow
This is the feature that makes Vault worth the operational overhead. Instead of storing a static database password that lives forever, Vault generates a unique credential per request with a short TTL:
- The application authenticates to Vault via Kubernetes auth
- It requests database credentials from the
database/creds/readonlypath - Vault connects to the database, creates a temporary user with the
readonlyrole, and hands back a credential with a 1-hour lease - The application uses that credential for database queries
- When the lease expires, Vault revokes the credential and drops the database user
Think about what this means. If a credential leaks, the blast radius is one service for one hour. Not every service, not forever. That's a fundamentally different security posture.
Kubernetes Integration Patterns
| Pattern | Mechanism | Pros | Cons |
|---|---|---|---|
| Secrets Store CSI Driver | Mount secrets as files in pods | No app changes, auto-rotation | Requires CSI driver + provider |
| External Secrets Operator | Sync to K8s Secrets | Works with existing Secret consumers | K8s Secrets are base64, not encrypted |
| Vault Agent Sidecar | Sidecar injects secrets into shared volume | Dynamic secrets, lease renewal | Extra container per pod |
| Direct SDK | Application calls Vault API | Full control, dynamic secrets | Application code changes required |
My honest take: start with External Secrets Operator for getting secrets from a cloud provider into Kubernetes. It's the least disruptive path. Move to the CSI driver or Vault Agent when dynamic secrets or lease renewal are actually needed. The direct SDK approach gives the most control but couples application code to Vault, a tradeoff to make deliberately.
Production Considerations
- High availability. Run Vault in HA mode with Raft consensus (3 or 5 nodes). If Vault goes down, no application can fetch secrets. It becomes the most critical dependency in the infrastructure, so treat it that way.
- Break-glass procedures. Document and regularly test the unseal process. If the auto-unseal KMS becomes unavailable, operators with Shamir key shares physically distributed across trusted individuals are needed. I've seen teams that couldn't unseal Vault during an incident because the key holders were on vacation. Don't be that team.
- Secret sprawl detection. Run
trufflehogorgitleaksin CI pipelines on every commit. This is not optional. It's a matter of when, not if, someone accidentally commits a credential. - Rotation strategy. Build rotation through Lambda functions (AWS) or CronJobs (K8s) that trigger on a schedule. Always test rotation in staging first. Many applications choke on mid-connection credential changes because their connection pools don't refresh cleanly.
- Disaster recovery. Take Vault snapshots regularly and store them encrypted in a separate cloud account. Test restoration quarterly. I know "test the backups" sounds obvious, but I've personally watched a team discover their Vault snapshots were corrupted during an actual DR event.
Failure Scenarios
Scenario 1: Vault Cluster Quorum Loss Causes Total Secret Access Failure. A 3-node Vault cluster on Raft storage loses two nodes at once (AZ failure). The surviving node can't form a quorum and drops into standby mode, rejecting all reads and writes. Every application that needs to fetch or renew secrets breaks. Database connections expire because dynamic secret leases can't renew. Within 15 minutes, all services on dynamic database credentials start throwing auth errors. Revenue-impacting services go down within 30 minutes. Detection: Monitor vault_core_active (should be 1), vault_raft_peers (should equal cluster size), and vault_runtime_alloc_bytes. Alert when active node count drops to 0 or peer count falls below quorum. Recovery: Restore the second node from the most recent Raft snapshot. If AZ recovery is delayed, do an emergency unseal of a new node in a healthy AZ using the backup Shamir keys. Long-term: deploy 5-node Vault clusters across 3 AZs to survive a dual-node failure.
Scenario 2: Secret Rotation Breaks Active Database Connections. An automated rotation CronJob rotates the PostgreSQL app_user password in both Vault and the database at the same time. Meanwhile, the application holds a connection pool of 200 persistent connections using the old password. When the pool needs to open new connections during a scale-up, the old password gets rejected. Existing connections keep working until the pool recycles, creating a partial outage where some requests succeed and some fail. This is maddening to debug. Detection: Monitor pg_stat_activity for authentication failure counts. Correlate vault_secret_lease_creation_count with the application error rate. Recovery: Use dual-credential rotation. Create new credentials, update Vault, wait for applications to pick up the new credentials (via lease expiry or CSI volume sync), then revoke the old ones. Never rotate in-place. Vault's database/rotate-role handles this correctly by maintaining both old and new credentials during the transition window.
Scenario 3: Secret Sprawl, Leaked API Key in CI Logs. A developer adds echo $DATABASE_URL to a CI debug step. The CI system logs the output, exposing the database connection string (including the password) in build logs visible to 200 engineers. The secret sits in CI log storage for 90 days. A trufflehog scan catches it 4 days later. Detection: Run trufflehog and gitleaks in CI pipelines on every commit. Scan CI log output with pattern matching for known secret formats (AWS keys start with AKIA, Vault tokens start with hvs.). Recovery: Rotate the exposed credential in Vault immediately. Purge the CI logs. Add pipeline rules that mask environment variables in log output. Better yet, move to Vault-injected secrets via sidecar instead of environment variables to eliminate this entire class of leak.
Capacity Planning
Vault sizing formula: vault_memory_gb = base_2gb + (num_secrets / 100,000) * 1gb + (requests_per_second / 1,000) * 0.5gb. Vault's Raft storage works well up to roughly 1M secrets. Beyond that, look into sharding by namespace.
| Scale Tier | Secrets | Requests/sec | Dynamic Leases | Vault Nodes | Reference |
|---|---|---|---|---|---|
| Startup | 500 | 10 | 50 | 3 (single cluster) | Series A company |
| Mid-scale | 10K | 200 | 2,000 | 3-5 (single cluster) | 100-service platform |
| Large-scale | 100K | 2,000 | 50,000 | 5 (primary) + 3 (DR) | Stripe-scale |
| Hyper-scale | 1M+ | 10,000+ | 500,000+ | Multi-cluster, namespace sharding | Netflix, Uber |
Key thresholds: Keep Vault Raft write latency under 10ms. Anything above 50ms indicates a storage backend problem. Don't exceed 500,000 active leases per cluster, because lease expiry processing will start eating into performance. Token creation rate should stay under 5,000/sec sustained on a single cluster. Auto-unseal via AWS KMS adds about 50ms to the unseal process but removes the headache of managing Shamir keys. The Secrets Store CSI Driver syncs every 2 minutes by default. For faster rotation response, drop rotationPollInterval to 30s, but that trades more Vault API calls for speed. Budget roughly 1 Vault admin per 500 engineers for day-to-day operations and policy management.
Architecture Decision Record
Decision: Choosing a Secrets Management Solution
| Criteria (Weight) | HashiCorp Vault | AWS Secrets Manager | GCP Secret Manager | CyberArk Conjur |
|---|---|---|---|---|
| Dynamic secrets (25%) | 5, Database/AWS/PKI/SSH | 2, RDS rotation only | 2, Basic rotation | 3, Limited dynamic |
| Multi-cloud support (20%) | 5, Any infrastructure | 1, AWS only | 1, GCP only | 4, Multi-cloud agent |
| Operational complexity (20%) | 2, HA cluster/unseal/upgrades | 5, Fully managed | 5, Fully managed | 2, Complex on-prem |
| Kubernetes integration (15%) | 5, CSI/sidecar/ESO | 3, ESO only | 3, ESO only | 3, Follower injection |
| Compliance / audit (10%) | 5, Full audit log/namespaces | 4, CloudTrail integration | 4, Cloud Audit Logs | 5, Enterprise-grade audit |
| Cost (10%) | 3, Free OSS + infra cost (Enterprise is paid) | 3, $0.40/secret/month | 3, $0.06/secret/month + access fees | 1, Expensive enterprise license |
When to choose what:
- Team under 20, single cloud (AWS): Go with AWS Secrets Manager. Zero ops, native RDS rotation, CloudTrail audit. It covers what most startups actually need.
- Team of 20-100, multi-cloud or K8s-heavy: HashiCorp Vault. Dynamic secrets, PKI, and transit encryption justify the operational cost. Just don't underestimate that cost.
- Enterprise with PAM requirements: CyberArk Conjur. It plugs into existing PAM infrastructure with privileged session recording and enterprise compliance certifications.
- GCP-native shop: GCP Secret Manager + Workload Identity. Least friction, IAM-based access control, automatic replication.
- Hybrid (managed + self-hosted): Use the cloud provider's secrets manager for cloud resources (RDS passwords, API keys) and Vault for dynamic secrets, PKI, and cross-cloud use cases. External Secrets Operator bridges the two in Kubernetes. This is what most mid-to-large organizations end up doing in practice, and honestly it's a reasonable outcome even if it feels architecturally messy.
Key Points
- •Centralized secret storage with encryption at rest and fine-grained access control
- •Secrets should never live in code, environment variables, or container images
- •Dynamic secrets (generated on-demand with a TTL) beat static credentials for security
- •Automatic rotation shrinks the blast radius when credentials get compromised
- •Audit logging tracks who accessed which secret and when, which is non-negotiable for compliance
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| HashiCorp Vault | Open Source | Dynamic secrets, PKI, transit encryption | Medium-Enterprise |
| AWS Secrets Manager | Managed | AWS-native, automatic RDS rotation | Small-Enterprise |
| External Secrets Operator | Open Source | Sync cloud secrets into K8s Secrets | Medium-Enterprise |
| CyberArk Conjur | Commercial | Enterprise PAM, compliance-focused | Enterprise |
Common Mistakes
- Storing secrets in environment variables. They leak into logs, crash dumps, and child processes.
- Committing secrets to git. Even after deletion, they persist in git history forever.
- Not rotating secrets after an employee leaves or a breach is suspected
- Using the same credentials across environments (dev/staging/prod)
- Not encrypting secrets at rest in the secret store. Defense in depth applies here too.