Compliance & Audit Logging
Architecture Diagram
Why It Exists
Sooner or later, every production system gets hit with the same question: "Who changed what, and when?" Maybe it is a security incident. Maybe a regulator is at the door. Maybe the team is running a postmortem and nobody can explain why a database table disappeared at 3am. The answer has to come from somewhere, and that somewhere is an immutable, tamper-proof audit trail.
Compliance frameworks are not suggestions to wave off. SOC 2 Type II auditors will ask for evidence of access controls, change management, and incident response. GDPR requires logging access to personal data. PCI-DSS demands audit trails for all access to cardholder data. Without structured audit logging, organizations fail audits, pay fines, and have no way to do forensic analysis when (not if) a breach happens.
How It Works
Audit Event Structure
Every audit event needs to capture six things: actor (who, meaning user ID, service account, IP), action (what, meaning CRUD operation or API method), resource (the target, meaning resource type, ID, namespace), timestamp (when, in UTC with nanosecond precision), outcome (success or failure plus any error code), and context (where from, meaning source IP, user agent, session ID). Standardize this schema across all services. Without a common format, cross-system correlation becomes a nightmare.
Immutability Guarantees
Audit logs must be append-only and tamper-evident. There is no middle ground here:
- WORM storage (Write Once, Read Many). S3 Object Lock in Compliance mode prevents deletion even by root accounts for the configured retention period. This is the strongest guarantee AWS offers.
- Separate account. Store audit logs in a dedicated AWS account with locked-down IAM policies. The production account can write but never delete or modify. This separation is worth the operational overhead.
- Cryptographic chaining. Hash each log entry with the previous entry's hash (yes, like a blockchain, except actually useful) to detect tampering or deletion after the fact.
Compliance Framework Requirements
| Framework | Retention | Key Requirements |
|---|---|---|
| SOC 2 | 1 year | Access control logs, change management evidence, incident response records |
| GDPR | Duration of processing + reasonable period | Data access logs for personal data, right-to-erasure evidence, consent records |
| HIPAA | 6 years | PHI access logs, user authentication records, security incident logs |
| PCI-DSS | 1 year (3 months immediately accessible) | All access to cardholder data, admin actions, audit trail integrity verification |
Kubernetes Audit Policy
The K8s API server provides four audit levels: None, Metadata, Request, and RequestResponse. In production, log RequestResponse for sensitive resources (secrets, RBAC, pods/exec), Metadata for standard resources, and None for high-volume read-only endpoints like health checks and metrics. Logging everything at RequestResponse will cause storage costs to spiral and the SIEM to drown in noise. Be deliberate about what gets captured.
Production Considerations
- Separation of duties. Set up distinct IAM roles: application teams write audit events, security teams read and analyze, and no single role can both write and delete. Enforce this with AWS Organizations SCPs or GCP Organization Policies. If one person can both generate and erase logs, the audit trail means nothing.
- Real-time alerting. Feed audit events into Falco or a SIEM. Alert on anomalies: privilege escalation attempts, access from unusual IPs, bulk data exports, secret access outside normal patterns. Detection latency should be minutes, not days. I have seen teams that only check logs weekly. By then the damage is done.
- Compliance-as-code. Use OPA Gatekeeper to enforce compliance policies as Kubernetes admission webhooks: all pods must have resource limits, no privileged containers, images only from approved registries, required labels for data classification. The goal is to prevent violations, not just detect them.
- Retention lifecycle. Automate S3 lifecycle policies: hot storage (30 days, Standard), warm (1 year, Infrequent Access), cold (7 years, Glacier Deep Archive). Tag by compliance framework to apply framework-specific retention. Do not manually manage this. Someone will forget, and then an auditor will find the gap.
- Audit the auditors. Monitor the audit logging pipeline itself. If the collector goes down for 6 hours, that is a 6-hour gap in the forensic record. Alert on pipeline lag, dropped events, and storage write failures. The worst time to discover the audit pipeline is broken is during an actual incident.
Failure Scenarios
Scenario 1: Audit Log Pipeline Gap During an Infrastructure Incident. A Kafka cluster serving as the audit event bus goes down for 4 hours due to a botched broker upgrade. During that window, 2.3 million audit events from 300 services vanish. They were buffered in application memory but got dropped when pods recycled during the incident response. Two weeks later, a SOC 2 auditor asks for access control evidence covering that exact period. The gap is unexplainable. The result: a qualified audit opinion and a $200K remediation engagement. Detection: Monitor audit_events_produced_total vs audit_events_consumed_total and alert when the delta exceeds 1,000 events over 5 minutes. Run an end-to-end synthetic audit event through the full pipeline every 60 seconds, and alert when the canary does not arrive. Recovery: Deploy a dead-letter queue for failed audit writes. Add local filesystem buffering in the audit SDK with at-least-once delivery guarantees. Backfill the gap from application-level logs if available, and document the gap formally for auditors. They will accept a documented gap much better than an unexplained one.
Scenario 2: Insider Threat, Engineer Deletes Audit Records. A disgruntled engineer with production AWS access deletes CloudTrail logs from the S3 audit bucket to cover tracks after exfiltrating customer data. Nobody notices for 72 hours because the monitoring system only checks for new log delivery, not log integrity. Detection: Enable S3 Object Lock in Compliance mode. This makes deletion physically impossible, even for root accounts, for the configured retention period. Enable S3 access logging on the audit bucket itself (audit the audit storage). Turn on CloudTrail log file integrity validation, which uses SHA-256 digest files to catch tampering. Recovery: If Object Lock was not enabled, restore from S3 cross-region replication in a separate AWS account (the insider likely does not have access to the backup account). Add an AWS Organizations SCP that denies s3:DeleteObject on audit buckets across all accounts. Separate the audit storage account from the production account with distinct IAM boundaries. This scenario is exactly why separate accounts matter.
Scenario 3: Compliance Scope Creep After Acquisition. A company acquires a startup whose application processes payment card data (PCI-DSS scope) but has zero audit logging infrastructure. The acquired app runs on a separate AWS account with no CloudTrail, no Kubernetes audit policy, and application logs that overwrite daily. During PCI-DSS audit, the QSA flags the acquired system as a critical gap: 90 days to fix it or face non-compliance for the entire organization. Detection: Maintain a compliance inventory that maps every application to its applicable compliance frameworks. Run automated scans (AWS Config Rules, Prowler) that check for CloudTrail enablement, log retention policies, and encryption at rest across all accounts. Do this before acquisitions close, not after. Recovery: Deploy a standardized audit logging stack (CloudTrail + Falco + centralized SIEM) to the acquired account within the remediation window. Use network segmentation to isolate PCI-scoped workloads and shrink the compliance boundary as much as possible.
Capacity Planning
Audit event volume estimation: daily_audit_events = api_requests_per_day * audit_ratio + admin_actions_per_day + auth_events_per_day. A platform with 50M API requests/day, a 1% audit ratio, 10K admin actions, and 500K auth events generates roughly 500,000 + 10,000 + 500,000 = ~1M audit events/day at ~1 KB each = 1 GB/day raw. That number gets big fast.
| Scale Tier | API Requests/Day | Audit Events/Day | Daily Storage | 7-Year Retention | Reference |
|---|---|---|---|---|---|
| Startup | 1M | 50K | 50 MB | 127 GB | Pre-compliance |
| Mid-scale | 50M | 1M | 1 GB | 2.5 TB | SOC 2 Type II |
| Large-scale | 500M | 20M | 20 GB | 51 TB | PCI-DSS Level 1 |
| Hyper-scale | 5B+ | 500M+ | 500 GB+ | 1.3 PB+ | Stripe, Coinbase |
Key thresholds: S3 Object Lock Compliance mode retention should match framework requirements (SOC 2: 1 year, HIPAA: 6 years, PCI-DSS: 1 year, financial services: 7 years). Keep Elasticsearch for SIEM hot queries at 30 days; use Glacier for long-term cold storage. Falco generates roughly 100 alerts/day per 1,000 pods in a well-tuned deployment. Above 500 alerts/day, the rules are too noisy and need tuning. CloudTrail processes up to 50,000 management events per AWS account per region for free; data events (S3 object-level, Lambda invocation) cost $0.10 per 100,000 events, which adds up at scale. Budget 1 GRC/compliance engineer per 500 engineers and 3 compliance frameworks.
Architecture Decision Record
Decision: Choosing a Compliance & Audit Logging Architecture
| Criteria (Weight) | CloudTrail + S3 + Athena | Splunk Enterprise | ELK + S3 Archive | Datadog Security |
|---|---|---|---|---|
| Compliance certifications (25%) | 4 - SOC 2, PCI, HIPAA eligible | 5 - Most certifications, GovCloud | 3 - Self-managed, self-certified | 4 - SOC 2, ISO 27001 |
| Query & investigation (20%) | 3 - Athena SQL, slow for ad-hoc | 5 - SPL is industry-leading | 4 - Lucene queries, KQL | 4 - Log analytics, SIEM |
| Immutability guarantees (20%) | 5 - S3 Object Lock, cross-account | 3 - Index immutability, not native | 3 - Requires manual S3 archival | 3 - SaaS retention, limited control |
| Cost at scale (15%) | 5 - S3 + Glacier is cheapest | 1 - Most expensive per GB | 3 - Self-hosted infra cost | 2 - Per-GB ingestion pricing |
| Operational complexity (10%) | 4 - Managed services, minimal ops | 3 - Heavy infrastructure | 2 - Cluster management, lifecycle | 5 - Fully managed |
| Real-time detection (10%) | 2 - EventBridge delay, not real-time | 5 - Real-time correlation | 3 - Near-real-time with pipeline | 4 - Real-time threat detection |
When to choose what:
- AWS-native, cost-conscious: CloudTrail + S3 Object Lock + Athena. Cheapest long-term storage with immutability built in. Add GuardDuty for real-time threat detection. Works for SOC 2, HIPAA, PCI-DSS.
- Enterprise SOC with a dedicated security team: Splunk. The most powerful query language, real-time correlation rules, compliance dashboards out of the box. The cost is justified when breach exposure exceeds $10M.
- Multi-cloud, K8s-heavy: ELK stack with Falco for runtime security, S3 archival for compliance retention. Most flexible but highest operational burden. Budget at least one person dedicated to keeping the cluster healthy.
- Team under 50, need compliance fast: Datadog Security. Unified with existing monitoring, built-in compliance rules, managed retention. The trade is cost for time-to-compliance, which is often the right call for smaller teams.
- Financial services or government: Splunk GovCloud or AWS GovCloud + CloudTrail. FedRAMP authorized, ITAR compliant, required for government contracts. No open-source alternative meets the certification requirements here. That is just the reality.
Key Points
- •Audit logs answer the only question that matters after an incident: who did what, when, and from where
- •Immutable, append-only storage is non-negotiable. If someone can delete the logs, the logs are worthless
- •SOC 2, GDPR, HIPAA, PCI-DSS all require specific logging and retention policies, and auditors will check
- •Separation of duties matters. Engineers who deploy code should never be able to touch audit logs
- •Automated compliance checks in CI/CD catch policy violations before they hit production
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| AWS CloudTrail | Managed | AWS API audit logging, S3 integration | Small-Enterprise |
| Falco | Open Source | Runtime security, K8s audit, eBPF-based | Medium-Enterprise |
| Splunk | Commercial | Enterprise SIEM, compliance reporting, SPL queries | Large-Enterprise |
| Open Policy Agent | Open Source | Policy enforcement, admission webhooks, audit | Medium-Enterprise |
Common Mistakes
- Storing audit logs in the same system they audit. A compromised system can wipe its own trail
- Ignoring failed authentication attempts. These are often the earliest signal of an attack
- Keeping logs for too short a period. Compliance frameworks demand 1-7 years depending on the standard
- No alerting on suspicious audit events. Logs nobody reads are just expensive storage
- Stuffing PII into audit records. The audit logs become their own compliance liability