Compliance & Audit Logging

Why It Exists

Sooner or later, every production system gets hit with the same question: "Who changed what, and when?" Maybe it is a security incident. Maybe a regulator is at the door. Maybe the team is running a postmortem and nobody can explain why a database table disappeared at 3am. The answer has to come from somewhere, and that somewhere is an immutable, tamper-proof audit trail.

Compliance frameworks are not suggestions to wave off. SOC 2 Type II auditors will ask for evidence of access controls, change management, and incident response. GDPR requires logging access to personal data. PCI-DSS demands audit trails for all access to cardholder data. Without structured audit logging, organizations fail audits, pay fines, and have no way to do forensic analysis when (not if) a breach happens.

How It Works

Audit Event Structure

Every audit event needs to capture six things: actor (who, meaning user ID, service account, IP), action (what, meaning CRUD operation or API method), resource (the target, meaning resource type, ID, namespace), timestamp (when, in UTC with nanosecond precision), outcome (success or failure plus any error code), and context (where from, meaning source IP, user agent, session ID). Standardize this schema across all services. Without a common format, cross-system correlation becomes a nightmare.

Immutability Guarantees

Audit logs must be append-only and tamper-evident. There is no middle ground here:

WORM storage (Write Once, Read Many). S3 Object Lock in Compliance mode prevents deletion even by root accounts for the configured retention period. This is the strongest guarantee AWS offers.
Separate account. Store audit logs in a dedicated AWS account with locked-down IAM policies. The production account can write but never delete or modify. This separation is worth the operational overhead.
Cryptographic chaining. Hash each log entry with the previous entry's hash (yes, like a blockchain, except actually useful) to detect tampering or deletion after the fact.

Compliance Framework Requirements

Framework	Retention	Key Requirements
SOC 2	1 year	Access control logs, change management evidence, incident response records
GDPR	Duration of processing + reasonable period	Data access logs for personal data, right-to-erasure evidence, consent records
HIPAA	6 years	PHI access logs, user authentication records, security incident logs
PCI-DSS	1 year (3 months immediately accessible)	All access to cardholder data, admin actions, audit trail integrity verification

Kubernetes Audit Policy

The K8s API server provides four audit levels: None, Metadata, Request, and RequestResponse. In production, log RequestResponse for sensitive resources (secrets, RBAC, pods/exec), Metadata for standard resources, and None for high-volume read-only endpoints like health checks and metrics. Logging everything at RequestResponse will cause storage costs to spiral and the SIEM to drown in noise. Be deliberate about what gets captured.

Production Considerations

Separation of duties. Set up distinct IAM roles: application teams write audit events, security teams read and analyze, and no single role can both write and delete. Enforce this with AWS Organizations SCPs or GCP Organization Policies. If one person can both generate and erase logs, the audit trail means nothing.
Real-time alerting. Feed audit events into Falco or a SIEM. Alert on anomalies: privilege escalation attempts, access from unusual IPs, bulk data exports, secret access outside normal patterns. Detection latency should be minutes, not days. I have seen teams that only check logs weekly. By then the damage is done.
Compliance-as-code. Use OPA Gatekeeper to enforce compliance policies as Kubernetes admission webhooks: all pods must have resource limits, no privileged containers, images only from approved registries, required labels for data classification. The goal is to prevent violations, not just detect them.
Retention lifecycle. Automate S3 lifecycle policies: hot storage (30 days, Standard), warm (1 year, Infrequent Access), cold (7 years, Glacier Deep Archive). Tag by compliance framework to apply framework-specific retention. Do not manually manage this. Someone will forget, and then an auditor will find the gap.
Audit the auditors. Monitor the audit logging pipeline itself. If the collector goes down for 6 hours, that is a 6-hour gap in the forensic record. Alert on pipeline lag, dropped events, and storage write failures. The worst time to discover the audit pipeline is broken is during an actual incident.

Failure Scenarios

Scenario 1: Audit Log Pipeline Gap During an Infrastructure Incident. A Kafka cluster serving as the audit event bus goes down for 4 hours due to a botched broker upgrade. During that window, 2.3 million audit events from 300 services vanish. They were buffered in application memory but got dropped when pods recycled during the incident response. Two weeks later, a SOC 2 auditor asks for access control evidence covering that exact period. The gap is unexplainable. The result: a qualified audit opinion and a $200K remediation engagement. Detection: Monitor audit_events_produced_total vs audit_events_consumed_total and alert when the delta exceeds 1,000 events over 5 minutes. Run an end-to-end synthetic audit event through the full pipeline every 60 seconds, and alert when the canary does not arrive. Recovery: Deploy a dead-letter queue for failed audit writes. Add local filesystem buffering in the audit SDK with at-least-once delivery guarantees. Backfill the gap from application-level logs if available, and document the gap formally for auditors. They will accept a documented gap much better than an unexplained one.

Scenario 2: Insider Threat, Engineer Deletes Audit Records. A disgruntled engineer with production AWS access deletes CloudTrail logs from the S3 audit bucket to cover tracks after exfiltrating customer data. Nobody notices for 72 hours because the monitoring system only checks for new log delivery, not log integrity. Detection: Enable S3 Object Lock in Compliance mode. This makes deletion physically impossible, even for root accounts, for the configured retention period. Enable S3 access logging on the audit bucket itself (audit the audit storage). Turn on CloudTrail log file integrity validation, which uses SHA-256 digest files to catch tampering. Recovery: If Object Lock was not enabled, restore from S3 cross-region replication in a separate AWS account (the insider likely does not have access to the backup account). Add an AWS Organizations SCP that denies s3:DeleteObject on audit buckets across all accounts. Separate the audit storage account from the production account with distinct IAM boundaries. This scenario is exactly why separate accounts matter.

Scenario 3: Compliance Scope Creep After Acquisition. A company acquires a startup whose application processes payment card data (PCI-DSS scope) but has zero audit logging infrastructure. The acquired app runs on a separate AWS account with no CloudTrail, no Kubernetes audit policy, and application logs that overwrite daily. During PCI-DSS audit, the QSA flags the acquired system as a critical gap: 90 days to fix it or face non-compliance for the entire organization. Detection: Maintain a compliance inventory that maps every application to its applicable compliance frameworks. Run automated scans (AWS Config Rules, Prowler) that check for CloudTrail enablement, log retention policies, and encryption at rest across all accounts. Do this before acquisitions close, not after. Recovery: Deploy a standardized audit logging stack (CloudTrail + Falco + centralized SIEM) to the acquired account within the remediation window. Use network segmentation to isolate PCI-scoped workloads and shrink the compliance boundary as much as possible.

Capacity Planning

Audit event volume estimation: daily_audit_events = api_requests_per_day * audit_ratio + admin_actions_per_day + auth_events_per_day. A platform with 50M API requests/day, a 1% audit ratio, 10K admin actions, and 500K auth events generates roughly 500,000 + 10,000 + 500,000 = ~1M audit events/day at ~1 KB each = 1 GB/day raw. That number gets big fast.

Scale Tier	API Requests/Day	Audit Events/Day	Daily Storage	7-Year Retention	Reference
Startup	1M	50K	50 MB	127 GB	Pre-compliance
Mid-scale	50M	1M	1 GB	2.5 TB	SOC 2 Type II
Large-scale	500M	20M	20 GB	51 TB	PCI-DSS Level 1
Hyper-scale	5B+	500M+	500 GB+	1.3 PB+	Stripe, Coinbase

Key thresholds: S3 Object Lock Compliance mode retention should match framework requirements (SOC 2: 1 year, HIPAA: 6 years, PCI-DSS: 1 year, financial services: 7 years). Keep Elasticsearch for SIEM hot queries at 30 days; use Glacier for long-term cold storage. Falco generates roughly 100 alerts/day per 1,000 pods in a well-tuned deployment. Above 500 alerts/day, the rules are too noisy and need tuning. CloudTrail processes up to 50,000 management events per AWS account per region for free; data events (S3 object-level, Lambda invocation) cost $0.10 per 100,000 events, which adds up at scale. Budget 1 GRC/compliance engineer per 500 engineers and 3 compliance frameworks.

Architecture Decision Record

Decision: Choosing a Compliance & Audit Logging Architecture

Criteria (Weight)	CloudTrail + S3 + Athena	Splunk Enterprise	ELK + S3 Archive	Datadog Security
Compliance certifications (25%)	4 - SOC 2, PCI, HIPAA eligible	5 - Most certifications, GovCloud	3 - Self-managed, self-certified	4 - SOC 2, ISO 27001
Query & investigation (20%)	3 - Athena SQL, slow for ad-hoc	5 - SPL is industry-leading	4 - Lucene queries, KQL	4 - Log analytics, SIEM
Immutability guarantees (20%)	5 - S3 Object Lock, cross-account	3 - Index immutability, not native	3 - Requires manual S3 archival	3 - SaaS retention, limited control
Cost at scale (15%)	5 - S3 + Glacier is cheapest	1 - Most expensive per GB	3 - Self-hosted infra cost	2 - Per-GB ingestion pricing
Operational complexity (10%)	4 - Managed services, minimal ops	3 - Heavy infrastructure	2 - Cluster management, lifecycle	5 - Fully managed
Real-time detection (10%)	2 - EventBridge delay, not real-time	5 - Real-time correlation	3 - Near-real-time with pipeline	4 - Real-time threat detection

When to choose what:

AWS-native, cost-conscious: CloudTrail + S3 Object Lock + Athena. Cheapest long-term storage with immutability built in. Add GuardDuty for real-time threat detection. Works for SOC 2, HIPAA, PCI-DSS.
Enterprise SOC with a dedicated security team: Splunk. The most powerful query language, real-time correlation rules, compliance dashboards out of the box. The cost is justified when breach exposure exceeds $10M.
Multi-cloud, K8s-heavy: ELK stack with Falco for runtime security, S3 archival for compliance retention. Most flexible but highest operational burden. Budget at least one person dedicated to keeping the cluster healthy.
Team under 50, need compliance fast: Datadog Security. Unified with existing monitoring, built-in compliance rules, managed retention. The trade is cost for time-to-compliance, which is often the right call for smaller teams.
Financial services or government: Splunk GovCloud or AWS GovCloud + CloudTrail. FedRAMP authorized, ITAR compliant, required for government contracts. No open-source alternative meets the certification requirements here. That is just the reality.

Tool	Type	Best For	Scale
AWS CloudTrail	Managed	AWS API audit logging, S3 integration	Small-Enterprise
Falco	Open Source	Runtime security, K8s audit, eBPF-based	Medium-Enterprise
Splunk	Commercial	Enterprise SIEM, compliance reporting, SPL queries	Large-Enterprise
Open Policy Agent	Open Source	Policy enforcement, admission webhooks, audit	Medium-Enterprise

Why It Exists

How It Works

Audit Event Structure

Immutability Guarantees

Audit logs must be append-only and tamper-evident. There is no middle ground here:

WORM storage (Write Once, Read Many). S3 Object Lock in Compliance mode prevents deletion even by root accounts for the configured retention period. This is the strongest guarantee AWS offers.
Separate account. Store audit logs in a dedicated AWS account with locked-down IAM policies. The production account can write but never delete or modify. This separation is worth the operational overhead.
Cryptographic chaining. Hash each log entry with the previous entry's hash (yes, like a blockchain, except actually useful) to detect tampering or deletion after the fact.

Compliance Framework Requirements

Framework	Retention	Key Requirements
SOC 2	1 year	Access control logs, change management evidence, incident response records
GDPR	Duration of processing + reasonable period	Data access logs for personal data, right-to-erasure evidence, consent records
HIPAA	6 years	PHI access logs, user authentication records, security incident logs
PCI-DSS	1 year (3 months immediately accessible)	All access to cardholder data, admin actions, audit trail integrity verification

Kubernetes Audit Policy

Production Considerations

Separation of duties. Set up distinct IAM roles: application teams write audit events, security teams read and analyze, and no single role can both write and delete. Enforce this with AWS Organizations SCPs or GCP Organization Policies. If one person can both generate and erase logs, the audit trail means nothing.
Real-time alerting. Feed audit events into Falco or a SIEM. Alert on anomalies: privilege escalation attempts, access from unusual IPs, bulk data exports, secret access outside normal patterns. Detection latency should be minutes, not days. I have seen teams that only check logs weekly. By then the damage is done.
Compliance-as-code. Use OPA Gatekeeper to enforce compliance policies as Kubernetes admission webhooks: all pods must have resource limits, no privileged containers, images only from approved registries, required labels for data classification. The goal is to prevent violations, not just detect them.
Retention lifecycle. Automate S3 lifecycle policies: hot storage (30 days, Standard), warm (1 year, Infrequent Access), cold (7 years, Glacier Deep Archive). Tag by compliance framework to apply framework-specific retention. Do not manually manage this. Someone will forget, and then an auditor will find the gap.
Audit the auditors. Monitor the audit logging pipeline itself. If the collector goes down for 6 hours, that is a 6-hour gap in the forensic record. Alert on pipeline lag, dropped events, and storage write failures. The worst time to discover the audit pipeline is broken is during an actual incident.

Failure Scenarios

Capacity Planning

Scale Tier	API Requests/Day	Audit Events/Day	Daily Storage	7-Year Retention	Reference
Startup	1M	50K	50 MB	127 GB	Pre-compliance
Mid-scale	50M	1M	1 GB	2.5 TB	SOC 2 Type II
Large-scale	500M	20M	20 GB	51 TB	PCI-DSS Level 1
Hyper-scale	5B+	500M+	500 GB+	1.3 PB+	Stripe, Coinbase

Architecture Decision Record

Decision: Choosing a Compliance & Audit Logging Architecture

Criteria (Weight)	CloudTrail + S3 + Athena	Splunk Enterprise	ELK + S3 Archive	Datadog Security
Compliance certifications (25%)	4 - SOC 2, PCI, HIPAA eligible	5 - Most certifications, GovCloud	3 - Self-managed, self-certified	4 - SOC 2, ISO 27001
Query & investigation (20%)	3 - Athena SQL, slow for ad-hoc	5 - SPL is industry-leading	4 - Lucene queries, KQL	4 - Log analytics, SIEM
Immutability guarantees (20%)	5 - S3 Object Lock, cross-account	3 - Index immutability, not native	3 - Requires manual S3 archival	3 - SaaS retention, limited control
Cost at scale (15%)	5 - S3 + Glacier is cheapest	1 - Most expensive per GB	3 - Self-hosted infra cost	2 - Per-GB ingestion pricing
Operational complexity (10%)	4 - Managed services, minimal ops	3 - Heavy infrastructure	2 - Cluster management, lifecycle	5 - Fully managed
Real-time detection (10%)	2 - EventBridge delay, not real-time	5 - Real-time correlation	3 - Near-real-time with pipeline	4 - Real-time threat detection

When to choose what:

AWS-native, cost-conscious: CloudTrail + S3 Object Lock + Athena. Cheapest long-term storage with immutability built in. Add GuardDuty for real-time threat detection. Works for SOC 2, HIPAA, PCI-DSS.
Enterprise SOC with a dedicated security team: Splunk. The most powerful query language, real-time correlation rules, compliance dashboards out of the box. The cost is justified when breach exposure exceeds $10M.
Multi-cloud, K8s-heavy: ELK stack with Falco for runtime security, S3 archival for compliance retention. Most flexible but highest operational burden. Budget at least one person dedicated to keeping the cluster healthy.
Team under 50, need compliance fast: Datadog Security. Unified with existing monitoring, built-in compliance rules, managed retention. The trade is cost for time-to-compliance, which is often the right call for smaller teams.
Financial services or government: Splunk GovCloud or AWS GovCloud + CloudTrail. FedRAMP authorized, ITAR compliant, required for government contracts. No open-source alternative meets the certification requirements here. That is just the reality.

Architecture Diagram

Why It Exists

How It Works

Audit Event Structure

Immutability Guarantees

Compliance Framework Requirements

Kubernetes Audit Policy

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics

Compliance & Audit Logging

Architecture Diagram

Why It Exists

How It Works

Audit Event Structure

Immutability Guarantees

Compliance Framework Requirements

Kubernetes Audit Policy

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Key Points

Tool Comparison

Common Mistakes

Related Topics