Data Classification Frameworks
Why Classification Comes First
Every compliance framework you will encounter asks the same question: do you know where your sensitive data is? GDPR requires you to inventory personal data. HIPAA requires you to track PHI. PCI DSS requires you to map cardholder data flows. SOC 2 auditors ask about data classification policies. Without a classification framework, you are answering all of these questions with guesswork.
The Four Levels
Keep it simple. Four levels cover the vast majority of use cases:
- Public - Information intended for external audiences. Marketing content, open-source code, published documentation. No access restrictions. No encryption requirements beyond TLS in transit.
- Internal - Information meant for employees but not harmful if exposed. Internal wikis, team meeting notes, non-sensitive configs. Standard authentication required. No special encryption at rest.
- Confidential - Business-sensitive information. Customer data, financial records, employee PII, source code, API keys. Role-based access controls. Encryption at rest and in transit. Audit logging on access.
- Restricted - Highly sensitive data with regulatory implications. PHI, payment card data, credentials, encryption keys, data subject to legal hold. Strict need-to-know access. Strong encryption. Full audit trails. Enhanced monitoring.
Implementing Classification in Practice
Classification works when it connects to your infrastructure layer. Tag data stores in your cloud provider with their classification level. Use AWS resource tags, GCP labels, or Azure tags. Then build IAM policies that reference those tags. A Restricted S3 bucket gets a bucket policy requiring SSE-KMS encryption and restricting access to specific IAM roles. A Confidential database column gets column-level security in your warehouse.
At the application layer, annotate your data models. Mark fields that contain PII, PHI, or financial data. Use these annotations to drive behavior: automatic redaction in logs, masking in non-production environments, and inclusion in deletion workflows for data subject requests.
Automated Discovery
Manual classification does not scale. Use automated discovery tools to scan your data stores:
- AWS Macie scans S3 for PII patterns using machine learning
- Google Cloud DLP inspects BigQuery, Cloud Storage, and Datastore
- Microsoft Purview classifies data across Azure, SQL Server, and third-party sources
- Open source options like Apache Atlas and DataHub provide metadata management and lineage tracking
Run discovery scans regularly, not just once. New tables get created. Data schemas evolve. Engineers copy production data to staging environments. Continuous scanning catches classification drift before it becomes an audit finding.
Data Catalogs
A data catalog ties it all together. Tools like DataHub, Amundsen, or Atlan provide a searchable inventory of every dataset in your organization, complete with ownership, classification, lineage, and freshness metadata. When a compliance team asks "where do we store customer email addresses?" the answer should be a catalog query, not a Slack thread.
Key Points
- •Four classification levels work for most organizations: Public, Internal, Confidential, and Restricted. More levels create confusion without adding meaningful protection
- •Classification is only useful if it drives automated policy enforcement. Labels that do not trigger access controls, encryption, or retention rules are just decoration
- •Data discovery and classification tools (AWS Macie, Google DLP, Microsoft Purview) can scan structured and unstructured data stores for PII, PHI, and financial data automatically
- •Every data store in your organization needs an owner, a classification level, and a documented retention period
- •Classification must happen at ingestion time. Retroactively classifying a data lake with three years of untagged data is painful and expensive
Common Mistakes
- ✗Creating a classification policy document that nobody references because it does not connect to technical controls
- ✗Classifying data at the system level instead of the field level, leading to entire databases being labeled Confidential when only a few columns contain sensitive data
- ✗Letting engineering teams self-classify without validation, resulting in sensitive data being marked Internal to avoid access control overhead
- ✗Not accounting for derived data and aggregated datasets that may inherit the classification of their source data