Data Classification Frameworks

Why Classification Comes First

Every compliance framework you will encounter asks the same question: do you know where your sensitive data is? GDPR requires you to inventory personal data. HIPAA requires you to track PHI. PCI DSS requires you to map cardholder data flows. SOC 2 auditors ask about data classification policies. Without a classification framework, you are answering all of these questions with guesswork.

The Four Levels

Keep it simple. Four levels cover the vast majority of use cases:

Public - Information intended for external audiences. Marketing content, open-source code, published documentation. No access restrictions. No encryption requirements beyond TLS in transit.
Internal - Information meant for employees but not harmful if exposed. Internal wikis, team meeting notes, non-sensitive configs. Standard authentication required. No special encryption at rest.
Confidential - Business-sensitive information. Customer data, financial records, employee PII, source code, API keys. Role-based access controls. Encryption at rest and in transit. Audit logging on access.
Restricted - Highly sensitive data with regulatory implications. PHI, payment card data, credentials, encryption keys, data subject to legal hold. Strict need-to-know access. Strong encryption. Full audit trails. Enhanced monitoring.

Implementing Classification in Practice

Classification works when it connects to your infrastructure layer. Tag data stores in your cloud provider with their classification level. Use AWS resource tags, GCP labels, or Azure tags. Then build IAM policies that reference those tags. A Restricted S3 bucket gets a bucket policy requiring SSE-KMS encryption and restricting access to specific IAM roles. A Confidential database column gets column-level security in your warehouse.

At the application layer, annotate your data models. Mark fields that contain PII, PHI, or financial data. Use these annotations to drive behavior: automatic redaction in logs, masking in non-production environments, and inclusion in deletion workflows for data subject requests.

Automated Discovery

Manual classification does not scale. Use automated discovery tools to scan your data stores:

AWS Macie scans S3 for PII patterns using machine learning
Google Cloud DLP inspects BigQuery, Cloud Storage, and Datastore
Microsoft Purview classifies data across Azure, SQL Server, and third-party sources
Open source options like Apache Atlas and DataHub provide metadata management and lineage tracking

Run discovery scans regularly, not just once. New tables get created. Data schemas evolve. Engineers copy production data to staging environments. Continuous scanning catches classification drift before it becomes an audit finding.

Data Catalogs

A data catalog ties it all together. Tools like DataHub, Amundsen, or Atlan provide a searchable inventory of every dataset in your organization, complete with ownership, classification, lineage, and freshness metadata. When a compliance team asks "where do we store customer email addresses?" the answer should be a catalog query, not a Slack thread.

Why Classification Comes First

The Four Levels

Keep it simple. Four levels cover the vast majority of use cases:

Public - Information intended for external audiences. Marketing content, open-source code, published documentation. No access restrictions. No encryption requirements beyond TLS in transit.

Internal - Information meant for employees but not harmful if exposed. Internal wikis, team meeting notes, non-sensitive configs. Standard authentication required. No special encryption at rest.

Confidential - Business-sensitive information. Customer data, financial records, employee PII, source code, API keys. Role-based access controls. Encryption at rest and in transit. Audit logging on access.

Restricted - Highly sensitive data with regulatory implications. PHI, payment card data, credentials, encryption keys, data subject to legal hold. Strict need-to-know access. Strong encryption. Full audit trails. Enhanced monitoring.

Implementing Classification in Practice

Automated Discovery

Manual classification does not scale. Use automated discovery tools to scan your data stores:

AWS Macie scans S3 for PII patterns using machine learning

Google Cloud DLP inspects BigQuery, Cloud Storage, and Datastore

Microsoft Purview classifies data across Azure, SQL Server, and third-party sources

Open source options like Apache Atlas and DataHub provide metadata management and lineage tracking

Data Catalogs

Why Classification Comes First

The Four Levels

Implementing Classification in Practice

Automated Discovery

Data Catalogs

Key Points

Common Mistakes

Related Topics

Data Classification Frameworks

Why Classification Comes First

The Four Levels

Implementing Classification in Practice

Automated Discovery

Data Catalogs

Key Points

Common Mistakes

Related Topics