Object Storage & Data Lake

Why It Exists

File systems and block storage were never built for internet scale. File systems hit inode limits and force teams into messy distributed protocols like NFS and CIFS. Block storage is tied to a single machine. Object storage threw all of that out and started over: each object is just a blob of bytes plus some metadata, addressed by a unique key in a flat namespace. No hierarchy, no inodes, no locking protocols.

That simplicity is the whole trick. It is what lets S3 handle over 100 million requests per second while hitting 11 nines of durability through erasure coding and geographic replication. The trade-off is giving up random writes and partial updates, and in return getting storage that scales to hundreds of petabytes without anyone ever thinking about capacity.

How It Works

Objects vs Files vs Blocks

Storage Type	Access Pattern	Unit	Mutability	Use Case
Block	Random read/write	Fixed-size blocks	In-place updates	Databases, boot volumes
File	Hierarchical path	Files in directories	In-place updates	Shared documents, NFS
Object	Key-value HTTP API	Immutable objects	Replace entire object	Media, logs, backups, data lake

Objects are immutable. Appending to one or doing a partial update is not possible. That sounds like a limitation, and it is, but it also makes aggressive caching, replication, and content-addressable storage straightforward. The trade-off is worth it for the vast majority of data that, once written, is read far more than modified.

Consistency Models

AWS S3 now provides strong read-after-write consistency for all operations (since December 2020). GCS has always been strongly consistent. A GET right after a PUT returns the latest version. Period.

Anyone who has been doing this long enough remembers the pain of S3's old eventual consistency model. Write an object, list the bucket, and the new object would not show up. Entire data pipelines broke because of this. That history is relevant because workarounds for it still linger in older codebases where they are no longer necessary.

Data Lake Architecture

A data lake stores raw data in its native format on object storage and applies schema-on-read at query time. This is the opposite of a traditional data warehouse, which requires defining the schema before writing anything (schema-on-write).

The modern data lake stack looks like this:

Storage layer: S3, GCS, or ADLS holding Parquet, ORC, or Avro files.
Catalog: AWS Glue Catalog or Hive Metastore. These map table schemas to object paths so query engines know where to find the data and what shape it is in.
Query engines: Athena (serverless Presto), Spark, Trino, or Databricks SQL. These read the catalog and execute queries against the stored files.
Open table formats: This is where the real action is right now. Apache Iceberg, Delta Lake, and Apache Hudi add ACID transactions, schema evolution, time travel, and partition evolution on top of object storage. Iceberg has pulled ahead as the default for new projects because of its vendor neutrality and hidden partitioning. For greenfield projects, start with Iceberg unless there's a specific reason not to.

Schema-on-Read vs Schema-on-Write

Schema-on-write (warehouses) validates data at ingestion. The result is quality guarantees, but the cost is upfront schema design and rigidity. Schema-on-read (lakes) ingests raw data and applies structure at query time. More flexible, but data quality nightmares are inevitable without discipline.

The lakehouse pattern (Delta Lake, Iceberg) tries to provide both. Raw data lands in a bronze layer, gets cleaned into silver, and is curated into gold. Each layer adds more schema enforcement and quality checks. It works well in practice, though "bronze/silver/gold" is one of those terms people love to put in architecture docs and rarely implement as cleanly as described.

Production Considerations

Cost optimization: Set up lifecycle policies. Transition objects to Infrequent Access after 30 days, Glacier after 90, Deep Archive after 365. For typical log data, this cuts storage costs by 60-80%. Skipping this step is basically lighting money on fire.
Request rate optimization: S3 partitions data by key prefix. Use randomized or hashed prefixes (e.g., hex(hash(id))/data/) to spread load across partitions. Do not use sequential timestamps as prefixes. I have seen this mistake take down analytics pipelines that worked fine at low volume and fell apart at scale.
Multipart uploads: Anything over 100MB should use multipart upload. The benefits are parallel upload, resumability on failure, and support for objects up to 5TB. There is no good reason to skip this.
Versioning and replication: Turn on versioning for production buckets. Someone will eventually delete something they should not have. Cross-region replication provides disaster recovery with RPO in minutes.
Encryption: Use server-side encryption (SSE-S3 or SSE-KMS) for data at rest. Enforce aws:SecureTransport in bucket policies so all traffic uses TLS. For sensitive workloads where the provider should not hold keys, use client-side encryption. It is more operational burden but sometimes the right call.

Failure Scenarios

Scenario 1: Small File Problem Cascading into Query Failure. An event pipeline writes one Parquet file per minute to S3, producing 1,440 files per day. After 6 months, a single partition has 260,000+ tiny files (under 1MB each). Athena and Spark queries that used to take 10 seconds now take 15 minutes. The query engine spends 90% of its time on S3 LIST and GET request overhead, not actual data processing. S3 request costs spike 20x. Detection: Monitor average file size per partition and alert when it drops below 64MB. Track the ratio of query planning time to execution time. If planning exceeds 30% of total time, there's a small file problem. Recovery: Run a compaction job (Spark repartition or Iceberg's rewrite_data_files) to merge small files into 128MB-256MB targets. Netflix runs automated compaction to keep file sizes between 128MB and 512MB, and they report 60% faster Spark queries as a result.

Scenario 2: Accidental Bucket Deletion. A misconfigured Terraform destroy command wipes out a production S3 bucket with 50TB of business data. No versioning. No cross-region replication. The data is gone. 18 months of analytics, unrecoverable. This is not hypothetical. Detection: Enable CloudTrail logging for all S3 API calls. Alert on DeleteBucket or DeleteObject calls from non-approved IAM roles. Use S3 Object Lock in governance mode for compliance-critical data. Recovery: Prevention is the only real recovery here. Enable versioning on all production buckets so deleted objects become delete markers that can be restored. Layer the defenses: MFA-delete, bucket policies that deny s3:DeleteBucket, and cross-region replication as a last resort. Any single layer can fail. The combination is what provides real protection.

Scenario 3: Data Lake Schema Drift. Upstream producers change the schema of events written to the data lake without telling anyone. A new field appears. An existing field changes from integer to string. A required field gets dropped. Downstream Spark jobs break on type mismatches, the Glue Catalog falls out of sync with actual data on disk, and data analysts get wrong results for 3 days before anyone notices. Detection: Validate schemas at ingestion using a schema registry. Track column-level statistics (null ratios, type distributions) per partition. If the null ratio for a column jumps from 0% to 40% overnight, something broke upstream. Recovery: Apache Iceberg's schema evolution handles this well since it tracks schema changes and can read old data with old schemas. Enforce backward-compatible schema changes only. The bigger lesson here is organizational, not technical. If producers can change schemas without going through a review process, no amount of tooling will help.

Capacity Planning

Metric	Threshold	Action
S3 request rate per prefix	> 5,500 PUT/s or 55,000 GET/s	Redistribute keys across more prefixes
Average file size	< 64MB (analytics)	Run compaction jobs
Storage cost per TB/month	> $23 (S3 Standard)	Review lifecycle policies, move cold data to IA/Glacier
Query scan volume	> 10x of result size	Partition pruning is failing, review partition strategy
Cross-region replication lag	> 15 minutes	Investigate bandwidth or throttling

Scale references: AWS S3 stores over 200 trillion objects with sustained throughput of 100+ million requests/second. Netflix's data lake on S3 holds over 60PB with 100,000+ daily Spark/Presto jobs. These numbers are useful as upper-bound benchmarks, but most teams will never get anywhere close. Focus on actual access patterns first.

Capacity formulas: Monthly storage cost = (hot_TB * $0.023) + (IA_TB * $0.0125) + (glacier_TB * $0.004) + (deep_archive_TB * $0.00099). Request cost at scale: monthly_request_cost = (PUT_requests * $0.000005) + (GET_requests * $0.0000004). Here is why file size matters more than almost anything else for data lake costs: a data lake ingesting 1TB/day in 128MB files makes about 8,000 PUT requests per day. Negligible cost. That same 1TB/day in 1KB files is 1 billion PUTs per day, which is $5,000/day in request costs alone. File size is the single most important cost lever available.

Architecture Decision Record

Storage and Table Format Decision Matrix

Criteria (Weight)	Raw S3 + Athena	Delta Lake	Apache Iceberg	Apache Hudi
ACID transactions needed	No	Yes	Yes	Yes
Time travel / snapshots	No	Yes (30-day default)	Yes (configurable)	Yes
Schema evolution	Manual (break/fix)	Append-only columns	Full (add, rename, reorder, drop)	Limited
Engine compatibility	Athena, Presto, Spark	Spark-first (Databricks)	All major engines	Spark-first
Partition evolution	Requires rewrite	Requires rewrite	Transparent (hidden partitioning)	Requires rewrite
Vendor lock-in risk	Low	Medium (Databricks ecosystem)	Lowest (vendor-neutral)	Low-Medium

Decision rules:

If the data lake is under 10TB and queries are ad-hoc: Raw Parquet on S3 + Athena is enough. Do not add a table format just because it sounds more professional. Plenty of startups over-engineer this and regret the complexity later.
If ACID transactions, time travel, or concurrent writes are needed: Pick a table format. Iceberg is the default for new projects. It has the broadest engine support (Spark, Trino, Flink, Dremio, Snowflake) and the best partition evolution story. Unless there's a strong reason to pick something else, just use Iceberg.
For Databricks shops: Delta Lake is deeply integrated and well-optimized within that ecosystem. The performance work they have done (Z-ordering, liquid clustering) is genuinely impressive inside the Databricks runtime. Outside of it, the story gets weaker.
For near-real-time incremental ingestion (CDC from databases into the lake): Apache Hudi was built for exactly this. Uber created it to solve their incremental ETL problem at petabyte scale, and it still does that particular thing better than the alternatives.
When multi-engine access is a hard requirement (Spark for ETL, Trino for ad-hoc, Flink for streaming): Iceberg is the safest bet. Its open specification and catalog-level interoperability mean teams will not get boxed into a single engine. Apple, Netflix, and LinkedIn all standardized on Iceberg for this reason.

Tool	Type	Best For	Scale
AWS S3	Managed	De facto standard, broadest ecosystem integration	Small-Enterprise
MinIO	Open Source	S3-compatible on-premise, Kubernetes-native	Medium-Enterprise
GCS	Managed	BigQuery integration, strong consistency	Small-Enterprise
Azure Blob Storage	Managed	Azure ecosystem, ADLS Gen2 for analytics	Small-Enterprise

Why It Exists

How It Works

Objects vs Files vs Blocks

Storage Type	Access Pattern	Unit	Mutability	Use Case
Block	Random read/write	Fixed-size blocks	In-place updates	Databases, boot volumes
File	Hierarchical path	Files in directories	In-place updates	Shared documents, NFS
Object	Key-value HTTP API	Immutable objects	Replace entire object	Media, logs, backups, data lake

Consistency Models

Data Lake Architecture

The modern data lake stack looks like this:

Storage layer: S3, GCS, or ADLS holding Parquet, ORC, or Avro files.
Catalog: AWS Glue Catalog or Hive Metastore. These map table schemas to object paths so query engines know where to find the data and what shape it is in.
Query engines: Athena (serverless Presto), Spark, Trino, or Databricks SQL. These read the catalog and execute queries against the stored files.
Open table formats: This is where the real action is right now. Apache Iceberg, Delta Lake, and Apache Hudi add ACID transactions, schema evolution, time travel, and partition evolution on top of object storage. Iceberg has pulled ahead as the default for new projects because of its vendor neutrality and hidden partitioning. For greenfield projects, start with Iceberg unless there's a specific reason not to.

Schema-on-Read vs Schema-on-Write

Production Considerations

Cost optimization: Set up lifecycle policies. Transition objects to Infrequent Access after 30 days, Glacier after 90, Deep Archive after 365. For typical log data, this cuts storage costs by 60-80%. Skipping this step is basically lighting money on fire.
Request rate optimization: S3 partitions data by key prefix. Use randomized or hashed prefixes (e.g., hex(hash(id))/data/) to spread load across partitions. Do not use sequential timestamps as prefixes. I have seen this mistake take down analytics pipelines that worked fine at low volume and fell apart at scale.
Multipart uploads: Anything over 100MB should use multipart upload. The benefits are parallel upload, resumability on failure, and support for objects up to 5TB. There is no good reason to skip this.
Versioning and replication: Turn on versioning for production buckets. Someone will eventually delete something they should not have. Cross-region replication provides disaster recovery with RPO in minutes.
Encryption: Use server-side encryption (SSE-S3 or SSE-KMS) for data at rest. Enforce aws:SecureTransport in bucket policies so all traffic uses TLS. For sensitive workloads where the provider should not hold keys, use client-side encryption. It is more operational burden but sometimes the right call.

Failure Scenarios

Capacity Planning

Metric	Threshold	Action
S3 request rate per prefix	> 5,500 PUT/s or 55,000 GET/s	Redistribute keys across more prefixes
Average file size	< 64MB (analytics)	Run compaction jobs
Storage cost per TB/month	> $23 (S3 Standard)	Review lifecycle policies, move cold data to IA/Glacier
Query scan volume	> 10x of result size	Partition pruning is failing, review partition strategy
Cross-region replication lag	> 15 minutes	Investigate bandwidth or throttling

Architecture Decision Record

Storage and Table Format Decision Matrix

Criteria (Weight)	Raw S3 + Athena	Delta Lake	Apache Iceberg	Apache Hudi
ACID transactions needed	No	Yes	Yes	Yes
Time travel / snapshots	No	Yes (30-day default)	Yes (configurable)	Yes
Schema evolution	Manual (break/fix)	Append-only columns	Full (add, rename, reorder, drop)	Limited
Engine compatibility	Athena, Presto, Spark	Spark-first (Databricks)	All major engines	Spark-first
Partition evolution	Requires rewrite	Requires rewrite	Transparent (hidden partitioning)	Requires rewrite
Vendor lock-in risk	Low	Medium (Databricks ecosystem)	Lowest (vendor-neutral)	Low-Medium

Decision rules:

If the data lake is under 10TB and queries are ad-hoc: Raw Parquet on S3 + Athena is enough. Do not add a table format just because it sounds more professional. Plenty of startups over-engineer this and regret the complexity later.
If ACID transactions, time travel, or concurrent writes are needed: Pick a table format. Iceberg is the default for new projects. It has the broadest engine support (Spark, Trino, Flink, Dremio, Snowflake) and the best partition evolution story. Unless there's a strong reason to pick something else, just use Iceberg.
For Databricks shops: Delta Lake is deeply integrated and well-optimized within that ecosystem. The performance work they have done (Z-ordering, liquid clustering) is genuinely impressive inside the Databricks runtime. Outside of it, the story gets weaker.
For near-real-time incremental ingestion (CDC from databases into the lake): Apache Hudi was built for exactly this. Uber created it to solve their incremental ETL problem at petabyte scale, and it still does that particular thing better than the alternatives.
When multi-engine access is a hard requirement (Spark for ETL, Trino for ad-hoc, Flink for streaming): Iceberg is the safest bet. Its open specification and catalog-level interoperability mean teams will not get boxed into a single engine. Apple, Netflix, and LinkedIn all standardized on Iceberg for this reason.

Architecture Diagram

Why It Exists

How It Works

Objects vs Files vs Blocks

Consistency Models

Data Lake Architecture

Schema-on-Read vs Schema-on-Write

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Storage and Table Format Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics

Object Storage & Data Lake

Architecture Diagram

Why It Exists

How It Works

Objects vs Files vs Blocks

Consistency Models

Data Lake Architecture

Schema-on-Read vs Schema-on-Write

Production Considerations

Failure Scenarios

Capacity Planning

Architecture Decision Record

Storage and Table Format Decision Matrix

Key Points

Tool Comparison

Common Mistakes

Related Topics