Object Storage & Data Lake
Architecture Diagram
Why It Exists
File systems and block storage were never built for internet scale. File systems hit inode limits and force teams into messy distributed protocols like NFS and CIFS. Block storage is tied to a single machine. Object storage threw all of that out and started over: each object is just a blob of bytes plus some metadata, addressed by a unique key in a flat namespace. No hierarchy, no inodes, no locking protocols.
That simplicity is the whole trick. It is what lets S3 handle over 100 million requests per second while hitting 11 nines of durability through erasure coding and geographic replication. The trade-off is giving up random writes and partial updates, and in return getting storage that scales to hundreds of petabytes without anyone ever thinking about capacity.
How It Works
Objects vs Files vs Blocks
| Storage Type | Access Pattern | Unit | Mutability | Use Case |
|---|---|---|---|---|
| Block | Random read/write | Fixed-size blocks | In-place updates | Databases, boot volumes |
| File | Hierarchical path | Files in directories | In-place updates | Shared documents, NFS |
| Object | Key-value HTTP API | Immutable objects | Replace entire object | Media, logs, backups, data lake |
Objects are immutable. Appending to one or doing a partial update is not possible. That sounds like a limitation, and it is, but it also makes aggressive caching, replication, and content-addressable storage straightforward. The trade-off is worth it for the vast majority of data that, once written, is read far more than modified.
Consistency Models
AWS S3 now provides strong read-after-write consistency for all operations (since December 2020). GCS has always been strongly consistent. A GET right after a PUT returns the latest version. Period.
Anyone who has been doing this long enough remembers the pain of S3's old eventual consistency model. Write an object, list the bucket, and the new object would not show up. Entire data pipelines broke because of this. That history is relevant because workarounds for it still linger in older codebases where they are no longer necessary.
Data Lake Architecture
A data lake stores raw data in its native format on object storage and applies schema-on-read at query time. This is the opposite of a traditional data warehouse, which requires defining the schema before writing anything (schema-on-write).
The modern data lake stack looks like this:
- Storage layer: S3, GCS, or ADLS holding Parquet, ORC, or Avro files.
- Catalog: AWS Glue Catalog or Hive Metastore. These map table schemas to object paths so query engines know where to find the data and what shape it is in.
- Query engines: Athena (serverless Presto), Spark, Trino, or Databricks SQL. These read the catalog and execute queries against the stored files.
- Open table formats: This is where the real action is right now. Apache Iceberg, Delta Lake, and Apache Hudi add ACID transactions, schema evolution, time travel, and partition evolution on top of object storage. Iceberg has pulled ahead as the default for new projects because of its vendor neutrality and hidden partitioning. For greenfield projects, start with Iceberg unless there's a specific reason not to.
Schema-on-Read vs Schema-on-Write
Schema-on-write (warehouses) validates data at ingestion. The result is quality guarantees, but the cost is upfront schema design and rigidity. Schema-on-read (lakes) ingests raw data and applies structure at query time. More flexible, but data quality nightmares are inevitable without discipline.
The lakehouse pattern (Delta Lake, Iceberg) tries to provide both. Raw data lands in a bronze layer, gets cleaned into silver, and is curated into gold. Each layer adds more schema enforcement and quality checks. It works well in practice, though "bronze/silver/gold" is one of those terms people love to put in architecture docs and rarely implement as cleanly as described.
Production Considerations
- Cost optimization: Set up lifecycle policies. Transition objects to Infrequent Access after 30 days, Glacier after 90, Deep Archive after 365. For typical log data, this cuts storage costs by 60-80%. Skipping this step is basically lighting money on fire.
- Request rate optimization: S3 partitions data by key prefix. Use randomized or hashed prefixes (e.g.,
hex(hash(id))/data/) to spread load across partitions. Do not use sequential timestamps as prefixes. I have seen this mistake take down analytics pipelines that worked fine at low volume and fell apart at scale. - Multipart uploads: Anything over 100MB should use multipart upload. The benefits are parallel upload, resumability on failure, and support for objects up to 5TB. There is no good reason to skip this.
- Versioning and replication: Turn on versioning for production buckets. Someone will eventually delete something they should not have. Cross-region replication provides disaster recovery with RPO in minutes.
- Encryption: Use server-side encryption (SSE-S3 or SSE-KMS) for data at rest. Enforce
aws:SecureTransportin bucket policies so all traffic uses TLS. For sensitive workloads where the provider should not hold keys, use client-side encryption. It is more operational burden but sometimes the right call.
Failure Scenarios
Scenario 1: Small File Problem Cascading into Query Failure. An event pipeline writes one Parquet file per minute to S3, producing 1,440 files per day. After 6 months, a single partition has 260,000+ tiny files (under 1MB each). Athena and Spark queries that used to take 10 seconds now take 15 minutes. The query engine spends 90% of its time on S3 LIST and GET request overhead, not actual data processing. S3 request costs spike 20x. Detection: Monitor average file size per partition and alert when it drops below 64MB. Track the ratio of query planning time to execution time. If planning exceeds 30% of total time, there's a small file problem. Recovery: Run a compaction job (Spark repartition or Iceberg's rewrite_data_files) to merge small files into 128MB-256MB targets. Netflix runs automated compaction to keep file sizes between 128MB and 512MB, and they report 60% faster Spark queries as a result.
Scenario 2: Accidental Bucket Deletion. A misconfigured Terraform destroy command wipes out a production S3 bucket with 50TB of business data. No versioning. No cross-region replication. The data is gone. 18 months of analytics, unrecoverable. This is not hypothetical. Detection: Enable CloudTrail logging for all S3 API calls. Alert on DeleteBucket or DeleteObject calls from non-approved IAM roles. Use S3 Object Lock in governance mode for compliance-critical data. Recovery: Prevention is the only real recovery here. Enable versioning on all production buckets so deleted objects become delete markers that can be restored. Layer the defenses: MFA-delete, bucket policies that deny s3:DeleteBucket, and cross-region replication as a last resort. Any single layer can fail. The combination is what provides real protection.
Scenario 3: Data Lake Schema Drift. Upstream producers change the schema of events written to the data lake without telling anyone. A new field appears. An existing field changes from integer to string. A required field gets dropped. Downstream Spark jobs break on type mismatches, the Glue Catalog falls out of sync with actual data on disk, and data analysts get wrong results for 3 days before anyone notices. Detection: Validate schemas at ingestion using a schema registry. Track column-level statistics (null ratios, type distributions) per partition. If the null ratio for a column jumps from 0% to 40% overnight, something broke upstream. Recovery: Apache Iceberg's schema evolution handles this well since it tracks schema changes and can read old data with old schemas. Enforce backward-compatible schema changes only. The bigger lesson here is organizational, not technical. If producers can change schemas without going through a review process, no amount of tooling will help.
Capacity Planning
| Metric | Threshold | Action |
|---|---|---|
| S3 request rate per prefix | > 5,500 PUT/s or 55,000 GET/s | Redistribute keys across more prefixes |
| Average file size | < 64MB (analytics) | Run compaction jobs |
| Storage cost per TB/month | > $23 (S3 Standard) | Review lifecycle policies, move cold data to IA/Glacier |
| Query scan volume | > 10x of result size | Partition pruning is failing, review partition strategy |
| Cross-region replication lag | > 15 minutes | Investigate bandwidth or throttling |
Scale references: AWS S3 stores over 200 trillion objects with sustained throughput of 100+ million requests/second. Netflix's data lake on S3 holds over 60PB with 100,000+ daily Spark/Presto jobs. These numbers are useful as upper-bound benchmarks, but most teams will never get anywhere close. Focus on actual access patterns first.
Capacity formulas: Monthly storage cost = (hot_TB * $0.023) + (IA_TB * $0.0125) + (glacier_TB * $0.004) + (deep_archive_TB * $0.00099). Request cost at scale: monthly_request_cost = (PUT_requests * $0.000005) + (GET_requests * $0.0000004). Here is why file size matters more than almost anything else for data lake costs: a data lake ingesting 1TB/day in 128MB files makes about 8,000 PUT requests per day. Negligible cost. That same 1TB/day in 1KB files is 1 billion PUTs per day, which is $5,000/day in request costs alone. File size is the single most important cost lever available.
Architecture Decision Record
Storage and Table Format Decision Matrix
| Criteria (Weight) | Raw S3 + Athena | Delta Lake | Apache Iceberg | Apache Hudi |
|---|---|---|---|---|
| ACID transactions needed | No | Yes | Yes | Yes |
| Time travel / snapshots | No | Yes (30-day default) | Yes (configurable) | Yes |
| Schema evolution | Manual (break/fix) | Append-only columns | Full (add, rename, reorder, drop) | Limited |
| Engine compatibility | Athena, Presto, Spark | Spark-first (Databricks) | All major engines | Spark-first |
| Partition evolution | Requires rewrite | Requires rewrite | Transparent (hidden partitioning) | Requires rewrite |
| Vendor lock-in risk | Low | Medium (Databricks ecosystem) | Lowest (vendor-neutral) | Low-Medium |
Decision rules:
- If the data lake is under 10TB and queries are ad-hoc: Raw Parquet on S3 + Athena is enough. Do not add a table format just because it sounds more professional. Plenty of startups over-engineer this and regret the complexity later.
- If ACID transactions, time travel, or concurrent writes are needed: Pick a table format. Iceberg is the default for new projects. It has the broadest engine support (Spark, Trino, Flink, Dremio, Snowflake) and the best partition evolution story. Unless there's a strong reason to pick something else, just use Iceberg.
- For Databricks shops: Delta Lake is deeply integrated and well-optimized within that ecosystem. The performance work they have done (Z-ordering, liquid clustering) is genuinely impressive inside the Databricks runtime. Outside of it, the story gets weaker.
- For near-real-time incremental ingestion (CDC from databases into the lake): Apache Hudi was built for exactly this. Uber created it to solve their incremental ETL problem at petabyte scale, and it still does that particular thing better than the alternatives.
- When multi-engine access is a hard requirement (Spark for ETL, Trino for ad-hoc, Flink for streaming): Iceberg is the safest bet. Its open specification and catalog-level interoperability mean teams will not get boxed into a single engine. Apple, Netflix, and LinkedIn all standardized on Iceberg for this reason.
Key Points
- •Stores unstructured data (files, images, logs, backups) as objects with metadata in flat namespaces
- •Virtually unlimited scalability. S3 stores over 200 trillion objects
- •11 nines (99.999999999%) durability via erasure coding and cross-region replication
- •Data lakes layer structured query engines (Athena, Presto, Spark) over raw object storage
- •Storage tiering (hot/warm/cold/archive) keeps costs sane. Lifecycle policies automate transitions
Tool Comparison
| Tool | Type | Best For | Scale |
|---|---|---|---|
| AWS S3 | Managed | De facto standard, broadest ecosystem integration | Small-Enterprise |
| MinIO | Open Source | S3-compatible on-premise, Kubernetes-native | Medium-Enterprise |
| GCS | Managed | BigQuery integration, strong consistency | Small-Enterprise |
| Azure Blob Storage | Managed | Azure ecosystem, ADLS Gen2 for analytics | Small-Enterprise |
Common Mistakes
- Storing small objects individually. High request overhead per object, so batch into larger files
- Not enabling versioning. Accidental deletes or overwrites become irrecoverable
- Ignoring storage class optimization. Keeping cold data in hot tier wastes 60-80% on storage costs
- Not using multipart upload for large files. Single PUT fails silently on network issues
- Flat namespace without key prefix strategy. Poor prefix design causes throttling (S3 partitions by prefix)