Storage & DatabasesTech 8 of 21

Storage

MinIO

S3-compatible object storage you can run anywhere in 15 minutes

Use Cases

Private cloud object storage with full S3 API compatibilityOn-prem replacement for AWS S3 (data sovereignty, compliance)Backend storage for AI/ML training pipelines (fast local reads)Kubernetes-native persistent storage for stateful workloadsData lake storage tier for Spark, Presto, Trino queriesBackup target for databases and application data

Architecture

Why It Exists

Most teams that need object storage reach for S3. It works, it scales, there's nothing to manage. But sometimes S3 isn't an option. The data can't leave the datacenter (regulatory compliance, data sovereignty). Cloud egress costs are too high. The environment is air-gapped. Or there's a need to test locally against a real S3-compatible API without mocking.

MinIO fills that gap. It's a full S3-compatible object storage server packaged as a single Go binary. Download, run, point an S3 SDK at it. Bucket operations, multipart uploads, presigned URLs, versioning, lifecycle rules, server-side encryption. It all works. The S3 compatibility is not a subset or an approximation. It's close enough that production apps switch between S3 and MinIO by changing an endpoint URL.

The trade-off is scale. MinIO handles petabytes well. It does not handle exabytes. For hundreds of petabytes across thousands of nodes with rack-aware placement and cross-AZ erasure coding, Ceph or a custom system is more appropriate. MinIO's sweet spot is 10TB to 10PB of S3-compatible storage on owned hardware, with minimal operational overhead.

How It Works Internally

A MinIO deployment starts with a set of servers and drives. Say 4 servers with 4 drives each: 16 drives total. MinIO groups these into erasure sets. With the default parity of EC:4, all 16 drives form one erasure set. Each object gets split into 12 data shards and 4 parity shards, one per drive.

The write path looks like this: a client sends a PUT to any MinIO server. That server computes which erasure set owns the object (hash of bucket name + object key, deterministic). It reads the incoming data, splits it into 12 data shards using Reed-Solomon encoding, computes 4 parity shards, and writes all 16 shards to the 16 drives in parallel. Each shard is a file on disk. Next to it sits an xl.meta file containing the object's metadata (name, size, ETag, erasure parameters, part numbers for multipart). The write completes when all shards are written and fsynced.

The read path is the reverse. Client sends a GET. The server computes the erasure set, reads the 12 data shards in parallel, and streams them back to the client. If some drives are down, it reads however many data shards are available plus enough parity shards to reconstruct. With EC:4, it can reconstruct from any 12 of 16 shards.

Metadata lives on disk alongside data. There is no separate metadata database, no distributed KV store, no consensus protocol for metadata. The xl.meta file on each drive is the source of truth for objects on that drive. For a LIST operation, MinIO scans the filesystem to enumerate objects. This works fine for buckets with thousands or millions of objects. For buckets with hundreds of millions, it gets slow compared to systems that maintain a sorted index.

Server pools are how MinIO scales. Your initial 4-server deployment is Pool 1. When you need more capacity, you add Pool 2 (another set of servers). MinIO places new objects on the pool with the most free space. Old objects stay on Pool 1 unless you explicitly rebalance. Each pool has its own erasure sets. A drive failure in Pool 1 has no effect on Pool 2.

Production Architecture

A typical production MinIO deployment:

4-16 servers per pool, each with 4-16 NVMe drives
Load balancer (nginx, HAProxy, or K8s Service) in front for a single endpoint
TLS termination at the load balancer or at MinIO directly
Active-active replication to a second site for disaster recovery
Prometheus metrics exposed at /minio/v2/metrics/cluster

Deployment Size	Servers	Drives	Usable Capacity (EC:4)	Fault Tolerance
Small	4	16 (4 per server)	~180TB (with 12TB drives)	4 drives
Medium	8	64 (8 per server)	~540TB	4 drives per erasure set
Large	16	256 (16 per server)	~2PB	4 drives per erasure set
Multi-pool	32+	500+	5PB+	Independent per pool

MinIO on Kubernetes uses the MinIO Operator, which manages MinIO tenants as custom resources. Each tenant gets its own StatefulSet, PersistentVolumes, and TLS certificates. The operator handles rolling upgrades and pool expansion.

Decision Criteria

Criteria	MinIO	AWS S3	Ceph (RGW)
S3 compatibility	Near-complete	Native	Near-complete
Deployment	Self-hosted, single binary	Managed	Self-hosted, complex
Operational effort	Low	Zero	High
Scale ceiling	~10PB per cluster	Unlimited	Exabytes
Placement control	Per erasure set	Managed internally	CRUSH (rack/AZ aware)
Cost model	Hardware + ops	Per-GB + requests + egress	Hardware + ops (more ops)
Latency (local)	Sub-ms on NVMe	~10-50ms (network)	Depends on config

Pick MinIO when you need S3 on your own hardware and want something running in an afternoon. Pick Ceph when you need rack-aware placement at massive scale and have the ops team. Pick S3 when ops overhead matters more than cost or data locality.

Capacity Planning

Throughput: A single MinIO server with 8 NVMe drives handles ~10-20 GB/sec aggregate throughput (reads + writes). A 4-server pool: 40-80 GB/sec. MinIO is designed for throughput, not IOPS. Small random reads are slower.

Storage: Usable capacity = total raw capacity x (data shards / total shards). With EC:4 on 16 drives of 12TB each: 16 x 12TB x (12/16) = 144TB usable. Leave 20% headroom for compaction and healing.

Object count: Each object creates metadata files on disk. At 100 million objects per bucket, LIST operations slow down. For very large buckets, use prefixes to partition and speed up listing.

Network: Each write fans out to all drives in the erasure set. A 1GB object write on a 16-drive set generates 1GB of network traffic across 4 servers. Plan for at least 25Gbps NIC per server.

Failure Scenarios

Scenario 1: Drive Failure in an Erasure Set

What happens: One of 16 drives in the erasure set dies. Objects on that drive lose one shard each. With EC:4, 15 of 16 shards remain. Well within tolerance.

User impact: None. Reads and writes continue normally. Reads for affected objects skip the dead drive and read from the remaining 15 (12 data + 3 parity, or reconstruct as needed).

Recovery: Replace the drive. MinIO's healing process kicks in automatically (or trigger it manually with mc admin heal). It scans objects on the replacement drive's erasure set, identifies missing shards, reads the surviving shards, reconstructs the missing ones, and writes them to the new drive. Healing speed depends on disk throughput but typically takes hours for a full drive.

When it gets dangerous: Losing 5 drives in the same erasure set (with EC:4) means data loss for objects in that set. This is why monitoring drive health is critical. Replace failed drives before a second one goes.

Scenario 2: Server Goes Down (All Drives on That Server)

What happens: A server with 4 drives crashes. Each erasure set that had drives on this server loses 4 shards. With EC:4, losing exactly 4 means you're at the edge: any additional failure in the same set causes data loss.

User impact: Reads continue but are degraded. Every read now requires reconstruction (reading parity shards and doing the math). This adds latency. Writes to the affected erasure sets fail if MinIO requires all shards written (default behavior).

Recovery: Bring the server back or replace it. If the drives are intact, data recovers when the server rejoins. If drives are lost, healing reconstructs from parity. While a server is down, the cluster is in a vulnerable state. If you have multiple erasure sets and the dead server only contributed to some of them, unaffected sets work normally.

Scenario 3: Pool Expansion Gone Wrong

What happens: You add a second server pool to expand capacity. New objects go to Pool 2. But you misconfigure the pool (wrong drive count, wrong erasure coding setting). Objects written to Pool 2 have different parity than Pool 1.

User impact: Objects on Pool 2 may have weaker or stronger durability than intended. If Pool 2 has fewer drives than the parity setting requires, writes fail entirely.

Recovery: Catch it early with monitoring. MinIO logs and metrics show erasure set configuration per pool. If objects were written with wrong settings, there's no automatic fix. You'd need to copy them to a correctly configured pool and delete the originals. Test pool expansion in staging first.

Pros

• Near-complete S3 API compatibility. Most S3 SDKs and tools work out of the box.
• Single Go binary. No JVM, no dependencies, no complex installation. Deploy in minutes.
• Reed-Solomon erasure coding per object. Configurable data/parity ratio.
• High throughput on NVMe/SSD. Designed for modern hardware, not spinning disks.
• Kubernetes-native via the MinIO Operator. First-class Helm charts and CRDs.
• Built-in bucket replication, versioning, lifecycle management, and encryption.

Cons

• No topology-aware placement (no CRUSH equivalent). Shards distribute across a server pool, but not rack/AZ-aware by default.
• Scaling requires adding full server pools. Adding a single node to an existing pool is not supported.
• Metadata is co-located with data (no separate metadata tier). At very large scale, this limits flexibility.
• Rebalancing after expansion is manual and can be slow for large datasets.
• Community edition lacks some enterprise features (LDAP, AD integration, audit logging require paid tier).
• Not battle-tested at exabyte scale. Designed for petabytes, not hundreds of petabytes.

When to use

• Need S3-compatible storage on your own hardware or in any cloud
• Team wants simplicity over operational flexibility
• Data fits in petabyte range (up to ~10PB comfortably)
• Running on Kubernetes and want native integration
• AI/ML workloads that benefit from local high-throughput storage

When NOT to use

• Need exabyte-scale storage (Ceph or custom is more appropriate)
• Require fine-grained rack/AZ-aware placement control
• Want to scale incrementally by adding single nodes
• Need a managed service with zero ops (use S3 itself)

Key Points

•MinIO organizes storage into server pools and erasure sets. A server pool is a group of servers that were added together (e.g., 4 servers with 4 disks each). An erasure set is a subset of drives across the pool that shares erasure coding. A 16-drive pool with EC 12+4 has one erasure set of 16 drives. Objects are striped across the drives in their erasure set.
•Erasure coding is per-object, not per-volume. Each object is independently encoded with Reed-Solomon. You configure the parity level (e.g., EC:4 means 4 parity shards). With 16 drives and EC:4, you get 12 data + 4 parity, tolerating any 4 drive failures.
•Metadata is stored alongside data as small files on the same drives. Each object has an xl.meta file containing the object name, version, erasure coding parameters, part info, and checksums. There is no separate metadata database. This keeps things simple but means metadata operations (LIST, versioning) are bounded by filesystem performance.
•Scaling happens by adding new server pools, not by expanding existing ones. If you start with a 4-server pool and need more capacity, you add a second 4-server pool. MinIO treats them as separate erasure domains. New objects are placed on the pool with the most free space. Existing objects stay where they are.
•S3 API compatibility is the core selling point. Bucket notifications (SNS/SQS/Kafka), bucket replication (active-active), object locking (WORM), lifecycle rules, server-side encryption (SSE-S3, SSE-KMS, SSE-C), and presigned URLs all work. Most apps that talk to S3 can point at MinIO with zero code changes.
•Write path: client sends object to any MinIO server via the S3 API. That server computes the erasure set for the object (based on a hash of bucket + object name), splits the data into shards, encodes parity shards, and writes all shards to the drives in the erasure set in parallel. The write is acknowledged once all shards are durable (MinIO does not use quorum writes by default; it waits for all).

Common Mistakes

✗Starting with a small pool and expecting to expand it later. You can't add drives to an existing server pool. Plan your initial pool size for at least 6-12 months of growth, then add a new pool.
✗Running on HDDs and expecting S3-like latency. MinIO is optimized for NVMe/SSD. On spinning disks, small object reads are slow because each read does filesystem seeks for both the data and the xl.meta file.
✗Not setting up bucket replication for disaster recovery. MinIO supports active-active replication between sites, but it's not enabled by default. Without it, a site failure loses all data at that site.
✗Ignoring the erasure set size. With EC:4 on a 16-drive pool, losing 5 drives means data loss. Monitor drive health and replace failed drives promptly. MinIO's healing process reconstructs missing shards, but only if enough healthy drives remain.
✗Using MinIO as a drop-in S3 replacement without testing LIST performance. MinIO's metadata is filesystem-based. Buckets with millions of objects and deep prefix hierarchies can have slower LIST than S3, which uses a purpose-built index.
✗Deploying without a load balancer. Each MinIO server can handle any request, but clients need a single endpoint. Use nginx, HAProxy, or a Kubernetes Service to distribute traffic across the pool.

Related Technologies