Storage & DatabasesTech 16 of 21

Storage

SeaweedFS

Fast, simple distributed storage that just works for small files

Use Cases

High-volume small file storage (images, thumbnails, log files)CDN origin storage where most objects are under 10MBLightweight object storage for startups and mid-size teamsS3 gateway for applications that need basic S3 compatibilityFile storage backend for web applicationsBlob storage for microservices that don't need S3-scale complexity

Architecture

Why It Exists

Most object storage systems are designed for the general case: any object size, any access pattern, any scale. That generality comes with overhead. Ceph's CRUSH algorithm, PG management, and OSD tuning make sense at petabyte scale, but for a team storing 50 million product images at 200KB each, it's a lot of machinery.

SeaweedFS starts from a different question: what if the design optimizes specifically for high volumes of small files? The answer is a simple two-tier architecture. A lightweight master server handles volume assignment (which server should this file go to?). Volume servers store the data by appending files into large volume files, similar to Facebook's Haystack. The file ID encodes the volume and offset directly, so reads are a single lookup followed by a single disk seek.

The project started in 2011 as a Go implementation inspired by Facebook's Haystack paper. It stayed focused on that original insight: if you can encode location in the file ID, you eliminate most of the metadata overhead that slows down other systems. A SeaweedFS cluster with a few volume servers can handle millions of small file writes per day with minimal operational effort.

The trade-off is the central master. Every write assignment goes through it. At very high write rates or very large cluster sizes, the master becomes the bottleneck. This is why SeaweedFS stays in the TB-to-low-PB range, while Ceph (with its decentralized CRUSH placement) scales to exabytes.

How It Works Internally

Volume assignment (write path):

Client asks the Master: "I want to write a file. Where should it go?"
Master picks a volume with free space (round-robin or weighted) and returns the volume ID plus a volume server address.
Client sends the file directly to that volume server.
Volume server appends the file to the volume's .dat file, writes an entry to the .idx file, and returns a file ID like 3,01637037d6.

The file ID 3,01637037d6 encodes: volume ID = 3, file key = 01637037, file cookie = d6. The cookie is a random value that prevents unauthorized guessing of file IDs.

Read path:

Client has the file ID 3,01637037d6. It knows volume ID is 3.
Client asks the Master: "Which server has volume 3?" (This mapping is cached aggressively by the client.)
Client reads directly from the volume server: seek to the offset in the .dat file, read the bytes.

For a small file (say 50KB), this is one network round-trip to the Master (cached after the first call) plus one round-trip to the volume server. Total latency: a few milliseconds on a local network.

Volume structure:

Each volume is a pair of files:

.dat file: sequential append of all objects. Each object has a small header (file key, cookie, size, checksum) followed by the raw bytes. Append-only, never modified in place.
.idx file: a sorted index mapping file keys to offsets in the .dat file. Loaded into memory for O(1) lookups.

This is conceptually identical to Facebook's Haystack and the extent files described in object storage designs. The advantage over one-file-per-object is massive: no inode overhead, sequential writes instead of random creates, and the index fits in RAM.

Deletion: Deleted files are marked with a tombstone in the .dat file. The space isn't immediately reclaimed. A background compaction process periodically rewrites the .dat file, skipping tombstoned entries. After compaction, dead space is reclaimed.

Replication: SeaweedFS supports synchronous replication at the volume level. Creating a volume with replication factor 001 (one extra copy on a different server), writes go to both servers before acknowledging. You can also configure rack-aware replication (010 = copy on a different rack) or datacenter-aware replication (100 = copy in a different datacenter).

Erasure coding: Applied per-volume. A 30GB volume with 10+4 EC produces 14 shards of ~3GB each, spread across 14 volume servers. This saves storage vs replication (1.4x vs 2x-3x) but makes recovery coarser. Losing one volume server means reconstructing all the EC volume shards it held, not just individual files.

Production Architecture

A production SeaweedFS cluster:

3 Master servers with Raft consensus for high availability
N Volume servers (scaled as needed for capacity and throughput)
Optional: Filer for S3 API and directory tree (backed by PostgreSQL, MySQL, or etcd)
Optional: S3 gateway (runs on top of the Filer)

Deployment Size	Masters	Volume Servers	Volumes	Capacity
Small	3	3-5	~100	~3TB
Medium	3	10-30	~1,000	~30TB
Large	3	50-100	~5,000	~150TB
Max practical	3	200+	~10,000+	~300TB-1PB

Beyond ~200 volume servers, the Master becomes the coordination bottleneck. Each write assignment is a round-trip to the Master. At 50K writes/sec, a single Master (even with Raft) starts showing latency. This is the practical scaling ceiling.

Filer backends for the S3 gateway:

Backend	Durability	Performance	Use Case
LevelDB	Single node, no replication	Fast	Dev/test only
PostgreSQL	Replicated, ACID	Good	Production, small-medium
MySQL	Replicated	Good	Production, small-medium
etcd	Raft-replicated	Good for small directories	Production, limited scale
Redis	In-memory, optional persistence	Very fast	Caching layer, not primary

Decision Criteria

Criteria	SeaweedFS	MinIO	Ceph (RGW)
Sweet spot	Small files, TB-PB scale	General S3, PB scale	Any workload, PB-EB scale
Architecture	Central master + volume servers	Distributed, peer-to-peer	CRUSH, fully distributed
Small file perf	Excellent (optimized for this)	Good	OK (PG overhead per object)
S3 compatibility	Partial (via Filer)	Near-complete	Near-complete
Erasure coding	Per-volume	Per-object	Per-pool (configurable)
Operational effort	Low	Low	High
Scaling ceiling	~1PB (master bottleneck)	~10PB	Exabytes

SeaweedFS fits when the workload is dominated by millions of small files and simplicity matters most. MinIO fits when full S3 compatibility is required. Ceph fits when multi-petabyte scale with topology-aware placement is needed.

Capacity Planning

Write throughput: A single volume server on SSD handles ~10K-30K small file writes/sec (depending on file size and replication). The Master handles ~50K assignment requests/sec before becoming a bottleneck. For workloads under 10K writes/sec, the master is never an issue.

Read throughput: Reads bypass the master after the first volume lookup (cached). A volume server with SSD serves ~50K-100K small file reads/sec. Reads are a single seek to the offset in the .dat file plus a sequential read of the file bytes.

Storage: Each volume is 30GB by default. With replication factor 1 (2 copies), you need 2x raw capacity. With EC 10+4, you need 1.4x. A volume server with 4TB of SSD holds ~130 volumes (replicated) or ~190 volumes (EC).

Memory: The .idx file for each volume is loaded into memory. A 30GB volume with 1 million files: ~10MB for the index. A server hosting 100 volumes: ~1GB for indexes. This is modest. But a volume with 30 million tiny files: ~300MB for the index. Plan accordingly.

Failure Scenarios

Scenario 1: Volume Server Failure

What happens: A volume server hosting 50 volumes goes down. All files on those volumes become temporarily unavailable on that server.

Impact with replication: If each volume has a replica on another server, reads automatically fall back to the replica. Writes to those volumes pause until the master reassigns the primary. User impact: minimal if replication is configured. A few seconds of write latency while the master updates assignments.

Impact without replication or EC: Files on those volumes are unavailable until the server comes back. If the disk is permanently lost, those files are gone.

Recovery: If the server restarts with data intact, it rejoins and the master re-registers its volumes. If the disk is lost, the master removes those volumes from its registry. Replicated volumes have their surviving copies promoted. EC volumes reconstruct missing shards from the surviving 10+ shards (for 10+4).

Scenario 2: Master Server Failure

What happens: The Raft leader among the 3 master servers crashes.

Impact: Raft elects a new leader in a few seconds. During the election, no new write assignments are issued. Clients that already have volume-to-server mappings cached can still read (and write to already-assigned volumes). New clients that need a volume assignment will see errors until the election completes.

If all masters are down: No new writes can start. Reads continue if clients have cached the volume-to-server mapping. But the cluster cannot recover from volume server failures because the master can't update assignments. This is the central coordination risk.

Recovery: Bring at least 2 of 3 masters back to restore quorum. Masters store very little data (just the volume-to-server mapping), so recovery is fast.

Scenario 3: Volume Compaction Pressure

What happens: A high-delete workload (e.g., image cache with TTL expiry) marks millions of files as deleted. The .dat files fill up with tombstoned entries. Actual disk usage stays high even though logical usage is low.

Impact: Disk space runs out despite having "plenty" of logical free space. Write assignments for new volumes fail because there's no physical disk space left.

Detection: Monitor the ratio of logical data to physical data per volume. If a volume's .dat file is 30GB but only 5GB is live data, it needs compaction.

Recovery: Trigger compaction (happens automatically on a schedule, or manually via the API). Compaction rewrites the .dat file, skipping tombstoned entries. The new file is smaller. Old file is deleted. During compaction, the volume is briefly read-only. For high-delete workloads, schedule compaction more aggressively or use smaller volumes so each compaction is faster.

Pros

• Extremely fast for small files. Optimized for the use case where most objects are under 10MB.
• Simple architecture. Master + Volume servers. Easy to understand and deploy.
• Low metadata overhead per object. File ID encodes volume and offset directly.
• Built-in Reed-Solomon erasure coding for volume-level durability
• Filer component provides directory semantics and S3 API on top of the blob store
• Written in Go. Single binary per component. Minimal dependencies.

Cons

• Central master server is a coordination bottleneck. All volume assignments go through it.
• Master is a single point of failure (though it can be replicated with Raft, it's still the critical path)
• S3 compatibility is partial. Some advanced S3 features (object lock, complex lifecycle rules) are missing or incomplete.
• Not designed for exabyte scale. Works well up to low petabytes.
• Erasure coding is per-volume (groups of objects), not per-object. A volume failure recovery reconstructs the entire volume.
• Smaller community and ecosystem compared to MinIO or Ceph. Fewer production references.

When to use

• Primary workload is lots of small files (images, thumbnails, logs)
• Want something simpler than Ceph with less operational overhead
• Data fits in terabytes to low petabytes
• Team is small and needs a storage system they can understand end to end

When NOT to use

• Need exabyte scale or rack-aware placement (use Ceph)
• Require full S3 API compatibility (use MinIO)
• Workload is dominated by large objects (>1GB). SeaweedFS's advantage is small files.
• Cannot tolerate a central master in the critical path
• Need enterprise support and a large community

Key Points

•SeaweedFS has two core components: the Master and Volume Servers. The Master tracks which volumes exist, which servers host them, and which volumes have free space. Volume Servers store the actual data. The Master is the brain; Volume Servers are the muscle.
•Data is stored in volumes. A volume is a large file (default 30GB) that contains many objects packed together. Think of it like a Haystack-style extent file. Each volume has a .dat file (the actual data, appended sequentially) and a .idx file (an index mapping file IDs to offsets within the .dat file).
•File IDs encode location directly. On upload, SeaweedFS returns a file ID like '3,01637037d6'. The '3' is the volume ID, and the rest is the offset/cookie within that volume. To read, the client asks the Master which server hosts volume 3, then reads directly from that server. One lookup, one read. Very fast.
•The Master uses Raft for high availability (3 masters recommended). But the master is still in the critical path for writes: before writing, the client asks the master which volume to write to. This is the central coordination trade-off. It keeps the design simple but puts a ceiling on write throughput.
•Erasure coding in SeaweedFS is per-volume, not per-object. A volume with 10+4 EC is split into 14 EC shards, distributed across 14 volume servers. If a server goes down, the entire volume shard is reconstructed. This is different from MinIO (per-object EC) and means recovery is coarser grained.
•The Filer is an optional component that adds directory semantics and S3 compatibility on top of the raw blob store. It maintains a directory tree (stored in a backend like LevelDB, MySQL, PostgreSQL, or etcd) and maps paths to file IDs. The S3 gateway runs on top of the Filer.

Common Mistakes

✗Running a single master without Raft replication. If the master dies, no new writes can be assigned (reads still work since clients cache volume-to-server mappings). Always run 3 masters with Raft.
✗Not pre-allocating volumes. If volumes fill up before new ones are created, there's a pause while the master creates and assigns a new volume. Pre-allocate with -volumePreallocate.
✗Ignoring volume size for the workload. The default 30GB volume works for general use, but if your files are tiny (1KB each), a 30GB volume holds 30 million objects. The .idx file for that volume is ~300MB. Smaller volumes reduce index memory pressure.
✗Expecting full S3 compatibility from the Filer. Multipart upload works, but features like object lock, cross-region replication, and complex lifecycle policies are limited or missing.
✗Not configuring the Filer backend for durability. The default LevelDB backend is fine for dev but not for production. Use PostgreSQL, MySQL, or etcd for a durable, replicated directory tree.
✗Using SeaweedFS for large files without considering that the volume-level EC means recovering a 30GB volume shard takes much longer than recovering a single object's shard.

Related Technologies