Storage & DatabasesTech 2 of 21

Storage

Ceph

The storage system that runs CERN's physics data and DigitalOcean's block storage

Use Cases

Exabyte-scale object storage via RADOS Gateway (S3/Swift compatible)Block storage for VMs and containers (RBD)Shared filesystem for HPC and research clusters (CephFS)Private cloud storage backend (OpenStack Cinder/Manila)Data lake storage for analytics at petabyte+ scaleUnified storage platform replacing separate block/file/object systems

Architecture

Why It Exists

Ceph was born in 2004 as Sage Weil's PhD research at UC Santa Cruz. The question was simple: how to build a storage system that scales to petabytes without a single point of failure, without central metadata bottlenecks, and without manual data placement? The answer was CRUSH, an algorithm that computes where data lives instead of looking it up in a table.

Two decades later, Ceph runs at CERN (storing hundreds of petabytes of particle physics data), DigitalOcean (block storage for their cloud), Bloomberg (financial data), and dozens of telecom providers. It's the default storage backend for OpenStack private clouds. Red Hat packages it commercially.

What makes Ceph different from MinIO or S3 is that it's not just object storage. The same cluster serves object storage (via RADOS Gateway), block devices (via RBD for VMs), and a POSIX filesystem (via CephFS). One cluster, one ops team, three storage types. That's the value proposition. The trade-off is complexity. Ceph is not an afternoon setup.

How It Works Internally

Everything is built on RADOS, a flat object store. When the S3 gateway receives a PUT for photos/beach.jpg, it translates that into one or more RADOS object writes (splitting the data into 4MB chunks if it's large). When a VM writes a 4KB block to its virtual disk, the RBD driver translates that into a RADOS object write. Everything goes through RADOS.

Placement Groups (PGs) are the indirection layer. When RADOS receives an object write, it hashes the object name to get a PG number: pg_id = hash(object_name) % pg_count. The PG is the unit of replication and recovery. Each PG maps to a set of OSDs via CRUSH.

CRUSH takes a PG ID and the cluster topology (the CRUSH map) and deterministically computes which OSDs store that PG's data. The topology is a tree: root at the top, then datacenters, racks, hosts, and OSDs at the leaves. CRUSH rules say things like "place 3 replicas on 3 different hosts" or "place erasure-coded shards across 3 different racks." Any client or OSD can run CRUSH locally. There is no placement lookup service.

The write path for a replicated pool (3x): the client computes the PG and primary OSD via CRUSH. It sends the write to the primary. The primary forwards to the two secondary OSDs. Once all three have written to their local BlueStore and acknowledged, the primary acknowledges the client. The write is durable.

For an erasure-coded pool (e.g., 8+3): the primary OSD receives the full object, splits it into 8 data chunks, computes 3 parity chunks, and sends each chunk to the appropriate OSD. Acknowledgment after all 11 chunks are written. Reads only need the 8 data chunks unless some are missing.

BlueStore is the on-OSD storage engine. It writes directly to the raw block device, bypassing the Linux filesystem entirely. Internally, it uses RocksDB for metadata (object names, extent maps) and stores data in allocated extents on the raw device. This eliminates the double-write problem: a journaling filesystem would write data to the journal first, then to the final location. BlueStore writes data once and updates the RocksDB metadata atomically.

Monitors are the cluster brain. Three or five monitors run a Paxos consensus cluster that maintains the master copy of the cluster map (which OSDs exist, which are up/down, the CRUSH map, PG placement). When an OSD joins, fails, or gets decommissioned, monitors update the map. Every OSD and client caches the map and recomputes placement locally.

Production Architecture

A production Ceph cluster:

3-5 Monitor nodes (dedicated, local SSD for Paxos). Never co-locate with OSDs.
N OSD nodes, each running one OSD process per physical disk. A node with 12 HDDs runs 12 OSD daemons.
RADOS Gateway (RGW) nodes for S3/Swift API, behind a load balancer.
Optional: MDS nodes for CephFS (required only if using the filesystem interface).
Manager (MGR) for the dashboard, Prometheus module, and balancer plugin. Runs on monitor nodes.

Cluster Size	OSD Nodes	Disks	Capacity (3x replication)	Capacity (EC 8+3)
Small	3-5	36-60	~80-130TB	~130-210TB
Medium	10-30	120-360	~400TB-1.2PB	~650TB-2PB
Large	50-200	600-2400	~4-16PB	~6-26PB
CERN-scale	1000+	10,000+	100PB+	160PB+

PG count tuning: Target 100-200 PGs per OSD. A 100-OSD cluster with 3x replication: 100 OSDs * 100 PGs / 3 replicas = ~3,333, round up to 4,096 (power of 2). Getting this wrong is one of the most common causes of uneven data distribution.

Decision Criteria

Criteria	Ceph	MinIO	AWS S3
Scale ceiling	Exabytes	Petabytes	Unlimited
Storage types	Object + Block + File	Object only	Object only
Placement control	CRUSH (rack/AZ/DC aware)	Per erasure set	Managed
Self-healing	Automatic, configurable	Automatic (simpler)	Managed
Operational effort	High (dedicated team)	Low	Zero
Erasure coding	Multiple profiles (RS, LRC, SHEC)	RS only	Managed
Kubernetes	Rook-Ceph operator	MinIO Operator	N/A (managed)

Ceph fits when multi-petabyte storage with topology-aware placement is needed and the ops team is available. MinIO fits when S3 compatibility matters more than placement control. S3 itself fits when running storage infrastructure isn't an option.

Capacity Planning

Throughput: A single OSD on a 7200 RPM HDD does ~100-150 MB/sec sequential, ~100 IOPS random. On NVMe: ~500MB-3GB/sec sequential, ~100K IOPS. Aggregate cluster throughput scales linearly with OSD count. A 100-OSD HDD cluster: ~10-15 GB/sec aggregate.

Recovery bandwidth: When an OSD dies, recovery pulls data from surviving replicas. With 200 PGs per OSD and 100MB per PG shard, that's 20GB to rebuild. At rate-limited 100MB/sec recovery: ~3 minutes. At 1TB per PG shard (large data): ~30 minutes per PG. Budget recovery bandwidth: 10-20% of cluster throughput.

Memory: Each OSD needs ~3-5GB RAM for BlueStore cache and RocksDB. A node with 12 OSDs needs 36-60GB RAM. Monitors need 4-8GB.

Network: Ceph uses two networks: a public network (client to OSD) and a cluster network (OSD to OSD for replication and recovery). Budget 10Gbps minimum for each. At large scale, 25Gbps per network per node.

Failure Scenarios

Scenario 1: OSD Failure and Recovery Storm

What happens: An OSD hosting 200 PGs goes down. The monitors detect it (heartbeat timeout, ~20 seconds). They mark the OSD as down in the cluster map. All 200 PGs are now degraded (one fewer replica than configured).

Impact: Client reads and writes continue because the surviving replicas serve the data. But performance degrades because recovery starts: for each of the 200 PGs, surviving OSDs read the PG's data and write a new copy to another OSD. This recovery I/O competes with client I/O.

The storm: Without rate limits, 200 PGs rebuilding simultaneously generate enormous disk and network load. Client latency spikes from 5ms to 500ms+. Other OSDs that were healthy start timing out under I/O pressure, and monitors mark them as down too. Now you have a cascading failure where the "recovery" is making things worse.

Recovery: Set osd_recovery_max_active (default 3, lower to 1 for busy clusters) and osd_recovery_sleep to throttle recovery. Prioritize PGs with the fewest remaining replicas. Monitor ceph health detail and watch for slow_ops warnings. Once recovery finishes (minutes to hours depending on data volume), the cluster returns to full health.

Scenario 2: CRUSH Map Misconfiguration

What happens: An operator adds new OSDs but doesn't update the CRUSH rules. The new OSDs are added under the default host bucket but not assigned to the correct rack or AZ in the CRUSH hierarchy. CRUSH now places replicas on nodes it thinks are in separate failure domains but are actually in the same rack.

Impact: Data is less protected than it appears. The cluster reports "3 replicas in 3 different hosts" but two of those hosts are in the same rack. A rack power failure now loses two replicas instead of one.

Detection: Run ceph osd tree and verify the hierarchy matches your physical layout. Use crushtool --test to simulate failures and verify that placement rules actually distribute data across the intended failure domains. This should be part of your deployment checklist.

Recovery: Update the CRUSH map with correct bucket assignments. Ceph will automatically rebalance data to match the new rules. This triggers data movement, so do it during a maintenance window.

Scenario 3: Monitor Quorum Loss

What happens: Two of three monitors go down (same rack, same power circuit). The remaining monitor cannot form a Paxos quorum.

Impact: The cluster map cannot be updated. No new OSDs can join, no failed OSDs can be marked down, no PG recovery can start. Existing reads and writes continue if clients have a cached cluster map and the OSDs they need are still up. But any change to the cluster state is frozen. New clients that don't have the map cached can't connect at all.

Recovery: Bring at least one more monitor back online to restore quorum. If the monitors are permanently lost, you can bootstrap a new monitor from a surviving one's data store. Monitors are lightweight. Their data (the cluster map) is small. The main risk is availability, not data loss. This is why monitors should be on separate power circuits, separate racks, and ideally separate AZs.

Pros

• Proven at exabyte scale. CERN stores hundreds of petabytes of physics data on it.
• CRUSH algorithm places data across failure domains (racks, AZs) without a central lookup
• Three storage interfaces from one cluster: object (RGW), block (RBD), file (CephFS)
• Configurable erasure coding profiles (RS, LRC, SHEC) per storage pool
• Self-healing. Detects failed OSDs and automatically rebalances and reconstructs data.
• No single point of failure. Monitors, OSDs, and metadata servers are all distributed.

Cons

• Operational complexity is high. Requires dedicated ops team who understand PG states, CRUSH maps, OSD tuning, and recovery dynamics.
• Performance tuning is hard. Dozens of settings interact (PG count, bluestore cache, recovery limits, scrub schedules). Wrong defaults cause silent degradation.
• Recovery storms can saturate network. When an OSD dies, all PGs on that OSD start rebuilding simultaneously unless you rate-limit.
• BlueStore (the storage backend) has known edge cases with fragmentation on long-running clusters.
• Upgrading across major versions requires careful planning. Rolling upgrades are supported but risky if you skip versions.
• Not designed for small deployments. Minimum viable cluster is 3 nodes, but you really want 5+ for production.

When to use

• Need petabyte to exabyte scale on your own hardware
• Require rack-aware and AZ-aware data placement
• Want object, block, and file storage from one platform
• Have an ops team with distributed systems experience
• Building private or hybrid cloud infrastructure

When NOT to use

• Small team without Ceph operational experience
• Storage under 100TB (MinIO is simpler)
• Need a managed service with zero ops
• Performance-sensitive workloads where tuning time is limited
• Greenfield project where S3 is an option and data sovereignty isn't a concern

Key Points

•Everything in Ceph is built on RADOS (Reliable Autonomic Distributed Object Store). RADOS is a flat object store: you give it an object name and pool, and it stores the bytes. The S3 gateway (RGW), block device (RBD), and filesystem (CephFS) are all thin layers that translate their semantics into RADOS object operations.
•CRUSH (Controlled Replication Under Scalable Hashing) is the placement algorithm. Given an object name and a pool, CRUSH computes which OSDs store the object's replicas or erasure-coded shards. The computation is deterministic and topology-aware: you define a hierarchy (root > datacenter > rack > host > OSD), and CRUSH ensures replicas land in different failure domains. No central placement lookup needed.
•A Placement Group (PG) is the unit of data management. Objects hash to PGs, and PGs map to OSDs via CRUSH. A typical cluster has 100-300 PGs per OSD. PGs are what get replicated, recovered, and rebalanced. When an OSD dies, Ceph doesn't reason about individual objects. It reasons about which PGs lost a replica and needs rebuilding.
•Each OSD (Object Storage Daemon) manages one physical disk. It stores data using BlueStore, which writes directly to the raw block device (no filesystem). BlueStore uses RocksDB internally for its metadata (object names, extent maps, checksums). This avoids the double-write problem of journaling filesystems.
•Monitors form a Paxos cluster (typically 3 or 5) that maintains the cluster map: which OSDs are up, which are down, the CRUSH map, and the PG placement. Every client and OSD has a copy of this map. When something changes (OSD joins or fails), monitors update the map and distribute it.
•Self-healing: when an OSD goes down, the monitors detect it (heartbeat timeout, default 20 seconds). They update the cluster map. All OSDs that had PGs shared with the dead OSD start recovering: they read surviving replicas/shards and write new copies to other OSDs. This happens automatically, no human intervention needed. But it generates a lot of I/O, which is why recovery rate limits exist.

Common Mistakes

✗Setting PG count wrong. Too few PGs means uneven data distribution (some OSDs overloaded, others idle). Too many means high memory overhead per OSD. Rule of thumb: target 100-200 PGs per OSD, then round up to a power of 2.
✗Not rate-limiting recovery. When an OSD with 200 PGs dies, all 200 PGs start rebuilding at once. Without osd_recovery_max_active and osd_recovery_sleep settings, recovery I/O saturates the network and kills client performance.
✗Ignoring CRUSH rule design. Default CRUSH rules replicate across hosts but not across racks. If your two replicas land on the same rack and that rack loses power, you lose both copies. Define CRUSH rules that enforce rack-level or AZ-level separation.
✗Using the wrong erasure coding profile for the workload. LRC (Local Reconstruction Codes) reduce recovery network traffic at the cost of extra storage. SHEC (Shingled Erasure Coding) optimizes for partial reads. The default RS profile is fine for most cases, but don't use EC for pools with lots of small random writes (the read-modify-write overhead is painful).
✗Running monitors on the same hosts as OSDs. Monitors need stable, low-latency storage for Paxos. A noisy OSD compacting its RocksDB can starve the monitor process and cause election storms.
✗Skipping regular scrubbing. Ceph's scrubber compares replicas to detect bit rot and silent corruption. Disabling it to 'save IOPS' means corruption goes undetected until a recovery event reads the bad data. Run shallow scrubs daily and deep scrubs weekly.

Related Technologies