MongoDB
The document database you'll probably use at least once in your career
Use Cases
Architecture
MongoDB is everywhere. Anyone who has worked on a modern web stack has either used it or had to justify the alternative choice. The flexible document model and the low friction of getting started are genuinely useful, especially early in a project when the schema is still shifting. Since version 4.0, multi-document ACID transactions filled the biggest gap in the document model story, making Mongo a realistic option for transactional workloads that used to require a relational database. That said, do not confuse "viable" with "ideal." Postgres still wins for heavily relational data, and that is fine.
How It Works Internally
MongoDB stores data as BSON (Binary JSON) documents. BSON is a binary-encoded format that extends JSON with extra types: dates, binary data, ObjectId, Decimal128, regular expressions, and others. Each document can be up to 16MB and carries a unique _id field (typically an ObjectId encoding a timestamp, machine identifier, process ID, and counter).
Under the hood, the WiredTiger storage engine (default since 3.2) manages data through in-memory pages and on-disk checkpoints. It provides document-level concurrency control via optimistic locking, so multiple threads can write to different documents in the same collection at the same time. WiredTiger compresses with snappy by default (zstd and zlib are also available) and typically hits 50-70% compression ratios on real workloads. A write-ahead journal records every write before it touches the in-memory data files, with journal commits every 50ms by default. This is what keeps data intact if the process crashes.
Replication works through the oplog (operations log), a capped collection on the primary that records every write as an idempotent entry. Secondaries continuously tail this oplog and replay operations locally. When a primary goes down, the remaining members hold an election using a Raft-like protocol (introduced in 3.6) and promote the secondary with the most current oplog position. Elections typically finish in 10-12 seconds. During that window, writes to the replica set are unavailable. Plan for it.
Production Architecture
Start with a 3-member replica set per shard: one primary, two secondaries, spread across three availability zones. For sharded clusters, put at least two mongos routers behind a load balancer, plus a dedicated 3-member config server replica set that holds shard metadata and chunk distribution maps.
Now, shard key selection. This is the single decision that causes the most pain if it goes wrong. A good shard key has high cardinality, low frequency, and is non-monotonic. Hashed shard keys distribute writes evenly but kill range queries. Compound shard keys (e.g., {tenant_id: 1, created_at: 1}) offer targeted queries for a single tenant while still spreading data across shards. Before MongoDB 5.0, a shard key could not be changed after setting it. 5.0 introduced reshardCollection, but resharding a large cluster is still expensive and disruptive. Pick carefully the first time.
Turn on authentication, TLS on all connections, and audit logging in production. Use readPreference: secondaryPreferred for analytics queries to take load off the primary, but know that reads from secondaries can return stale data. For strong consistency, read from the primary.
Capacity Planning
A replica set on m5.2xlarge instances (8 vCPUs, 32GB RAM) with gp3 storage handles roughly 20,000-50,000 operations per second, depending on document size and index complexity. Set WiredTiger's internal cache to 50% of RAM minus 1GB (that is the default formula). Watch the wiredTiger.cache.bytes currently in the cache metric and keep the cache hit ratio above 95%.
The most important number to estimate is the working set size: the data plus indexes that get actively queried. If the working set fits in WiredTiger's cache, performance stays predictable. The moment it spills to disk, latency spikes and tail latencies get ugly. Monitor oplog.rs collection size and make sure it covers at least 48-72 hours of operations. This gives secondaries and disaster recovery processes enough buffer to catch up after maintenance or outages. Track globalLock.activeClients.readers and globalLock.activeClients.writers to spot concurrency bottlenecks before they bite in production.
Failure Scenarios
Scenario 1: Chunk migration storm from a bad shard key. Picking a monotonically increasing shard key (ObjectId, timestamp) and all writes land on the chunk at one shard's upper range. The balancer keeps splitting and migrating chunks to other shards, burning I/O and network bandwidth. At scale this creates a feedback loop: migrations slow down the overloaded shard, which causes more chunks to queue for migration. Spot this by checking sh.status() for jumbo chunks and looking at moveChunk operations in the mongos logs. The fix: use a hashed shard key or a compound shard key with a high-cardinality prefix. For existing deployments stuck with a bad key, MongoDB 5.0+ supports online resharding via reshardCollection, but expect it to take a while on large datasets.
Scenario 2: Oplog window exhaustion forcing a full resync. A secondary falls behind because of network issues or a heavy batch job on the primary. If the secondary's replication position falls off the end of the primary's oplog (a capped collection typically sized for 24-72 hours), it cannot resume incremental replication. It has to do a full initial sync, copying the entire dataset. For a 2TB shard, that means 12-24 hours over a 10Gbps network. During resync, read capacity is lost on that member and have no failover redundancy. Monitor rs.status() for optimeDate lag and replSetGetStatus.oplogWindow. Size the oplog generously with --oplogSize, alert when any secondary's lag exceeds 1 hour, and run batch operations on hidden members so production secondaries stay healthy.
Pros
- • Flexible schema, no migrations needed
- • Rich query language with aggregation pipeline
- • Horizontal scaling via built-in sharding
- • Native JSON document model
- • Multi-document ACID transactions
Cons
- • Joins are limited (no server-side joins pre-5.0, $lookup is costly)
- • Memory-mapped storage can consume significant RAM
- • Denormalization leads to data duplication
- • Write amplification with large documents
- • Sharding requires careful key selection
When to use
- • Schema evolves frequently (startups, MVPs)
- • Data is naturally document-shaped (JSON)
- • Need flexible querying over semi-structured data
- • Rapid prototyping with changing requirements
When NOT to use
- • Highly relational data with many joins
- • Need strong multi-row transactions across collections
- • Write-heavy append-only workloads (prefer Cassandra)
- • Strict schema enforcement is required
Key Points
- •The document model stores data as BSON (Binary JSON), supporting nested objects, arrays, and 25+ data types including Decimal128, dates, and binary. Each document can have a unique structure within a collection.
- •WiredTiger uses document-level locking, snappy compression by default, and a B-tree + LSM hybrid with an in-memory write-ahead journal for crash recovery.
- •Change Streams provide a real-time CDC (Change Data Capture) API built on the oplog. Event-driven architectures work out of the box without polling or bolting on Debezium.
- •The aggregation pipeline processes data through sequential stages ($match, $group, $lookup, $unwind) and pushes computation to the server, cutting data transfer to the application.
- •Sharding distributes data via mongos routers that consult config servers for chunk-to-shard mappings. Shard key selection is the single most important architectural decision.
- •Read concern and write concern settings provide fine-grained control over the consistency vs. latency tradeoff, from fire-and-forget writes to majority-acknowledged linearizable reads.
Common Mistakes
- ✗Not creating indexes for query patterns. MongoDB performs collection scans (full table scans) without indexes. Use explain() to verify every production query hits an index.
- ✗Using $where or JavaScript in queries. Server-side JavaScript execution bypasses the query optimizer, cannot use indexes, and opens security holes. Use standard query operators instead.
- ✗Unbounded arrays in documents. Arrays that grow forever (e.g., storing all user events inside a user document) cause document growth past the 16MB BSON limit, write amplification, and index bloat.
- ✗Not setting appropriate read/write concerns. The default write concern of w:1 acknowledges after the primary writes but before replication. Use w:majority for critical data to prevent rollback on failover.
- ✗Ignoring the oplog window for replication. If a secondary falls behind the oplog window (typically 24-72 hours of operations), it must do a full resync, which takes hours and tanks cluster performance.