Database Selection Framework
Architecture Diagram
The Decision Matrix
Picking a database is one of the hardest-to-reverse decisions in a system. The right framework evaluates five dimensions: consistency requirements, query patterns, scale trajectory, operational maturity, and team expertise.
Build a simple matrix. List your top 10 queries in rows and the candidate databases in columns. Score each combination on a 1-5 scale for natural fit. This exercise alone eliminates most bad choices.
Relational Databases
PostgreSQL and MySQL cover 80% of use cases. PostgreSQL has stronger support for complex queries, JSON columns, and extensions like PostGIS and pg_trgm. MySQL has a simpler operational model and better replication ergonomics at scale. If you need ACID transactions across multiple tables, relational databases are still the best answer.
At scale, look at CockroachDB or TiDB for distributed SQL that maintains PostgreSQL or MySQL compatibility. Expect 2-3x latency overhead compared to a single-node setup, but you gain horizontal scalability without application changes.
Document Stores
MongoDB works well when your data is naturally hierarchical and your access pattern is "fetch one document by ID." It struggles when you need joins across collections or strong consistency across documents. DynamoDB is a better fit for pure key-value and key-range patterns where you can design your access patterns upfront. DynamoDB charges per read/write unit, so unpredictable query patterns get expensive fast.
Specialized Databases
Time-series: TimescaleDB (PostgreSQL extension) or InfluxDB for metrics, IoT, and financial tick data. Compression and retention policies are built in.
Graph: Neo4j for relationship-heavy queries like social networks, fraud detection, or recommendation engines. If you are doing more than 3-hop traversals, a graph database will outperform SQL self-joins by orders of magnitude.
Wide-column: Cassandra or ScyllaDB for write-heavy workloads at massive scale. Netflix runs Cassandra for over a trillion rows. The trade-off is eventual consistency and a rigid query model where you design tables around specific queries.
Key-value cache: Redis for sub-millisecond reads on hot data. Treat it as ephemeral. Teams that use Redis as a primary data store eventually regret it when they hit memory limits or lose data on restart.
The Polyglot Persistence Trap
Using multiple databases is sometimes necessary but always expensive operationally. Each database is another system to monitor, back up, upgrade, and page on. Stripe famously ran on just MongoDB and Redis for years before adding more databases. Start with as few databases as you can justify and add specialists only when PostgreSQL demonstrably cannot meet a specific requirement.
Key Points
- •Start with query patterns, not the database. List the top 10 queries your application will run and let that drive the selection
- •PostgreSQL is the safe default for most applications. It handles JSON, full-text search, geospatial, and time-series reasonably well before you need a specialist
- •Operational complexity matters more than benchmarks. A database your team can operate confidently at 2 AM is better than one that wins synthetic benchmarks
- •Multi-model databases reduce operational burden but sacrifice peak performance. DynamoDB is fast for key-value but painful for ad-hoc analytics
- •Data gravity is real. Once you have terabytes in one system, migration cost dominates all other considerations
Common Mistakes
- ✗Choosing a database based on hype or conference talks instead of matching it to actual access patterns
- ✗Running benchmarks on empty datasets. Performance characteristics change dramatically at 100GB, 1TB, and 10TB
- ✗Ignoring the backup, restore, and disaster recovery story. A database you cannot reliably restore is a liability