PostgreSQL
The relational database that keeps earning its spot in production
Use Cases
Architecture
PostgreSQL is the database most teams should start with, and many never need to leave. Apple, Instagram, Spotify, Reddit all run it in production. That is not a coincidence. After 35+ years of development, Postgres handles everything from basic CRUD apps to analytical workloads on petabytes of data. It is not the fastest option for every access pattern, but the combination of ACID compliance, extensibility, and SQL standards support makes it the safest default for a primary datastore.
How It Works Internally
PostgreSQL's concurrency model, MVCC (Multi-Version Concurrency Control), stores multiple physical versions of each row right in the heap. Every row carries xmin (the transaction ID that created it) and xmax (the transaction ID that deleted or updated it) as system columns. When a row is updated, Postgres inserts a new version and stamps the old one with xmax, building a version chain. Readers only see versions visible to their snapshot, defined by the transaction's starting XID and the commit log (CLOG). The upside: reads never block writes. The downside: dead tuples accumulate, and VACUUM has to clean them up.
This is the defining detail of Postgres internals. Unlike Oracle or MySQL/InnoDB, which use a separate undo log, Postgres leaves old row versions in place. That design choice shapes everything from the autovacuum configuration to how long-running transactions are handled.
The Write-Ahead Log (WAL) is how Postgres delivers durability and replication. Every data modification goes to sequential WAL segments (16MB each by default) before the transaction commits. WAL records are binary representations of page-level changes. During crash recovery, Postgres replays WAL from the last checkpoint forward. Streaming replication ships these WAL records to replicas in near-real-time. With synchronous replication enabled, the primary waits for at least one replica to confirm before acknowledging the commit. The result is zero data loss, at the cost of write latency.
Postgres supports multiple index types, and picking the right one matters more than most engineers realize. B-tree indexes handle equality and range queries on scalar types. GIN (Generalized Inverted Index) indexes are the right choice for full-text search, JSONB containment queries, and array operations. GiST (Generalized Search Tree) indexes cover geometric data, range types, and nearest-neighbor queries. BRIN (Block Range Index) indexes store min/max summaries per block range, using orders of magnitude less space for naturally ordered data like timestamps. A 1TB time-series table can have a BRIN index under 1MB versus a 20GB B-tree. That is not a typo.
Production Architecture
A production Postgres deployment revolves around a primary instance with streaming replicas. Put PgBouncer in front of the primary in transaction pooling mode. This collapses thousands of application connections down to hundreds of actual Postgres backends. For financial or compliance workloads, use synchronous replication to at least one replica. For read scaling and disaster recovery, async replicas do the job.
Configure replication slots so the primary does not discard WAL segments that replicas have not consumed yet. Keep an eye on pg_stat_replication for lag. If lag exceeds a few seconds, that points to network issues or an overloaded replica. Use pg_basebackup for initial replica provisioning and point-in-time recovery (PITR) backups. For automated failover, Patroni (used by GitLab and Zalando) works well with etcd or ZooKeeper for consensus.
One thing I will be direct about: Postgres replication is not as turnkey as managed databases make it look. Expect to spend time configuring slots, monitoring lag, and planning failover. But once the pieces are understood, the operational model is straightforward and well-documented.
Capacity Planning
A single Postgres instance on a db.r6g.4xlarge (128GB RAM, 16 vCPUs, io2 storage) can push 10,000-50,000 transactions per second depending on query complexity. Set shared_buffers to 25% of RAM (32GB), effective_cache_size to 75% (96GB), and work_mem to 128-256MB for analytical queries. Watch max_connections carefully. Each backend costs 5-10MB, so 500 connections burn 2.5-5GB just on process overhead. This is why PgBouncer is not optional.
Monitor tup_deleted versus tup_inserted ratios. If deletes consistently outpace inserts, table bloat is building up. Track pg_stat_user_tables.n_dead_tup and make sure autovacuum runs often enough. On modern SSDs, set autovacuum_vacuum_cost_delay to 2ms (the default is 20ms) so VACUUM can actually keep up.
Failure Scenarios
Scenario 1: XID wraparound shutdown. This one is scary because it can bring down the database with no obvious warning without monitoring the right metric. PostgreSQL uses 32-bit transaction IDs that wrap around at roughly 4.2 billion. When a table's relfrozenxid falls too far behind the current XID, Postgres enters "emergency autovacuum" mode. If that fails (usually because of a bloated table or I/O saturation), Postgres shuts down with: "database is not accepting commands to avoid wraparound data loss." Samsara hit this in 2019 and lost hours of availability. Detection: monitor age(relfrozenxid) per table and alert when anything exceeds 500 million transactions. Prevention: tune autovacuum_freeze_max_age and give autovacuum workers enough I/O bandwidth to do their job.
Scenario 2: Replication slot causing disk exhaustion. A paused or disconnected replica with an active replication slot stops the primary from deleting old WAL segments. WAL piles up on the primary until the disk fills, which kills all writes and can corrupt in-flight transactions. GitLab hit a variant of this during their 2017 database incident. Detection: monitor pg_replication_slots for inactive slots and use pg_wal_lsn_diff() to check the gap between current WAL position and each slot's restart LSN. Starting with PostgreSQL 13, set max_slot_wal_keep_size to cap WAL retention per slot. This automatically invalidates slots that fall too far behind, which is better than running out of disk.
Pros
- • Full ACID compliance with MVCC
- • Wildly extensible (custom types, functions, extensions)
- • Best-in-class SQL standard compliance
- • Rich indexing: B-tree, GIN, GiST, BRIN
- • Strong community and ecosystem
Cons
- • Vertical scaling hits a wall eventually
- • Write-heavy workloads need careful tuning
- • Replication is async by default
- • Sharding requires extensions like Citus
- • VACUUM overhead bites you during long-running transactions
When to use
- • You need strong consistency and ACID transactions
- • Complex relational data with joins
- • Mixed workloads (relational + JSON + full-text search)
- • Geospatial queries
When NOT to use
- • You need automatic horizontal sharding at massive scale
- • Your access pattern is purely key-value lookups
- • Ultra-low latency requirements (sub-millisecond)
- • Append-only time-series at millions of events per second
Key Points
- •MVCC stores old row versions (dead tuples) directly in the heap, not in a separate undo log. Reads never block writes, but dead tuples pile up and VACUUM has to clean them.
- •WAL (Write-Ahead Log) is what provides durability: every transaction hits sequential WAL segments before the commit returns. This is also how point-in-time recovery and streaming replication work.
- •Extensibility is what sets Postgres apart. PostGIS, TimescaleDB, pgvector, pg_cron, Citus: they all run as extensions, turning one engine into geospatial, time-series, vector, and distributed databases.
- •Connection pooling through PgBouncer is not optional. Each PostgreSQL backend eats 5-10MB of RAM. 500 direct connections blow 5GB just on connection overhead.
- •VACUUM overhead gets dangerous with transaction ID wraparound risk. If autovacuum falls behind, PostgreSQL will refuse all writes to prevent XID wraparound. This is a catastrophic failure mode.
- •Index selection matters more than most people think. B-tree for equality and range. GIN for full-text and JSONB containment. GiST for geometric and nearest-neighbor. BRIN for naturally ordered large tables.
Common Mistakes
- ✗Skipping connection pooling (PgBouncer). 5000 direct connections will eat all the memory and tank performance because of the process-per-connection model. Use PgBouncer in transaction mode.
- ✗Ignoring table bloat and VACUUM. Dead tuples stack up without aggressive autovacuum tuning, and the tables end up 10x larger than the actual data. Sequential scans slow to a crawl.
- ✗Running with default shared_buffers (128MB) and work_mem (4MB). Set shared_buffers to 25% of RAM and work_mem to 50-256MB per sort/hash, scaled to the max_connections and available memory.
- ✗Using SELECT * everywhere. It pulls unnecessary columns, kills index-only scans, wastes I/O, and makes query plans unpredictable as the schema evolves.
- ✗Letting long-running transactions block VACUUM. A single open transaction stops VACUUM from reclaiming any dead tuples created after it started. Table bloat accelerates across the entire database.