Self-Service Database Provisioning
Self-Service Provisioning
Developers should not file tickets to get a database. The platform provides a self-service interface, whether that's a CLI tool, a web portal, or a Kubernetes custom resource. The developer selects a database type (PostgreSQL, MySQL, Redis, MongoDB), picks a size tier, and gets a running instance with credentials delivered to their secrets manager. The whole process takes minutes, not days.
Behind the scenes, the platform uses Terraform, Crossplane, or cloud-native operators to provision the infrastructure. AWS RDS, GCP Cloud SQL, or Azure Database services handle the heavy lifting. The platform team's job is to wrap these with sensible defaults: encryption at rest enabled, automated backups configured, monitoring agents installed, and network policies applied.
Connection Pooling Architecture
PostgreSQL has a process-per-connection model. Each connection consumes about 10MB of memory. A Kubernetes cluster running 200 pods each opening 10 connections means 2,000 connections, which would require 20GB of memory just for connection overhead. Most workloads don't need dedicated connections.
PgBouncer sits between your application and PostgreSQL. In transaction pooling mode, a connection is assigned to a client only for the duration of a transaction. Between transactions, the connection returns to the pool. This lets 2,000 application connections share 50 actual database connections. Deploy PgBouncer as a sidecar or as a shared service per database cluster.
ProxySQL does the same for MySQL with additional features like query routing and caching. Redis doesn't need connection pooling in the same way since its single-threaded model handles connections differently, but connection limits still matter.
Schema Migration Tooling
Database schema changes are code. They should be version controlled, reviewed, and applied through an automated pipeline. Flyway uses numbered SQL migration files. Atlas uses declarative schema definitions and generates migration plans automatically. Both integrate with CI/CD.
The workflow: developer writes a migration, opens a PR, CI runs the migration against a test database and validates it. After merge, the deployment pipeline applies the migration to staging, runs integration tests, then applies to production. No human touches the production database directly.
Dangerous operations like dropping columns or adding indexes on large tables need special handling. Atlas can detect these and flag them for review. For PostgreSQL, use CREATE INDEX CONCURRENTLY to avoid locking tables. For column drops, first stop reading the column, deploy, then drop it in a follow-up migration.
Monitoring and Alerting
Every database instance needs baseline monitoring from day one. The critical metrics: CPU utilization above 80%, disk space below 20% free, replication lag above 5 seconds, active connections above 80% of max, and queries running longer than 30 seconds. These should fire alerts to the owning team's Slack channel and PagerDuty.
The platform team sets up the monitoring stack once and applies it automatically to every provisioned database. Prometheus with postgres_exporter for PostgreSQL, mysqld_exporter for MySQL, and redis_exporter for Redis. Grafana dashboards templated per database type give teams instant visibility.
Cost Management and Rightsizing
Database instances are the second-largest cloud cost after compute for most organizations. The platform should track per-team database costs and surface rightsizing recommendations. An RDS instance running at 5% CPU for three months should be flagged for downsizing. Development databases should be automatically stopped outside business hours, saving 65% on those instances.
Reserved instances for stable production workloads cut costs by 30-60% compared to on-demand. The platform team should handle RI purchasing centrally based on aggregate usage across all teams.
Key Points
- •Self-service database provisioning should take under 5 minutes from request to a running instance with connection string delivered
- •PgBouncer in transaction pooling mode can reduce PostgreSQL connection count from 2,000 application connections to 50 database connections
- •Automated daily backups with point-in-time recovery are the minimum. Test restores monthly because untested backups are not backups
- •Schema migration tooling like Atlas or Flyway should be integrated into CI/CD so database changes go through the same review process as code
- •Multi-tenant shared instances save 60-80% on costs but require strict resource isolation through connection limits, statement timeouts, and schema separation
Common Mistakes
- ✗Letting every team run their own database without centralized backup verification, monitoring, or security patching
- ✗Using shared connection pools across all services without per-service connection limits, allowing one misbehaving service to exhaust the pool
- ✗Skipping schema migration tooling and letting teams apply DDL changes manually in production through direct database access
- ✗Not setting up automated alerting for disk space, replication lag, and long-running queries before they cause outages