Amazon SQS — AWS's managed message queue. No servers, no clusters, no headaches.
Category: Messaging
## Why It Exists
Use Cases for Amazon SQS
- Decoupling microservices through async message passing
- Distributing work across a fleet of consumers
- Buffering requests when traffic spikes hit
- Fan-out with SNS-to-SQS for event-driven architectures
- Dead-letter queues for catching and debugging failed messages
- Serverless event processing with Lambda triggers
Pros of Amazon SQS
- Fully managed. No infrastructure to provision, patch, or scale.
- Virtually unlimited throughput on standard queues, no pre-provisioning needed
- At-least-once delivery with automatic retries and dead-letter queues
- FIFO queues give you exactly-once processing and strict ordering
- Deep AWS integration (Lambda, SNS, EventBridge, Step Functions)
Cons of Amazon SQS
- Maximum message size is 256KB. Larger payloads need the S3 pointer pattern.
- Standard queues deliver at-least-once, not exactly-once
- FIFO queues cap at 300 messages/sec (3,000 with batching and high-throughput mode)
- No message replay. Once you delete a message, it is gone.
- Vendor lock-in to the AWS ecosystem
When to Use Amazon SQS
- You are on AWS and need a simple, reliable message queue
- You want zero operational overhead for messaging infra
- You are decoupling services that do not need streaming semantics
- Serverless architectures with Lambda-based consumers
When Not to Use Amazon SQS
- You need event streaming with replay and consumer groups (look at Kafka or Pulsar)
- Multi-cloud or on-premise deployments where portability matters
- High-throughput ordered streaming, because FIFO throughput limits will bite you
- Complex routing patterns (use RabbitMQ or an SNS+SQS combo instead)
Alternatives to Amazon SQS
RabbitMQ, Kafka, Apache Pulsar
Grafana Beyla — eBPF-based auto-instrumentation for zero-code observability
Category: Observability
## Why It Exists
Use Cases for Grafana Beyla
- HTTP/gRPC auto-instrumentation without touching application code
- Baseline RED metrics (Rate, Errors, Duration) for every service from day one
- Service mesh alternative for network-level telemetry collection
- Brownfield observability rollout across polyglot environments
- Bootstrapping trace context propagation before SDK adoption
Pros of Grafana Beyla
- Zero code changes required. Attach to a running process and get RED metrics and basic trace spans immediately
- Kernel-level visibility via eBPF uprobes and kprobes. Sees HTTP/gRPC/SQL calls that application-level instrumentation might miss
- Low overhead: ~200 MB RAM per node. Runs as a DaemonSet alongside OTel Collectors without competing for resources
- Complements OTel SDK instrumentation. Beyla provides the baseline, SDK adds depth. The two-tier model means every service has observability even before teams write instrumentation code
- Native OTLP export. Feeds directly into the OTel Collector pipeline so metrics and traces flow through the same processing, routing, and storage path as SDK-generated telemetry
Cons of Grafana Beyla
- Linux-only. Requires kernel 5.8+ with BTF (BPF Type Format) support. No Windows, no macOS, no older kernels
- Limited to network-level signals. Cannot capture custom business metrics, application-specific span attributes, or baggage propagation. For those you still need the OTel SDK
- eBPF verifier constraints limit program complexity. Some edge cases in protocol parsing (non-standard HTTP framing, custom binary protocols) may not be detected
- No custom metric dimensions. The labels Beyla produces are fixed (service name, HTTP method, status code, URL path). You cannot add business context like customer_id or feature_flag
- Kernel upgrades can break eBPF programs. BTF relocations handle most cases, but major kernel version jumps require testing
When to Use Grafana Beyla
- Bootstrapping observability for existing services that have zero instrumentation today
- Providing baseline RED metrics and trace spans before teams adopt the OTel SDK
- Polyglot environments where maintaining SDK instrumentation across 5+ languages is impractical
- Validating that services are correctly instrumented by comparing eBPF-captured metrics against SDK-reported metrics
When Not to Use Grafana Beyla
- Custom business metrics like orders_placed or payment_amount. Use the OTel SDK with custom metric instruments
- Windows or macOS hosts. eBPF is a Linux kernel feature with no equivalent on other operating systems
- Kernels older than 5.8. Without BTF support, Beyla cannot attach its eBPF programs
- Deep trace instrumentation with custom span attributes, baggage propagation, or manual context injection. Use OTel SDK
Alternatives to Grafana Beyla
OpenTelemetry, OTel Collector, Prometheus, Grafana, Grafana Tempo, VictoriaMetrics
Cassandra — Distributed wide-column store built for punishing write loads
Category: Databases
## How It Works Internally
Use Cases for Cassandra
- Time-series data at scale
- IoT sensor data ingestion
- Messaging and chat history
- User activity tracking
- Product catalogs
- Write-heavy workloads
Pros of Cassandra
- Linear horizontal scalability
- No single point of failure (peer-to-peer)
- Tunable consistency levels per query
- Optimized for high write throughput
- Multi-datacenter replication built-in
Cons of Cassandra
- Limited query flexibility (no joins, no aggregations)
- Data modeling driven by query patterns
- Eventual consistency by default
- Operational complexity (compaction, tombstones, repairs)
- Read performance depends heavily on data model
When to Use Cassandra
- Need to handle millions of writes per second
- Data is naturally partitioned (time-series, per-user)
- Require multi-region active-active deployment
- Availability matters more than strong consistency
When Not to Use Cassandra
- Need complex queries with joins
- Dataset is small (< 10 GB)
- Require strong consistency for every read
- Ad-hoc querying is a primary use case
Alternatives to Cassandra
DynamoDB, MongoDB, PostgreSQL
Ceph — The storage system that runs CERN's physics data and DigitalOcean's block storage
Category: Storage
## Why It Exists
Use Cases for Ceph
- Exabyte-scale object storage via RADOS Gateway (S3/Swift compatible)
- Block storage for VMs and containers (RBD)
- Shared filesystem for HPC and research clusters (CephFS)
- Private cloud storage backend (OpenStack Cinder/Manila)
- Data lake storage for analytics at petabyte+ scale
- Unified storage platform replacing separate block/file/object systems
Pros of Ceph
- Proven at exabyte scale. CERN stores hundreds of petabytes of physics data on it.
- CRUSH algorithm places data across failure domains (racks, AZs) without a central lookup
- Three storage interfaces from one cluster: object (RGW), block (RBD), file (CephFS)
- Configurable erasure coding profiles (RS, LRC, SHEC) per storage pool
- Self-healing. Detects failed OSDs and automatically rebalances and reconstructs data.
- No single point of failure. Monitors, OSDs, and metadata servers are all distributed.
Cons of Ceph
- Operational complexity is high. Requires dedicated ops team who understand PG states, CRUSH maps, OSD tuning, and recovery dynamics.
- Performance tuning is hard. Dozens of settings interact (PG count, bluestore cache, recovery limits, scrub schedules). Wrong defaults cause silent degradation.
- Recovery storms can saturate network. When an OSD dies, all PGs on that OSD start rebuilding simultaneously unless you rate-limit.
- BlueStore (the storage backend) has known edge cases with fragmentation on long-running clusters.
- Upgrading across major versions requires careful planning. Rolling upgrades are supported but risky if you skip versions.
- Not designed for small deployments. Minimum viable cluster is 3 nodes, but you really want 5+ for production.
When to Use Ceph
- Need petabyte to exabyte scale on your own hardware
- Require rack-aware and AZ-aware data placement
- Want object, block, and file storage from one platform
- Have an ops team with distributed systems experience
- Building private or hybrid cloud infrastructure
When Not to Use Ceph
- Small team without Ceph operational experience
- Storage under 100TB (MinIO is simpler)
- Need a managed service with zero ops
- Performance-sensitive workloads where tuning time is limited
- Greenfield project where S3 is an option and data sovereignty isn't a concern
Alternatives to Ceph
MinIO, SeaweedFS, RocksDB
ClickHouse — The columnar OLAP database that actually delivers on sub-second analytics
Category: Search & Analytics
## How It Works Internally
Use Cases for ClickHouse
- Real-time analytics dashboards
- Ad tech and clickstream analysis
- Time-series data at scale
- Business intelligence queries
- Log analytics (a serious alternative to Elasticsearch)
- A/B test result analysis
Pros of ClickHouse
- Analytical queries are absurdly fast thanks to columnar storage
- Excellent compression ratios, often 10-20x
- Handles billions of rows with sub-second response times
- SQL-compatible query interface, so the learning curve is gentle
- Materialized views and real-time aggregation work out of the box
Cons of ClickHouse
- Not built for OLTP or point lookups. Do not try.
- No full ACID transactions
- Updates and deletes are expensive (async mutations under the hood)
- You have to think carefully about schema design or performance tanks
- Smaller ecosystem than PostgreSQL or Elasticsearch
When to Use ClickHouse
- You need sub-second queries over billions of rows
- Analytical/OLAP workloads heavy on aggregations
- Real-time dashboards and reporting systems
- You want a cost-effective alternative to managed analytics services
When Not to Use ClickHouse
- OLTP workloads with frequent updates/deletes
- You need full transaction support
- Small datasets that fit comfortably in PostgreSQL
- Full-text search (use Elasticsearch instead)
Alternatives to ClickHouse
Elasticsearch, PostgreSQL, Spark
CockroachDB — Distributed SQL that actually survives failures
Category: Databases
CockroachDB is the tool to reach for when the requirements include the relational model, real ACID transactions, and the ability to scale horizontally across regions. It draws heavily from Google's Spanner paper, but it can actually run without owning a fleet of atomic clocks. The core problem it solves is real: scaling a relational database horizontally without giving up consistency. Anyone who has operated a sharded Postgres cluster with application-level routing knows exactly why this matters.
Use Cases for CockroachDB
- Multi-region apps that need real transactions
- Financial systems where consistency is non-negotiable
- Global SaaS platforms serving users across continents
- Replacing painfully sharded PostgreSQL setups
- Disaster recovery without someone getting paged at 3am
Pros of CockroachDB
- Distributed ACID transactions that actually work
- Automatic horizontal scaling and rebalancing
- PostgreSQL-compatible wire protocol
- Multi-region with locality-aware reads
- Survives node, rack, and datacenter failures
Cons of CockroachDB
- Write latency is higher because of consensus overhead
- Not fully PostgreSQL-compatible (missing extensions will surprise you)
- Steep learning curve if you have never operated distributed SQL
- Overkill and expensive for small datasets that fit on one Postgres box
When to Use CockroachDB
- You need horizontal scaling with SQL and ACID, not just one of them
- Multi-region deployment with strong consistency
- You want automatic failover without manual runbooks
- Your single PostgreSQL instance is starting to sweat
When Not to Use CockroachDB
- Single-region, small data volume. Just use Postgres.
- Ultra-low latency requirements where single-node PG is measurably faster
- You depend on PostgreSQL extensions like PostGIS or pg_cron
- Budget-constrained projects where the infra cost is hard to justify
Alternatives to CockroachDB
PostgreSQL, Cassandra, DynamoDB
DynamoDB — AWS managed NoSQL database with single-digit millisecond reads at any scale
Category: Databases
For teams building on AWS with well-defined access patterns, DynamoDB is probably the right database. Full stop. It is the one NoSQL service where operations are truly handed off to AWS with no need to think about nodes, disk space, and failover. The catch is that convenience comes at a price, both in dollars and in flexibility. The data model must be designed around queries upfront, and changing access patterns later is painful.
Use Cases for DynamoDB
- Serverless application backends
- Gaming leaderboards and player state
- Shopping carts and user preferences
- IoT data ingestion
- Session management
- Ad tech and real-time bidding
Pros of DynamoDB
- Fully managed, zero operational overhead
- Single-digit millisecond latency at any scale
- Automatic scaling with on-demand capacity
- Built-in DAX caching layer
- Global Tables for multi-region replication
Cons of DynamoDB
- Vendor lock-in to AWS
- Expensive at large scale compared to self-managed alternatives
- 25 GSI limit per table
- Item size limited to 400 KB
- Complex pricing model (RCU/WCU)
When to Use DynamoDB
- Building on AWS and want zero ops overhead
- Need predictable single-digit millisecond performance
- Access patterns are known and well-defined
- Serverless architectures with Lambda
When Not to Use DynamoDB
- Need complex relational queries or joins
- Want multi-cloud or vendor-neutral solution
- Data model is highly relational
- Need full-text search capabilities
Alternatives to DynamoDB
Cassandra, MongoDB, Redis
Elasticsearch — The search engine most teams reach for first, built on Lucene
Category: Search & Analytics
## How It Works Internally
Use Cases for Elasticsearch
- Full-text search across large datasets
- Log and event data analysis (ELK stack)
- E-commerce product search
- Application performance monitoring
- Security analytics (SIEM)
- Autocomplete and suggestions
Pros of Elasticsearch
- Near real-time full-text search
- Horizontally scalable with automatic sharding
- Rich query DSL with aggregations
- Schema-free JSON documents
- Powerful text analysis and tokenization
Cons of Elasticsearch
- Not a primary data store (no ACID transactions)
- High memory consumption for indexing
- Split-brain risk without careful cluster config
- Complex capacity planning and tuning
- License changes (SSPL) may affect deployment
When to Use Elasticsearch
- Need full-text search with relevance scoring
- Log aggregation and analytics (ELK/EFK stack)
- Real-time search across millions of documents
- Complex aggregations and faceted search
When Not to Use Elasticsearch
- Primary data store for transactional workloads
- Simple key-value lookups
- Strong consistency requirements
- Limited infrastructure budget (resource-hungry)
Alternatives to Elasticsearch
ClickHouse, PostgreSQL, MongoDB
Elixir / BEAM — The runtime that handles millions of concurrent connections where Java and Go need hundreds of servers
Category: Messaging
## Why BEAM Exists
Use Cases for Elixir / BEAM
- Real-time chat and messaging platforms (Discord, WhatsApp-scale WebSocket handling)
- Notification delivery layer holding millions of concurrent push connections
- IoT device communication (millions of persistent connections from sensors and devices)
- Telecom infrastructure (call routing, signaling, session management)
- Live collaboration features (presence indicators, cursor tracking, typing indicators)
- API gateways and connection proxies for high-concurrency workloads
Pros of Elixir / BEAM
- Millions of lightweight processes per machine. Each process is ~2KB vs ~1MB for an OS thread. One server handles what takes hundreds of servers in Java/Go
- Built-in distributed process registry across nodes. No Redis or external coordination needed to track which user is on which server
- Fault tolerance via supervision trees. A crashed process is restarted automatically without affecting other processes. WhatsApp achieved 99.999% uptime with this
- Hot code upgrades. Deploy new code without dropping connections. Telecom systems ran for years without restart
- Battle-tested runtime. Erlang/BEAM has powered telecom switches since 1986. Elixir (2012) adds modern syntax and tooling on the same VM
Cons of Elixir / BEAM
- Smaller ecosystem than Java/Go/Node. Fewer libraries, fewer framework choices, fewer Stack Overflow answers
- Limited hiring pool. Finding experienced Elixir developers is harder than Java or Go developers
- Not suited for CPU-heavy work. Number crunching, ML inference, image processing are all better in Go/Rust/C++. BEAM is optimized for I/O concurrency, not compute
- Learning curve for OTP patterns. Supervisors, GenServers, and the 'let it crash' philosophy require a mental model shift from try/catch thinking
- Distributed Erlang has a fully connected mesh topology. Past ~50-100 nodes, the mesh becomes expensive. Large deployments need clustering libraries like libcluster or partisan
When to Use Elixir / BEAM
- The system needs to hold millions of concurrent WebSocket or TCP connections with minimal infrastructure
- Real-time messaging or notification delivery where connection routing is the bottleneck
- Fault tolerance matters and process-level isolation is preferred over container restarts
- The project is a connection/delivery layer while business logic stays in Java/Go/Python
When Not to Use Elixir / BEAM
- CPU-bound workloads like ML inference, video encoding, or heavy computation
- The team has zero Erlang/Elixir experience and the project timeline is tight
- Simple request-response APIs where Go or Java already meet the latency and throughput needs
- A massive library ecosystem is needed for third-party integrations (payment SDKs, cloud provider clients)
Alternatives to Elixir / BEAM
Redis, Kafka
Envoy Proxy — The L7 proxy that actually solved the 'every service does networking differently' problem
Category: API Infrastructure
## Why It Exists
Use Cases for Envoy Proxy
- Service mesh data plane (sidecar proxy sitting next to each service)
- API gateway and edge proxy for external traffic
- Load balancing with algorithms like least-request, ring-hash, and Maglev
- Automatic L7 metrics, distributed tracing, and access logging without touching app code
- Traffic management: retries, circuit breaking, rate limiting, fault injection
- TLS termination and mutual TLS (mTLS) for zero-trust networking
Pros of Envoy Proxy
- Understands L7 protocols (HTTP/2, gRPC, WebSocket, MongoDB, Redis), so it can make smart routing decisions
- Dynamic configuration via xDS APIs. No restarts, no reloads. Config just shows up.
- Ships with Prometheus metrics, distributed tracing, and structured access logs out of the box
- Battle-tested at serious scale (Lyft, Google, Stripe, Airbnb)
- CNCF graduated project with a stable API and an active community
Cons of Envoy Proxy
- Each sidecar eats 50-100MB of memory. That adds up fast.
- The xDS API has a steep learning curve. Expect a few weeks before your team is comfortable.
- Sidecar proxying adds 0.5-2ms of tail latency per hop
- Debugging proxy issues means you need to understand L7 protocol internals
- Filter chain ordering is easy to mess up, and misconfigurations cause subtle routing bugs
When to Use Envoy Proxy
- Microservices that need consistent load balancing and observability across languages
- Service mesh deployments (Istio, Consul Connect, or your own custom setup)
- You need canary deployments, traffic shifting, or fault injection
- Zero-trust networking with mutual TLS between all services
When Not to Use Envoy Proxy
- A monolith with no inter-service communication. Envoy has nothing to do here.
- Environments where 50-100MB of memory overhead per pod is a dealbreaker
- Teams without the bandwidth to learn and operate L7 proxy infrastructure
- Pure L4 load balancing needs. Just use IPVS or a simpler L4 proxy.
Alternatives to Envoy Proxy
gRPC, Nginx, Kong
etcd — The consensus store that Kubernetes literally cannot run without
Category: Coordination
## Why It Exists
Use Cases for etcd
- Kubernetes cluster state storage
- Service discovery
- Distributed configuration
- Leader election
- Feature flag management
- Distributed locking
Pros of etcd
- Strong consistency via Raft consensus
- Simple HTTP/gRPC API
- Watch API for real-time change notifications
- Lease-based TTL for ephemeral keys
- Foundation of Kubernetes control plane
Cons of etcd
- Not designed for large data volumes (recommended < 8GB)
- Write latency depends on cluster size and network
- All data must fit in memory
- Limited query capabilities (prefix-based only)
- Compaction needed to prevent unbounded growth
When to Use etcd
- Kubernetes or cloud-native infrastructure
- Need strongly consistent configuration store
- Service discovery with health checking
- Distributed coordination in Go-based systems
When Not to Use etcd
- General-purpose data storage
- Large datasets (> 8 GB)
- High write throughput requirements
- Complex query patterns
Alternatives to etcd
ZooKeeper, Kafka
Flink — The streaming engine that treats batch as a special case, not the other way around
Category: Stream Processing
## Why It Exists
Use Cases for Flink
- Real-time fraud detection
- Event-driven applications
- Real-time ETL pipelines
- Complex event processing (CEP)
- Real-time machine learning inference
- Continuous monitoring and alerting
Pros of Flink
- True event-time processing with watermarks
- Exactly-once state consistency
- Low-latency stream processing
- Sophisticated windowing (tumbling, sliding, session)
- Unified batch and stream processing
Cons of Flink
- Steep learning curve
- Complex cluster management
- Checkpointing can impact latency under load
- Smaller community than Spark
- Resource-intensive for stateful operations
When to Use Flink
- Need true real-time processing (low latency)
- Complex event patterns with event-time semantics
- Exactly-once processing guarantees required
- Stateful stream processing (aggregations, joins)
When Not to Use Flink
- Simple batch processing jobs
- Small data volumes that don't justify the complexity
- Team lacks streaming expertise
- Ad-hoc analytical queries (use Spark or ClickHouse)
Alternatives to Flink
Spark, Kafka Streams, Kafka
FoundationDB — The ordered, transactional key-value store that Apple trusts with iCloud
Category: Databases
## Why It Exists
Use Cases for FoundationDB
- Metadata store for object storage systems
- Strongly consistent ordered key-value layer
- Multi-model database foundation (document, graph, relational layers on top)
- Distributed ACID transactions across shards
- Cloud infrastructure control plane storage
- Record layer for structured data at scale
Pros of FoundationDB
- Serializable ACID transactions across the entire keyspace
- Ordered keys with efficient range scans
- Extremely high write throughput (millions of writes/sec at scale)
- Simulation testing framework catches bugs before production
- Multi-tenant isolation at the key prefix level
- Automatic sharding and rebalancing
Cons of FoundationDB
- 5-second transaction time limit, so long-running transactions must be broken up
- Value size limit of 100KB, so large blobs must be chunked
- Key size limit of 10KB
- Operational complexity at scale (coordinators, storage servers, proxies)
- Smaller community than PostgreSQL or Cassandra
- Limited built-in query language (requires layers or client logic)
When to Use FoundationDB
- Need ordered key-value store with ACID transactions
- Building infrastructure that requires strong consistency at scale (metadata stores, control planes)
- Want to build custom data models on a reliable foundation
- Need cross-shard transactions without 2PC complexity
When Not to Use FoundationDB
- Simple CRUD applications (PostgreSQL is easier)
- Analytics workloads (use ClickHouse or a columnar store)
- Need values larger than 100KB without chunking
- Small team without distributed systems experience
- Need a full SQL query engine out of the box
Alternatives to FoundationDB
etcd, CockroachDB, RocksDB, TiKV
Google S2 Geometry — The spherical geometry library behind Google Maps, Spanner, and every serious geospatial system
Category: Geospatial
## Why This Matters
Use Cases for Google S2 Geometry
- Spatial indexing for location-based queries (find all restaurants within 2km)
- Geo-fencing and point-in-region containment checks
- Ride-sharing supply/demand matching by geographic cells
- Region covering for spatial database queries
- Proximity search at planetary scale
- Map tile generation and spatial partitioning
Pros of Google S2 Geometry
- Mathematically rigorous. Works on a sphere, not a flat plane, so distance calculations are accurate everywhere on Earth
- Hierarchical cell system gives you 30 levels of precision, from continent-sized to sub-centimeter
- Cell IDs are 64-bit integers, which means you can index them in any database with a B-tree
- Battle-tested at Google scale across Maps, Spanner, and dozens of internal services
- Hilbert curve ordering means spatially close points get numerically close cell IDs, perfect for range scans
Cons of Google S2 Geometry
- Steep learning curve. You need to understand spherical geometry and the cell hierarchy to use it well
- The C++ library is the reference. Ports to Go, Java, and Python vary in completeness
- No built-in persistence or query engine. It is a library, not a database
- Debugging spatial issues is hard without visualization tools
- Covering algorithms require tuning (max cells, min/max level) for your specific use case
When to Use Google S2 Geometry
- You are building a location-based service that needs to scale beyond PostGIS
- Spatial queries need to be converted to simple range scans on integer keys
- Your system runs on Spanner or another database without native spatial indexing
- You need consistent precision across the entire globe, not just near the equator
When Not to Use Google S2 Geometry
- Simple distance calculations where the haversine formula is good enough
- You already have PostGIS and your query patterns are well-served by its spatial functions
- Your team does not have the time to learn spherical geometry concepts
- You need a turnkey geospatial database, not a library to build on top of
Alternatives to Google S2 Geometry
PostgreSQL, Spanner, DynamoDB
Grafana — Open-source dashboarding and alerting that sits on top of whatever backends you already run
Category: Observability
## Why It Exists
Use Cases for Grafana
- Infrastructure and application monitoring dashboards
- Single-pane-of-glass observability across metrics, logs, and traces
- Business KPI dashboards with real-time data
- Incident response with correlated views from multiple sources
- SLO tracking and reporting for engineering teams
- IoT and sensor data visualization
Pros of Grafana
- 60+ data source plugins (Prometheus, Loki, Elasticsearch, PostgreSQL, CloudWatch, and more)
- Rich visualization library with 15+ panel types including graphs, heatmaps, and geo maps
- Dashboard-as-code with JSON models and Terraform provider
- Unified alerting across all data sources with a single rule engine
- Free and open-source core with enterprise features available in Grafana Cloud
Cons of Grafana
- Dashboard sprawl is real. Organizations end up with hundreds of unmaintained dashboards nobody owns
- Complex dashboard JSON models are painful to manage without Terraform or Jsonnet
- Performance tanks when you pack too many panels or high-cardinality queries into one dashboard
- Steep learning curve for advanced PromQL/LogQL queries inside panels
- Plugin quality is inconsistent. Some community plugins are abandoned or buggy
When to Use Grafana
- You need a single visualization layer across multiple data sources
- You are building monitoring dashboards for Prometheus, Loki, or other time series data
- You want dashboard-as-code for version-controlled, reproducible observability
- Different teams use different backends and you need cross-team visibility
When Not to Use Grafana
- Data collection and storage. Grafana only visualizes, it does not store metrics. Use Prometheus for that
- Business intelligence with complex data transformations (use Looker, Tableau, or Metabase)
- Simple status pages (use Betteruptime, Statuspage, or a similar SaaS)
- Real-time streaming dashboards with sub-second updates (build a custom WebSocket solution instead)
Alternatives to Grafana
Prometheus, Elasticsearch, Kafka
gRPC — Google's RPC framework that actually delivers on the 'high performance' promise, built on HTTP/2 and Protocol Buffers
Category: API Infrastructure
## Why It Exists
Use Cases for gRPC
- Low-latency service-to-service calls in microservices
- Real-time streaming between services, including bidirectional
- Mobile client-to-server communication where bandwidth is tight
- Polyglot environments where your services are written in different languages
- Internal APIs where JSON serialization has become a measurable bottleneck
- IoT device communication over constrained networks
Pros of gRPC
- 10-100x faster serialization than JSON thanks to Protocol Buffers binary format
- HTTP/2 multiplexing kills head-of-line blocking and opens the door to streaming
- Strongly typed contracts with code generation for 12+ languages
- Four communication patterns: unary, server streaming, client streaming, bidirectional
- Built-in deadline propagation, cancellation, and metadata passing
Cons of gRPC
- Binary format is not human-readable. You need tooling just to inspect payloads.
- Browser support requires a gRPC-Web proxy since browsers don't expose raw HTTP/2
- Load balancing gets tricky. HTTP/2 connection reuse means L4 balancers won't cut it.
- Proto schema evolution demands real discipline around backward compatibility
- Smaller ecosystem of tooling and middleware compared to REST/JSON
When to Use gRPC
- Internal service-to-service calls where latency is on the critical path
- You need streaming: real-time updates, log tailing, event feeds
- Polyglot microservices that benefit from generated, typed client/server code
- High-throughput APIs where JSON serialization shows up in your flame graphs
When Not to Use gRPC
- Public APIs consumed by web browsers (just use REST or GraphQL)
- Simple CRUD services where shipping fast matters more than shaving milliseconds
- Teams that have never touched Protocol Buffers or schema management
- Environments where HTTP/2 is blocked or unsupported by network intermediaries
Alternatives to gRPC
Envoy Proxy, Kafka, Kong
H3 — Uber's hexagonal spatial index that makes geospatial aggregation and visualization actually intuitive
Category: Geospatial
## Why Uber Built This
Use Cases for H3
- Ride-sharing and delivery supply/demand heat maps
- Geospatial aggregation and analytics (average price per neighborhood)
- Coverage analysis for service areas and delivery zones
- Movement flow analysis between geographic regions
- Market segmentation by location
- Network coverage planning for telecom
Pros of H3
- Hexagons tile a plane with uniform adjacency. Every hex has exactly 6 neighbors, same distance from center to center
- Great for visualization and aggregation. Hex grids look clean on maps and avoid the visual bias of square grids
- 16 resolution levels from continent-scale down to ~1 m² per hex
- Hierarchical structure with predictable parent-child relationships
- Bindings for Python, JavaScript, Java, Go, Rust, and more. Well-maintained.
Cons of H3
- Not a perfect tiling on a sphere. Uses 12 pentagons (unavoidable from topology) which can cause edge cases
- Parent-child mapping is not exact. A parent hex does not perfectly contain its 7 children due to aperture-7 subdivision
- Hex IDs are 64-bit but the Hilbert curve locality is weaker than S2's for range scan queries
- Less suited for point-in-polygon or containment queries compared to S2
- Community is strong but smaller than PostGIS or S2 ecosystems
When to Use H3
- You need to aggregate data by geographic area for analytics or visualization
- Heat maps, demand forecasting, or coverage analysis are core features
- Your team wants a simpler mental model than S2's quadtree cells
- Neighbor traversal and adjacency queries matter (routing, flow analysis)
When Not to Use H3
- Point-in-polygon containment checks (S2 or PostGIS are better)
- You need range-scan-friendly IDs for database spatial indexing (S2 is stronger here)
- Exact spatial containment where parent must perfectly contain children
- Your workload is purely distance-based queries (find nearest K points)
Alternatives to H3
Google S2 Geometry, PostgreSQL, ClickHouse
Hocuspocus — The Yjs WebSocket server that handles sync, auth, and persistence so teams do not have to
Category: Collaboration
Yjs handles the CRDT math, but it says nothing about servers. If two users need to collaborate on a document, their Yjs instances need to exchange updates somehow. One option is to build a raw WebSocket server that relays binary messages between clients. But then authentication is needed (who can access this document?), persistence (where do edits go when the server restarts?), reconnection logic (what happens when a client drops off for 30 seconds?), and multi-server coordination (what if there are 3 servers behind a load balancer?). That is a lot of infrastructure code that has nothing to do with the product.
Use Cases for Hocuspocus
- WebSocket relay for Yjs document sync
- Server-side persistence of collaborative documents
- Authentication and authorization for document access
- Multi-server Yjs deployments with Redis pub/sub fan-out
Pros of Hocuspocus
- Native Yjs binary sync protocol. No JSON serialization overhead
- Hook-based lifecycle (onConnect, onAuthenticate, onLoadDocument, onStoreDocument) for custom logic
- Persistence adapters for PostgreSQL, SQLite, Redis, and S3
- Horizontal scaling via Redis pub/sub for multi-server coordination
Cons of Hocuspocus
- Node.js only. No Go, Rust, or Python server implementation
- Single-document-per-connection model complicates multi-document UIs
- No built-in rate limiting. Must be added separately
- Debugging sync issues requires understanding Yjs internals and the binary protocol
When to Use Hocuspocus
- Building collaborative features with Yjs and need a server for relay and persistence
- Need authentication before granting document access
- Running multiple server instances that need to stay in sync
- Want debounced persistence without building custom flush logic
When Not to Use Hocuspocus
- Non-Yjs collaboration (use Socket.IO or a custom WebSocket server)
- Simple real-time features like presence indicators (use Supabase Realtime or Pusher)
- Read-only document viewing with no editing
- Teams that need a server in Go, Rust, or Java
Alternatives to Hocuspocus
Yjs, TipTap, Redis
Hugging Face — The de facto open-source registry for ML models, datasets, and the Transformers library
Category: AI & ML
## Why It Exists
Use Cases for Hugging Face
- Pulling pre-trained ML models and running inference fast
- Fine-tuning foundation models on your own data
- Hosting and sharing models, datasets, and ML demos
- Running inference pipelines with the Transformers library
- Building ML workflows around standardized model interfaces
- Benchmarking and comparing model performance
Pros of Hugging Face
- Largest open model registry with 800K+ models and 200K+ datasets
- Transformers library gives you a unified API across PyTorch, TensorFlow, and JAX
- Hub supports versioned model and dataset hosting with Git LFS
- Inference Endpoints let you deploy a model to cloud GPUs in one click
- Active community with model cards, discussion forums, and leaderboards
Cons of Hugging Face
- Transformers abstractions can hide important implementation details from you
- Model quality is all over the place. No curation on community uploads
- Large model downloads eat significant bandwidth and storage
- Free tier rate limits will bite you in CI/CD pipelines
- Auto-classes can pull unexpected model variants if you don't pin versions
When to Use Hugging Face
- Need pre-trained models for NLP, vision, audio, or multimodal tasks
- Fine-tuning open-weight models on domain-specific data
- Sharing models and datasets within a team or with the community
- Rapid prototyping with current model architectures
When Not to Use Hugging Face
- Production inference at scale (use vLLM, TGI, or dedicated serving infrastructure)
- Training models from scratch with custom architectures (use raw PyTorch/JAX)
- Applications requiring proprietary models not on the Hub
- Air-gapped environments where downloading models is not an option
Alternatives to Hugging Face
vLLM, LangChain, RAG
InfluxDB — The purpose-built time series database that owns the IoT and metrics space
Category: Time Series
## How It Works Internally
Use Cases for InfluxDB
- IoT sensor data collection and analytics
- Infrastructure and application monitoring
- Real-time analytics on event streams
- Financial market data tracking
- Energy grid and smart meter telemetry
- DevOps metrics and SLA tracking
Pros of InfluxDB
- Purpose-built for time series from day one, not bolted on as an afterthought
- InfluxQL is SQL-like enough that most engineers pick it up in an afternoon
- Telegraf agent ecosystem covers 300+ integrations out of the box
- Built-in retention policies and continuous queries handle data lifecycle automatically
- Impressive write throughput for a single node, easily 500K+ points/sec
Cons of InfluxDB
- The open-source version (OSS) is single-node only. Clustering requires InfluxDB Cloud or Enterprise
- Flux query language is powerful but has a steep learning curve and not everyone loves it
- High cardinality series can cause memory issues and slow queries significantly
- Schema-on-write means you cannot change tag vs field decisions after the fact without rewriting data
- Delete operations are expensive and discouraged in practice
When to Use InfluxDB
- You need a dedicated TSDB for metrics, IoT, or sensor data
- Your team wants something up and running fast with minimal configuration
- Write-heavy workloads where ingestion speed matters more than complex queries
- You already use the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor)
When Not to Use InfluxDB
- You need distributed clustering on open-source (look at VictoriaMetrics or TimescaleDB)
- Complex relational queries with joins across multiple measurements
- Your workload is more analytical/OLAP than time series (use ClickHouse)
- You need strong consistency guarantees for financial transactions
Alternatives to InfluxDB
Prometheus, TimescaleDB, Grafana
Kafka Streams — A stream processing library that ships inside your JVM app, not a cluster you babysit
Category: Stream Processing
## How It Works Internally
Use Cases for Kafka Streams
- Real-time data transformation
- Event-driven microservices
- Stream-table joins
- Real-time aggregations and windowing
- Data enrichment pipelines
- Lightweight ETL within Kafka
Pros of Kafka Streams
- No separate cluster needed. It runs as a library inside your app.
- Exactly-once processing semantics
- Elastic scaling via Kafka consumer groups
- Interactive queries on local state stores
- Simple deployment (just a JVM application)
Cons of Kafka Streams
- Locked to Kafka for both input and output
- JVM only (Java/Kotlin/Scala)
- Limited to Kafka's partitioning model
- State stores can grow large on disk
- Less capable than Flink for complex processing
When to Use Kafka Streams
- You already run Kafka and need straightforward stream processing
- You want to skip managing a separate processing cluster
- Building event-driven microservices
- You need exactly-once guarantees inside the Kafka ecosystem
When Not to Use Kafka Streams
- Processing data from non-Kafka sources
- You need advanced CEP or event-time processing
- Your team runs Python or another non-JVM stack
- Complex multi-stream joins and windowing beyond what KStreams handles well
Alternatives to Kafka Streams
Kafka, Flink, Spark
Kafka — The distributed commit log that became the backbone of event streaming
Category: Messaging
## How It Works Internally
Use Cases for Kafka
- Event-driven microservices
- Real-time data pipelines
- Log aggregation
- Change data capture (CDC) with Debezium
- Stream processing with Kafka Streams or Flink
- Activity tracking and clickstream analytics
- Audit logging and compliance event trails
- Decoupling producers and consumers across organizational boundaries
Pros of Kafka
- Absurdly high throughput (millions of messages/sec on modest hardware) thanks to sequential I/O and zero-copy transfer
- Durable message storage with configurable retention, from hours to forever
- Scales horizontally by adding brokers and partitions without downtime
- Strong ordering guarantee within partitions, which is enough for most use cases
- Rich ecosystem: Connect for integrations, Streams for processing, Schema Registry for governance
- KRaft mode eliminates ZooKeeper entirely, simplifying operations and reducing the component count
- Tiered storage offloads old data to object storage, decoupling compute from long-term retention costs
Cons of Kafka
- Operationally heavy even with KRaft. You still need to understand brokers, partitions, consumer groups, ISR, and replication.
- Wrong tool for low-latency request-reply patterns. If you need sub-millisecond RPC, look elsewhere.
- Consumer group rebalancing can stall your entire pipeline if not configured carefully
- No built-in message routing or filtering. Every consumer reads from a partition and filters client-side.
- Steep learning curve, especially around offset management, exactly-once semantics, and partition key design
- Partition count is hard to change after the fact. Repartitioning means re-keying all your data.
When to Use Kafka
- You need durable, ordered event streaming at scale
- You're building event-driven, CQRS, or event-sourcing architectures
- High-throughput log or data pipeline ingestion (100K+ events/sec)
- Decoupling producers and consumers where replay capability matters
- CDC pipelines that capture database changes and fan them out to downstream systems
When Not to Use Kafka
- Simple task queues with low volume (use SQS or Redis streams instead)
- You need complex message routing, priority queues, or dead-letter exchanges (RabbitMQ is better suited)
- Request-reply messaging patterns where you need synchronous responses
- Small team without the bandwidth to operate distributed infrastructure (consider a managed service or a simpler queue)
- Message ordering across multiple partitions is a hard requirement (Kafka only guarantees order within a partition)
Alternatives to Kafka
RabbitMQ, Kafka Streams, Flink, Pulsar, Amazon SQS, Spark
Kong — Cloud-native API gateway built on NGINX
Category: API Infrastructure
## Why It Exists
Use Cases for Kong
- Centralized API gateway
- Authentication and authorization (OAuth, JWT, API keys)
- Rate limiting and throttling
- Request/response transformation
- API analytics and monitoring
- Service mesh with Kong Mesh
Pros of Kong
- Built on NGINX, so you inherit its battle-tested performance
- Rich plugin ecosystem (100+ plugins)
- Supports declarative and database-backed configuration
- Kubernetes-native with Ingress Controller
- Open-source core with enterprise features
Cons of Kong
- Adds latency compared to running NGINX directly
- Plugin ecosystem quality is all over the place
- Enterprise features require a paid license
- Configuration gets messy at scale
- Database dependency (Postgres/Cassandra) for some modes
When to Use Kong
- You need a centralized API gateway with auth and rate limiting
- You're managing multiple APIs/microservices behind one entry point
- Kubernetes environments needing an ingress controller
- You want plugin-based extensibility without writing custom code
When Not to Use Kong
- Simple reverse proxy needs (just use NGINX directly)
- Ultra-low latency where every microsecond counts
- Tight budget constraints (enterprise features are paid)
- Simple static site serving
Alternatives to Kong
NGINX, Kafka
LangChain — The go-to framework for wiring up LLM apps with chains, agents, and retrieval
Category: AI & ML
## Why It Exists
Use Cases for LangChain
- Building RAG pipelines: loading docs, splitting them, embedding, and retrieving at query time
- Multi-step AI agents that pick tools and act on results
- Chatbots that actually remember previous turns
- Pulling structured data out of messy, unstructured text
- Routing requests across different LLMs based on task type
- Testing and evaluating LLM outputs (harder than it sounds)
Pros of LangChain
- Widest integration ecosystem out there (150+ LLMs, 50+ vector stores, 100+ tools)
- LangGraph lets you build agent workflows with cycles, branching, and persistent state
- LangSmith gives you real observability: tracing, evaluation, and debugging in production
- Good built-in abstractions for RAG, agents, and structured output
- Very active community, ships fast, docs are solid
Cons of LangChain
- Abstraction layers make debugging harder than you'd expect
- Breaking changes between versions happen more than they should
- Overkill for simple use cases. Sometimes a direct API call is all you need
- Performance overhead from chain composition and serialization
- LangGraph's state machine model takes time to click
When to Use LangChain
- Your LLM pipeline has multiple steps, tools, and data sources
- You need pre-built integrations with vector stores, LLMs, or tools
- You want to prototype LLM apps fast using well-known patterns
- You're building multi-agent systems that need state management and tool coordination
When Not to Use LangChain
- You're making a single LLM call with a static prompt. Just use the provider SDK.
- Latency matters so much that framework overhead is a problem
- Your team wants minimal dependencies and full control over every LLM interaction
- You need stability more than you need the latest features
Alternatives to LangChain
RAG, Vector Databases, MCP Server
MCP Server — The open protocol that lets AI models talk to your tools and data without custom glue code
Category: AI & ML
## Why It Exists
Use Cases for MCP Server
- Connecting LLMs to databases, APIs, and file systems
- Building AI-powered IDE extensions and developer tools
- Enterprise AI assistants that need secure access to internal systems
- Multi-tool AI agents that call out to external services
- Giving every LLM provider the same tool interface
- Context-aware AI workflows that pull in data on the fly
Pros of MCP Server
- Write one server, use it with any MCP-compatible client
- Three clean primitives (Tools, Resources, Prompts) cover most integration patterns
- Solid security model with OAuth 2.1 for remote servers
- Official SDKs for TypeScript and Python, plus a growing community ecosystem
- Kills the custom integration code you used to write for each model provider
Cons of MCP Server
- Still a young protocol (launched 2024), so the ecosystem is catching up
- Remote server deployment adds latency compared to in-process tool calls
- Debugging distributed MCP chains gets painful fast
- Not every AI platform supports MCP natively yet
- Schema evolution and versioning need careful planning upfront
When to Use MCP Server
- Your AI tools need to hit external data sources or APIs
- You want a single integration that works across multiple AI clients
- You need authenticated, scoped access to enterprise systems from AI models
- You are building reusable tool servers that multiple AI apps can share
When Not to Use MCP Server
- Simple one-off LLM API calls that do not need tool access
- Latency-critical apps where any middleware overhead is a dealbreaker
- Apps locked into a single AI provider's native tool format
- Trivial tools where the protocol overhead costs more than the implementation
Alternatives to MCP Server
LangChain, RAG, gRPC
Memcached — The simplest distributed cache that actually works at scale
Category: Caching
Most caching debates start with "Redis or Memcached?" and the answer people want is always Redis. But if the requirement is a fast key-value cache, Memcached is the better tool. It's been running at the core of Facebook, Twitter, YouTube, and Wikipedia for over two decades. Brad Fitzpatrick built it for LiveJournal in 2003, and the design philosophy hasn't changed since. Key-value strings with TTL expiration. That's it. No data structures, no scripting, no persistence. That constraint is the whole point. It makes Memcached one of the most predictable and operationally boring (in the best way) components in a stack.
Use Cases for Memcached
- Database query result caching
- HTML fragment caching
- Session storage
- API response caching
- Object caching for web apps
Pros of Memcached
- Dead simple. Just key-value strings, nothing to overthink
- Multi-threaded architecture for high throughput
- Consistent hashing makes horizontal scaling straightforward
- Minimal memory overhead per item
- Mature and battle-tested at massive scale
Cons of Memcached
- No persistence. Data gone on restart, full stop
- Only supports string values (no rich data structures)
- No built-in replication
- No pub/sub or advanced features
- Limited to key-value operations
When to Use Memcached
- Simple caching of serialized objects or query results
- Need multi-threaded performance for high concurrency
- Want a lightweight, low-overhead cache layer
- Already have a durable primary store
When Not to Use Memcached
- Need data persistence or durability
- Require rich data structures (use Redis)
- Need pub/sub or stream processing
- Want built-in replication and failover
Alternatives to Memcached
Redis, NGINX
MinIO — S3-compatible object storage you can run anywhere in 15 minutes
Category: Storage
## Why It Exists
Use Cases for MinIO
- Private cloud object storage with full S3 API compatibility
- On-prem replacement for AWS S3 (data sovereignty, compliance)
- Backend storage for AI/ML training pipelines (fast local reads)
- Kubernetes-native persistent storage for stateful workloads
- Data lake storage tier for Spark, Presto, Trino queries
- Backup target for databases and application data
Pros of MinIO
- Near-complete S3 API compatibility. Most S3 SDKs and tools work out of the box.
- Single Go binary. No JVM, no dependencies, no complex installation. Deploy in minutes.
- Reed-Solomon erasure coding per object. Configurable data/parity ratio.
- High throughput on NVMe/SSD. Designed for modern hardware, not spinning disks.
- Kubernetes-native via the MinIO Operator. First-class Helm charts and CRDs.
- Built-in bucket replication, versioning, lifecycle management, and encryption.
Cons of MinIO
- No topology-aware placement (no CRUSH equivalent). Shards distribute across a server pool, but not rack/AZ-aware by default.
- Scaling requires adding full server pools. Adding a single node to an existing pool is not supported.
- Metadata is co-located with data (no separate metadata tier). At very large scale, this limits flexibility.
- Rebalancing after expansion is manual and can be slow for large datasets.
- Community edition lacks some enterprise features (LDAP, AD integration, audit logging require paid tier).
- Not battle-tested at exabyte scale. Designed for petabytes, not hundreds of petabytes.
When to Use MinIO
- Need S3-compatible storage on your own hardware or in any cloud
- Team wants simplicity over operational flexibility
- Data fits in petabyte range (up to ~10PB comfortably)
- Running on Kubernetes and want native integration
- AI/ML workloads that benefit from local high-throughput storage
When Not to Use MinIO
- Need exabyte-scale storage (Ceph or custom is more appropriate)
- Require fine-grained rack/AZ-aware placement control
- Want to scale incrementally by adding single nodes
- Need a managed service with zero ops (use S3 itself)
Alternatives to MinIO
Ceph, SeaweedFS, etcd
MongoDB — The document database you'll probably use at least once in your career
Category: Databases
MongoDB is everywhere. Anyone who has worked on a modern web stack has either used it or had to justify the alternative choice. The flexible document model and the low friction of getting started are genuinely useful, especially early in a project when the schema is still shifting. Since version 4.0, multi-document ACID transactions filled the biggest gap in the document model story, making Mongo a realistic option for transactional workloads that used to require a relational database. That said, do not confuse "viable" with "ideal." Postgres still wins for heavily relational data, and that is fine.
Use Cases for MongoDB
- Content management systems
- Product catalogs with varied attributes
- Mobile app backends
- Real-time analytics
- User profiles and personalization
- Prototyping and rapid iteration
Pros of MongoDB
- Flexible schema, no migrations needed
- Rich query language with aggregation pipeline
- Horizontal scaling via built-in sharding
- Native JSON document model
- Multi-document ACID transactions
Cons of MongoDB
- Joins are limited (no server-side joins pre-5.0, $lookup is costly)
- Memory-mapped storage can consume significant RAM
- Denormalization leads to data duplication
- Write amplification with large documents
- Sharding requires careful key selection
When to Use MongoDB
- Schema evolves frequently (startups, MVPs)
- Data is naturally document-shaped (JSON)
- Need flexible querying over semi-structured data
- Rapid prototyping with changing requirements
When Not to Use MongoDB
- Highly relational data with many joins
- Need strong multi-row transactions across collections
- Write-heavy append-only workloads (prefer Cassandra)
- Strict schema enforcement is required
Alternatives to MongoDB
PostgreSQL, DynamoDB, Elasticsearch
Monocle (Datadog) — Datadog's shard-per-core Rust TSDB that replaced their legacy storage
Category: Time Series
## How It Works Internally
Use Cases for Monocle (Datadog)
- Trillion-scale metric ingestion
- Multi-tenant SaaS observability
- Sub-second query at petabyte scale
- Real-time anomaly detection backing store
Pros of Monocle (Datadog)
- Shard-per-core: each CPU core owns its own LSM-tree, memory allocator, and I/O queue. Zero cross-core locks.
- Written in Rust for predictable latency and memory safety without GC pauses
- RocksDB-based indexing layer for label lookups at billion-series cardinality
- Designed for multi-tenant isolation from day one (Datadog's SaaS model)
- Handles trillions of data points per day in production
Cons of Monocle (Datadog)
- Proprietary: not available outside Datadog. Cannot be self-hosted or evaluated.
- No public API documentation or query language specification
- Architecture details come only from blog posts and conference talks, not source code
- Tightly coupled to Datadog's infrastructure (custom networking, deployment tooling)
- Not a viable option for build-vs-buy decisions; study-only value
When to Use Monocle (Datadog)
- You are evaluating Datadog as a managed observability vendor
- You want to study shard-per-core TSDB architecture for your own system design
- You need a reference point for what 'beyond open-source scale' looks like
When Not to Use Monocle (Datadog)
- You need a self-hosted or open-source TSDB (use VictoriaMetrics, Mimir, or InfluxDB instead)
- You want to run your own metrics infrastructure
- You need a system you can inspect, fork, or contribute to
Alternatives to Monocle (Datadog)
VictoriaMetrics, Prometheus, InfluxDB, TimescaleDB, RocksDB
NGINX — The web server that actually handles scale, plus reverse proxy and load balancer
Category: API Infrastructure
## Why It Exists
Use Cases for NGINX
- Reverse proxy and load balancing
- SSL/TLS termination
- Static file serving
- API gateway
- Rate limiting and access control
- HTTP caching
Pros of NGINX
- Handles massive concurrency through event-driven, non-blocking I/O
- Low memory footprint
- Battle-tested at serious scale
- Rich module ecosystem
- Supports HTTP, TCP, and UDP load balancing
Cons of NGINX
- Configuration gets gnarly fast for advanced use cases
- Dynamic reconfiguration requires a reload
- Limited built-in API management features
- Free version is missing several enterprise features
- Lua scripting for advanced logic adds real complexity
When to Use NGINX
- You need a reverse proxy in front of application servers
- SSL termination and HTTP/2 support
- Serving static files alongside dynamic content
- Simple load balancing without a service mesh
When Not to Use NGINX
- You need a full API gateway with auth, rate limiting, analytics
- Service mesh with dynamic service discovery
- Complex traffic routing that needs programmatic control
- GraphQL-specific gateway features
Alternatives to NGINX
Kong, Kafka
OpenTelemetry — The vendor-neutral standard for instrumentation, collection, and export of telemetry data
Category: Observability
## Why It Exists
Use Cases for OpenTelemetry
- Unified instrumentation across metrics, traces, logs, and profiles with a single SDK
- Vendor-neutral telemetry export to any backend (Prometheus, Jaeger, Tempo, Datadog, etc.)
- Distributed tracing with automatic context propagation across service boundaries
- Custom business metrics (orders, revenue, SLA counters) alongside infrastructure metrics
- Fleet-wide telemetry pipeline processing via the OTel Collector (batching, routing, enrichment, sampling)
Pros of OpenTelemetry
- Single SDK for all four signals (metrics, traces, logs, profiles). One dependency instead of four separate libraries
- Vendor-neutral wire protocol (OTLP). Switch backends without changing application code. Export to VictoriaMetrics, Tempo, Datadog, or any OTLP-compatible receiver
- Automatic instrumentation libraries for most frameworks. Spring Boot, Express, Flask, net/http, gRPC -- get spans and metrics with a few lines of setup code
- Context propagation is built in. W3C Trace Context (traceparent/tracestate) propagates trace_id and span_id across HTTP, gRPC, and messaging boundaries automatically
- The OTel Collector decouples applications from backends. Applications export to the local collector, the collector handles batching, retry, routing, and format conversion. Backend changes never touch application code
- CNCF graduated project with contributions from Google, Microsoft, Splunk, Datadog, Grafana Labs, and 1,000+ contributors. The industry standard, not a single-vendor bet
Cons of OpenTelemetry
- SDK maturity varies by language. Go and Java are production-stable. Python and JavaScript are stable but have rough edges in async context propagation. Rust and C++ are experimental
- Auto-instrumentation adds latency overhead. Java agent adds 1-5% latency to instrumented calls. For latency-critical hot paths, manual instrumentation with selective span creation is better
- Configuration complexity. The Collector alone has 100+ receivers, processors, and exporters. Getting the processor chain right (order matters) takes operational experience
- Log signal is newer than metrics and traces. The logs SDK stabilized later and some language implementations are still catching up. Bridge APIs exist for existing log frameworks but add a translation layer
- Breaking changes between SDK versions still happen in newer signals (logs, profiles). Pin versions carefully and test upgrades in staging
When to Use OpenTelemetry
- Any new service that needs observability. OTel should be the default instrumentation choice for greenfield development
- Migrating off vendor-specific SDKs (Datadog APM, New Relic agents) to avoid vendor lock-in
- Building a multi-signal observability platform where metrics, traces, logs, and profiles share context (trace_id, span_id)
- Custom business metrics that need labels, histograms, or counters beyond what auto-instrumentation provides
When Not to Use OpenTelemetry
- Kernel-level network telemetry without code changes. Use eBPF (Grafana Beyla) instead -- OTel SDK requires application code changes
- Simple Prometheus metric scraping from existing /metrics endpoints. The Prometheus client library is lighter if you only need counters and gauges with no tracing
- Environments where adding a dependency is not possible (embedded systems, bare-metal firmware, legacy COBOL)
Alternatives to OpenTelemetry
OTel Collector, Grafana Beyla, Prometheus, Grafana, Grafana Tempo, VictoriaMetrics, Kafka, Grafana Pyroscope
OTel Collector — The vendor-neutral telemetry pipeline for receiving, processing, and exporting observability data
Category: Observability
## Why It Exists
Use Cases for OTel Collector
- Decoupling applications from observability backends -- SDK exports to localhost, Collector handles everything else
- Fleet-wide telemetry processing: batching, filtering, enrichment, sampling, and routing across all services
- Protocol translation between different telemetry formats (OTLP, Prometheus, Jaeger, Zipkin, Kafka)
- Tail-based trace sampling that keeps 100% of errors while reducing baseline volume by 98-99.5%
- Value-based data routing: full fidelity for SLO-critical signals, sample or drop health checks and debug noise
Pros of OTel Collector
- Vendor-neutral pipeline. The same Collector instance exports to VictoriaMetrics, Tempo, Datadog, or any OTLP-compatible backend. Switch backends without touching application code
- Plugin architecture with 100+ pre-built components. Receivers, processors, and exporters are assembled into a single binary at build time and configured entirely through YAML
- Backpressure-aware. When a backend is slow, the sending queue buffers data, retries with exponential backoff, and only drops data as a last resort. The memory_limiter processor prevents OOM
- Runs anywhere. DaemonSet on Kubernetes, sidecar, gateway, bare-metal binary, Docker container. Same binary, same config format
- Self-monitoring built in. The Collector exposes its own metrics (accepted/dropped/exported counts, queue depth, memory usage) on a Prometheus endpoint for meta-monitoring
Cons of OTel Collector
- Configuration complexity. 100+ components with different config schemas. Getting the processor chain right (order matters) requires operational experience
- Single-binary plugin model means you can't add a custom component at runtime. Adding a new receiver or exporter requires rebuilding the binary with the Collector Builder (ocb)
- Memory overhead for stateful processors. Tail-based sampling buffers all spans for 60 seconds. At high throughput, this consumes several GB of RAM per instance
- No built-in persistent buffering in the default setup. The in-memory sending queue loses data on crash. Persistent queue (disk WAL) is available but adds I/O overhead
- Debug difficulty. When data disappears between SDK and backend, tracing which processor dropped it requires checking multiple internal metrics
When to Use OTel Collector
- Any production observability pipeline. The Collector should sit between your applications and your storage backends
- When you need to process telemetry before storage: filter noise, enrich with metadata, sample traces, route by tenant
- Multi-backend setups where metrics go to VictoriaMetrics, traces to Tempo, and logs to VictoriaLogs from the same collection layer
- When you want to change backends without redeploying applications
When Not to Use OTel Collector
- Simple single-app setups where the SDK can export directly to the backend with no processing needed
- When latency of even a single network hop is unacceptable (though DaemonSet mode uses localhost, so overhead is minimal)
- As a long-term storage buffer. The Collector is a processing pipeline, not a message queue. Use Kafka for durable buffering
Alternatives to OTel Collector
OpenTelemetry, Grafana Beyla, Kafka, Prometheus, VictoriaMetrics, Grafana Tempo
PostGIS — The spatial extension that turned PostgreSQL into a serious GIS database
Category: Geospatial
## What PostGIS Actually Is
Use Cases for PostGIS
- Storing and querying geographic features (points, lines, polygons)
- Location-based search (find all stores within 10 km)
- Route planning and network analysis
- Land parcel management and urban planning
- Geospatial data pipelines for mapping and GIS workflows
- Fleet tracking with real-time spatial queries
Pros of PostGIS
- Full OGC standards compliance (WKT, WKB, GeoJSON, KML, GML). If a GIS tool exists, it probably talks to PostGIS
- R-tree spatial indexing via GiST handles complex polygon queries efficiently
- Raster support for satellite imagery, elevation models, and remote sensing data alongside vector data
- Topology support for network analysis, routing, and connectivity queries
- Massive function library. 800+ spatial functions covering everything from distance to voronoi diagrams
Cons of PostGIS
- Inherits PostgreSQL's single-writer limitation and vacuum overhead
- R-tree indexes are less parallelizable than hash-based spatial schemes like S2 or H3
- Complex geometry operations (buffer, union on large polygons) can be CPU-intensive
- Scaling horizontally requires Citus or manual sharding. Not straightforward.
- Learning curve for the full GIS stack (SRIDs, projections, coordinate systems) is significant
When to Use PostGIS
- You need a full-featured GIS database with standards compliance
- Complex spatial operations: polygon intersection, buffering, union, voronoi, routing
- Your team already runs PostgreSQL and wants to add spatial capabilities
- Integration with the broader GIS ecosystem (QGIS, GeoServer, MapServer, GDAL)
When Not to Use PostGIS
- Simple proximity queries where S2 or H3 cell IDs on a regular B-tree would suffice
- Billions of points with simple spatial lookups (consider a key-value store with S2 indexing)
- Real-time streaming geospatial data at very high write throughput
- Your workload is purely analytics/aggregation by area (H3 in a data warehouse is simpler)
Alternatives to PostGIS
PostgreSQL, Google S2 Geometry, Elasticsearch
PostgreSQL — The relational database that keeps earning its spot in production
Category: Databases
PostgreSQL is the database most teams should start with, and many never need to leave. Apple, Instagram, Spotify, Reddit all run it in production. That is not a coincidence. After 35+ years of development, Postgres handles everything from basic CRUD apps to analytical workloads on petabytes of data. It is not the fastest option for every access pattern, but the combination of ACID compliance, extensibility, and SQL standards support makes it the safest default for a primary datastore.
Use Cases for PostgreSQL
- OLTP transactional workloads
- Complex queries with joins and aggregations
- Geospatial data (PostGIS)
- Full-text search
- JSON/JSONB document storage
- Time-series data (with TimescaleDB)
Pros of PostgreSQL
- Full ACID compliance with MVCC
- Wildly extensible (custom types, functions, extensions)
- Best-in-class SQL standard compliance
- Rich indexing: B-tree, GIN, GiST, BRIN
- Strong community and ecosystem
Cons of PostgreSQL
- Vertical scaling hits a wall eventually
- Write-heavy workloads need careful tuning
- Replication is async by default
- Sharding requires extensions like Citus
- VACUUM overhead bites you during long-running transactions
When to Use PostgreSQL
- You need strong consistency and ACID transactions
- Complex relational data with joins
- Mixed workloads (relational + JSON + full-text search)
- Geospatial queries
When Not to Use PostgreSQL
- You need automatic horizontal sharding at massive scale
- Your access pattern is purely key-value lookups
- Ultra-low latency requirements (sub-millisecond)
- Append-only time-series at millions of events per second
Alternatives to PostgreSQL
CockroachDB, MongoDB, Cassandra
Prometheus — The pull-based metrics system that actually works for containers and Kubernetes
Category: Observability
## Why It Exists
Use Cases for Prometheus
- Collecting infrastructure metrics like CPU, memory, disk, and network
- Tracking application performance through custom metrics
- Alerting when SLOs are violated or anomalies appear
- Monitoring Kubernetes clusters and workloads
- Measuring SLIs for reliability engineering
- Capacity planning based on historical metrics
Pros of Prometheus
- Pull model makes service discovery and health detection straightforward
- PromQL is hands-down the best metrics query language available
- Kubernetes integration with automatic service discovery works out of the box
- Huge exporter ecosystem with 500+ integrations
- CNCF graduated project, widely adopted, battle-tested
Cons of Prometheus
- Single-node by default. You need Thanos or Cortex to scale horizontally
- Local storage is not durable. A disk failure wipes your metrics
- High cardinality labels will blow up memory and kill query performance
- Pull model means Prometheus needs network access to every target
- No built-in dashboards. You will need Grafana
When to Use Prometheus
- Cloud-native environments, especially anything running on Kubernetes
- You want a proven, standards-based monitoring stack
- Your team practices SRE and needs SLI/SLO tracking
- You need multi-dimensional metrics with label-based querying
When Not to Use Prometheus
- Log aggregation or distributed tracing (reach for Loki or Jaeger instead)
- Billing or accounting metrics where 100% accuracy matters (Prometheus can drop data)
- Environments where Prometheus cannot reach targets (use a push-based alternative)
- Very long retention (years) without adding Thanos or Cortex for remote storage
Alternatives to Prometheus
Grafana, Kafka, Elasticsearch
Apache Pulsar — A messaging and streaming system built around separated compute and storage, with multi-tenancy and geo-replication baked in from day one
Category: Messaging
## Why It Exists
Use Cases for Apache Pulsar
- Running both queuing and streaming workloads on one platform instead of stitching Kafka and RabbitMQ together
- Multi-tenant messaging where different teams need real isolation, not just separate topic prefixes
- Geo-replicated event streaming across data centers without bolting on MirrorMaker
- Event sourcing with long-term retention by offloading old data to S3 or GCS
- Lightweight event processing via Pulsar Functions (skip the full Flink deployment for simple transforms)
- High-volume IoT ingestion where you need flexible subscription models
Pros of Apache Pulsar
- Brokers are stateless, BookKeeper handles storage. You can scale them independently, and replacing a dead broker takes seconds, not hours.
- Multi-tenancy is a first-class concept. Tenant, namespace, topic hierarchy gives you per-namespace quotas and policies out of the box.
- Geo-replication is built into the protocol. One admin command to set it up, no external tools needed.
- Tiered storage moves old ledger segments to object storage automatically. Infinite retention without burning SSD budget.
- Supports queuing (shared subscription) and streaming (exclusive/failover) in the same system, so you don't need two different platforms.
Cons of Apache Pulsar
- Operationally heavy. You need ZooKeeper + BookKeeper + Brokers. That is a minimum of 9 processes before you even publish a message.
- The ecosystem is smaller than Kafka's. Fewer connectors, fewer managed offerings, fewer Stack Overflow answers.
- Simple streaming workloads see higher tail latency compared to Kafka because of the extra BookKeeper hop.
- Schema registry and exactly-once semantics still lag behind Kafka's implementations in maturity.
- Managed cloud options are limited. Kafka has Confluent, MSK, Aiven, and more. Pulsar has StreamNative and not much else.
When to Use Apache Pulsar
- You actually need multi-tenancy with hard isolation between teams or customers
- Geo-replication is a real requirement, not just a nice-to-have
- You want queues and streaming topics in one system and are tired of running both RabbitMQ and Kafka
- You need to retain messages for months or years without paying for SSD-tier storage the whole time
When Not to Use Apache Pulsar
- Single-cluster streaming where Kafka works fine and has a decade of battle scars to prove it
- Your team wants something simpler to operate. Kafka is fewer moving parts.
- You already have deep Kafka Connect integrations and ecosystem tooling
- Community support matters a lot to you. Kafka's community is 5-10x larger.
Alternatives to Apache Pulsar
Kafka, RabbitMQ, Kafka Streams
Grafana Pyroscope — S3-native continuous profiling with span-to-profile correlation and flame graph visualization
Category: Observability
## Why It Exists
Use Cases for Grafana Pyroscope
- Continuous profiling of CPU, memory, goroutine, and mutex contention in production
- Profile-to-trace correlation: click from a slow trace span to the CPU flame graph showing the exact bottleneck
- Identifying hot functions, excessive allocations, and lock contention without reproducing locally
- Regression detection by comparing flame graphs across deployments
- Cost-effective long-term profile retention via S3 storage classes
- Polyglot profiling across Go, Java, Python, Ruby, Rust, and .NET services
Pros of Grafana Pyroscope
- S3-native storage means profile cost scales with object storage pricing. Petabytes of profiles at a fraction of local-disk cost
- Span-to-profile correlation via shared span_id labels in pprof data. Click from a Tempo trace span to the CPU flame graph for that exact execution window
- Native Grafana integration. Flame graph panel is built-in. Differential flame graphs compare two time ranges to spot regressions
- pprof format is the industry standard. Go, Java (async-profiler), Python (py-spy), Ruby, Rust, and .NET all produce pprof-compatible output
- Same architecture as Tempo: ingesters buffer in memory, flush to S3, compactors merge blocks. One operational model for traces and profiles
- OTel profile signal support (experimental) means profiles flow through the same OTel Collector fleet as metrics, traces, and logs
Cons of Grafana Pyroscope
- Profiles require SDK-level instrumentation. eBPF (Grafana Beyla) does not produce profiles — only RED metrics and basic trace spans
- Per-profile size is large (50-200 KB per snapshot) compared to metrics (2 bytes) or trace spans (1 KB). Storage adds up at scale
- Ingester memory consumption is significant. Buffering 50K profiles/sec at 100 KB average requires careful node sizing
- OTel profile signal is still experimental as of early 2026. Most deployments use the Pyroscope SDK or async-profiler agent directly
- Flame graph interpretation requires performance engineering skills. Without training, teams may struggle to act on profile data
When to Use Grafana Pyroscope
- You already run Grafana + Tempo and want profile-to-trace correlation in the same UI
- Debugging production performance issues requires knowing which function is the bottleneck, not just which service
- S3-native storage with automatic lifecycle tiering is a requirement for cost control
- Multiple language runtimes need profiling under a single system (Go, Java, Python, Ruby)
When Not to Use Grafana Pyroscope
- You only need RED metrics and basic traces. Grafana Beyla covers that without profiles
- Your services run on Windows or platforms without pprof-compatible profilers
- Budget does not allow the incremental S3 storage cost for continuous profiles at scale
- Team lacks performance engineering skills to interpret flame graphs (invest in training first)
Alternatives to Grafana Pyroscope
Grafana, Tempo, Grafana Beyla, Prometheus, VictoriaMetrics
QuestDB — The zero-GC time series database built for speed that actually delivers millions of rows per second
Category: Time Series
## Why This Exists
Use Cases for QuestDB
- Financial market data and tick-level analytics
- High-frequency IoT sensor ingestion
- Real-time dashboards over streaming data
- Network telemetry and flow analytics
- Application performance monitoring at high resolution
- Cryptocurrency trading and exchange analytics
Pros of QuestDB
- Ingestion speed is genuinely best-in-class. Millions of rows/sec on commodity hardware.
- Standard SQL with time series extensions, no new query language to learn
- Zero-GC Java implementation avoids the latency spikes that plague other JVM databases
- Built-in support for InfluxDB Line Protocol and PostgreSQL wire protocol
- Column-oriented storage with SIMD-accelerated query execution
Cons of QuestDB
- Younger project with a smaller community compared to InfluxDB or TimescaleDB
- No built-in replication or clustering yet. Single-node only for now.
- Limited support for UPDATE and DELETE operations
- Ecosystem of integrations and connectors is still growing
- Documentation covers the basics but lacks depth on advanced operational topics
When to Use QuestDB
- You need the fastest possible ingestion for high-volume time series data
- Financial or trading applications where microsecond timestamps matter
- Your team wants SQL and does not want to learn InfluxQL, Flux, or PromQL
- Single-node deployment is acceptable and you want maximum performance per node
When Not to Use QuestDB
- You need multi-node clustering or built-in replication for HA
- Heavy UPDATE/DELETE workloads on existing data
- You need a proven, battle-tested solution for mission-critical systems
- Prometheus-compatible monitoring (use VictoriaMetrics instead)
Alternatives to QuestDB
InfluxDB, TimescaleDB, ClickHouse
RabbitMQ — The message broker you actually understand on day one
Category: Messaging
## How It Works Internally
Use Cases for RabbitMQ
- Task queues and background job processing
- Request-reply messaging patterns
- Complex message routing (topic, headers, fanout)
- Microservice communication
- Delayed and scheduled message delivery
- Priority queues
Pros of RabbitMQ
- Rich routing with exchanges (direct, topic, fanout, headers)
- Supports multiple protocols (AMQP, MQTT, STOMP)
- Message acknowledgment and dead-letter queues
- Priority queues and TTL support
- Easy to set up and operate for small-medium scale
Cons of RabbitMQ
- Lower throughput than Kafka for streaming workloads
- Messages are deleted after consumption (not replayable)
- Clustering can be complex at large scale
- Memory pressure under high message backlog
- Not designed for event sourcing or log-based systems
When to Use RabbitMQ
- Need flexible message routing patterns
- Task queues with acknowledgment and retries
- Request-reply or RPC patterns over messaging
- Small-to-medium scale with simpler operations
When Not to Use RabbitMQ
- Need event replay or long-term message storage
- Ultra-high throughput streaming (millions/sec)
- Event sourcing or CQRS architectures
- Need strong message ordering across partitions
Alternatives to RabbitMQ
Kafka, Redis
RAG — Retrieval-Augmented Generation: give your LLM actual facts instead of letting it guess
Category: AI & ML
## Why It Exists
Use Cases for RAG
- Enterprise knowledge base Q&A
- Customer support bots backed by real product docs
- Legal document analysis and contract review
- Medical literature search and clinical decision support
- Code documentation search and developer assistants
- Internal wiki and policy compliance queries
Pros of RAG
- Grounds LLM responses in factual, up-to-date sources
- Cuts hallucinations drastically compared to raw LLM generation
- No fine-tuning needed to add domain-specific knowledge
- Sources are citable, so users can actually verify answers
- You can update the knowledge base without retraining the model
Cons of RAG
- Retrieval quality caps generation quality. Garbage in, garbage out.
- Chunking strategy has a huge impact on relevance, and there is no universal answer
- Latency overhead from embedding, retrieval, and re-ranking adds up
- Evaluation is tricky. You need to measure retrieval precision, context relevance, and answer faithfulness separately.
- Cost scales with corpus size (vector storage, embedding API calls, re-ranking)
When to Use RAG
- Your LLM needs access to private, proprietary, or frequently changing data
- Factual accuracy and source attribution matter
- Fine-tuning is too expensive or the knowledge base changes too often
- Domain-specific Q&A where the LLM's training data falls short
When Not to Use RAG
- Tasks that need pure reasoning or creativity, not factual grounding
- Tiny knowledge bases where the whole corpus fits in the context window
- Real-time applications that cannot tolerate retrieval latency
- Highly structured data that is better served by SQL queries or APIs
Alternatives to RAG
Vector Databases, LangChain, Elasticsearch
Redis — The in-memory store you'll reach for first when latency matters
Category: Caching
Most engineers working on backend systems that need to be fast have already used Redis. It started as a caching layer, but these days it sits at the center of session management, rate limiting, real-time analytics, and a dozen other use cases. The reason it stuck around while other tools came and went is simple: it delivers sub-millisecond latency with data structures that actually match real problems, not just key-value pairs.
Use Cases for Redis
- Session storage
- Rate limiting
- Leaderboards and rankings
- Real-time analytics
- Pub/Sub messaging
- Distributed locks
- Caching hot data
Pros of Redis
- Sub-millisecond latency for reads and writes
- Rich data structures: lists, sets, sorted sets, hashes, streams
- Built-in replication and Lua scripting
- Persistence options (RDB snapshots, AOF)
- Cluster mode for horizontal scaling
Cons of Redis
- RAM-bound, so your entire dataset must fit in memory
- Single-threaded command execution
- Cluster mode adds real operational complexity
- No query language for complex lookups
When to Use Redis
- You need sub-millisecond reads and writes
- Caching hot data in front of a slower database
- Real-time counters, leaderboards, or rate limiters
- Session management across multiple app servers
When Not to Use Redis
- Your dataset is much larger than available RAM
- You need full ACID transactions with joins
- Primary long-term storage for data you cannot lose
- Complex relational queries
Alternatives to Redis
Memcached, DynamoDB, PostgreSQL
RocksDB — The embedded LSM-tree engine that powers half the databases you already use
Category: Databases
## How It Works Internally
Use Cases for RocksDB
- State backend for stream processors like Flink and Kafka Streams
- Embedded key-value layer inside distributed databases (CockroachDB, TiKV)
- High-write-throughput time-series ingestion
- Metadata storage for distributed file systems
- Persistent local caching when Redis feels like overkill
- State management for stateful microservices
Pros of RocksDB
- Built for fast storage (SSD/NVMe) with write throughput that is hard to beat
- Embeddable. Runs in-process, so no network hop, no serialization tax
- Extremely tunable compaction, compression, and memory settings
- Incremental checkpointing through hard links makes snapshots nearly free
- Column families let you logically separate data inside one instance
Cons of RocksDB
- Not a standalone database. You need to build access patterns on top of it
- Read amplification is real. The LSM-tree structure means checking multiple levels
- Tuning is its own discipline. Dozens of knobs, and they interact in ways that surprise you
- Write amplification from compaction can hit 10-30x in the worst case
- Space amplification during compaction means you need to provision extra disk headroom
When to Use RocksDB
- You are building a system that needs an embedded storage engine (stream processor, distributed DB, etc.)
- Write throughput is your primary bottleneck
- Your data fits on a single node's local disk
- You need fast point lookups and range scans over sorted keys
When Not to Use RocksDB
- You want a standalone database with SQL or some query language
- You need distributed transactions across multiple nodes
- Your workload is read-heavy with random access patterns (a B-tree will likely serve you better)
- Your team has no experience tuning LSM-tree engines and no appetite to learn
Alternatives to RocksDB
Flink, Kafka Streams, CockroachDB
ScyllaDB — A shard-per-core NoSQL database built for low latency at serious scale
Category: Databases
## How It Works Internally
Use Cases for ScyllaDB
- High-throughput notification storage
- Real-time user profile and preference stores
- IoT and time-series ingestion
- Ad tech bidding and event logging
- Session and state management
- Write-heavy workloads where you need predictable latency
Pros of ScyllaDB
- Shard-per-core architecture removes cross-CPU contention entirely
- Drop-in Cassandra CQL compatibility (drivers, tools, data model all work)
- Written in C++ on the Seastar framework. No GC pauses. Period.
- Automatic workload-aware scheduling (reads vs compaction vs streaming)
- Speculative execution that genuinely cuts tail latency
Cons of ScyllaDB
- Same query-driven data modeling constraints as Cassandra
- Smaller community and ecosystem than Cassandra
- Enterprise features (CDC, LDAP, encryption at rest) sit behind a paywall
- Fewer managed service options compared to DynamoDB or Cassandra on Astra
- Lightweight transactions (LWT) are noticeably slower than regular writes because of Paxos
When to Use ScyllaDB
- You want the Cassandra data model but 2-10x lower p99 latency
- GC pauses in your JVM-based database are causing tail-latency spikes
- Your workload exceeds 100K ops/sec per node and you want fewer nodes overall
- You are running latency-sensitive reads alongside compaction-heavy tables
When Not to Use ScyllaDB
- Your dataset fits comfortably in a single relational database
- You need complex joins or ad-hoc analytical queries
- Your team has zero experience with Cassandra data modeling
- You need a fully managed serverless setup with zero ops
Alternatives to ScyllaDB
Cassandra, DynamoDB, Redis
SeaweedFS — Fast, simple distributed storage that just works for small files
Category: Storage
## Why It Exists
Use Cases for SeaweedFS
- High-volume small file storage (images, thumbnails, log files)
- CDN origin storage where most objects are under 10MB
- Lightweight object storage for startups and mid-size teams
- S3 gateway for applications that need basic S3 compatibility
- File storage backend for web applications
- Blob storage for microservices that don't need S3-scale complexity
Pros of SeaweedFS
- Extremely fast for small files. Optimized for the use case where most objects are under 10MB.
- Simple architecture. Master + Volume servers. Easy to understand and deploy.
- Low metadata overhead per object. File ID encodes volume and offset directly.
- Built-in Reed-Solomon erasure coding for volume-level durability
- Filer component provides directory semantics and S3 API on top of the blob store
- Written in Go. Single binary per component. Minimal dependencies.
Cons of SeaweedFS
- Central master server is a coordination bottleneck. All volume assignments go through it.
- Master is a single point of failure (though it can be replicated with Raft, it's still the critical path)
- S3 compatibility is partial. Some advanced S3 features (object lock, complex lifecycle rules) are missing or incomplete.
- Not designed for exabyte scale. Works well up to low petabytes.
- Erasure coding is per-volume (groups of objects), not per-object. A volume failure recovery reconstructs the entire volume.
- Smaller community and ecosystem compared to MinIO or Ceph. Fewer production references.
When to Use SeaweedFS
- Primary workload is lots of small files (images, thumbnails, logs)
- Want something simpler than Ceph with less operational overhead
- Data fits in terabytes to low petabytes
- Team is small and needs a storage system they can understand end to end
When Not to Use SeaweedFS
- Need exabyte scale or rack-aware placement (use Ceph)
- Require full S3 API compatibility (use MinIO)
- Workload is dominated by large objects (>1GB). SeaweedFS's advantage is small files.
- Cannot tolerate a central master in the critical path
- Need enterprise support and a large community
Alternatives to SeaweedFS
MinIO, Ceph, RocksDB
Google Cloud Spanner — The database that uses atomic clocks to solve distributed consistency
Category: Databases
Most distributed databases force a choice: strong consistency or horizontal scale. Spanner is the rare system that actually delivers both, and the trick behind it is wonderfully weird. Google put atomic clocks and GPS receivers in every data center, built a time API around them, and used bounded clock uncertainty to make globally consistent transactions possible. It has been running internally at Google since before it became a cloud product in 2017, powering AdWords, Google Play, and Photos at scale.
Use Cases for Google Cloud Spanner
- Global financial systems where inconsistency means real money lost
- Multi-region SaaS platforms that need strong consistency, not eventual
- Gaming backends with global leaderboards that must be accurate in real time
- Inventory systems spread across continents
- Ad platforms crunching billions of events per day
- Government and healthcare systems with strict data integrity rules
Pros of Google Cloud Spanner
- External consistency (the strongest guarantee you can get). Linearizable reads and writes, globally.
- Fully managed. You do not deal with replication, sharding, or failover at all.
- Automatic split-based sharding with zero manual partitioning
- 99.999% SLA for multi-region setups (under 5.3 minutes of downtime per year)
- Real SQL support with schemas, secondary indexes, and interleaved tables
Cons of Google Cloud Spanner
- Hard vendor lock-in to Google Cloud. No on-prem, no multi-cloud.
- Expensive. Minimum $0.90/hour per node (~$650/month) before storage and network.
- Custom SQL dialect. The PostgreSQL interface exists but has real limitations.
- Write latency goes up for multi-region instances because Paxos has to cross continents
- No stored procedures, no triggers, no user-defined functions
When to Use Google Cloud Spanner
- You are already on Google Cloud and need globally consistent transactions
- Regulations or business rules require the absolute strongest consistency guarantees
- You need a managed database that scales from 1 node to thousands without rearchitecting
- Your workload needs multi-region writes with automatic conflict resolution
When Not to Use Google Cloud Spanner
- You need multi-cloud or on-prem deployment flexibility
- You are budget-constrained. Spanner's minimum cost is overkill for smaller workloads.
- You depend on the full PostgreSQL extension ecosystem or stored procedures
- Your workload is read-heavy and you can tolerate eventual consistency (much cheaper options exist)
Alternatives to Google Cloud Spanner
CockroachDB, DynamoDB, PostgreSQL
Spark — The Swiss Army knife of distributed data processing, for better or worse
Category: Stream Processing
## Why It Exists
Use Cases for Spark
- Large-scale batch ETL
- Machine learning pipelines (MLlib)
- Interactive SQL analytics (Spark SQL)
- Graph processing (GraphX)
- Micro-batch stream processing (Structured Streaming)
- Data lake processing
Pros of Spark
- One API covers batch, streaming, SQL, and ML instead of stitching four systems together
- In-memory processing makes iterative workloads dramatically faster than MapReduce ever was
- Massive community means most problems already have a Stack Overflow answer
- Pick your language: Scala, Python, Java, R, or SQL
- Catalyst optimizer does genuinely smart things with your SQL queries
Cons of Spark
- Micro-batch streaming adds seconds of latency, not milliseconds. If you need real-time, look elsewhere.
- Memory hungry. Budget for it or watch your executors die.
- Tuning Spark well is practically a full-time job
- Overkill for anything that fits on a single machine
- Shuffle operations will punish you if you are not careful
When to Use Spark
- Large-scale batch data processing
- You want one platform for batch, streaming, and ML instead of maintaining three
- Interactive SQL queries on datasets too big for a single database
- Your team already knows the JVM or Python ecosystem
When Not to Use Spark
- You need true sub-second stream processing (use Flink instead)
- Your ETL fits comfortably on one machine. Just use pandas or DuckDB.
- Real-time event processing where latency actually matters
- OLTP workloads. Spark is not a database.
Alternatives to Spark
Flink, Kafka Streams, ClickHouse
Grafana Tempo — S3-native distributed tracing with no index to maintain and TraceQL for structural queries
Category: Observability
## Why It Exists
Use Cases for Grafana Tempo
- Distributed tracing across microservice architectures at scale
- S3-native trace storage with automatic tiered lifecycle
- TraceQL queries for attribute-based and structural trace analysis
- Grafana-native observability workflows with metric-to-trace correlation
- High-volume trace ingestion with tail-based sampling
- Cost-effective long-term trace retention via S3 storage classes
- Service dependency map generation from trace data
- Incident investigation with exemplar-linked trace lookups
Pros of Grafana Tempo
- No index to maintain. Bloom filters for trace ID lookup, Parquet columnar format for attribute search. Zero index management overhead.
- S3-native storage means trace cost scales with object storage pricing, not compute. Petabytes of traces for a fraction of Elasticsearch cost.
- TraceQL is the only trace query language with structural operators (find traces where span A is a parent of span B with duration > 2s)
- Native Grafana integration. Tempo datasource is built-in. Trace waterfall, service maps, and exemplar links work out of the box.
- Automatic tiered storage via S3 lifecycle policies. Hot to Glacier with zero application-level logic.
- Stateless query layer scales horizontally. Add querier pods to handle more concurrent queries without touching storage.
- Apache Parquet columnar format enables column pruning — queries that filter on one attribute skip all other columns
Cons of Grafana Tempo
- S3 query latency is higher than local disk. Non-cached attribute searches can take 2-10 seconds depending on block count.
- Uses 4.26 GiB RAM at 10K spans/sec vs VictoriaTraces' 1.15 GiB. Ingesters buffer spans in memory before flushing to S3.
- Compactor is critical infrastructure. If it falls behind, block count grows, bloom filters fragment, and query latency degrades.
- No local-disk-only option. S3 or compatible object storage is required. Cannot run in air-gapped environments without MinIO or similar.
- Attribute-based queries (not by trace ID) can be slow on large time ranges because there is no inverted index — Tempo must scan Parquet blocks.
- Bloom filter false positives increase with block count. Well-compacted blocks are essential for query performance.
When to Use Grafana Tempo
- S3-native storage with automatic lifecycle tiering is a requirement
- TraceQL queries for structural trace analysis are needed
- Already running Grafana and want native datasource integration
- Trace volume is high and object storage economics make more sense than local disk provisioning
- Team prefers operational simplicity over resource efficiency (stateless query, no local state to manage)
When Not to Use Grafana Tempo
- Air-gapped or no-external-dependency environments (VictoriaTraces uses local disk only)
- Resource efficiency is the top priority (VictoriaTraces uses 3.7x less RAM)
- Already running VictoriaMetrics + VictoriaLogs and want the same operational model for traces
- Need sub-second attribute search on large time ranges (Elasticsearch-backed Jaeger has inverted indexes)
- Budget does not allow S3 costs for trace storage
Alternatives to Grafana Tempo
VictoriaTraces, Grafana, Kafka, VictoriaMetrics, Prometheus
TiDB — MySQL-compatible distributed database that actually handles both OLTP and OLAP
Category: Databases
Most teams hit the same wall eventually: the MySQL instance is maxed out on writes, the analytics pipeline is a fragile mess of ETL jobs copying data into a warehouse, and half the reports are hours stale. TiDB exists to solve that specific pain. It is a distributed database that speaks MySQL protocol and can serve both transactional and analytical queries from the same cluster. That sounds like marketing, but it actually works in practice, with tradeoffs worth understanding before committing.
Use Cases for TiDB
- Running analytics directly on transactional data without an ETL pipeline
- Scaling beyond a single MySQL instance while keeping SQL compatibility
- Mixed OLTP/OLAP workloads that currently need two separate databases
- Multi-tenant SaaS platforms hitting MySQL write limits
- Financial systems that need both fast transactions and real-time reporting
- High-volume e-commerce order management with analytics
Pros of TiDB
- Speaks the MySQL wire protocol, so most MySQL drivers, ORMs, and tools just work
- TiFlash columnar replicas let you run analytics without hurting OLTP performance
- Scales horizontally with automatic Region splitting and rebalancing
- Strong consistency through Raft consensus on every data Region
- Open source with a real community and commercial backing from PingCAP
Cons of TiDB
- Write latency is higher than single-node MySQL because of Raft consensus round-trips
- Not every MySQL feature is supported (stored procedures and triggers are limited)
- Three component types to operate (TiDB, TiKV, PD), which adds real operational overhead
- Cross-Region transactions pay for multi-Raft-group coordination
- TiFlash replication doubles your storage cost since it keeps a columnar copy of row data
When to Use TiDB
- You have outgrown a single MySQL instance and need horizontal scale with SQL
- You want real-time analytics on live transactional data, no ETL
- You need distributed ACID transactions with MySQL compatibility
- You are running separate OLTP and OLAP systems and want to merge them
When Not to Use TiDB
- Your workload fits comfortably on a single MySQL or PostgreSQL instance
- You depend heavily on stored procedures and triggers
- You need sub-millisecond latency where single-node databases are just faster
- Your team has never operated a distributed database cluster
Alternatives to TiDB
CockroachDB, PostgreSQL, RocksDB
TiKV — Distributed key-value store that delivers transactions without giving up range scans
Category: Databases
## How It Works Internally
Use Cases for TiKV
- Storage engine for TiDB (distributed MySQL-compatible database)
- Metadata store for large-scale infrastructure
- Ordered key-value layer with distributed transactions
- Backend for systems that need both point lookups and range scans at scale
- Replacement for single-node key-value stores that hit scaling limits
Pros of TiKV
- Distributed ACID transactions with snapshot isolation across shards
- Sorted keys with efficient range scans, not just point lookups
- Raft consensus per region gives strong consistency without a single leader bottleneck
- Automatic region splitting and merging as data grows or shrinks
- Coprocessor pushes computation to storage nodes, cutting network round-trips
- Built on RocksDB. Battle-tested LSM-tree performance underneath
Cons of TiKV
- Operational complexity. You're running Placement Driver (PD) + TiKV nodes + monitoring
- Snapshot isolation, not serializable. Phantom reads are possible in edge cases
- Write latency depends on Raft replication. Cross-AZ deployments add 5-15ms
- Region splitting can cause brief latency spikes during transitions
- Smaller ecosystem than etcd or Redis. Fewer client libraries and community resources
When to Use TiKV
- Need ordered key-value storage with distributed transactions
- Outgrowing a single-node store and need horizontal scaling
- Running TiDB and want its native storage engine
- Workload needs both fast point reads and efficient range scans
When Not to Use TiKV
- Simple coordination or config storage (etcd is simpler and good enough)
- Pure cache workloads (Redis is faster and more appropriate)
- Data fits on one machine (RocksDB embedded avoids all the distributed overhead)
- Need serializable isolation (FoundationDB is a better fit)
- Team doesn't have distributed systems operational experience
Alternatives to TiKV
RocksDB, etcd, FoundationDB, CockroachDB
Tile38 — The real-time geospatial database built for tracking things that move
Category: Geospatial
## Why This Exists
Use Cases for Tile38
- Real-time fleet and vehicle tracking
- Geo-fencing with instant entry/exit notifications
- Live asset tracking for logistics and supply chain
- Location-based push notifications
- Drone and robot position monitoring
- Real-time delivery driver tracking
Pros of Tile38
- Purpose-built for real-time geospatial data. Updates and queries on moving objects are first-class operations
- Built-in geo-fencing with webhook notifications on enter/exit events
- Redis-compatible protocol makes integration trivial for teams that know Redis
- Supports points, polygons, GeoJSON, and geohashes natively
- In-memory with persistence to disk (AOF), so reads are consistently fast
Cons of Tile38
- Single-node only. No built-in clustering or sharding
- Dataset must fit in memory. Not suitable for historical data at scale
- Smaller community and ecosystem compared to PostGIS or Redis with its GEO commands
- No SQL interface. Query language is custom (RESP-based commands)
- Limited analytical capabilities. It tracks objects, it does not analyze spatial patterns
When to Use Tile38
- You need to track millions of moving objects with sub-second update latency
- Geo-fencing is a core requirement and you need real-time enter/exit events
- Your workload is dominated by frequent location updates, not complex spatial queries
- You want something simpler than PostGIS for real-time tracking
When Not to Use Tile38
- You need complex spatial operations (polygon intersection, buffering, spatial joins)
- Your data does not fit in memory or you need long-term spatial data storage
- You need SQL and standard GIS tool compatibility
- Analytics and aggregation over spatial data are more important than real-time tracking
Alternatives to Tile38
Redis, PostGIS, Google S2 Geometry
TimescaleDB — Full PostgreSQL with time series superpowers bolted in at the storage layer
Category: Time Series
## Why This Exists
Use Cases for TimescaleDB
- IoT and sensor data with complex relational queries
- Financial tick data and market analytics
- Application metrics alongside business data in one database
- Geospatial time series (fleet tracking, weather stations)
- SLA/SLO monitoring with historical trending
- Energy and utilities metering data
Pros of TimescaleDB
- It is PostgreSQL. Full SQL, JOINs, CTEs, window functions, stored procedures, all of it
- Your existing PostgreSQL tooling, ORMs, drivers, and expertise all carry over
- Automatic time-based partitioning via hypertables is genuinely painless
- Continuous aggregates give you materialized rollups that refresh incrementally
- Native compression achieves 90-95% reduction on typical time series data
Cons of TimescaleDB
- Single-node performance ceiling for writes compared to purpose-built TSDBs
- Multi-node (distributed hypertables) adds operational complexity
- Compression is columnar but queries on compressed chunks are slower than on uncompressed data
- Still carries the PostgreSQL overhead for MVCC, vacuuming, and WAL management
- License changed from Apache 2.0 to a more restrictive Timescale License for some features
When to Use TimescaleDB
- You already run PostgreSQL and want to add time series without another database
- Your queries need JOINs between time series data and relational tables
- Your team knows SQL well and does not want to learn a new query language
- You need ACID transactions on time series data
When Not to Use TimescaleDB
- Pure metrics collection at massive scale (Prometheus or VictoriaMetrics is simpler)
- You need 1M+ inserts/sec sustained on a single node (look at QuestDB or InfluxDB)
- Your workload is append-only with no relational queries (a purpose-built TSDB will be leaner)
- You want a fully managed experience without any PostgreSQL administration
Alternatives to TimescaleDB
PostgreSQL, InfluxDB, Prometheus
TipTap — The headless rich-text editor built on ProseMirror with first-class Yjs support
Category: Collaboration
ProseMirror is the most powerful editor framework ever built. It is also genuinely hard to use. The API is low-level, the documentation assumes familiarity with document schemas and state machines, and building a basic bold button requires understanding transactions, steps, and marks. TipTap wraps ProseMirror with a developer-friendly API while keeping the full power accessible when needed. Think of it as React to ProseMirror's DOM.
Use Cases for TipTap
- Rich text editors in SaaS products
- CMS content editing interfaces
- Collaborative document editing (Google Docs-style)
- Notion-style block editors with custom node types
- Comments and annotations with inline formatting
Pros of TipTap
- Headless architecture. Zero UI opinions. Full control over every pixel of the editor
- ProseMirror backbone provides a battle-tested document model with schema enforcement
- First-class Yjs collaboration via @tiptap/extension-collaboration
- Extension system for custom nodes, marks, and keyboard shortcuts
- Schema-enforced structure prevents invalid document states at the transaction level
Cons of TipTap
- ProseMirror learning curve is steep. The mental model is not intuitive at first
- Bundle size grows with every extension added. Tree-shaking helps but only so much
- Documentation has gaps for advanced ProseMirror interop and custom node specs
- Mobile support is limited. Touch selection and virtual keyboards are pain points
When to Use TipTap
- Building a rich text editor with full control over the UI
- Need real-time collaboration with multiple simultaneous editors
- Want schema-enforced content structure (headings can only contain inline content, etc.)
- Building a Notion-style block editor with custom block types
When Not to Use TipTap
- Plain text input where a textarea is sufficient
- Static content display with no editing needed
- WYSIWYG email composers (email HTML is a different beast entirely)
- Teams that need a working editor in a day. The learning curve is real
Alternatives to TipTap
Yjs, Hocuspocus, Elasticsearch
Vector Databases — Specialized databases for storing, indexing, and querying high-dimensional vector embeddings at scale
Category: AI & ML
## Why It Exists
Use Cases for Vector Databases
- Semantic search and retrieval for RAG pipelines
- Recommendation engines built on embedding similarity
- Image and video similarity search
- Anomaly detection across high-dimensional feature spaces
- Deduplication and near-duplicate detection
- Multimodal search across text, images, and audio
Pros of Vector Databases
- Sub-millisecond similarity search across billions of vectors
- ANN indexes dramatically outperform brute-force scanning
- Native metadata filtering combined with vector search
- Managed cloud options reduce operational overhead
- Support for hybrid search (dense + sparse vectors)
Cons of Vector Databases
- Results are approximate. ANN algorithms trade recall for speed.
- Index build time grows steep for large datasets (hours at the billion-vector scale)
- Memory-hungry. HNSW indexes hold graph structures in RAM.
- No standard query language. Every database ships its own API.
- Changing your embedding model means re-indexing your entire corpus
When to Use Vector Databases
- Semantic similarity search where keyword matching falls short
- RAG applications that need fast retrieval over large document sets
- Recommendation systems built on learned embeddings
- Any application where items are represented as dense vectors
When Not to Use Vector Databases
- Exact match or structured queries (use a relational database)
- Small datasets under 10K vectors (brute-force cosine similarity works fine)
- Frequently swapping embedding models (re-indexing cost adds up fast)
- Workloads that require ACID transactions on vector data
Alternatives to Vector Databases
RAG, Elasticsearch, PostgreSQL
VictoriaLogs — The log database that indexes everything without the Elasticsearch bill
Category: Observability
## Why It Exists
Use Cases for VictoriaLogs
- Centralized log aggregation for Kubernetes clusters
- High-volume operational logging (millions of log lines/sec)
- Full-text log search across high-cardinality fields
- Incident investigation with trace-to-log correlation
- Multi-tenant log storage for platform teams
- Replacing Elasticsearch for log storage at lower cost
- Complementing VictoriaMetrics metrics with the same operational model
Pros of VictoriaLogs
- Indexes all log fields automatically via bloom filters. No schema design or label planning required.
- Uses 30x less RAM and 15x less disk than Elasticsearch on the same workload
- 3x higher ingestion throughput and 87% less RAM than Grafana Loki in benchmarks
- Single binary, zero-config deployment. Same operational simplicity as VictoriaMetrics.
- LogsQL query language with full-text search built in, not bolted on
- Handles high-cardinality log fields natively without performance degradation
- Cluster mode with vlinsert/vlselect/vlstorage for linear horizontal scaling
Cons of VictoriaLogs
- Younger project than Elasticsearch and Loki, smaller community and ecosystem
- No native S3/object storage backend. Uses local disk. Cold tier requires vmbackup to S3.
- Grafana integration via datasource plugin, not native like Loki
- LogsQL is powerful but less widely known than Elasticsearch KQL or Loki LogQL
- Fewer third-party integrations and managed service options compared to Elasticsearch
When to Use VictoriaLogs
- High-volume log ingestion where Elasticsearch cost is prohibitive
- Log queries frequently search content, not just labels (where Loki brute-force grep is slow)
- Already running VictoriaMetrics and want the same operational model for logs
- Need full-text search on logs without the operational overhead of Elasticsearch
- High-cardinality log fields (trace_id, user_id, request_id) are common query targets
When Not to Use VictoriaLogs
- Need S3-native storage with automatic lifecycle tiering (Loki is simpler here)
- Log analytics and dashboards built from log content (Elasticsearch excels at aggregation queries)
- Small team that wants a fully managed solution (Elasticsearch Service, Grafana Cloud Loki)
- Tight Grafana ecosystem integration is a hard requirement (Loki has native support)
Alternatives to VictoriaLogs
VictoriaMetrics, Grafana, Elasticsearch, Kafka, Prometheus
VictoriaMetrics — The Prometheus long-term storage that does more with less hardware
Category: Time Series
## Why Teams Switch To This
Use Cases for VictoriaMetrics
- Long-term Prometheus metrics storage
- Multi-tenant monitoring for platform teams
- High-cardinality metrics that would OOM Prometheus
- Cost-effective replacement for Thanos or Cortex
- Centralized metrics aggregation across Kubernetes clusters
- IoT and sensor data monitoring at scale
- 500M+ metrics/sec ingestion at Datadog-tier scale (cluster mode)
- Gorilla-compressed hot storage tier in tiered TSDB architectures
Pros of VictoriaMetrics
- Dramatically lower resource usage than Prometheus for the same workload, often 5-10x less RAM
- Handles high cardinality far better than Prometheus without falling over
- Drop-in Prometheus replacement with full PromQL compatibility plus MetricsQL extensions
- Compression is exceptional, typically 0.4-0.8 bytes per data point
- Single binary deployment. Download, run, done. Operationally simple.
- Shared-nothing cluster architecture where vminsert, vmselect, and vmstorage scale independently with zero coordination overhead
Cons of VictoriaMetrics
- Smaller community than Prometheus and Thanos, though growing fast
- Cluster version has a different architecture than single-node (not just 'add more nodes')
- MetricsQL extensions are useful but create vendor lock-in if you rely on them heavily
- Alerting is a separate binary (vmalert), not embedded in the storage engine. Extra component to deploy and configure.
- Documentation is functional but not as polished as Prometheus ecosystem docs
When to Use VictoriaMetrics
- You need long-term Prometheus storage without the complexity of Thanos
- Prometheus is running out of memory or disk and you need a more efficient backend
- Multi-cluster or multi-tenant monitoring where each team pushes metrics centrally
- High cardinality workloads that crash Prometheus
When Not to Use VictoriaMetrics
- You want alerting embedded in the same process as storage (Prometheus bundles this; VictoriaMetrics requires the separate vmalert binary)
- Your metrics volume fits comfortably in a single Prometheus instance with local storage
- You need strong ecosystem support and battle-tested integrations right now
- Log aggregation or distributed tracing (this is a metrics-only database)
Alternatives to VictoriaMetrics
Prometheus, Grafana, InfluxDB, TimescaleDB, Thanos
VictoriaTraces — Distributed tracing built on the same engine as VictoriaLogs, without the external storage tax
Category: Observability
## Why It Exists
Use Cases for VictoriaTraces
- Distributed tracing across microservice architectures
- Request latency analysis and bottleneck identification
- End-to-end request flow visualization
- Trace-based alerting via vmalert integration
- Full-stack observability alongside VictoriaMetrics and VictoriaLogs
- High-volume trace ingestion from OpenTelemetry instrumented services
- Service dependency graph generation
- Root cause analysis during production incidents
Pros of VictoriaTraces
- Uses 3.7x less RAM and 2.6x less CPU than Grafana Tempo in benchmarks
- No external storage dependencies. No Elasticsearch, Cassandra, or S3 required for production.
- Same operational model as VictoriaMetrics and VictoriaLogs (vtinsert/vtselect/vtstorage)
- OTLP-native ingestion with custom HTTP/2 server (25% smaller binary, 36% less CPU than gRPC-Go)
- Bloom filter indexed search on all span fields without manual index configuration
- Cluster mode with linear horizontal scaling. Each component scales independently.
- Compatible with Grafana (via Jaeger datasource) and Jaeger UI for visualization
Cons of VictoriaTraces
- Younger project than Jaeger and Grafana Tempo, smaller community
- No S3-native storage. Uses local disk. Cold/archive requires vmbackup to S3.
- No TraceQL equivalent. Querying uses Jaeger APIs and LogsQL, which is less expressive for trace-specific patterns
- Grafana integration via Jaeger datasource plugin, not a native datasource
- Tempo datasource API support is still experimental
- Fewer managed service options and third-party integrations compared to Tempo and Jaeger
When to Use VictoriaTraces
- Already running VictoriaMetrics and VictoriaLogs and want the same operational model for traces
- Resource efficiency is a priority and the infrastructure budget is tight
- Trace volume is high and Tempo's RAM usage (4x higher) is a concern
- No external storage dependencies is a hard requirement (air-gapped environments, strict compliance)
- Need trace storage with bloom filter indexed search on all span attributes
When Not to Use VictoriaTraces
- Need TraceQL for advanced trace queries (Grafana Tempo is the only option for this)
- S3-native storage with automatic lifecycle tiering is required
- Deep Grafana ecosystem integration is a priority (Tempo has native support)
- Need a mature, battle-tested tracing backend with a large community (Jaeger, Tempo)
- Team is already invested in the Grafana LGTM stack (Loki, Grafana, Tempo, Mimir)
Alternatives to VictoriaTraces
VictoriaMetrics, VictoriaLogs, Grafana, Kafka, Prometheus
vLLM — The go-to LLM inference engine, built around PagedAttention for squeezing real throughput out of your GPUs
Category: AI & ML
## Why It Exists
Use Cases for vLLM
- Running open-weight LLMs in production without paying per-token API fees
- Serving Llama, Mistral, Qwen, and similar models at scale
- Batch processing large queues of LLM requests overnight or on demand
- Hosting multiple models behind one unified API gateway
- Cutting inference costs once your request volume justifies owning GPUs
- Low-latency inference for real-time chat or autocomplete features
Pros of vLLM
- 2-24x higher throughput than naive HuggingFace inference, thanks to PagedAttention
- OpenAI-compatible API, so you can swap it in without changing your client code
- Supports 50+ model architectures: Llama, Mistral, Qwen, Gemma, and more
- Continuous batching keeps GPU utilization high across concurrent requests
- Large, active open-source community shipping features fast
Cons of vLLM
- NVIDIA GPUs (CUDA) are basically required. AMD and Intel support is still rough
- Large models (70B+) need careful memory tuning or you will hit OOM errors
- You will spend time tuning config flags to match your specific hardware
- Inference only. No built-in fine-tuning support
- Multi-node tensor parallelism adds real networking headaches
When to Use vLLM
- You are self-hosting open-weight LLMs and need production-grade serving
- The model's native serving framework cannot keep up with your throughput needs
- You want an OpenAI-compatible endpoint for internal services
- Your request volume is high enough that owning GPUs beats paying per-token API costs
When Not to Use vLLM
- Low request volume where API providers are cheaper than renting or owning GPUs
- You need proprietary models like GPT-4 or Claude (those are not self-hostable)
- You do not have GPU infrastructure and are not ready to set it up
- You are rapidly experimenting across many models and setup time matters more than throughput
Alternatives to vLLM
Hugging Face, RAG, LangChain
Yjs — The CRDT library that makes real-time collaboration work without a central authority
Category: Collaboration
Anyone who has tried to build real-time collaboration into an app knows the hard part is not the WebSocket. The hard part is what happens when two users edit the same paragraph at the same time while one of them is on a flaky train Wi-Fi. Yjs solves that problem. It is a CRDT (Conflict-Free Replicated Data Type) library that makes every character in a document a unique, mergeable item. Two users can edit the same word simultaneously, go offline, come back 20 minutes later, and their changes merge without losing anything. No server needed to decide who wins.
Use Cases for Yjs
- Collaborative document editing (Google Docs-style)
- Shared whiteboards and design tools
- Multiplayer coding environments
- Offline-first apps that sync when back online
- Peer-to-peer collaboration without a server
Pros of Yjs
- Works offline. Edits sync automatically when connectivity returns
- No central server required for correctness (peer-to-peer capable)
- Sub-millisecond local operations. Edits feel instant
- Language-agnostic binary sync protocol. Clients in JS, Rust, Swift, Kotlin
- Mature ecosystem: bindings for ProseMirror, Monaco, CodeMirror, Quill
Cons of Yjs
- Document size grows over time from internal metadata (tombstones, clock vectors)
- No built-in access control or permissions. A server layer handles that
- Debugging merge conflicts requires understanding YATA internals
- Garbage collection of deleted content is limited by design
When to Use Yjs
- Building a collaborative editor (text, code, diagrams)
- Offline-first applications where users edit without connectivity
- Peer-to-peer sync without a central server dependency
- Need sub-100ms sync latency for real-time cursor presence
When Not to Use Yjs
- Simple form-based collaboration where last-writer-wins is fine
- Data with strict invariants (inventory counts, bank balances)
- Very large documents (100MB+) where metadata overhead matters
- Teams unfamiliar with CRDTs who need something simpler
Alternatives to Yjs
TipTap, Hocuspocus, Redis
ZooKeeper — The coordination service that half the big data world still depends on
Category: Coordination
## Why It Exists
Use Cases for ZooKeeper
- Leader election
- Distributed configuration management
- Service discovery
- Distributed locking
- Cluster membership tracking
- Barrier synchronization
Pros of ZooKeeper
- Strong consistency through ZAB consensus protocol
- Ordered, sequential operations
- Ephemeral nodes for failure detection
- Watch mechanism for real-time notifications
- Battle-tested in production (Kafka, HBase, Solr)
Cons of ZooKeeper
- Not designed for large data storage
- Write throughput is limited (leader bottleneck)
- Operational complexity with quorum management
- Java-based, which means GC pause headaches
- Being replaced by newer alternatives (etcd, KRaft)
When to Use ZooKeeper
- Need leader election for distributed services
- Existing Hadoop/Kafka ecosystem dependency
- Distributed configuration that must be consistent
- Service coordination requiring strong ordering
When Not to Use ZooKeeper
- General-purpose data storage
- High write throughput requirements
- Greenfield projects (consider etcd instead)
- Simple service discovery (consider Consul)
Alternatives to ZooKeeper
etcd, Kafka