Grafana
Open-source dashboarding and alerting that sits on top of whatever backends you already run
Use Cases
Architecture
Why It Exists
Every monitoring backend ships its own UI, and every one of them is mediocre. Prometheus offers a bare-bones expression browser. Elasticsearch has Kibana (which is great for logs but clunky for metrics). InfluxDB has Chronograf. Running three backends means tabbing between three UIs, and none of them talk to each other.
Torkel Ödegaard started Grafana in 2014 as a fork of Kibana's visualization layer, and the core idea was simple: decouple the visualization from the storage. Connect to any backend through a plugin, render it all in one place. That separation turned out to be the killer feature. Pick whatever storage makes sense for each signal type, and Grafana provides a unified view.
Today it has over 65,000 GitHub stars and a massive ecosystem. It is used everywhere from startups to places like CERN (monitoring the Large Hadron Collider).
How It Works
Data Source Plugins: Grafana stores nothing. It queries external data sources in real-time. Each plugin translates the panel's query configuration into the backend's native language (PromQL for Prometheus, LogQL for Loki, JSON DSL for Elasticsearch, plain SQL for databases) and returns results in Grafana's internal data frame format. The plugin system is open, so anyone can write a data source plugin for their own backend.
Dashboards: Under the hood, a dashboard is just a JSON document. It holds panels (visualizations), variables, time range settings, and annotations. Each panel has a query (data source + expression), a visualization type (time series, stat, gauge, table, heatmap, etc.), and display config like thresholds and colors. Dashboards can be built in the UI, provisioned from JSON/YAML files, or managed through the Grafana HTTP API. The JSON format is verbose and not fun to write by hand, which is why tooling like Grafonnet exists.
Alerting: Before v9, alerting in Grafana was a mess. Dashboard-level alerts, Prometheus alerts, and external alerting tools all coexisted without clear ownership. Unified alerting (v9+) fixed this with a single rule engine that evaluates conditions against any data source. Alert rules use the same query language as panels. Notification policies route alerts to contact points (email, Slack, PagerDuty, OpsGenie) based on labels, with support for grouping, silencing, and muting.
Architecture Deep Dive
Grafana Server: It runs as a single Go binary serving both the web UI and API. Configuration state (dashboards, data sources, users, alerts) lives in a database. SQLite works fine for local dev or single-node setups. For production, use PostgreSQL or MySQL. Authentication supports built-in accounts, LDAP, OAuth, and SAML. Authorization is role-based: viewer, editor, admin. Nothing fancy, but it covers most org structures.
Panel Rendering: When a dashboard loads, the frontend sends a query request to the server for each panel. The server proxies those to the right data sources, collects the responses, and streams results back to the browser. The frontend renders visualizations using uPlot for time series, D3.js for custom charts, and Grafana's own rendering engine. Each panel is isolated and renders independently, which is both a strength (one broken panel does not take down the dashboard) and a weakness (30 panels means 30 separate queries).
The LGTM Stack: Grafana Labs pushes the LGTM stack for full observability: Loki (log aggregation, like Prometheus but for logs, using LogQL), Grafana (visualization), Tempo (distributed tracing), and Mimir (long-term metrics storage, compatible with Prometheus remote write). The real value is correlation. Spot a spike in a metrics panel, click through to related logs in Loki, and drill down to individual traces in Tempo. When it works, it cuts incident investigation time dramatically. When the linking breaks (and it does sometimes, usually due to misconfigured labels), it is back to manual correlation.
Dashboard as Code: Grafonnet is a Jsonnet library that generates dashboard JSON programmatically. It is great for stamping out similar dashboards across dozens of services. Terraform Grafana Provider manages dashboards, data sources, and folders declaratively. Provisioning reads YAML/JSON from disk at startup, so dashboards deploy via ConfigMaps in Kubernetes. Pick whichever approach fits the workflow, but do pick one. Hand-clicking dashboards in the UI does not scale past a handful of services.
Best Practices
Structure dashboards as a hierarchy: service overview (RED metrics across all services), then service detail (per-service deep dive), then component detail (database, cache, queue specifics). Each level should answer "is something broken?" before anyone needs to drill deeper.
Use template variables for every selector. A well-built dashboard works for any instance of a service by changing a single dropdown. If cloning a dashboard and swapping out hardcoded values becomes necessary, the dashboard needs parameterization.
Set meaningful thresholds on panels. Green, yellow, red. If everything is always green, the thresholds are too loose. If everything is always red, nobody pays attention. The goal is that anyone on the team can glance at the dashboard and know whether to worry. That takes tuning over time, and it is worth the effort.
Pros
- • 60+ data source plugins (Prometheus, Loki, Elasticsearch, PostgreSQL, CloudWatch, and more)
- • Rich visualization library with 15+ panel types including graphs, heatmaps, and geo maps
- • Dashboard-as-code with JSON models and Terraform provider
- • Unified alerting across all data sources with a single rule engine
- • Free and open-source core with enterprise features available in Grafana Cloud
Cons
- • Dashboard sprawl is real. Organizations end up with hundreds of unmaintained dashboards nobody owns
- • Complex dashboard JSON models are painful to manage without Terraform or Jsonnet
- • Performance tanks when you pack too many panels or high-cardinality queries into one dashboard
- • Steep learning curve for advanced PromQL/LogQL queries inside panels
- • Plugin quality is inconsistent. Some community plugins are abandoned or buggy
When to use
- • You need a single visualization layer across multiple data sources
- • You are building monitoring dashboards for Prometheus, Loki, or other time series data
- • You want dashboard-as-code for version-controlled, reproducible observability
- • Different teams use different backends and you need cross-team visibility
When NOT to use
- • Data collection and storage. Grafana only visualizes, it does not store metrics. Use Prometheus for that
- • Business intelligence with complex data transformations (use Looker, Tableau, or Metabase)
- • Simple status pages (use Betteruptime, Statuspage, or a similar SaaS)
- • Real-time streaming dashboards with sub-second updates (build a custom WebSocket solution instead)
Key Points
- •Mixed data source mode lets a single panel query multiple backends at once. Overlaying Prometheus metrics with Loki log counts correlates error rates with specific error messages in one view
- •Dashboard variables (template variables) make dashboards reusable. A single Kubernetes dashboard with $namespace, $pod, and $container variables can serve every team without per-team copies
- •Unified alerting (since v9) evaluates alert rules server-side against any data source, not just Prometheus. This replaces the need for separate alerting systems per backend
- •Explore view provides ad-hoc query mode built for incident investigation. Split view enables correlating metrics and logs side-by-side, with time range sync and log-to-trace linking
- •Dashboard provisioning via YAML/JSON files in a ConfigMap supports GitOps workflows. Dashboards ship alongside application code, so monitoring evolves with the service it covers
Common Mistakes
- ✗Cramming 30+ panels into one dashboard. Each panel fires an independent query. With 30 panels on a 5-minute auto-refresh, that is 360 queries/hour per viewer, which will hammer the data sources
- ✗Hardcoding values instead of using dashboard variables. Baking in a namespace or service name leads to duplicated dashboards. Parameterize all selectors so one dashboard works for any service
- ✗Skipping folder organization and permissions. Without folder-based RBAC, every user sees every dashboard. Organize by team or service and lock down access
- ✗Querying raw high-resolution data over long time ranges. A 30-day graph at 15-second resolution pulls 172,800 points per series. Use Prometheus recording rules to pre-aggregate for anything beyond 6 hours
- ✗Not testing alerting contacts and notification policies. Grafana Alerting without configured contact points and routing trees sends alerts into the void. Always test the full notification path before depending on it