← All posts

Mar 19, 2025

The Monitoring Stack I Recommend

The stack and principles I reach for when I want visibility without turning observability into theater.

ObservabilityMonitoringSRE

When people ask for a monitoring stack recommendation, they often expect a tool list. The real answer is not just which tools to install. It is what kind of visibility problem you are actually trying to solve.

For most teams building on modern infrastructure, I like a practical, open-source setup that grows with the system:

  • Prometheus for multi-dimensional metrics
  • Grafana for visualization and centralized alerting
  • Loki for log aggregation and exploration
  • Tempo for distributed tracing (when the service graph justifies it)

Why this stack?

This combination works because it provides a unified view of system health without the vendor lock-in or the high cost of SaaS solutions.

Prometheus: The backbone of metrics

Prometheus isn't just a database; it’s an ecosystem. The ability to pull metrics from exporters (like node_exporter or kube-state-metrics) means you can get deep visibility into your infra and apps with minimal code changes. I prefer it over push-based systems because it puts the control of collection frequency and overhead on the monitoring system itself.

Grafana: The "Single Pane of Glass"

Grafana is where the data becomes useful. I focus on building dashboards that follow the Four Golden Signals: Latency, Traffic, Errors, and Saturation. If a dashboard doesn't help me identify which of these is failing during an incident, it’s just noise.

Loki: Logs like Prometheus

Loki’s approach—indexing labels rather than the full log text—makes it incredibly cost-effective. It allows you to use the same labels for your logs as you do for your metrics. This is the "killer feature": you can see a spike in a Prometheus chart and jump directly to the relevant logs in the same timeframe with zero context switching.

Observability is not "Theater"

I see many teams build impressive-looking dashboards that help nobody during an actual outage. This happens when you optimize for "looking smart" instead of "debugging fast."

To avoid this, I follow three principles:

  1. Alert on symptoms, not causes. Don't alert me because CPU is high; alert me because the user-facing latency has increased.
  2. Structure your logs. JSON-formatted logs make Loki infinitely more powerful. It turns a wall of text into a searchable database.
  3. Keep the signal-to-noise ratio high. If an alert doesn't require immediate action, it shouldn't be an alert—it should be a dashboard entry or a daily report.

Final thought

A good monitoring stack is one that reduces ambiguity under pressure. If it cannot shorten the path from confusion to action, it is overhead dressed up as maturity. Use your tools to build confidence, not just collection.