Mar 2025 – Apr 2025•SRE / Observability

Observability Playground

Interactive observability lab demonstrating metrics, logs, and tracing.

GitHub Repository Live Demo

Screenshots

Demo Video

The Problem

Many systems lack proper monitoring and tracing, making it hard to diagnose production issues.

The Solution

Observability Playground demonstrates metrics, logs, and traces through simulated failure scenarios.

Implementation Details

You can't fix what you can't see. This project is a curated environment where "chaos" is invited so that we can learn how to observe it.

The Three Pillars

I integrated the full LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) to show how logs, metrics, and traces correlate. When a simulated memory leak occurs, you can see the resident memory metric spike, find the specific error logs, and trace the exact request path that triggered the leak.

Simulated Failures

The playground includes a "Chaos Dashboard" where you can trigger network latency, 5xx error storms, and CPU saturation events. It's the ultimate training ground for SREs.