Back to projects
Kubernetes Failure Lab
Mar 2026 – Apr 2026SRE Training Platform

Kubernetes Failure Lab

Interactive lab for learning Kubernetes troubleshooting.

Screenshots

Project screenshot
Project screenshot

The Problem

Most engineers learn how to deploy to Kubernetes, but struggle when things go wrong in production.

The Solution

Interactive platform for diagnosing Kubernetes failures.

Implementation Details

The best way to learn is to fix a broken system. The Kubernetes Failure Lab provides a set of "Broken Cluster" scenarios that you have to investigate and repair.

Scenario Engine

I built an automated agent that breaks specific parts of a K8s cluster (e.g., misconfiguring a CoreDNS ConfigMap or deleting a CNI binary) and then challenges the user to find the root cause using standard kubectl and observability tools.

Real-world Simulations

Scenarios include "The Stealthy CPU Leak," "The Flaky Webhook," and "The OOMKill Mystery"—all based on real incidents I've encountered or studied.