← All posts

Sep 10, 2025

Why Alert Fatigue Happens

Teams do not ignore alerts because they are careless. They ignore alerts because the alert system trained them to.

ObservabilityOn-callSRE

Alert fatigue is not a people problem first. It is a systems design problem. Teams do not ignore alerts because they are careless; they ignore them because the alerting system has effectively trained them to.

The Vicious Cycle

Alert fatigue starts when the line between "informational" and "critical" becomes blurred. It follows a predictable and destructive pattern:

  1. Low-Signal Alerts: A team sets up alerts for minor deviations (e.g., CPU spikes at 70%, occasional 5xx errors).
  2. Desensitization: People start seeing these alerts multiple times a day. Since they rarely require action, the mental cost of "checking" them feels like wasted effort.
  3. The Mute Habit: Engineers start muting channels or creating auto-archive filters.
  4. The Critical Miss: A genuine, system-threatening incident occurs, but it’s buried under the noise of the signals that were trained to be ignored.

Designing for Human Attention

Good on-call culture begins with respecting human attention. Every alert that triggers a page should meet a high bar of clarity and urgency.

1. The Actionability Test

If an alert triggers at 3 AM and the response is "I'll check this when I wake up," then it wasn't a page—it was an email. If an alert doesn't have a clear next step or a linked runbook, it shouldn't be interruptive.

2. Symptom over Cause

Don't alert on the "cause" unless you have to. For example, don't page because "Memory is high." Page because "Memory is high AND swap usage is increasing AND latency has spiked." This reduces the likelihood of paging on transient spikes that the system would have recovered from on its own.

3. Tiered Alerting

Separating your signals into tiers is essential:

  • Critical (P1): Wakes someone up. Needs a runbook. Needs immediate action.
  • Warning (P2): Visible in a dashboard or a non-interruptive Slack channel. Needs to be addressed within the current work day.
  • Informational (P3): Purely for debugging or post-incident analysis.

Final thought

A paging system that wastes attention will eventually fail during the one incident that matters. Scaling an engineering team isn't just about hiring more people; it's about making sure the people you have are looking at the right signals.