Skip to main content

What On-Call Fatigue Taught Me About System Design

1 min read

The week from hell

Three pages in one night. Two of them were false positives from badly configured alert thresholds. The third was a real issue — but one we’d seen before and hadn’t fixed permanently.

The pattern

Alert typeCountAction taken
PagerDuty noise12Adjusted thresholds
Recurring incident5Manual restart each time
Novel problem2Permanent fix

The recurring incidents were the real drain. Each restart took 15 minutes and happened at 3 AM.

What we fixed

  • Auto-remediation: If a service crashes the same way twice in an hour, restart it automatically and file a ticket
  • Silence known issues: If there’s a known workaround, automate it
  • Post-incident investment: Every recurring incident gets a permanent fix within two weeks

What I learned

A quiet pager doesn’t mean the system is healthy. But a pager that wakes you up for the same thing twice means the system is neglected. Invest in permanence, not monitoring.