What On-Call Fatigue Taught Me About System Design

November 1, 2025 1 min read

The week from hell

Three pages in one night. Two of them were false positives from badly configured alert thresholds. The third was a real issue — but one we’d seen before and hadn’t fixed permanently.

The pattern

Alert type	Count	Action taken
PagerDuty noise	12	Adjusted thresholds
Recurring incident	5	Manual restart each time
Novel problem	2	Permanent fix

The recurring incidents were the real drain. Each restart took 15 minutes and happened at 3 AM.

What we fixed

Auto-remediation: If a service crashes the same way twice in an hour, restart it automatically and file a ticket
Silence known issues: If there’s a known workaround, automate it
Post-incident investment: Every recurring incident gets a permanent fix within two weeks

What I learned

A quiet pager doesn’t mean the system is healthy. But a pager that wakes you up for the same thing twice means the system is neglected. Invest in permanence, not monitoring.