What On-Call Fatigue Taught Me About System Design
1 min read
The week from hell
Three pages in one night. Two of them were false positives from badly configured alert thresholds. The third was a real issue — but one we’d seen before and hadn’t fixed permanently.
The pattern
| Alert type | Count | Action taken |
|---|---|---|
| PagerDuty noise | 12 | Adjusted thresholds |
| Recurring incident | 5 | Manual restart each time |
| Novel problem | 2 | Permanent fix |
The recurring incidents were the real drain. Each restart took 15 minutes and happened at 3 AM.
What we fixed
- Auto-remediation: If a service crashes the same way twice in an hour, restart it automatically and file a ticket
- Silence known issues: If there’s a known workaround, automate it
- Post-incident investment: Every recurring incident gets a permanent fix within two weeks
What I learned
A quiet pager doesn’t mean the system is healthy. But a pager that wakes you up for the same thing twice means the system is neglected. Invest in permanence, not monitoring.