Debugging a CPU Spike on Linux: A Practical Walkthrough
1 min read
The symptom
Production alert: CPU > 90% for api-server. By the time SSH was open, it was back to normal. Happened every hour on the hour.
The toolkit
| Tool | What it told us |
|---|---|
top | Which process (the server) |
strace | What syscall (a tight loop) |
perf | Where in the code (a mutex) |
| Flamegraph | The full picture |
The culprit
strace -p $(pgrep api-server) -c -S time 2>&1 | head -20
Showed futex syscalls dominating — a mutex contention issue. A background goroutine was holding a lock while doing a slow Redis operation, blocking the main request handler.
The fix
Move the Redis operation out of the critical section. The mutex should only protect the in-memory state, not I/O.
What I learned
strace is the first tool, not the last. It tells you what syscall is busy but not why. perf top and flamegraphs fill in the “why.” The combination is faster than guessing.