Skip to main content

Debugging a CPU Spike on Linux: A Practical Walkthrough

1 min read

The symptom

Production alert: CPU > 90% for api-server. By the time SSH was open, it was back to normal. Happened every hour on the hour.

The toolkit

ToolWhat it told us
topWhich process (the server)
straceWhat syscall (a tight loop)
perfWhere in the code (a mutex)
FlamegraphThe full picture

The culprit

strace -p $(pgrep api-server) -c -S time 2>&1 | head -20

Showed futex syscalls dominating — a mutex contention issue. A background goroutine was holding a lock while doing a slow Redis operation, blocking the main request handler.

The fix

Move the Redis operation out of the critical section. The mutex should only protect the in-memory state, not I/O.

What I learned

strace is the first tool, not the last. It tells you what syscall is busy but not why. perf top and flamegraphs fill in the “why.” The combination is faster than guessing.