1

I'm wondering how I could monitor spinlocks. At my client, we have cpu soft lockup failure, for which, if I understand well, spinlock is a likely cause.

Different team use that server for predictive modeling using R, Python and SAS, meaning we often have many unsupervized processes running in parallel, possibly with multiprocessing librairies.

Monitoring the number of spinlocks or, even better, which processes used them, might help in validating or invalidating them as a cause for our frequent failures (5 during the last 3 weeks).

Is there any way to monitor them? If not, how could we know what would be causing those soft lockups?

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250

1 Answers1

1

If the spin-locks are in user space, you likely can't monitor them. Some software will track spin-lock time and provide a method for extracting it. You may be able to monitor by proxy using a count of runnable processes. This should increase if you have a number of processes in a runnable state.

A well-behaved program will abandon it spin-lock after a short period. If the spin-lock was not successful it will preempt and wait for a lock. If you have a poorly behaved program, it will increase CPU utilization when spin-locking frequently.

Recording the the system state over time can be useful in cases like this. sar can run in the background recording data periodically. This is useful in a case such as yours as you can examine the trends leading up to the failure. There are are tools that will provide graphical output, but I find looking at the raw data more useful.

There are also tools that will record ongoing usage directly into rrd (round-robin database) files and graph the results. These are useful for trend analysis.

If these are batch or batch-like programs, you may want to dynamically nice programs using the most CPU. There are various programs available that monitor resource utilization and and adjust priorities so resource hogs don't kill performance for other users.

BillThor
  • 8,887
  • 22
  • 27
  • So if I understand you well, the kernel is using spin-locks itself. Does it means that a system which is overloaded by CPU heavy programs might crach by itself? I'm trying to see if it's one of the programs that might use a spinlock incorrectly, or if the general load by itself could crash the system. – laurent exsteens Jun 01 '16 at 16:23
  • Other question: if we manage to pin-point programs using spinlocks incorrectly, will nicing them solve the issue (meaning, can a spinlock be put on sleep, from my understanding it can't)? Or the only way would be to force them to run single-threaded to avoid using spinlocks? – laurent exsteens Jun 01 '16 at 16:24
  • @laurentexsteens Nicing the process won't solve the problem, but it will allow all other processes to get time slices in preference to the program that has been niced. This should minimize the issue. I would expect the kernel spinlocks to terminate quickly. Any long running kernel spinlock would be a bug, and should show up as high system time. – BillThor Jun 02 '16 at 13:31