15

I'm running Ubuntu Linux 12.04.1, with VirtualMin 4.08.gpl GPL and 2 CPU cores.

Pretty much all the time for the last few weeks, it's been running at well above load average of 5, usually up closer to 10, sometimes reaching 20.

Right now, CPU load averages: 9.20 (1 min) 8.20 (5 mins) 7.81 (15 mins)

At the same time, VirtualMin returns:

Virtual Memory: 996 MB total, 15.44 MB used
Real Memory: 3.80 GB total, 972.43 MB used 
Local disk space: 915.94 GB total, 116.03 GB used

Have restarted (shutdown -rf now) the machine a few times and sure enough sooner or later we're back up with high CPU loads.

Running top (or htop) returns nothing significant at all running at high CPU - in fact watching it for a few minutes and the highest item would maybe high 3% CPU.

Top returns this also:

Cpu(s): 2.2%us, 1.2%sy, 0.0%ni, 0.0%id, 96.5%wa, 0.0%hi, 0.2%si, 0.0%st

The %wa concerns me as it's so high - seems to stay up above 80%. I understand this is % in wait, but not sure what that means in practical terms.

Where can I start to debug this and figure out what's causing the high CPU load?

Braiam
  • 35,380
  • 25
  • 108
  • 167
rjbathgate
  • 251
  • 1
  • 2
  • 3

1 Answers1

19

Those are not "CPU load averages" but system "load averages". It doesn't mean necessarily that your CPU is busy, but something in your system is. This value comes from /proc/loadavg which man proc explains it in more detail:

/proc/loadavg

The first three fields in this file are load average figures giving the number of jobs in the run queue (state R) or waiting for disk I/O (state D) averaged over 1, 5, and 15 minutes. They are the same as the load average numbers given by uptime(1) and other programs. The fourth field consists of two numbers separated by a slash (/). The first of these is the number of currently runnable kernel scheduling entities (processes, threads). The value after the slash is the number of kernel scheduling entities that currently exist on the system. The fifth field is the PID of the process that was most recently created on the system.

So, what your are seeing is the average of process running or waiting for the disk.

If you are seeing a load average of 20 it means that an average of 20 process are in Running or Waiting state. You can have a load average very high and the CPU very low, or a load average very low and CPU very high, since they share no relationship.

The %wa high can be some process trashing the disk with uncanny frequency that makes everything else slow, so figure out which is the culprit, starting for D process. The wa means IO wait on most top's implementations.

Braiam
  • 35,380
  • 25
  • 108
  • 167
  • Hi, thanks for the heplful reply. What would be my next step to figure out what processes are the culprit? Thanks – rjbathgate Jun 04 '14 at 02:02
  • my last paragraph already says to checkout the D process. – Braiam Jun 04 '14 at 02:05
  • 1
    Yes, thanks - forgive me, but how do i checkout the D process? What is the D process? – rjbathgate Jun 04 '14 at 02:24
  • @user1513196 I've answered that question a few times, so I'll shamelessly supply a link to one of those times: http://unix.stackexchange.com/a/116865/4358 – phemmer Jun 04 '14 at 02:37
  • Thanks, appreciate the link. These seem a bit suspect: TIME+ / WCHAN / COMMAND 26:28.85 / sleep_on_ /jbd2/sda1-8 39:30.24 / get_reque / flush-8:0 They come and go from being in D state, but the TIME+ seems excessive. There are other D processes but none are in D for long and I recoginise them all as ok. – rjbathgate Jun 04 '14 at 03:02
  • 1
    @user1513196 seems like you need to be asking another question ;) – Braiam Jun 04 '14 at 03:16
  • 2
    @Braiam - I would think the general gist of what I am trying to solve here is pretty clear. – rjbathgate Jun 04 '14 at 03:24
  • I had this situation: 100% cpu usage (all cores), cpu load of about 40; and the system was running apparently fine but cpu was heating; I SIGKILL 3 culprit pids and the remaining one leave me with about 80% cpu all cores usage. htop, top, ps could not say me what pid was causing that, so I created [this question](http://unix.stackexchange.com/questions/138502/detect-process-eating-cpu-without-top-htop-ps), unfortunately I didnt see the %wa and I still cant reproduce the problem.. I would like to try to create some script that fastly gives all info to help on tracking the culprit pids – Aquarius Power Jun 22 '14 at 23:22