5

We have 2 identically configured servers and one of them is showing a high %sy while the other does not. This would seem to indicate that the system is busy doing something.

Cpu(s): 28.1%us, 66.3%sy,  0.7%ni,  4.8%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st

My Question

How do I determine what sub-module(s) of the kernel are causing this high load?

Additional info

  • Both systems are Ubuntu 12.04 LTS
  • Both systems are EC2 instances running on AWS (Amazon)

The affected system experienced what looks like a stack trace in the logs, along with this message:

soft lock message
[33644527.529071] BUG: soft lockup - CPU#0 stuck for 23s! [monitorcron:31103]
[33644527.529087] Modules linked in: isofs ip6table_filter ip6_tables ipt_REJECT xt_state iptable_filter xt_REDIRECT xt_comment xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
Stack trace
[33644527.529116] CPU: 0 PID: 31103 Comm: moncron Tainted: G        W    3.13.0-34-generic #60~precise1-Ubuntu
[33644527.529120] task: ffff8800a565afe0 ti: ffff8800c6150000 task.ti: ffff8800c6150000
[33644527.529122] RIP: e030:[<ffffffff8175f32f>]  [<ffffffff8175f32f>] _raw_spin_unlock+0x1f/0x30
[33644527.529133] RSP: e02b:ffff8800c6151c58  EFLAGS: 00000286
[33644527.529135] RAX: ffff8801aed728c0 RBX: ffff8800c6151cc0 RCX: ffff8801aed728c0
[33644527.529137] RDX: ffff8801aed728c0 RSI: 00000000a10ca10a RDI: ffff8801aed72898
[33644527.529139] RBP: ffff8800c6151c58 R08: 000000000000000a R09: 0000000000000000
[33644527.529141] R10: 0000000000000131 R11: 0000000000000130 R12: ffff8801aed72840
[33644527.529142] R13: ffff8801aed728c0 R14: ffff8801aed72898 R15: ffff8801aed72840
[33644527.529149] FS:  00007f37888a8700(0000) GS:ffff8801dee00000(0000) knlGS:0000000000000000
[33644527.529152] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[33644527.529153] CR2: 0000000000dac3b8 CR3: 00000000051e5000 CR4: 0000000000002660
[33644527.529156] Stack:
[33644527.529158]  ffff8800c6151ca8 ffffffff811e0a98 ffff8800c6151cb8 ffff8801aed728c0
[33644527.529161]  ffff8800c6151d00 ffff8800c6151cc0 ffff8801aee64900 0000000000007980
[33644527.529164]  0000000000007980 ffff8801aee64900 ffff8800c6151ce8 ffffffff811e0c18
[33644527.529168] Call Trace:
[33644527.529177]  [<ffffffff811e0a98>] shrink_dentry_list+0x28/0xe0
[33644527.529181]  [<ffffffff811e0c18>] shrink_dcache_parent+0x28/0x70
[33644527.529188]  [<ffffffff81232257>] proc_flush_task_mnt.isra.15+0x77/0x170
[33644527.529194]  [<ffffffff81235776>] proc_flush_task+0x56/0x70
[33644527.529200]  [<ffffffff8106c803>] release_task+0x33/0x130
[33644527.529204]  [<ffffffff8106cdcf>] wait_task_zombie+0x4cf/0x5f0
[33644527.529209]  [<ffffffff8106cffb>] wait_consider_task.part.8+0x10b/0x180
[33644527.529213]  [<ffffffff8106d0d5>] wait_consider_task+0x65/0x70
[33644527.529217]  [<ffffffff8106d1e1>] do_wait+0x101/0x260
[33644527.529220]  [<ffffffff8106e213>] SyS_wait4+0xa3/0x100
[33644527.529225]  [<ffffffff8106bc10>] ? task_stopped_code+0x50/0x50
[33644527.529231]  [<ffffffff8176853f>] tracesys+0xe1/0xe6
[33644527.529232] Code: 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 e9 0a 00 00 00 66 83 07 02 5d c3 0f 1f 40 00 8b 37 f0 66 83 07 02 <f6> 47 02 01 74 ed e8 d0 74 fe ff 5d c3 0f 1f 40 00 66 66 66 66 

I'm assuming this is related but haven't determined how or why, if indeed it is.

Braiam
  • 35,380
  • 25
  • 108
  • 167
slm
  • 363,520
  • 117
  • 767
  • 871
  • 1
    Possibly related: http://unix.stackexchange.com/questions/70377/bug-soft-lockup-cpu-stuck-for-x-seconds – muru Dec 05 '14 at 15:27

1 Answers1

1

First you should try to determine whether this problem is causes by a single process / application or system wide. There are probably tools (which I am not aware of) to do this in a more straightforward way but without them at hand I suggest to iterate through all processes which consume a relevant part of the CPU time and stop them for a moment (kill -STOP $PID). If this is caused by one or few processes then the %sy value should drop a lot.

If you have found such a process then you can attach strace -c -p $PID to it for a few seconds in order to see which syscalls are used and for how long. That may give you a hint which kernel parts are involved (especially if you compare it to the output on the other system).

Hauke Laging
  • 88,146
  • 18
  • 125
  • 174
  • The high load does not seem to be caused by a userland process (at least not directly). Top shows all the procs with loads of 0. Occasionally a proc will pop up and consume load, but it's normal and then once they're done their individual load drops back down. We've found a proc that seems to be blocking on a lock file, so I'll try attaching to it to see if it sheds additional info on the issue. – slm Dec 05 '14 at 15:23
  • FYI, when attaching to the suspected proc it's not doing anything. – slm Dec 05 '14 at 15:31
  • @slm You get the results when aborting `strace`. – Hauke Laging Dec 05 '14 at 15:35
  • [How to quit strace when attached to a process?](http://stackoverflow.com/questions/5431047/how-to-quit-strace-when-attached-to-a-process) – slm Dec 05 '14 at 15:43
  • Did the connect to `strace` with another `strace`, saw it was doing a `wait(...`, tried to `kill -15` (didn't work) so did a `kill -9` and it's gone. – slm Dec 05 '14 at 15:46
  • @slm Seems to be part of the problem. Does that happen with all processes or just with this one? – Hauke Laging Dec 05 '14 at 15:54
  • Just that one is exhibiting it. – slm Dec 05 '14 at 16:00
  • @slm You could kill the process and restart it via `strace`: `strace -o problemprog.strace -f problemprog` – Hauke Laging Dec 05 '14 at 16:17
  • We're afraid to kill it since we'll lose the state it's in and lose the opportunity to debug this further. – slm Dec 05 '14 at 16:21
  • @slm Instead of killing: Can you start another instance of this program (via `strace`)? – Hauke Laging Dec 05 '14 at 16:26
  • No. And even if we did there's no guarantee that a second one would exhibit the same problem. – slm Dec 05 '14 at 16:38