High %sy in top output, how to debug?

Question

We have 2 identically configured servers and one of them is showing a high %sy while the other does not. This would seem to indicate that the system is busy doing something.

Cpu(s): 28.1%us, 66.3%sy,  0.7%ni,  4.8%id,  0.0%wa,  0.2%hi,  0.0%si,  0.0%st

My Question

How do I determine what sub-module(s) of the kernel are causing this high load?

Additional info

Both systems are Ubuntu 12.04 LTS
Both systems are EC2 instances running on AWS (Amazon)

The affected system experienced what looks like a stack trace in the logs, along with this message:

soft lock message

[33644527.529071] BUG: soft lockup - CPU#0 stuck for 23s! [monitorcron:31103]
[33644527.529087] Modules linked in: isofs ip6table_filter ip6_tables ipt_REJECT xt_state iptable_filter xt_REDIRECT xt_comment xt_multiport iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables intel_rapl x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64

Stack trace

[33644527.529116] CPU: 0 PID: 31103 Comm: moncron Tainted: G        W    3.13.0-34-generic #60~precise1-Ubuntu
[33644527.529120] task: ffff8800a565afe0 ti: ffff8800c6150000 task.ti: ffff8800c6150000
[33644527.529122] RIP: e030:[<ffffffff8175f32f>]  [<ffffffff8175f32f>] _raw_spin_unlock+0x1f/0x30
[33644527.529133] RSP: e02b:ffff8800c6151c58  EFLAGS: 00000286
[33644527.529135] RAX: ffff8801aed728c0 RBX: ffff8800c6151cc0 RCX: ffff8801aed728c0
[33644527.529137] RDX: ffff8801aed728c0 RSI: 00000000a10ca10a RDI: ffff8801aed72898
[33644527.529139] RBP: ffff8800c6151c58 R08: 000000000000000a R09: 0000000000000000
[33644527.529141] R10: 0000000000000131 R11: 0000000000000130 R12: ffff8801aed72840
[33644527.529142] R13: ffff8801aed728c0 R14: ffff8801aed72898 R15: ffff8801aed72840
[33644527.529149] FS:  00007f37888a8700(0000) GS:ffff8801dee00000(0000) knlGS:0000000000000000
[33644527.529152] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
[33644527.529153] CR2: 0000000000dac3b8 CR3: 00000000051e5000 CR4: 0000000000002660
[33644527.529156] Stack:
[33644527.529158]  ffff8800c6151ca8 ffffffff811e0a98 ffff8800c6151cb8 ffff8801aed728c0
[33644527.529161]  ffff8800c6151d00 ffff8800c6151cc0 ffff8801aee64900 0000000000007980
[33644527.529164]  0000000000007980 ffff8801aee64900 ffff8800c6151ce8 ffffffff811e0c18
[33644527.529168] Call Trace:
[33644527.529177]  [<ffffffff811e0a98>] shrink_dentry_list+0x28/0xe0
[33644527.529181]  [<ffffffff811e0c18>] shrink_dcache_parent+0x28/0x70
[33644527.529188]  [<ffffffff81232257>] proc_flush_task_mnt.isra.15+0x77/0x170
[33644527.529194]  [<ffffffff81235776>] proc_flush_task+0x56/0x70
[33644527.529200]  [<ffffffff8106c803>] release_task+0x33/0x130
[33644527.529204]  [<ffffffff8106cdcf>] wait_task_zombie+0x4cf/0x5f0
[33644527.529209]  [<ffffffff8106cffb>] wait_consider_task.part.8+0x10b/0x180
[33644527.529213]  [<ffffffff8106d0d5>] wait_consider_task+0x65/0x70
[33644527.529217]  [<ffffffff8106d1e1>] do_wait+0x101/0x260
[33644527.529220]  [<ffffffff8106e213>] SyS_wait4+0xa3/0x100
[33644527.529225]  [<ffffffff8106bc10>] ? task_stopped_code+0x50/0x50
[33644527.529231]  [<ffffffff8176853f>] tracesys+0xe1/0xe6
[33644527.529232] Code: 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 e9 0a 00 00 00 66 83 07 02 5d c3 0f 1f 40 00 8b 37 f0 66 83 07 02 <f6> 47 02 01 74 ed e8 d0 74 fe ff 5d c3 0f 1f 40 00 66 66 66 66

I'm assuming this is related but haven't determined how or why, if indeed it is.

Possibly related: http://unix.stackexchange.com/questions/70377/bug-soft-lockup-cpu-stuck-for-x-seconds — muru, Dec 05 '14 at 15:27

score 1 · Answer 1 · answered Dec 05 '14 at 15:18

1

First you should try to determine whether this problem is causes by a single process / application or system wide. There are probably tools (which I am not aware of) to do this in a more straightforward way but without them at hand I suggest to iterate through all processes which consume a relevant part of the CPU time and stop them for a moment (kill -STOP $PID). If this is caused by one or few processes then the %sy value should drop a lot.

If you have found such a process then you can attach strace -c -p $PID to it for a few seconds in order to see which syscalls are used and for how long. That may give you a hint which kernel parts are involved (especially if you compare it to the output on the other system).

answered Dec 05 '14 at 15:18

Hauke Laging

88,146
18
125
174

The high load does not seem to be caused by a userland process (at least not directly). Top shows all the procs with loads of 0. Occasionally a proc will pop up and consume load, but it's normal and then once they're done their individual load drops back down. We've found a proc that seems to be blocking on a lock file, so I'll try attaching to it to see if it sheds additional info on the issue. – slm Dec 05 '14 at 15:23
FYI, when attaching to the suspected proc it's not doing anything. – slm Dec 05 '14 at 15:31
@slm You get the results when aborting `strace`. – Hauke Laging Dec 05 '14 at 15:35
[How to quit strace when attached to a process?](http://stackoverflow.com/questions/5431047/how-to-quit-strace-when-attached-to-a-process) – slm Dec 05 '14 at 15:43
Did the connect to `strace` with another `strace`, saw it was doing a `wait(...`, tried to `kill -15` (didn't work) so did a `kill -9` and it's gone. – slm Dec 05 '14 at 15:46
@slm Seems to be part of the problem. Does that happen with all processes or just with this one? – Hauke Laging Dec 05 '14 at 15:54
Just that one is exhibiting it. – slm Dec 05 '14 at 16:00
@slm You could kill the process and restart it via `strace`: `strace -o problemprog.strace -f problemprog` – Hauke Laging Dec 05 '14 at 16:17
We're afraid to kill it since we'll lose the state it's in and lose the opportunity to debug this further. – slm Dec 05 '14 at 16:21
@slm Instead of killing: Can you start another instance of this program (via `strace`)? – Hauke Laging Dec 05 '14 at 16:26
No. And even if we did there's no guarantee that a second one would exhibit the same problem. – slm Dec 05 '14 at 16:38

High %sy in top output, how to debug?

My Question

Additional info

1 Answers1