4

I have been trying to debug random system freezes when running the 4.14.93-rt kernel. To this end, I have enabled the lockup detector in the kernel using the following config:

CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=1
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=1

The goal is to trigger a kernel panic when a lockup occurs. I also enabled the NMI watchdog on the kernel cmdline:

nmi_watchdog=1

Making use of kdump/kexec tools, I have configured the system to generate a kernel crash dump on kernel panic. The mechanism works when triggering a panic manually using:

echo c > /proc/sysrq-trigger

I can confirm that the system loads a dump-capture kernel in that case. However, when experiencing an actual lockup, the system just reboots when the watchdog kicks in. AFAIK there is no kernel panic occuring. There is no switch to the dump-capture kernel. No core dump, nothing stored in the logs.

Note that I enabled all relevant sysctl options:

kernel.panic = 1
kernel.panic_on_oops = 1
kernel.unknown_nmi_panic = 1
kernel.panic_on_unrecovered_nmi = 1
kernel.panic_on_io_nmi = 1
kernel.softlockup_panic = 1
kernel.hung_task_panic = 1

I see this behavior when experiencing a real-life system freeze. It also occurs when running a CPU hogging while-loop on all cores with high RT priority. I would expect this to be detected as a hung task and lead to a panic.

What could cause a reboot in this case without triggering the panic/kdump mechanism?

tomptz
  • 41
  • 3
  • Could it be that you have a reboot on panic set? The system enters a hard lockup - This causes a panic -- This causes it to reboot – user352632 May 12 '19 at 20:08
  • @user352632 What do you really mean? Kernel parameters? – firo Jul 22 '20 at 05:36
  • I also have a similar issue with armv7 (Cortex A9) running Linux 4.14.85. Upon investigation, it is found that there is something tainted the disable MMU path. CPU simply stuck in loop after trying to disabling MMU in hard-lockup / soft-lockup path. Ref: https://elixir.bootlin.com/linux/v4.14.85/source/arch/arm/mm/proc-v7.S#L61 The mechanism works when triggering a panic manually using: echo c > /proc/sysrq-trigger – Theorizchy Cleven Aug 25 '20 at 16:55
  • tomptz, were you able to bypass the problem? please share your findings if you are able to resolve. – Sandeep Oct 06 '20 at 07:03

0 Answers0