0

I am using ./einj_mem_uc -f 'single' to inject an uncorrected error.

source code: https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git

I tested on CentOS 8.5 ( Kernel 4.18.0-348.el8.x86_64 ), it seems to boot into the mini kernel ( kexec ? ) then kernel panic.

I tried on 5.15 kernel, the system did not crash, why got the difference?

in both tests, I can see:

Memory failure: 0x2xxxxx: recovery action for dirty LRU page: Recovered

and the pass one ( didn't crash one ), I can see:

SIGBUS: addr = 0x7exxxxxxxxx
page not present
Saw local machine check
Test passed

the crashed one just crashed, no above logs.

Mark K
  • 779
  • 2
  • 13
  • 33
  • A/ Don't tell us that you are running THAT latest version of ras_tools on a 4.18 kernel. B/ The comparison between both seems eloquent to me : Some SIGBUS event is generated and In the "pass one" ocurrence, the signal handler correctly handles the situation when in the panicing experience it is not even offered the chance to run. I have seen a couple of kernel patches related to SIGBUS, I'll check. In the mean time, since all that stuff initially starts from an MCE interrupt you can investigate the differences in its handler. – MC68020 Aug 27 '22 at 10:57
  • Yes, ras_tools is using the latest version in both test cases ( kernels ) – Mark K Aug 27 '22 at 14:15
  • BTW as part of the other question you had opened under a different title, right at the beginning of the problem, you reported : *"Killing einj_mem_uc"* how the hell would you expect its local SIGBUS handler to handle whatever properly ? – MC68020 Aug 27 '22 at 14:22
  • In my understanding (maybe wrong), https://elixir.bootlin.com/linux/v4.18/source/mm/memory-failure.c#L187 , the string in log is "Killing xxxxx" but actually doing the same as 5.15 kernel. ( https://elixir.bootlin.com/linux/v5.15/source/mm/memory-failure.c#L256 ) – Mark K Aug 27 '22 at 14:27

0 Answers0