2

My OS got kernel panic ( it looks like triggered another kernel to dump, kdump ? )

[   124.674715] core: Uncorrected hardware memory error in user-access at xxxxxxx
[   124.684140] BUG: scheduling while atomic: einj_mem_uc/5151/0xxxxxxxxx
[   124.684310] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
r = 0xxxxxxxxxxx[   124.691839] Memory failure: 0x25eae3: Killing einj_mem_uc:6161 due to hardware memory corruption
[   124.700827] {1}[Hardware Error]: event severity: recoverable
[   124.700828] {1}[Hardware Error]:  Error 0, type: recoverable
00 paddr = xxxxx[   124.700829] {1}[Hardware Error]:  fru_text: Card01, ChnE, DIMM0
[   124.700830] {1}[Hardware Error]:   section_type: memory error
[   124.700835] {1}[Hardware Error]:   error_status: 0x0000000000000400
[   124.712309] Memory failure: 0x25eae3: recovery action for dirty LRU page: Recovered
[   124.718713] {1}[Hardware Error]:   physical_address: 0x000000015ace3400
[   124.718715] {1}[Hardware Error]:   node: 0 card: 4 module: 0 rank: 0 bank: 21 device: 0 row: 10455 column: 1408 
[   124.718716] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[   124.718718] {1}[Hardware Error]:   DIMM location: _Node0_Channel4_Dimm0 CPU0_E0 
[   124.791089] Memory failure: 0x25eae3: already hardware poisoned
3 116
400
[    0.000000] Linux version 4.18.0-348.el8.x86_64 

I checked the source code:

https://elixir.bootlin.com/linux/v4.18/source/kernel/sched/core.c#L3287

OS should only panic when panic_on_warn == 1, but I checked my OS:

sudo sysctl -a | grep -i panic_on
...
kernel.panic_on_warn = 0
MC68020
  • 6,281
  • 2
  • 13
  • 44
Mark K
  • 779
  • 2
  • 13
  • 33
  • Curious about what being printed between the BUG… message and the start of some reboot. **Should have re "scheduling while atomic"** then the stack dump. According to the code you should get the stack dump. And since you panic_on_warn=0 then the system could well panic for some other reason when dumping the stack. And **not** because of the scheduling bug. – MC68020 Aug 26 '22 at 16:58
  • I'd indeed bet on some double fault when dumping the stack. So please provide the missing lines your represented as.... – MC68020 Aug 26 '22 at 17:21
  • And BTW, DO CARE when reading code on linux' github. You are referring to **current** from which… 4.18 is actually **very far**. (not a real problem here since the debug scheduling code did not change much) – MC68020 Aug 26 '22 at 17:25
  • 1
    update the missing lines. – Mark K Aug 26 '22 at 23:22
  • 1
    update the kernel source link to 4.18. – Mark K Aug 26 '22 at 23:25

1 Answers1

0

OK then, only in order to confirm my comments here-above thanks to the supplemental information you provided :

The kernel does not panic because of BUG: scheduling while atomic (being, as intended with kernel.panic_on_warn = 0, not a valid reason for panicing) but more obviously because of repeated hardware memory failures detected by the MCE interrupt handler and possibly source of some fatal problem in that handler.

MC68020
  • 6,281
  • 2
  • 13
  • 44
  • It is a single injection and recoverable error. The same test on the newer kernel didn't crash. Please ref: https://unix.stackexchange.com/questions/714922/inject-uncorrected-error-then-system-reboot – Mark K Aug 27 '22 at 01:45
  • @GreenTea : ACK ! I'll swap to this new question. For what concerns this very question (kernel panic when panic_on_warn==0) my point remains that **"scheduling while atomic" did not trigger the kernel panic.** (you'd have had the stack dump) but because of some problem in handling the MCE interrupt related to the hardware memory errors detection. – MC68020 Aug 27 '22 at 10:11
  • I didn't see any related log to indicate why it rebooted/panic, only the "scheduling while atomic" log is different from the PASS ( no crashed) log. – Mark K Aug 27 '22 at 10:20
  • @GreenTea : So that is a way to tell that it is a real panic deep into kernel code. Something like a double-fault happening before dumping the stack. I'll swich to your other thread in which you give more info (SIGBUS) – MC68020 Aug 27 '22 at 10:44