0

I am having different hardware errors (see photo below, unfortunately my experience is they are not getting written to disk) with a computer. They occur most often when transferring data over the LAN, I was unable to generate them with stress-ng.

The computer has also passed a 24h memtest with flying colours (around 11 passes). Its processor is A10-9700 APU. The PSU is beQuiet Pure 750 W, the computer has been cleaned, thermal paste reapplied. This is the second PSU (the PSU change brought no improvement, previously it had a budget 500W Chieftec).

The rest is in the screenshot. I have updated the bios to F24 version, without any improvement (never versions do not support the CPU).

When memory dedicated to the integrated graphics was set to "auto" it would either crash (restart) or spit out mce errors:

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091048] [Hardware Error]: Corrected error, no action required.

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091095] [Hardware Error]: CPU:0 (15:65:1) MC1_STATUS[-|CE|MiscV|-|-|-|-]: 0x9800000000130151

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091160] [Hardware Error]: MC1 Error: Decoder predecode buffer parity error.

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091210] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091302] [Hardware Error]: Corrected error, no action required.

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091344] [Hardware Error]: CPU:0 (15:65:1) MC5_STATUS[-|CE|-|-|-|-|-]: 0x90000000000c0e0f

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091404] [Hardware Error]: MC5 Error: DE error occurred.

Message from syslogd@HOSTNAMEHERE at Mar  1 16:37:14 ...
 kernel:[31135.091446] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)

(in a different fashion) every 20-60 minutes - random restarts with no log info. Same happens when kernel is 4.19 (Debian buster: 4.19.0-14-amd64). The amdgpu graphics driver is currently in "nomodeset".

I have never seen "cpu stuck" so far, but the mce errors appear:

  • (as stated) in kernel 4.19 every 20-30 minutes
  • in kernel 5.9 and 5.10 every 2-10 hours

in both cases there are restarts, 2-4 mce-type error session occur (typically) for each restart.

What should I do?

enter image description here

  • While I do not understand why the following boot options ensure I saw no such error for 5 days already: BOOT_IMAGE=/ROOT/boot/vmlinuz-5.10.0-0.bpo.3-amd64 root=UUID=3f048493-7e17-4f40-84b9-17c395246806 ro rootflags=subvol=ROOT crashkernel=1024 nmi_watchdog=1 iommu=pt iommu=1 amdgpu.dc=0 amdgpu.audio=0 crashkernel=384M-:128M The crucial one seem: nmi_watchdog=1 crashkernel=1024 crashkernel=384M-:128M This is trial and error –  Mar 08 '21 at 17:48
  • But it it is not stable similar error just occurred. Though it took several days not "at most few hours". –  Mar 08 '21 at 21:09

0 Answers0