I am having different hardware errors (see photo below, unfortunately my experience is they are not getting written to disk) with a computer. They occur most often when transferring data over the LAN, I was unable to generate them with stress-ng.
The computer has also passed a 24h memtest with flying colours (around 11 passes). Its processor is A10-9700 APU. The PSU is beQuiet Pure 750 W, the computer has been cleaned, thermal paste reapplied. This is the second PSU (the PSU change brought no improvement, previously it had a budget 500W Chieftec).
The rest is in the screenshot. I have updated the bios to F24 version, without any improvement (never versions do not support the CPU).
When memory dedicated to the integrated graphics was set to "auto" it would either crash (restart) or spit out mce errors:
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091048] [Hardware Error]: Corrected error, no action required.
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091095] [Hardware Error]: CPU:0 (15:65:1) MC1_STATUS[-|CE|MiscV|-|-|-|-]: 0x9800000000130151
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091160] [Hardware Error]: MC1 Error: Decoder predecode buffer parity error.
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091210] [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091302] [Hardware Error]: Corrected error, no action required.
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091344] [Hardware Error]: CPU:0 (15:65:1) MC5_STATUS[-|CE|-|-|-|-|-]: 0x90000000000c0e0f
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091404] [Hardware Error]: MC5 Error: DE error occurred.
Message from syslogd@HOSTNAMEHERE at Mar 1 16:37:14 ...
kernel:[31135.091446] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: GEN, part-proc: GEN (no timeout)
(in a different fashion) every 20-60 minutes - random restarts with no log info. Same happens when kernel is 4.19 (Debian buster: 4.19.0-14-amd64). The amdgpu graphics driver is currently in "nomodeset".
I have never seen "cpu stuck" so far, but the mce errors appear:
- (as stated) in kernel 4.19 every 20-30 minutes
- in kernel 5.9 and 5.10 every 2-10 hours
in both cases there are restarts, 2-4 mce-type error session occur (typically) for each restart.
What should I do?
