I need some help diagnosing and finding the root cause of my system stability issues. All signs point to some sort of hardware problem (disk or RAM) but my investigations so far don't turn up anything.
This is a brand new system with new hardware running Ubuntu 20.04. It is a NUC (D54250WYK / NUC8I5BEH) with 2x16GB RAM and a 2TB Samsung SSD (Samsung 970 EVO Plus). It's also a fresh install of Ubuntu. The system has very little installed on it with just docker engine and around 8 containers.
The symptoms are that every so often the system will completely come to a halt. I can barely login to the machine via SSH, one time I could and every command I ran gave:
-bash: /usr/bin/ls: Input/output error
Other times I could not login remotely at all but opening the terminal on the machine directly I could see many errors logged to the terminal, mostly around disk being full or unable to write to disk.
A reboot fixes things and the system runs fine for between 1 and 6 days before the issue occurs again.
Checking dmesg and syslog I don't see much before the system goes unresponsive. I'm guessing it is unable to write logs due to the disk being read-only. I do see other services complaining a bit, eg:
[826122.177679] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[826122.178711] systemd[1161852]: containerd.service: Failed to connect stdout to the journal socket, ignoring: Connection refused
[826122.178970] systemd[1161852]: containerd.service: Failed to execute command: Input/output error
[826122.179022] systemd[1161852]: containerd.service: Failed at step EXEC spawning /usr/bin/containerd: Input/output error
[826122.179430] systemd[1]: containerd.service: Main process exited, code=exited, status=203/EXEC
[826122.179439] systemd[1]: containerd.service: Failed with result 'exit-code'.
[826122.179568] systemd[1]: Failed to start containerd container runtime.
I also see a LOT of logging for UFW firewall, blocking various requests (some are for ports I allow, I'm not sure why that would be happening).
Based on research this appears to be faulty hardware, likely disk or memory. So I have done as much diagnostics on both as I could:
smartctlreports no errors and a healthy SSDbadblocksruns fine through the system without issue, zero errorsfsckdoes not pick up any issues except after I reboot due to bad shutdown (which are fixed immediately)memtest86ran through several loops without problem and zero errors reported
What else can I do to better diagnose this issue? Is there more logging I can turn on? Are there other diagnostic tools I can use to work out the cause?