2

I need some help diagnosing and finding the root cause of my system stability issues. All signs point to some sort of hardware problem (disk or RAM) but my investigations so far don't turn up anything.

This is a brand new system with new hardware running Ubuntu 20.04. It is a NUC (D54250WYK / NUC8I5BEH) with 2x16GB RAM and a 2TB Samsung SSD (Samsung 970 EVO Plus). It's also a fresh install of Ubuntu. The system has very little installed on it with just docker engine and around 8 containers.

The symptoms are that every so often the system will completely come to a halt. I can barely login to the machine via SSH, one time I could and every command I ran gave:

-bash: /usr/bin/ls: Input/output error

Other times I could not login remotely at all but opening the terminal on the machine directly I could see many errors logged to the terminal, mostly around disk being full or unable to write to disk.

A reboot fixes things and the system runs fine for between 1 and 6 days before the issue occurs again.

Checking dmesg and syslog I don't see much before the system goes unresponsive. I'm guessing it is unable to write logs due to the disk being read-only. I do see other services complaining a bit, eg:

[826122.177679] systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
[826122.178711] systemd[1161852]: containerd.service: Failed to connect stdout to the journal socket, ignoring: Connection refused
[826122.178970] systemd[1161852]: containerd.service: Failed to execute command: Input/output error
[826122.179022] systemd[1161852]: containerd.service: Failed at step EXEC spawning /usr/bin/containerd: Input/output error
[826122.179430] systemd[1]: containerd.service: Main process exited, code=exited, status=203/EXEC
[826122.179439] systemd[1]: containerd.service: Failed with result 'exit-code'.
[826122.179568] systemd[1]: Failed to start containerd container runtime.

I also see a LOT of logging for UFW firewall, blocking various requests (some are for ports I allow, I'm not sure why that would be happening).

Based on research this appears to be faulty hardware, likely disk or memory. So I have done as much diagnostics on both as I could:

  • smartctl reports no errors and a healthy SSD
  • badblocks runs fine through the system without issue, zero errors
  • fsck does not pick up any issues except after I reboot due to bad shutdown (which are fixed immediately)
  • memtest86 ran through several loops without problem and zero errors reported

What else can I do to better diagnose this issue? Is there more logging I can turn on? Are there other diagnostic tools I can use to work out the cause?

AdamK
  • 151
  • 6
  • Maybe filesystem corruption? I'd try new kernel. Maybe try booting on a live distro to verify hardware's good state (i prefer the [systemrescue](https://www.system-rescue.org/) distro). – Veles Jan 07 '21 at 04:42
  • Thanks. I will try systemrescue, I have not seen that before. I thought it was disk or file system corruption but the disk checks I tried so far came back clean and the system runs without issue for over a week sometimes (doing a lot of disk I/O as well). Quite confusing! systemrescue has some more disk tools though so I will try those. – AdamK Jan 07 '21 at 05:02
  • @Ladon FWIW, I did try systemrescue, but all diagnostics and tests came back fine. I think I have found the issue though and have posted the answer in case you are interested. – AdamK Jan 19 '21 at 06:30

1 Answers1

3

After a lot of digging, I appear to have found the solution (no crashes so far )

tl;dr:

In /etc/default/grub in the GRUB_CMDLINE_LINUX_DEFAULT variable I added: nvme_core.default_ps_max_latency_us=0 and pcie_aspm=off, which ends up looking like: GRUB_CMDLINE_LINUX_DEFAULT="nvme_core.default_ps_max_latency_us=0 pcie_aspm=off"

This disables APSM (advanced power management) for the Samsung Evo SSD which seems to be causing the disk to unmount or become inaccessible.

There seems to be quite a history of issues with some types of newer SSDs and APSM on Linux. Most issues looked to have fixes in place but it seemed to still be affecting me.

More reading here:

AdamK
  • 151
  • 6