1
Debian 11 Stable
KDE Plasma 5.20.5

About once per day my system is crashing. I wrote down on paper what seems to be the most pertinent errors:

[83262.955525] systemd-journal [301]: failed to write entry (22 items, 747 bytes), 
ignoring: Read-only file system

EXT-FS Error (device sda1) __ext4_find_entry:1534 inode #1573987: 
com gmain: reading directory lblock 0

I found this photo here:

enter image description here

That's almost exactly how my screen looks during the crashes.

Drives and Partitions:

Drives:    Local Storage: total: 1.38 TiB used: 853.2 GiB (60.6%) 
           ID-1: /dev/sda vendor: Samsung model: SSD 860 EVO 1TB size: 931.51 GiB 
           ID-2: /dev/sdb vendor: Samsung model: SSD 850 PRO 512GB size: 476.94 GiB 
Partition: ID-1: / size: 45.53 GiB used: 8.3 GiB (18.2%) fs: ext4 dev: /dev/sda1 
           ID-2: /home size: 869.04 GiB used: 755.96 GiB (87.0%) fs: ext4 dev: /dev/sda3 
Swap:      ID-1: swap-1 type: partition size: 976 MiB used: 0 KiB (0.0%) dev: /dev/sda5 

I've haven't attempted to install any proprietary drivers onto this installation. I use this computer for work, and the non-proprietary driver's performance seem adequate for my work needs (unless they're somehow needed to stop these crashes).

The only errors I see in the log file are related to firmware, that I didn't think I even needed with debian:

[    0.101567] DMAR: [Firmware Bug]: No firmware reserved region can cover this RMRR [0x00000000bdeac000-0x00000000bdecbfff], contact BIOS vendor for fixes
[    0.101697] DMAR: [Firmware Bug]: Your BIOS is broken; bad RMRR [0x00000000bdeac000-0x00000000bdecbfff]
               BIOS vendor: Hewlett-Packard; Ver: F.25; Product Version: 0499220000241210001040000
[    0.237433] core: CPUID marked event: 'bus cycles' unavailable
[    0.244065] MDS CPU bug present and SMT on, data leak possible. See https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/mds.html for more details.
[    0.245624]  #5 #6 #7
[    0.266653] mtrr: your CPUs had inconsistent variable MTRR settings
[    1.889198] [Firmware Bug]: Invalid critical threshold (0)
[    2.306443] ACPI Warning: SystemIO range 0x0000000000000428-0x000000000000042F conflicts with OpRegion 0x0000000000000400-0x000000000000047F (\PMIO) (20200925/utaddress-204)
[    2.306458] ACPI Warning: SystemIO range 0x0000000000000540-0x000000000000054F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20200925/utaddress-204)
[    2.306467] ACPI Warning: SystemIO range 0x0000000000000530-0x000000000000053F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20200925/utaddress-204)
[    2.306476] ACPI Warning: SystemIO range 0x0000000000000500-0x000000000000052F conflicts with OpRegion 0x0000000000000500-0x0000000000000563 (\GPIO) (20200925/utaddress-204)
[    2.306484] lpc_ich: Resource conflict(s) found affecting gpio_ich
[    2.354733] r8169 0000:03:00.0: can't disable ASPM; OS doesn't have ASPM control
[    2.740251] nouveau 0000:01:00.0: bios: OOB 1 015f1901 015f1901
[    2.764763] ata5.00: supports DRM functions and may not be fully accessible
[    2.765804] ata1.00: supports DRM functions and may not be fully accessible
[    2.768822] ata1.00: supports DRM functions and may not be fully accessible
[    2.776029] ata5.00: supports DRM functions and may not be fully accessible
[    5.256428] systemd[1]: /lib/systemd/system/plymouth-start.service:16: Unit configured to use KillMode=none. This is unsafe, as it disables systemd's process lifecycle management for the service. Please update your service to use a safer KillMode=, such as 'mixed' or 'control-group'. Support for KillMode=none is deprecated and will eventually be removed.
[    5.599321] systemd-journald[301]: File /var/log/journal/e60bce5c0cc141a5b1ca070182b03357/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    5.631756] i801_smbus 0000:00:1f.3: BIOS is accessing SMBus registers
[    5.631758] i801_smbus 0000:00:1f.3: Driver SMBus register access inhibited
[    5.711693] at24 0-0050: supply vcc not found, using dummy regulator
[    5.765689] rc rc0: nonsensical timing event of duration 0
[    5.765692] rc rc0: two consecutive events of type space
[    5.895968] iwlwifi 0000:02:00.0: can't disable ASPM; OS doesn't have ASPM control
[    5.900275] iwlwifi 0000:02:00.0: firmware: failed to load iwlwifi-6000-4.ucode (-2)
[    5.900277] firmware_class: See https://wiki.debian.org/Firmware for information about missing firmware
[    5.900279] iwlwifi 0000:02:00.0: Direct firmware load for iwlwifi-6000-4.ucode failed with error -2
[    5.900284] iwlwifi 0000:02:00.0: iwlwifi-6000-4 is required
[    5.900286] iwlwifi 0000:02:00.0: check git://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
[    6.136626] kvm: VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL does not work properly. Using workaround
[    6.765931] r8169 0000:03:00.0: firmware: failed to load rtl_nic/rtl8168d-2.fw (-2)
[    6.766420] r8169 0000:03:00.0: Direct firmware load for rtl_nic/rtl8168d-2.fw failed with error -2
[    6.766425] r8169 0000:03:00.0: Unable to load firmware rtl_nic/rtl8168d-2.fw (-2)
[    7.019228] L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.
[    7.020441] kvm: KVM_SET_TSS_ADDR need to be called before entering vcpu
[   20.142355] systemd-journald[301]: File /var/log/journal/e60bce5c0cc141a5b1ca070182b03357/user-1000.journal corrupted or uncleanly shut down, renaming and replacing.
[   24.444130] nouveau 0000:01:00.0: firmware: failed to load nouveau/nva5_fuc084 (-2)
[   24.444136] nouveau 0000:01:00.0: Direct firmware load for nouveau/nva5_fuc084 failed with error -2
[   24.444149] nouveau 0000:01:00.0: firmware: failed to load nouveau/nva5_fuc084d (-2)
[   24.444151] nouveau 0000:01:00.0: Direct firmware load for nouveau/nva5_fuc084d failed with error -2
[   24.444154] nouveau 0000:01:00.0: msvld: unable to load firmware data
[   24.444157] nouveau 0000:01:00.0: msvld: init failed, -19
[   24.505830] CE: hpet5 increased min_delta_ns to 20115 nsec
[   25.499297] CE: hpet6 increased min_delta_ns to 20115 nsec
[   34.557568] hrtimer: interrupt took 14722 ns
[ 2760.299762] CE: hpet3 increased min_delta_ns to 20115 nsec
[ 2979.256577] CE: hpet increased min_delta_ns to 20115 nsec
[ 3050.545325] show_signal_msg: 19 callbacks suppressed
[ 3053.040108] CE: hpet7 increased min_delta_ns to 20115 nsec
[ 5509.560255] CE: hpet4 increased min_delta_ns to 20115 nsec

Are the free drivers generally stable? Here's my video card:

lspci -vnn | grep -A12 'VGA\|Display'
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GT216M [GeForce GT 230M] [10de:0a28] (rev a2) (prog-if 00 [VGA controller])
    DeviceName: NVIDIA Video Graphics Controller
    Subsystem: Hewlett-Packard Company GT216M [GeForce GT 230M] [103c:7001]
    Flags: bus master, fast devsel, latency 0, IRQ 33, IOMMU group 16
    Memory at d2000000 (32-bit, non-prefetchable) [size=16M]
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    Memory at d0000000 (64-bit, prefetchable) [size=32M]
    I/O ports at 6000 [size=128]
    Expansion ROM at 000c0000 [disabled] [size=128K]
    Capabilities: <access denied>
    Kernel driver in use: nouveau
    Kernel modules: nouveau

I'm not great at determining the exact cause of these crashes. Any advice is appreciated.

Does this look like something I can repair? If so, how?

Update

In LinuxSecurityFreak's answer, he suggested forcing fsck repair on reboot. I found this in the boot log after doing that:

cat /var/log/boot.log

------------ Tue Dec 07 06:01:46 CST 2021 ------------
/dev/sda1: recovering journal
/dev/sda1: Clearing orphaned inode 1966113 (uid=1000, gid=1000, mode=0100600, size=3538944)
/dev/sda1: Clearing orphaned inode 1966101 (uid=1000, gid=1000, mode=0100600, size=9830400)
/dev/sda1: Clearing orphaned inode 2802080 (uid=0, gid=0, mode=0100644, size=71592)
/dev/sda1: Clearing orphaned inode 2802077 (uid=0, gid=0, mode=0100644, size=917632)
/dev/sda1: Clearing orphaned inode 2802076 (uid=0, gid=0, mode=0100644, size=191416)
/dev/sda1: Clearing orphaned inode 2802075 (uid=0, gid=0, mode=0100644, size=190368)
/dev/sda1: Clearing orphaned inode 2802073 (uid=0, gid=0, mode=0100644, size=34728)
/dev/sda1: Clearing orphaned inode 2802071 (uid=0, gid=0, mode=0100644, size=18352)
/dev/sda1: Clearing orphaned inode 2802069 (uid=0, gid=0, mode=0100644, size=18352)
/dev/sda1: Clearing orphaned inode 2802067 (uid=0, gid=0, mode=0100644, size=14256)
/dev/sda1: Clearing orphaned inode 2802065 (uid=0, gid=0, mode=0100644, size=14256)
/dev/sda1: Clearing orphaned inode 2802063 (uid=0, gid=0, mode=0100644, size=22448)
/dev/sda1: Clearing orphaned inode 2802061 (uid=0, gid=0, mode=0100644, size=14256)
/dev/sda1: Clearing orphaned inode 2802059 (uid=0, gid=0, mode=0100644, size=14328)
/dev/sda1: Clearing orphaned inode 2802057 (uid=0, gid=0, mode=0100644, size=14256)
/dev/sda1: clean, 255277/3055616 files, 2444203/12206848 blocks

------------ Tue Dec 07 07:08:18 CST 2021 ------------
e2fsck 1.46.2 (28-Feb-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure                                           
Pass 3: Checking directory connectivity                                        
Pass 4: Checking reference counts
Pass 5: Checking group summary information                                     
/dev/sda1: 255255/3055616 files (0.1% non-contiguous), 2446557/12206848 blocks 

Update 2

The BIOS was the latest I could find on HP's website (Hewlett-Packard v: F.25). The may be a later BIOS, but I'm not sure if I can trust that source.

inxi -Fx

System:    Host: sidekick Kernel: 5.10.0-9-amd64 x86_64 bits: 64 compiler: gcc v: 10.2.1 
           Desktop: KDE Plasma 5.20.5 Distro: Debian GNU/Linux 11 (bullseye) 
Machine:   Type: Laptop System: Hewlett-Packard product: HP Pavilion dv8 Notebook PC 
           v: 0499220000241210001040000 serial: CNF02839BM 
           Mobo: Hewlett-Packard model: 7001 v: 35.35 serial: CNF02839BM BIOS: Hewlett-Packard v: F.25 
           date: 05/31/2010 
Battery:   ID-1: BAT0 charge: 0% condition: 93.1/365.8 Wh (25%) model: Hewlett-Packard Primary status: Unknown 
CPU:       Info: Quad Core model: Intel Core i7 Q 740 bits: 64 type: MT MCP arch: Nehalem rev: 5 L2 cache: 6 MiB 
           flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 27668 
           Speed: 931 MHz min/max: 933/1734 MHz boost: enabled Core speeds (MHz): 1: 931 2: 931 3: 931 4: 931 
           5: 931 6: 931 7: 931 8: 931 
Graphics:  Device-1: NVIDIA GT216M [GeForce GT 230M] vendor: Hewlett-Packard driver: nouveau v: kernel 
           bus ID: 01:00.0 
           Device-2: Quanta HP Webcam type: USB driver: uvcvideo bus ID: 2-1.5:5 
           Display: x11 server: X.Org 1.20.11 driver: loaded: modesetting unloaded: fbdev,vesa resolution: 
           1: 1920x1080~60Hz 2: 1920x1080~60Hz 
           OpenGL: renderer: NVA5 v: 3.3 Mesa 20.3.5 direct render: Yes 
Audio:     Device-1: Intel 5 Series/3400 Series High Definition Audio vendor: Hewlett-Packard 
           driver: snd_hda_intel v: kernel bus ID: 00:1b.0 
           Device-2: NVIDIA GT216 HDMI Audio vendor: Hewlett-Packard driver: snd_hda_intel v: kernel 
           bus ID: 01:00.1 
           Sound Server: ALSA v: k5.10.0-9-amd64 
Network:   Device-1: Intel Centrino Advanced-N 6200 driver: N/A port: 6000 bus ID: 02:00.0 
           Device-2: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet vendor: Hewlett-Packard 
           driver: r8169 v: kernel port: 4000 bus ID: 03:00.0 
           IF: enp3s0 state: up speed: 1000 Mbps duplex: full mac: c8:0a:a9:eb:14:ba 
Drives:    Local Storage: total: 1.38 TiB used: 853.2 GiB (60.6%) 
           ID-1: /dev/sda vendor: Samsung model: SSD 860 EVO 1TB size: 931.51 GiB 
           ID-2: /dev/sdb vendor: Samsung model: SSD 850 PRO 512GB size: 476.94 GiB 
Partition: ID-1: / size: 45.53 GiB used: 8.3 GiB (18.2%) fs: ext4 dev: /dev/sda1 
           ID-2: /home size: 869.04 GiB used: 755.96 GiB (87.0%) fs: ext4 dev: /dev/sda3 
Swap:      ID-1: swap-1 type: partition size: 976 MiB used: 0 KiB (0.0%) dev: /dev/sda5 
Sensors:   System Temperatures: cpu: 59.0 C mobo: N/A gpu: nouveau temp: 56.0 C 
           Fan Speeds (RPM): N/A 
Info:      Processes: 223 Uptime: 20m Memory: 7.76 GiB used: 1.9 GiB (24.5%) Init: systemd runlevel: 5 
           Compilers: gcc: N/A Packages: 2579 Shell: Bash v: 5.1.4 inxi: 3.3.01
Lonnie Best
  • 4,895
  • 6
  • 27
  • 42
  • 3
    This seems like a useful hint: “read-only file system” (from the first error message). Are any of your file systems read-only? (Not intentionally, I imagine, but as a result of a file system error.) – Stephen Kitt Dec 06 '21 at 16:42
  • @StephenKitt Thanks for helping. No, though. I mean, maybe they become read-only during the crash, but otherwise it is just a Samsung 1TB SSD, which it read and write during operation. – Lonnie Best Dec 06 '21 at 16:45
  • 4
    I think it's more likely the disk is becoming read only due to a filesystem issue, and that's causing the crash – roaima Dec 06 '21 at 16:46
  • @roaima Maybe. If so, this will be the first Samsung SSD that's gone bad on me. I've installed at least 50 of them at work. None of those have gone bad, and this one has way less miles on it than them. Maybe there's a command I can use to test the health of it. – Lonnie Best Dec 06 '21 at 16:52
  • 1
    doesn't have to be a hardware fault. bit errors on transport happen, your RAM has an error probability too, and to top it off, the linux kernel does have bugs. – Marcus Müller Dec 06 '21 at 16:53
  • One of those log entries says my BIOS is broken! I checked, and it is latest BIOS version already. I wonder if I should overwrite it with the same version again. Maybe it won't seem broken after that? – Lonnie Best Dec 06 '21 at 16:56
  • Think I should just trying reinstalling? – Lonnie Best Dec 06 '21 at 16:57
  • 2
    stop worrying about your hardware. You won't get a better bios than you have right now (if there was anything different about the firmware image than what you've flashed, loading it would have failed and your computer wouldn't have booted) – Marcus Müller Dec 06 '21 at 17:09
  • 2
    so what happens here is that journalctl (your system logs) can't write to the journal, which is pretty serious, but mostly, as others said, indicative of a damaged file system. Whether or not that happened due to malfunctioning hardware is at this point not relevant. You need to fix that file system; on what file system is /var/log? – Marcus Müller Dec 06 '21 at 17:12
  • 1
    That's on the root filesystem (/dev/sda1), which is [50GiB](http://www.lonniebest.com/DataUnitConverter/#50GiB) partition. /dev/sda3 is /home and I recall the suggested swap /dev/sda5 was only 1GiB (which is smaller than I usually make that). @MarcusMüller Oh, and you're right, I do believe it was SystemD giving those crash errors. – Lonnie Best Dec 06 '21 at 17:17
  • 1
    what *type* of file system is that? ext4, btrfs, XFS, ZFS…? – Marcus Müller Dec 06 '21 at 17:28
  • 1
    I didn’t mention a hardware error; any number of file system errors can result in this. Hardware may be at fault but it doesn’t have to be (although repeated errors of this type do suggest that). – Stephen Kitt Dec 06 '21 at 18:13
  • @MarcusMüller ext4 for everything except swap – Lonnie Best Dec 06 '21 at 19:35
  • @MarcusMüller : The crash just happened again. I was able to write down more details about the error messages. I've updated in my question above with a more detailed error. Maybe I can repair the specified inode . . . not sure yet (researching). – Lonnie Best Dec 07 '21 at 12:12

1 Answers1

1

Initial attempt

I suggest doing this first: What should I do to force the root filesystem check (and optionally a fix) at boot?, or an almost identical solution using GRUB:

Add this to your /etc/default/grub to line GRUB_CMDLINE_LINUX_DEFAULT:

fsck.mode=force fsck.repair=yes

then run update-grub, just reboot to have your ext4 file system fixed at boot time.

If this does not help, please report back.


Further elaboration

Since your BIOS is up-to-date, we can do nothing about the BIOS bugs mentioned in your dmesg however, they may be just notices rather than serious bugs, so I personally would be ok with those.

Then there is smartmontools if I spelled it correctly, which you please run as smartctl -a /dev/sda - just exchange the device. I use it often in my lab, but now I am out of office, be aware to spell your boot device (SATA disk, NVMe drive, memory stick, whatever) correctly.

Vlastimil Burián
  • 27,586
  • 56
  • 179
  • 309
  • 1
    @LonnieBest You may be a _victim_ of GDPR, I live in EU, and HP stopped providing BIOS updates for _older_ machines, I'm not a lawyer, so I can only say I know of some older machines stopped being provided BIOS patches, even when actually available. I called them and they told me their whole locked-BIOS system against misuse was originally good idea, but now not compliant, so they cut pretty much everything since GPPR standing effective, and providing only for newer laptops. Again, that's what they _told_ me, not necessarily true. Sorry friend. – Vlastimil Burián Dec 07 '21 at 14:17
  • 1
    @LonnieBest That is horrible behavior. Some companies do charge a lot, but this is just another overkill! I wish you the best anyway, going home in a bit more. – Vlastimil Burián Dec 07 '21 at 14:46
  • I was running solid as a rock, but then had two crashes late yesterday. During the crashes I was in a Jitsi meeting. That first crash wasn't like the others. I had a blank screen with no errors on it and had to reboot improperly. The 2nd crash happened about 15 minutes later, and it was just like the one I reported here. It may be unrelated, but I noticed that I was using an [AppImage of Ungoogled Chromium](https://ungoogled-software.github.io/ungoogled-chromium-binaries/releases/appimage/64bit/) during both crashes. – Lonnie Best Dec 09 '21 at 11:03
  • 1
    Did you already run [Memtest](https://www.memtest86.com/) on your machine to verify RAM? – Vlastimil Burián Dec 09 '21 at 11:17
  • No, I didn't know about that. I'm adding memtest86-usb.img to my [Ventoy](https://www.ventoy.net/) right now! – Lonnie Best Dec 09 '21 at 11:56
  • The memory tests completed successfully. I ended up installing a new hard drive. You won't believe it, but after this the system continued to crash about 5 times per day, but this time without any indication as to why in the logs. It was more of a freeze up than a crash, but it still required a hard reboot. You won't believe what finally made my problems go away. After having 5 crashes in one day, I installed XFCE onto the same OS that was crashing. Now it is days later, and I haven't had one crash. There is something unstable in KDE (for my hardware, at least). – Lonnie Best Dec 16 '21 at 11:02
  • XFCE isn't too fancy, but I take a plain desktop that's stable over a fancy one that isn't. – Lonnie Best Dec 16 '21 at 11:14
  • I have similar problem on Mint 19.3 when I boot from USD and do all kinds of SSD/NVME checks it's all good. So I think it became a problem by some Linux updates. I also have a fear that maybe my NVEM is dying somehow but again from USB life version, never a problem – Pawel Cioch Sep 06 '22 at 15:14
  • Having this same issue w/ Ubuntu 22 running on a new XPS 13 9315. The crashes are completely random, could be twice a day, could be once a week. All DELL diagnostics, btrfs-checks, etc come up clean. Based on this advice, I've just switched to XFCE w/ lightdm and will see how it goes. Maybe I'll try a non-Debian based Linux rocking TWM for my desktop if this continues :/ – James Adam May 02 '23 at 02:44