2

How can one make a Linux-based device to reboot once its rootfs gets unavailable?

There is software watchdog available only.

The problem is that rootfs gets mounted from NFS. When I stop the NFS server, the device gets blocked. I want it to get rebooted though. How can I achieve this?

I.e: there is a problematic rootfs, is there anything on the kernel level that can reset the whole system? I don't care of open/corrupted files and resources.

Note: I don't have the kernel sources for this architecture. The device is headless, no monitor or keyboard is attached. There is a root console with agetty (defined in /etc/inittab).

Daniel
  • 319
  • 1
  • 10
  • The software watchdog is inside the kernel so it should be possible to trigger it just as normal. Loop: read contents of rootfs. If no error then reset watchdog timer. This will catch unblocked read errors and blocked reads – roaima Oct 08 '20 at 07:19
  • Where shall I put this loop? Into a bash script? – Daniel Oct 08 '20 at 10:30
  • The problem is that once rootfs goes down, even root console (via `agetty`) seems blocked. – Daniel Oct 08 '20 at 10:33
  • Related - [linux watchdog and systemd watchdog](https://unix.stackexchange.com/a/308677/100397) – roaima Oct 08 '20 at 11:10
  • There is no hardware watchdog available. Even if there is `/dev/watchdog`, it can be sw-based implementation. I'm using it, but once rootfs goes down, it seems all process get blocked. – Daniel Oct 08 '20 at 12:07
  • Even though `/dev/watchdog` is software only, it should be managed inside the kernel and therefore outside the effect of a blocked rootfs. – roaima Oct 08 '20 at 12:42
  • Okay, but believe me, it is not rebooting the device. What can I check, if even agetty is blocked? – Daniel Oct 09 '20 at 12:02
  • If I get time this weekend I'll see if I can create a similar configuration. It probably won't be a rootfs NFS mount, but maybe we can proof-of-concept with a different one – roaima Oct 09 '20 at 12:34

2 Answers2

0

You didn't state if you have a physical keyboard attached, but if you do, then the "Magic SysRq Keys" might help. In your case

  • Alt+SysRq+S for emergency sync-to-disk, and
  • Alt+SysRq+B for immediate reboot

should do the job. Notice that for this to work it is necessary that these key combinations are not deactivated, see the setting in /proc/sys/kernel/sysrq which is an ORed bitmask of allowed SysRq-Actions (reproduced from here):

  2 =   0x2 - enable control of console logging level
  4 =   0x4 - enable control of keyboard (SAK, unraw)
  8 =   0x8 - enable debugging dumps of processes etc.
 16 =  0x10 - enable sync command
 32 =  0x20 - enable remount read-only
 64 =  0x40 - enable signalling of processes (term, kill, oom-kill)
128 =  0x80 - allow reboot/poweroff
256 = 0x100 - allow nicing of all RT tasks

You can also trigger this from a shell script/program by writing to /proc/sysrq-trigger:

echo "b" > /proc/sysrq-trigger

This will work no matter what the settings in /proc/sys/kernel/sysrq are, which only restrict keyboard-induced SysRq-events.

AdminBee
  • 21,637
  • 21
  • 47
  • 71
  • No, I have no keyboard nor monitor attached. Only a root console via `agetty`. – Daniel Oct 08 '20 at 10:31
  • But the problem is once rootfs goes down, even this root console seems blocked. – Daniel Oct 08 '20 at 10:32
  • so, in theory this could be good, but I would need something which can survive a broken rootfs. Otherwise, `/proc` can be still alive, but I have to start a program which can run in the memory and can write to this file once the rootfs gets unavailable. Do you have idea how can I achieve this? – Daniel Oct 08 '20 at 12:10
  • @Daniel Ah, that's tricky; I cannot come up with a quick solution to this unfortunately ... – AdminBee Oct 08 '20 at 12:19
0

Sounds like you would need the mount option onerror=panic for your NFS root filesystem, but I’m not sure if it will work with NFS. You might also need to mount the NFS root filesystem with the NFS-specific mount option soft so it will time out and return an error instead of retrying forever.

Note: the soft mount option may cause file corruption and/or data loss, but in the comments you specifically said you don't care about that.

Worth a try, maybe?

telcoM
  • 87,318
  • 3
  • 112
  • 232