5

I have a small server with Ubuntu 10.04 on it; I am manipulating this server from another computer via ssh, and I tried to use nfs on it to share files. That mostly works, until one of the clients unmounts and I want to shutdown nfs-kernel-server on the server. While the stopping seems proper:

$ sudo service nfs-kernel-server stop
 * Stopping NFS kernel daemon                                                     [ OK ] 
 * Unexporting directories for NFS kernel daemon...                               [ OK ] 

... I do get something like this in the log:

Feb  5 11:50:17 user init: statd main process (3806) killed by KILL signal
Feb  5 11:50:17 user init: statd main process ended, respawning
Feb  5 11:50:17 user init: idmapd main process (3808) killed by KILL signal
Feb  5 11:50:17 user init: idmapd main process ended, respawning
Feb  5 11:50:17 user statd-pre-start: local-filesystems started
Feb  5 11:50:17 user sm-notify[3815]: Already notifying clients; Exiting!
Feb  5 11:50:17 user rpc.statd[3830]: Version 1.1.6 Starting
Feb  5 11:50:17 user rpc.statd[3830]: Flags: 

... meaning that some related processes to nfs didn't care about me saying stop, and respawned again. If at this point I try to do sudo service nfs-kernel-server start (again via ssh), that command freezes, and in /var/log/syslog I get this:

Feb  5 11:43:55 user mountd[2045]: authenticated mount request from 192.168.0.2:1005 for /media/disk (/media/disk)
Feb  5 11:45:19 user mountd[2045]: Caught signal 15, un-registering and exiting.
Feb  5 11:45:19 user kernel: [27428.148368] nfsd: last server has exited, flushing export cache
Feb  5 11:45:19 user kernel: [27428.148431] BUG: Dentry d0bc8b28{i=1f6,n=} still in use (1) [unmount of vfat sdd8]
Feb  5 11:45:19 user kernel: [27428.148473] ------------[ cut here ]------------
Feb  5 11:45:19 user kernel: [27428.148481] kernel BUG at /build/buildd/linux-2.6.32/fs/dcache.c:670!
Feb  5 11:45:19 user kernel: [27428.148491] invalid opcode: 0000 [#1] SMP 
Feb  5 11:45:19 user kernel: [27428.148501] last sysfs file: /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq
...
Feb  5 11:45:19 user kernel: [27428.148807] Call Trace:
Feb  5 11:45:19 user kernel: [27428.148824]  [<c024c780>] ? vfs_quota_off+0x0/0x20
Feb  5 11:45:19 user kernel: [27428.148838]  [<c021d4fc>] ? shrink_dcache_for_umount+0x3c/0x50
Feb  5 11:45:19 user kernel: [27428.148852]  [<c020d090>] ? generic_shutdown_super+0x20/0xe0
...
Feb  5 11:45:19 user kernel: [27428.149511] EIP: [<c021d4a9>] shrink_dcache_for_umount_subtree+0x249/0x260 SS:ESP 0068:ccc6de6c
Feb  5 11:45:19 user kernel: [27428.149631] ---[ end trace 6198103bb62887ac ]---
Feb  5 11:49:53 user init: idmapd main process (838) killed by TERM signal
Feb  5 11:49:53 user init: idmapd main process ended, respawning
Feb  5 11:49:53 user rpc.statd[769]: Caught signal 15, un-registering and exiting.
Feb  5 11:49:53 user init: statd main process ended, respawning
Feb  5 11:49:53 user statd-pre-start: local-filesystems started
Feb  5 11:49:53 user sm-notify[3790]: Already notifying clients; Exiting!
Feb  5 11:49:53 user rpc.statd[3806]: Version 1.1.6 Starting
Feb  5 11:49:53 user rpc.statd[3806]: Flags: 
...

Now, the thing is this - after this bug happens, the server's ssh server is (for some reason) usually still "live", so I can log in via ssh again, and try to close processes (and realize it is impossible to kill /usr/sbin/rpc.nfsd 8, which is the one hanging).

BUT - if at this point, I try to issue a reboot via sudo shutdown -r now && exit from ssh, then this server PC will start the reboot process - but will not complete it; it will drop to a terminal, dump some error messages, and stay there :(

The problem is - the server PC is in a really difficult to access location, and having to go there to do Alt+SysRq + REISUB to properly reboot (if the kernel reacts to that key combo; else it's hard powerdown) is really difficult.

So my question is - is there some "hardcore reboot" command in Linux, that will more-less "guarantee" that the PC will reboot (and not just hang/freeze), even if it has encountered a kernel bug - and which I could issue via ssh? Something that would be the equivalent of a hard powerdown (i.e. turning of the power by e.g. holding the power button for 10+ seconds) and hard powerup?

sdaau
  • 6,668
  • 12
  • 57
  • 69

2 Answers2

16

To ensure that the system will reboot no matter what, I always do this sequence:

# echo s > /proc/sysrq-trigger
# echo u > /proc/sysrq-trigger
# echo s > /proc/sysrq-trigger
# echo b > /proc/sysrq-trigger

This requests the kernel to do:

  • emergency sync of the block devices
  • mount readonly of all filesystems
  • again a sync
  • force an immediate boot; you can also use o for poweroff.

See e.g. here for explanation of this feature.

wurtel
  • 15,835
  • 1
  • 29
  • 35
  • Thanks for that, @wurtel - I tried from my `ssh` shell: `sudo bash -c "( echo a ; sleep 5; echo s > /proc/sysrq-trigger ; echo u > /proc/sysrq-trigger ; echo s > /proc/sysrq-trigger ; echo b > /proc/sysrq-trigger )" & exit`; however, this doesn't seem to force the server into reboot... Is there a specific way I should format these commands, if I want to run them while I'm connected through a remote `ssh` shell? – sdaau Feb 05 '15 at 12:10
  • 2
    @sdaau it's possible you first need to enable sysrq. Begin the sequence with `echo 1 > /proc/sys/kernel/sysrq` – Gert Feb 05 '15 at 12:25
  • Thanks, @Gert; I checked with `cat /proc/sys/kernel/sysrq` and it was already `1`, so that was already active, I guess... I think this has something to do with the fact that I'm running this via `ssh` remote shell, but cannot tell what – sdaau Feb 05 '15 at 12:28
  • Some distro's have kernels where magic sysrq is not configured, perhaps that is your problem :-( I have encountered this myself, but I didn't check what the value of `proc/sys/kernel/sysrq` was so I can't confirm that in this case this is your problem. – wurtel Feb 05 '15 at 14:20
  • I can confirm that this works: I had a Raspberry Pi which had gone bad to the point where `reboot` processes would get suspended in the D state. `echo b > /proc/sysrq-trigger` did the trick. – Dmitry Grigoryev Aug 16 '18 at 11:02
1

You have to bypass the normal shutdown process that unmounts filesystems, stops daemons and so on. This is where it stops - it cannot safely stop the processes. What you need is reboot -f or poweroff -f whatever you want to achieve (some init systems may bring their own commands -- systemd for instance). The "force" feature skips the regular shutdown processes and goes directly for hardware reboot.

orion
  • 12,302
  • 2
  • 31
  • 41
  • Many thanks for that, @orion - I tried from `ssh` in the controlling terminal this: `sudo bash -c "( echo a; sleep 5; reboot -f )" & exit` but nothing happens on the server PC... I guess if I issue it manually on the server PC itself it would work, but here I need to issue it from a remote ssh shell, and immediately exit from the shell before the reboot takes place (and the [`at` command doesn't recognize seconds.](http://stackoverflow.com/questions/13905886/run-command-at-5-seconds-from-now)). Cheers! – sdaau Feb 05 '15 at 12:06
  • 1
    To avoid killing the process prematurely, either background the process, call `disown` to detach the process, and then logout. Or prefix the line with `nohup`. I prefer the first option. – orion Feb 05 '15 at 12:08
  • 1
    Just a note: I just tested several times with `sudo reboot -f` from ssh (both typed manually and in a script), and realized late that actually there are several instances of `reboot -f` shown in `ps axf` with status `D` which is `uninterruptible sleep (usually IO)`; probably the kernel bug freezes the reboot commands as well... – sdaau Feb 05 '15 at 14:23
  • Oh dear. If `reboot` itself gets stuck in kernel mode (D status means something went wrong in syscall) then I don't see much hope - raw sysrq has a chance, but I'm not sure. And connecting through `ssh` has nothing to do with anything (it doesn't make a difference) – orion Feb 05 '15 at 16:02