3

I have a Supermicro X9DR3-F motherboard where JWD jumper pins 1 and 2 are shorted and watchdog functionality in UEFI is enabled: Supermicro UEFI

This means that the system is reset after around 5 minutes if nothing resets the hardware watchdog timer. I installed the watchdog daemon and configured it to use iTCO_wdt driver:

$ cat /etc/default/watchdog 
# Start watchdog at boot time? 0 or 1
run_watchdog=1
# Start wd_keepalive after stopping watchdog? 0 or 1
run_wd_keepalive=1
# Load module before starting watchdog
watchdog_module="iTCO_wdt"
# Specify additional watchdog options here (see manpage).
$ 

When the watchdog daemon is started, then the driver is loaded without issues:

$ sudo dmesg | grep iTCO_wdt
[   17.435620] iTCO_wdt: Intel TCO WatchDog Timer Driver v1.11
[   17.435667] iTCO_wdt: Found a Patsburg TCO device (Version=2, TCOBASE=0x0460)
[   17.435761] iTCO_wdt: initialized. heartbeat=30 sec (nowayout=0)
$ 

Also, the /dev/watchdog file is present:

$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec  8 22:36 /dev/watchdog
$ 

watchdog-device option in watchdog daemon configuration points to this file:

$ grep -v ^# /etc/watchdog.conf 



watchdog-device    = /dev/watchdog
watchdog-timeout   = 60


interval           = 5
log-dir            = /var/log/watchdog
verbose            = yes
realtime           = yes
priority           = 1

heartbeat-file     = /var/log/watchdog/heartbeat
heartbeat-stamps   = 1000
$ 

In order to debug the writes to the watchdog device I have enabled heartbeat-file option and looks that the keepalive messages to /dev/watchdog are sent:

$ tail /var/log/watchdog/heartbeat
 1575830728
 1575830728
 1575830728
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
 1575830733
$ 

However, despite this the server resets itself with roughly five minute intervals.

My next thought was that maybe the iTCO_wdt driver controls the watchdog in C606 chipset and the watchdog resetting the server is instead part of IPMI. So I made sure that the iTCO_wdt driver is not loaded during the boot and rebooted the server. Fair enough, the /dev/watchdog was no longer present. Now I loaded the ipmi_watchdog module:

$ ls -l /dev/watchdog
ls: cannot access '/dev/watchdog': No such file or directory
$ sudo modprobe ipmi_watchdog
$ sudo dmesg -T | tail -1
[Tue Dec 10 21:12:48 2019] IPMI Watchdog: driver initialized
$ ls -l /dev/watchdog
crw------- 1 root root 10, 130 Dec 10 21:12 /dev/watchdog
$ 

.. and finally started the watchdog daemon which based on the /var/log/watchdog/heartbeat file is writing to /dev/watchdog with 5s interval. In addition, one can confirm this with strace:

$ ps -p 2296 -f
UID        PID  PPID  C STIME TTY          TIME CMD
root      2296     1  0 01:28 ?        00:00:00 /usr/sbin/watchdog
$ sudo strace -y -p 2296
strace: Process 2296 attached
restart_syscall(<... resuming interrupted nanosleep ...>) = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, NULL)                 = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, NULL)                 = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
open("/proc/uptime", O_RDONLY)          = 2</proc/uptime>
close(2</proc/uptime>)                  = 0
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
write(1</dev/watchdog>, "\0", 1)        = 1
nanosleep({5, 0}, ^Cstrace: Process 2296 detached
 <detached ...>
$

watchdog daemon above with PID 2296 was started in a way that heartbeat-file option in /etc/watchdog.conf was commented out in order to reduce the write system calls in the output of strace.

However, the server still reboots with roughly 300s intervals.

Why isn't the watchdog daemon able to reset the hardware watchdog timer on Supermicro X9DR3-F motherboard?

Martin
  • 7,284
  • 40
  • 125
  • 208
  • 1
    From another SuperMicro FAQ: disable the Watchdog function in BIOS, OS and BMC first : BIOS : Advanced >> Power Configuration >> Watch Dog function >> Disabled OS : `echo 0 >> /proc/sys/kernel/nmi_watchdog` IPMI : smcipmitool <30> <97> Data1: enable/disable, 0-disable, 1-enable Data2: expire time, count by minutes (minute unit). Then just use only either nmi or ipmi. – ThatOneDude Dec 13 '19 at 22:09
  • @ssnobody Thanks! I also read that Supermicro FAQ ID `26478`, but isn't this essentially asking one to disable that BIOS watchdog function altogether? If yes, then shouldn't there be a way to reset that BIOS watchdog from the OS? Or how should one use BIOS watchdog if there is no way to reset its timer from the OS? – Martin Dec 17 '19 at 13:26
  • 1
    Does strace show the watchdog daemon writing to /dev/watchdog? Are there any errors associated with the write? – icarus Dec 17 '19 at 16:21
  • @icarus Good idea! `Strace` shows that the `watchdog` daemon writes to `/dev/watchdog` with 5s interval and there seems to be no errors. I added the output of `sudo strace -y -p ` to my initial post. – Martin Dec 17 '19 at 23:46
  • 1
    @Martin yes, its asking to disable the watchdog in BIOS, but apparently that doesn't stop it from using nmi or ipmi watchdog functionality. I would agree that if they provide a "watchdog", there should probably be some way to reset it from the OS or they should rename it "Rebooter" instead of "watchdog". The post over at https://serverfault.com/questions/695650/supermicro-bmc-watchdog-caused-reboots has another suggestion for you though if you want to try to keep the BIOS watchdog enabled and use it. – ThatOneDude Dec 17 '19 at 23:54
  • For reference, the suggestion is "just leave watchdog jumper (JWD1) open with neither NMI nor hard-reset selected. Watchdog is enabled in BIOS settings" – ThatOneDude Dec 18 '19 at 00:03
  • @ssnobody When I leave the pins 1 and 2 of `JWD` jumper on my Supermicro X9DR3-F motherboard open and have `Watch Dog Function` in BIOS `Enabled` and have `watchdog` daemon in OS **not** running, then the motherboard does not reset. Basically, leaving the `JWD` jumper open has the same affect as having the `Watch Dog Function` in BIOS `Disabled`. – Martin Dec 18 '19 at 21:23
  • In addition, as user [Terry Kennedy writes in servethehome.com forum](https://forums.servethehome.com/index.php?threads/supermicro-x9scm-f-powers-off-every-5-minutes.10164/), the BIOS seems to be able to reset the BIOS watchdog. In other words, the system is not reset if `Watch Dog Function` in BIOS is `Enabled` and one stays in the BIOS. Or maybe the BIOS `Watch Dog Function` is temporarily deactivated as long as one stays in BIOS. – Martin Dec 19 '19 at 23:59

2 Answers2

2

The reason watchdog daemon was not able to reset the hardware watchdog timer on Supermicro X9DR3-F motherboard is that the watchdog functionality in UEFI controls the third watchdog. This is on Winbond Super I/O 83527 chip. In other words, iTCO_wdt and ipmi_watchdog drivers were wrong drivers for that watchdog chip.

Martin
  • 7,284
  • 40
  • 125
  • 208
-1

On a A2SDi-4C-HLN4F, I had to use bmc_watchdog (from freeipmi) to get it to work.

AdminBee
  • 21,637
  • 21
  • 47
  • 71
RafD
  • 1
  • 1
    Welcome to the site, and thank you for your contribution. Would you mind adding some more explanation on how to start/configure/use that watchdog? – AdminBee Jul 01 '21 at 13:02
  • 1
    [Brevity is acceptable, but fuller explanations are better](https://unix.stackexchange.com/help/how-to-answer). – Kusalananda Jul 12 '21 at 14:47