0

I have a setup of two bash scripts, one starting a node server (start-server.sh) and one that runs the first script and reruns it whenever it gets terminated via SIGTERM (start.sh).

start.sh looks like this:

#!/bin/bash

trap './start-server.sh' TERM
./start-server.sh

Inside of start-server.sh, some environment variables are exported and afterwards a node server is started.

To restart said server, I have the following bash snippet:

kill -TERM "-$(ps -ax -o pgid,command | tr -s " " | grep -E "[[:digit:]]+[[:space:]]+/bin/bash ./start.sh" | xargs | cut -d " " -f 1)"

which sends SIGTERM to the whole process group that got started by start.sh, causing all child processes to terminate and the trap inside of start.sh itself then restarts the start-server.sh script.

On my personal machine running Pop!_OS 22.04 LTS x86_64 and zsh 5.8.1 (x86_64-ubuntu-linux-gnu) this works flawlessly.

However on my colleagues machine, running macOS and zsh 5.9, repeatedly running above kill command stops working after exactly three times. How can that be? After three restarts, start.sh itself just stops upon issuing a fourth kill. Even waiting a moderate amount of times between the kill executions does not change this.


edit:

We now discovered that the following works on both machines:

#!/bin/bash                                                             
                                                                        
START_SCRIPT_PATH="./start-server.sh"                                   
                                                                        
handle_sigterm() {                                                      
  echo "Received SIGTERM, will restart"                                 
}                                                                       
                                                                        
handle_sigint() {                                                       
  echo "Received SIGINT, will exit"                                     
  exit 1                                                                
}                                                                       
                                                                        
trap 'handle_sigterm' SIGTERM                                           
trap 'handle_sigint' SIGINT                                             
                                                                        
while true                                                              
do                                                                      
  bash "$START_SCRIPT_PATH"                                             
  sleep 1                                                               
done

I would still be interested in understanding why the first solution does not work on both machines, i.e., what exactly the difference is between these two. Is it a timing issue where the first solution can terminate when start-server.sh gets terminated, before the new run is started? Seem strange, as it repeatedly only stopped working after the third restart... However, we tried executing with bash start-server.sh instead of ./start-server.sh and also the other "signal styles". The difference seems very likely to be related to the loop.

linusha
  • 1
  • 2
  • I tried to make (1) and (2) more clear. > (3) Maybe the difference is in the fact the first method allows `./start-server.sh` and its descendants to be started again before the previous instance exits, while the second method restarts `./start-server.sh` strictly after the previous instance exits. I was under the assumption, that the signal would "bubble up" through the process group, i.e., `start.sh` as the "parent" (not sure what the exact term is here) would receive the signal lastly. Am I understanding your comment correctly, that this assumption is false? – linusha Jul 17 '23 at 10:16

0 Answers0