I have a setup of two bash scripts, one starting a node server (start-server.sh) and one that runs the first script and reruns it whenever it gets terminated via SIGTERM (start.sh).
start.sh looks like this:
#!/bin/bash
trap './start-server.sh' TERM
./start-server.sh
Inside of start-server.sh, some environment variables are exported and afterwards a node server is started.
To restart said server, I have the following bash snippet:
kill -TERM "-$(ps -ax -o pgid,command | tr -s " " | grep -E "[[:digit:]]+[[:space:]]+/bin/bash ./start.sh" | xargs | cut -d " " -f 1)"
which sends SIGTERM to the whole process group that got started by start.sh, causing all child processes to terminate and the trap inside of start.sh itself then restarts the start-server.sh script.
On my personal machine running Pop!_OS 22.04 LTS x86_64 and zsh 5.8.1 (x86_64-ubuntu-linux-gnu) this works flawlessly.
However on my colleagues machine, running macOS and zsh 5.9, repeatedly running above kill command stops working after exactly three times. How can that be? After three restarts, start.sh itself just stops upon issuing a fourth kill. Even waiting a moderate amount of times between the kill executions does not change this.
edit:
We now discovered that the following works on both machines:
#!/bin/bash
START_SCRIPT_PATH="./start-server.sh"
handle_sigterm() {
echo "Received SIGTERM, will restart"
}
handle_sigint() {
echo "Received SIGINT, will exit"
exit 1
}
trap 'handle_sigterm' SIGTERM
trap 'handle_sigint' SIGINT
while true
do
bash "$START_SCRIPT_PATH"
sleep 1
done
I would still be interested in understanding why the first solution does not work on both machines, i.e., what exactly the difference is between these two. Is it a timing issue where the first solution can terminate when start-server.sh gets terminated, before the new run is started? Seem strange, as it repeatedly only stopped working after the third restart...
However, we tried executing with bash start-server.sh instead of ./start-server.sh and also the other "signal styles". The difference seems very likely to be related to the loop.