In the past I used pid-files to guarantee race-condition safe execution of scripts. But this had the downside, that the pid-file was not deleted if the kernel killed the script somehow. So sometimes manual interaction was needed (remove the pid file, killing some sub processes).
Finally I came up to a solution which checks if a script is running by using pgrep and kills really old processes that shouldn't exist anymore:
while read -r pid cmd; do
# check if pid or parent pid belong to current script execution
if [[ $pid == "$$" ]] || [[ $(ps -o ppid= -p "$pid" | xargs) == "$$" ]]; then
continue
fi
# avoid re-execution of script within 12 hours (43200 seconds)
if [[ $(date +%s --d="now - $( stat -c%X "/proc/$pid" ) seconds") -lt 43200 ]]; then
echo "Error: Script is already running!"
exit
fi
# kill outdated script executions
if kill -9 "$pid"; then
echo "Warning: Outdated script execution ($pid) has been killed!"
fi
done < <(pgrep -af "/bin/bash $(basename "$(readlink -f "$0")")")
This works on hundreds of servers and thousands of script executions without any issues, but in very rare cases it returned an "already running"-error although no other script was executed.
As I was not able to find a bug in my code, I created more debug information:
echo "pid: $pid"
echo "cmd: $cmd"
echo "lsof:"
lsof -p "$pid"
echo "/proc/$pid/cmdline:"
xargs -0 <"/proc/$pid/cmdline"
echo "pstree:"
pstree -asp "$pid"
And after several weeks it returned this:
2023-04-02 06:10:01 Error: Script is already running!
2023-04-02 06:10:01 pid: 3902
2023-04-02 06:10:01 cmd: /bin/bash /usr/local/bin/script.sh
2023-04-02 06:10:01 lsof:
2023-04-02 06:10:01 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
2023-04-02 06:10:01 sed 3902 root cwd DIR 8,3 4096 6291457 /root
2023-04-02 06:10:01 sed 3902 root rtd DIR 8,3 4096 2 /
2023-04-02 06:10:01 sed 3902 root txt REG 8,3 77664 2886809 /usr/bin/sed
2023-04-02 06:10:01 sed 3902 root mem REG 8,3 1936776 5243076 /lib64/libc-2.22.so
2023-04-02 06:10:01 sed 3902 root mem REG 8,3 10344 2892257 /usr/lib64/coreutils/libstdbuf.so
2023-04-02 06:10:01 sed 3902 root mem REG 8,3 164144 5242936 /lib64/ld-2.22.so
2023-04-02 06:10:01 sed 3902 root 0r FIFO 0,12 0t0 96041317 pipe
2023-04-02 06:10:01 sed 3902 root 1w FIFO 0,12 0t0 96037120 pipe
2023-04-02 06:10:01 sed 3902 root 2w FIFO 0,12 0t0 96021482 pipe
2023-04-02 06:10:01 /proc/3902/cmdline:
2023-04-02 06:10:01 sed s/%/%%/g
2023-04-02 06:10:01 pstree:
2023-04-02 06:10:01 systemd,1 --switched-root --system --deserialize 23
2023-04-02 06:10:01 `-cron,3599 -n
2023-04-02 06:10:01 `-cron,3885 -n
2023-04-02 06:10:01 `-fcc-monitor-and,3891 /usr/local/bin/script.sh
2023-04-02 06:10:01 `-fcc-monitor-and,3901 /usr/local/bin/script.sh
2023-04-02 06:10:01 `-sed,3902 s/%/%%/g
As you can see pgrep returned 3902, which is used by a subshell process of the script. Instead I would expect 3891 or 3901 (as in all other executions). Is this a bug of pgrep or why isn't it 100% reliable?
Update: Only for those, who are interested in the final solution:
# obtain script vars
script_pid="$$"
script_path=$(readlink -f "$0")
# make script safe against race conditions (can't be executed twice)
while read -r pid; do
# obtain all parent and child pids
pid_list=$(pstree --show-parents --show-pids "$pid")
# ignore list as it was created through current script execution
if echo "$pid_list" | grep -F "($script_pid)" >/dev/null; then
continue
fi
# verify that pid is part of the list ("pstree --show-parents" returns list, even pid does not exist https://github.com/acg/psmisc/issues/5)
if ! echo "$pid_list" | grep -F "($pid)" >/dev/null; then
continue # process does not exist anymore
fi
# obtain age of pid
pid_time=$( stat -c%X "/proc/$pid" 2>/dev/null )
if [[ ! $pid_time ]]; then
continue # process does not exist anymore
fi
# kill outdated script executions (older than 12 hours / 43200 seconds)
if [[ $(date +%s --d="now - $pid_time seconds") -gt 43200 ]]; then
if kill -9 "$pid"; then
echo "Warning: Outdated script execution ($pid) has been killed!"
fi
# we are facing a race condition
else
echo "Error: Script is already running!"
exit 1
fi
done < <(pgrep -f "^/bin/bash $script_path") # obtain pids that belong to this script