I've run into an odd issue in which a ps -o args -p <pid> command very occasionally fails to find the process in question, even though it is definitely running on the server in question. The processes in question are long-running wrapper scripts used to launch some Java apps.
The "in the wild" occurrences of the issue always seem to happen early in the morning, so there is some evidence that it's related to disk load on the server in question, because they're quite heavily loaded then, but by running the ps in question in a tight loop, I can eventually replicate the problem - once every few hundred or so runs I get an error.
By running the following bash script, I've managed to generate strace output for both a failed and a successful run:
while [ $? == 0 ] ; do strace -o fail.out ps -o args -p <pid> >/dev/null ; done ; strace -o good.out ps -o args -p <pid>
Comparing the output from fail.out and good.out , I can see that the getdents system call on the run that fails somehow returns a much smaller number than the actual count of processes on the system (on the order of ~500 compared with ~1100)
grep getdents good.out
getdents(5, /* 1174 entries */, 32768) = 32760
getdents(5, /* 31 entries */, 32768) = 992
getdents(5, /* 0 entries */, 32768) = 0
grep getdents fail.out
getdents(5, /* 673 entries */, 32768) = 16728
getdents(5, /* 0 entries */, 32768) = 0
... and that shorter list doesn't include the actual pid in question, so it's not found.
You can ignore this section, the ENOTTY errors are explained by dave_thompson's comment below, and are unrelated
Additionally, the failed run gets some
ENOTTYerrors that don't appear in the successful run. Near the beginning of the output I seeioctl(1, TIOCGWINSZ, 0x7fffe19db310) = -1 ENOTTY (Inappropriate ioctl for device) ioctl(1, TCGETS, 0x7fffe19db280) = -1 ENOTTY (Inappropriate ioctl for device)
And at the end I see a single
ioctl(1, TCGETS, 0x7fffe19db0d0) = -1 ENOTTY (Inappropriate ioctl for device)
The failed
ioctlat the end happens right before thepsreturns, but it occurs after thepshas already printed an empty results set, so I'm not sure if they're related. I do know that they're consistent in all of the failed strace outputs I have, but don't appear in the successful ones.
I have absolutely no idea why getdents would occasionally not find the full list of processes, and I've now reached the point where I'm just going to slap a band-aid on the entire thing by changing the control script that checks the wrapper script in question to call the ps a second time if the first one fails, but I'd be interested to know if anyone has any ideas what's going on here.
The system in question is running Kernel 4.16.13-1.el7.elrepo.x86_64 on CentOS 7 and procps-ng version 3.3.10-17.el7_5.2.x86_64