I have a script that reports the cpu usage of a specific program reading values from pidstat. The script launches as:
pidstat -C ^frontend$ -h -l -p ALL 1 1
The values reported this way sometimes exceed 100% cpu usage, which it's not possible since the process is single-trheaded.
After going up and down trying to find what could be causing this behaviour, I tried to simply increase the time pidstat would collect info for, and found that surprisingly, the values tend to be much more reasonable:
pidstat -C ^frontend$ -h -l -p ALL 1 10
Linux 3.13.0-32-generic (eu-123) 09/22/2014 _x86_64_ (8 CPU)
# Time PID %usr %system %guest %CPU CPU Command
1411396853 15884 28.43 29.41 0.00 57.84 1 /srv/propulsor/frontend-01/frontend
1411396853 15911 33.33 32.35 0.00 65.69 2 /srv/propulsor/frontend-02/frontend
1411396853 15968 58.82 0.00 0.00 58.82 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411396854 15884 141.00 4.00 0.00 145.00 1 /srv/propulsor/frontend-01/frontend
1411396854 15911 143.00 5.00 0.00 148.00 2 /srv/propulsor/frontend-02/frontend
1411396854 15968 32.00 13.00 0.00 45.00 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411396855 15884 0.00 0.00 0.00 0.00 1 /srv/propulsor/frontend-01/frontend
1411396855 15911 0.00 0.00 0.00 0.00 2 /srv/propulsor/frontend-02/frontend
1411396855 15968 32.00 32.00 0.00 64.00 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411396856 15884 61.00 0.00 0.00 61.00 1 /srv/propulsor/frontend-01/frontend
1411396856 15911 69.00 0.00 0.00 69.00 2 /srv/propulsor/frontend-02/frontend
1411396856 15968 33.00 31.00 0.00 64.00 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411396857 15884 30.00 7.00 0.00 37.00 1 /srv/propulsor/frontend-01/frontend
1411396857 15911 34.00 20.00 0.00 54.00 2 /srv/propulsor/frontend-02/frontend
1411396857 15968 32.00 32.00 0.00 64.00 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411396858 15884 31.00 32.00 0.00 63.00 1 /srv/propulsor/frontend-01/frontend
1411396858 15911 34.00 33.00 0.00 67.00 2 /srv/propulsor/frontend-02/frontend
1411396858 15968 141.00 7.00 0.00 148.00 3 /srv/propulsor/frontend-03/frontend
[snip]
Compared to:
root@eu-123 ~ # pidstat -C ^frontend$ -h -l -p ALL 3 10
Linux 3.13.0-32-generic (eu-123) 09/22/2014 _x86_64_ (8 CPU)
# Time PID %usr %system %guest %CPU CPU Command
1411397159 15884 37.09 10.60 0.00 47.68 1 /srv/propulsor/frontend-01/frontend
1411397159 15911 30.46 29.80 0.00 60.26 2 /srv/propulsor/frontend-02/frontend
1411397159 15968 60.60 13.25 0.00 73.84 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397162 15884 29.33 29.33 0.00 58.67 1 /srv/propulsor/frontend-01/frontend
1411397162 15911 66.00 2.00 0.00 68.00 2 /srv/propulsor/frontend-02/frontend
1411397162 15968 41.00 15.00 0.00 56.00 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397165 15884 66.33 0.00 0.00 66.33 1 /srv/propulsor/frontend-01/frontend
1411397165 15911 30.00 19.67 0.00 49.67 2 /srv/propulsor/frontend-02/frontend
1411397165 15968 61.00 13.33 0.00 74.33 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397168 15884 30.00 23.33 0.00 53.33 1 /srv/propulsor/frontend-01/frontend
1411397168 15911 66.00 21.33 0.00 87.33 2 /srv/propulsor/frontend-02/frontend
1411397168 15968 42.00 15.67 0.00 57.67 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397171 15884 59.00 12.33 0.00 71.33 1 /srv/propulsor/frontend-01/frontend
1411397171 15911 30.00 0.67 0.00 30.67 2 /srv/propulsor/frontend-02/frontend
1411397171 15968 70.00 24.67 0.00 94.67 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397174 15884 37.67 11.00 0.00 48.67 1 /srv/propulsor/frontend-01/frontend
1411397174 15911 30.00 29.67 0.00 59.67 2 /srv/propulsor/frontend-02/frontend
1411397174 15968 33.00 4.67 0.00 37.67 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397177 15884 65.33 21.67 0.00 87.00 1 /srv/propulsor/frontend-01/frontend
1411397177 15911 65.67 2.33 0.00 68.00 2 /srv/propulsor/frontend-02/frontend
1411397177 15968 32.33 32.00 0.00 64.33 3 /srv/propulsor/frontend-03/frontend
# Time PID %usr %system %guest %CPU CPU Command
1411397180 15884 29.00 0.00 0.00 29.00 1 /srv/propulsor/frontend-01/frontend
1411397180 15911 30.00 19.33 0.00 49.33 2 /srv/propulsor/frontend-02/frontend
1411397180 15968 70.00 2.33 0.00 72.33 3 /srv/propulsor/frontend-03/frontend
[snip]
Ok, so now the read looks more reasonable, but the question for me remains, what is it causing such a difference? And how can pidstat 1 1 return values over 100%?
Bonus question: Should I be just using different tools to track cpu usage of my processes?
Thanks in advance
Edit: I kept researching the issue. Seems like the process only starts doing such a strange thing after a while (namely ~2 weeks of lifetime). In the same time we found an increased amount of connections handled by the process.
The observed behaviour is to get a couple of runs of pidstat with a flat 0% cpu usage followed by a high peak, which averaged in the magnitude of tenths stays within expected cpu usage values.
I'm starting to think it could be something like the application is doing some loop of system calls through the connection pool which after a while is big enough so that it doesn't write down its /proc/ stats making pidstat believe that the usage is 0% and therefore hitting a big >100% usage after a couple rounds on 0.
When does the process write the /proc/ data? As far as I understand, pidstat uses that info to calculate the cpu usage, or am I wrong?
I hope for someone to shed some light on this issue...