I am trying to understand if setting cpu.cpu_quota_us in cpu cgroup subsytem has any impact on application performance. Essentially by reducing the CPU quota, but increasing the number of CPUs such that "effective" CPUs are still the same, would it impact the application? For example, is 4 CPU 100% quota configuration same as 8 CPU 50% quota configuration?
I know this depends a lot on the application design and whether its cpu or io bound. Here I am only concerned about CPU intensive applications.
My effort:
I wrote a simple C application available here https://github.com/ashu-mehra/cpu-quota-test.
This program creates 'N' threads. Each thread starts computing prime numbers starting between number 'n' and 1000000. The starting number 'n' is different for each thread. After computing 100 prime numbers, the thread sleeps for fixed duration of time. Once the thread reaches 1000000, it starts over from 2. At the end, the main thread displays the cumulative number of primes calculated by each thread. I treat this as the "throughput" of this sample application.
I ran this program under following configurations:
- In a cgroup which has 4 cpus and no limit on quota.
- In a cgroup which has 8 cpus and 50% quota.
I disabled the hyperthreads by setting /sys/devices/system/cpu/cpu/online` to 0.
For each configuration I varied the number of threads from 4 to 32. Following are the results for the "throughput" generated by the sample program. Numbers are average of 10 iterations.
threads cpu4quota100 cpu8quota50 4 66229.5 66079.4 8 128129 129768 16 189247 134882 24 188238 98917.8 32 176236 87252.5
Notice there is big difference in throughput between two cases from thread 16 onwards. For 24 and 32 threads, throughput dropped considerably for "cpu8quota50" case.
I have the perf stat results for these runs as well. I noticed cpu-migrations reported by perf vary a lot between these two configurations. Here is the comparison
threads cpu4quota100 cpu8quota50 4 9.6 11.2 8 3252.2 37.9 16 2956.2 4490.5 24 472.6 2347 32 118.3 1727.2
Numbers for threads 4, 8 and 16 make sense but I can't comprehend the numbers for thread 24 and 32 for "cpu4quota100" case which are way less than thread 16 case.
Can some one provide explanation for these results? Also, does "cpu-migration" have any impact on application performance?
Sorry for the long post!
Edit 1:
I updated my script for running above mentioned sample program to time the execution using time command to see if there is any difference between "cpu4quota100" and "cpu8quota50" cases.
I did the run for 32 threads only, and these are the results:
time cpu4quota100 cpu8quota50 user 119.956 secs 120.076 secs sys 0.001 secs 0.009 secs CPU 386.2% 386.5%
So not much difference in user and sys time in the two cases, but the "throughput" is twice in cpu4quota100 case compared to cpu8quota50 case.
Edit 2:
It seems changing kernel governor for CPU frequency helped in improving the throughput of cpu8quota50 case.
Earlier numbers were obtained when frequency governor "powersave" was in use. With "powersave" CPU frequency of the cores in case of cpu4quota100 shot to maximum but for cpu8quota50 it was much lower.
However, after changing frequency governor to "performance", CPU frequency in case of cpu8quota50 was also close to max.
For 32 threads running with "performance" as frequency governor I get following numbers:
threads cpu4quota100 cpu8quota50 32 175804 163831
So the difference has now come down from nearly 50% to 6.8% only.
But its interesting to note the difference in behavior of "powersave" governor in the two cases as mentioned above. I am not sure if it is working as expected in cpu8quota50 case.