0

we have the following Linux red-hat VM servers details , ( each server include application that runs under docker containers )

Linux redhat version - 7.6
number of CPU cores - 16

and we are suspect that number of cores isn't enough because the CPU idle is low around - 40%-50% and sometimes even under 40% , in spite CPU load average is normal around 9 - 12

we performed the following testing

from sar -u 2 5
Linux 3.10.0-862.el7.x86_64 (bigdata-machine03.kondel.com)  08/21/2022      _x86_64_        (16 CPU)

02:14:07 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
02:14:09 PM     all     36.82      0.00     14.64      0.57      0.00     47.97
02:14:11 PM     all     35.50      0.00     16.01      0.82      0.00     47.68
02:14:13 PM     all     21.52      0.00     10.90      0.69      0.00     66.89
02:14:15 PM     all     21.45      0.00     10.96      0.63      0.00     66.97
02:14:17 PM     all     22.28      0.00     10.15      0.78      0.00     66.78
Average:        all     27.51      0.00     12.53      0.70      0.00     59.27


vmstat 1 21
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 6  0 269568 26388424      0 29302496    0    0     0   419    0    0 19  9 72  0  0
 5  0 269568 26257112      0 29424172    0    0     0 131098 9739 4328 24  8 67  1  0
 5  0 269568 26124560      0 29548576    0    0     0 66573 8790 2414 24  8 67  0  0
 5  0 269568 25992844      0 29671288    0    0     0 146499 8701 2124 23  9 67  1  0
 5  0 269568 25861804      0 29795272    0    0     0 114700 9146 4341 23  8 67  1  0
 5  0 269568 25726984      0 29924684    0    0     0 131127 10060 4263 24  8 67  1  0
 5  0 269568 25592612      0 30049624    0    0     0 131098 9127 3958 24  8 67  1  0
 5  0 269568 25462696      0 30172108    0    0     0 131369 10000 4500 24  8 67  1  0
 5  0 269568 25325716      0 30297560    0    0     0 98332 8723 2942 24  8 67  1  0
 6  0 269568 25181400      0 30436356    0    0     0 98324 8585 2740 24  7 68  1  0
 6  0 269568 25044572      0 30560928    0    0     0 163876 9983 4029 24  8 67  1  0
 4  1 269568 24903352      0 30693816    0    0     0 157720 8468 3220 25  8 67  1  0
 6  0 269568 24770240      0 30819368    0    0     0 71702 9439 5035 24  7 67  1  0
 5  0 269568 24633396      0 30946824    0    0     0 131115 8974 3863 25  7 67  1  0
 5  0 269568 24508664      0 31064812    0    0     0 163873 9523 4525 23  8 67  1  0
 4  1 269568 24366044      0 31196540    0    0     0 65547 8381 2131 24  8 67  0  0
 5  0 269568 24243064      0 31314580    0    0     0 98326 8936 4413 24  7 68  1  0
 5  0 269568 24115296      0 31436264    0    0     0 163872 9698 4941 23  7 68  2  0
 5  0 269568 23974156      0 31569112    0    0     0 163876 9298 4221 24  7 68  2  0
 4  1 269568 23835196      0 31700900    0    0     0 65546 8262 2000 25  7 67  0  0
15  0 269568 22972552      0 31833020    0    0     0 131101 32338 4679 55 25 20  1  0



 # uptime
 14:14:31 up 149 days, 23:06,  1 user,  load average: 9.31, 9.32, 9.48

iostat
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sda              14.36         0.58      6648.36    7483539 86140749988
dm-0              0.27         0.12         2.10    1503954   27251899
dm-1              0.10         0.19         0.20    2427092    2539536
dm-2             14.18         0.27      6646.06    3449263 86110943670

what is the redhat line that we should be consider in order to add additional CPU CORES?

MC68020
  • 6,281
  • 2
  • 13
  • 44
yael
  • 12,598
  • 51
  • 169
  • 303
  • What response times or data throughput are you expecting from your applications? If you're meeting that metric you don't need to throw resources at a problem that does not exist. If the apps are not meeting your requirements then you need to identify the bottleneck and address that. Consider the apps also. If you have 12 single-threaded processes as the major applications then giving them more cores is not going to help. – doneal24 Aug 21 '22 at 15:54
  • from application side every thing is ok , and I not see something unusual , but what I little worry is about the idle - 40% – yael Aug 21 '22 at 16:01
  • The cores are there to be used. I be more concerned on what the CPU usage spikes to. If the usage never hits 80-90% then you don't have a problem. – doneal24 Aug 21 '22 at 16:05
  • as you see from my results sar gives idle cpu , and you are talk about cpu utilization , what your preferred approach to check cpu util ? – yael Aug 21 '22 at 16:08
  • With that uptime, are you doing regular updates? RHEL 7.6 went EOL August 6, 2019 that's 3 years ago. The latest RHEL 7 is 7.9 which came out nearly two years ago. Maybe it's time to start planning for RHEL 8 or 9? –  Aug 21 '22 at 16:33
  • 1
    yes of course , we prepare rhel upgrade from 7.6 to 7.9 but it will some time since we have a lot of machines , meanwhile we want to understand if the CPU cores allocation is enough or not – yael Aug 21 '22 at 16:37

2 Answers2

2

From what I see you have average of 6-7 running processes (from vmstat). From uptime you have 9 running processes in the queue. When you reach constantly 12 processes you may start to think about upgrade in the future.

You CPU usage is relatively small, when you start reach 75% for 50% of the time will be wise to think about future.

What you should pay attention is interrupts. Having >100K is a bit much for me. But this depend a lot from the programs you run. But dig in this direction.

Romeo Ninov
  • 16,541
  • 5
  • 32
  • 44
  • about what you said - "pay attention is interrupts. Having >100K" - how to check that? – yael Aug 21 '22 at 16:24
  • @yael, you can check this answer: https://unix.stackexchange.com/a/331969/101265 – Romeo Ninov Aug 21 '22 at 16:25
  • Cpu idle is around 40% , can we say when CPU idle is more lower as 10-20% then cpu utilization is close to 100% ? – yael Aug 21 '22 at 16:29
  • @yael, yes, when you see IDLE time to decrease it's time for vertical scaling. But for the moment I see IDLE is around 65% (do not pay much attention on rare peaks) – Romeo Ninov Aug 21 '22 at 16:33
  • so if idle is opposite to util , then what is the best way to verify the cpu utilization ? – yael Aug 21 '22 at 16:35
  • It's combination of CPU idle and load average. They represent the load of CPUs from different point of view, usage of processors and processes in the process queue. – Romeo Ninov Aug 21 '22 at 16:37
  • 1
    so can we summary that based on our results , we not need any additional cores , but as you said when load average is close to 16 core then we need to think about core upgrade - am I right >? – yael Aug 21 '22 at 16:39
  • second , do you have a "red line threshold " about cpu idle ? if for example cpu isle is 10% , then what we need to think about this from point of adding cores ? – yael Aug 21 '22 at 16:41
  • Correct about the cores. About the idle as I mention in the answer: 25% idle or less for 50% of the time or more. – Romeo Ninov Aug 21 '22 at 16:48
  • Remember, load average includes processes in I/O wait. If your load is higher than makes sense for the idle cpu, then you need to look for I/O contention and maybe you need an upgrade there. (More cache ram? Wider raid? More SSDs less disks?) – user10489 Aug 21 '22 at 17:06
  • @user10489, that's the reason I add the last paragraph in my answer. – Romeo Ninov Aug 21 '22 at 17:14
  • Yes, I meant to add that. High interrupts are a potential sign of high I/O. More cpus won't help that much probably, but enabling hyperthreading might. – user10489 Aug 21 '22 at 17:15
  • I have another question , do you think 8 core per machine will be enough based on the details that I gave ? , I get the feeling that 16 are too much – yael Aug 21 '22 at 17:59
  • 1
    @yael, please create new question, link this (so not need for paste data again). But IMHO with 8 CPUs machine will be on the edge (from both sides) – Romeo Ninov Aug 21 '22 at 18:09
1

I personally monitor the RES line in /proc/interrupts reporting the number of rescheduling interrupts. (in a no cpu-pinning context)

These interrupts occur whenever some cpu is being busy and another task (including irq processing in irq threads) set in the same scheduler queue could also run (on the same cpu.) and the scheduler managed to find an idle cpu to which the task could be migrated.

Therefore the less RES, either the less number of occurrences of more than one runnable task in the cpu work queue OR the less number of occurrences the scheduler has managed to find an iddle CPU for migrating.

Of course, the latter would tell you for certain that increasing the number of cpus would be beneficial to your workload.

In order to decide, I suggest you benchmark your system starting from some minimal workload (minimal total number of tasks running with respect to the workoad) then increase progressively the load and watch the increase in RES (grand total).
When the curve RES per second = f (number of tasks) stops increasing significantly then…

MC68020
  • 6,281
  • 2
  • 13
  • 44