2

I thought I understood CFS but...

The scheduled latency is the time that in which every runnable process can expect to get a share of the processor. In the Linux Kernel code it gives:

(default: 6ms * (1 + ilog(ncpus)), units: nanoseconds)

Which for a 1 cpu machine gives 6ms and for a 4 core machine 18ms. I checked this out on a Rpi Zero one core and a Rpi 4 four cores and it seems to be the case. However the Pi 4 with four cores is a more powerful machine than the Zero surely you would expect the schedule latency to be smaller not three times bigger? Looking at sched_min_granularity in both cases they both handle 8 tasks before falling back to a fixed time slice of 0.75ms and 2.25 ms. I am clearly cross wired on this ...

Mike James
  • 121
  • 3

1 Answers1

2

surely you would expect the schedule latency to be smaller not three times bigger

And that is the reason why it gets "corrected" with _NONE, _LOG. or _LINEAR.

The concept of SMP also works when you look at it as splitting, and not adding, a CPU. Then you don't gain performance overall, but still have better responsiveness.

This short function ("period") uses both min_granularity and latency. I reformatted a bit. I don't think you have to know anything about C language to understand - there is even an unlikely hint:

static u64 __sched_period(unsigned long nr_running)
{
    if (unlikely(nr_running > sched_nr_latency))

        return   nr_running * sysctl_sched_min_granularity;
    else
        return                sysctl_sched_latency;
}

In the end it is more about the word than the thing: wikipedia, CFS:

...the atomic units by which an individual process' share of the CPU was allocated (thus making redundant the previous notion of timeslices)


That redundant word is still in kernel/sched/fair.c:

 * (to see the precise effective timeslice length of your workload,
 *  run vmstat and monitor the context-switches (cs) field)

The values 6ms, 0.75ms (=1/8) and 24ms (= _LOG-corrected for ncpus=8) can be IMHO interpreted as periods i.e. timeslices. If you convert it to Hertz, it matches with the Kconfig.hz ranges, which are 100HZ (server) to 1000HZ (high-responsiveness).

1/.00075 s
1333.3 Hz

More than thousand min-granularity-"slices" fit in a second.

1/.006 s
166.6 Hz

166 uncorrected latency "slices" lies between the 100HZ "server" and the 250HZ "compromise".

1/.024 s
41.6 Hz

With log-correction for 8 cores, each one can reduce context switching by factor 4, still the "effective latency" remains low.


Compare it to a barber shop, where you want to guarantee that no new costumer has to wait longer than 10 minutes. This means you have to preempt your current costumer in the seat every 10 minutes. at least for the time it takes to say hello.

A shop with four seats and barbers can reduce that 10-minute slice. With four barbers working each in a cabinet, they only have to stop and peek every 40 minutes, and on average a newly entered costumer will only wait 10 min as before.

That would be the full, "linear" correction of latency: multiply by N.

But in the worst case, all four check for new costumers at the same time - because they started simultaneously. If a costumer enters one minute after that, he might have to wait 39 minutes before he gets served.

So as a compromise you multiply not by N, but by log(N).

1 + ilog(N)

this gives 1+ilog(4) = 1+2, so the 4 barbers can extend their slice from 10 to 30 minutes (instead of 40). Together they achieve a 10-minute latency.

Quadruple to 16 and it extends only to 50 minutes. The "correction" is logarithmic and has this + 1.

  • 1
    I have been trying to understand your answer for some days and I fall at the first point. What is log-correction. I have googled and read the code and I still don't understand what it does or why it is involved at all. Perhaps I'm searching for the wrong things and perhaps there is a document that I'm missing but I'm still confused. – Mike James May 11 '20 at 05:32