Host CPU does not scale frequency when KVM guest needs it

Question

Observation:
I have an HP server with an AMD dual core CPU (Turion II Neo N40L) which can scale frequencies from 800 to 1500 MHz. The frequency scaling works under FreeBSD 9 and under Ubuntu 12.04 with the Linux kernel 3.5. However, when I put FreeBSD 9 in a KVM environment on top of Ubuntu the frequency scaling does not work. The guest (thus FreeBSD) does not detect the minimum and maximum frequencies and thus does not scale anything when CPU occupation gets higher. On the host (thus Ubuntu) the KVM process uses between 80 and 140 % of the CPU resource but no frequency scaling happens, the frequency stays at 800 MHz, although when I run any other process on the same Ubuntu box, the ondemand governor quickly scales the frequency to 1500 MHz!

Concern and question:
I don't understand how the CPU is perhaps virtualised, and if it is up to the guest to perform the proper scaling. Does it require some CPU features to be exposed to the guest for this to work?

Apendix:
The following Red Hat release note tends to suggest that frequency scaling out to work even in a virtualised environment (see chapter 6.2.2 and 6.2.3), thought the note fails to address which virtualisation technology this work with (kvm, xen, etc.?)

For information, the cpufreq-info output on Ubuntu is:

$ cpufreq-info
cpufrequtils 007: cpufreq-info (C) Dominik Brodowski 2004-2009
Report errors and bugs to [email protected], please.
analyzing CPU 0:
  driver: powernow-k8
  CPUs which run at the same hardware frequency: 0
  CPUs which need to have their frequency coordinated by software: 0
  maximum transition latency: 8.0 us.
  hardware limits: 800 MHz - 1.50 GHz
  available frequency steps: 1.50 GHz, 1.30 GHz, 1000 MHz, 800 MHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 800 MHz and 1.50 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 800 MHz.
  cpufreq stats: 1.50 GHz:14.79%, 1.30 GHz:1.07%, 1000 MHz:0.71%, 800 MHz:83.43%  (277433)
analyzing CPU 1:
  driver: powernow-k8
  CPUs which run at the same hardware frequency: 1
  CPUs which need to have their frequency coordinated by software: 1
  maximum transition latency: 8.0 us.
  hardware limits: 800 MHz - 1.50 GHz
  available frequency steps: 1.50 GHz, 1.30 GHz, 1000 MHz, 800 MHz
  available cpufreq governors: conservative, ondemand, userspace, powersave, performance
  current policy: frequency should be within 800 MHz and 1.50 GHz.
                  The governor "ondemand" may decide which speed to use
                  within this range.
  current CPU frequency is 800 MHz.
  cpufreq stats: 1.50 GHz:14.56%, 1.30 GHz:1.06%, 1000 MHz:0.79%, 800 MHz:83.59%  (384089)

The reason I want this feature to work is: save energy, run quieter (less hot) and also simple curiosity to understand better why this is not working and how to make it work.

That Microserver only supports Windows and RHEL, see the quickspecs. — , Feb 08 '13 at 13:47
run `cpufreq-info` on the host OS, it will probably complain that there's no driver available. — Chris S, Feb 08 '13 at 13:55
It officially supports Windows and RHEL. I does not mean that other OS won't run on top of it. Note that the CPU scaling is working perfectly on Ubuntu and FreeBSD when they are installed on the bare metal (so not through virtualisation). In addition, when installed on bare metal both OS are working perfectly, no driver missing or weird behaviour. Finally, `cpufreq-info` does not complain and outputs proper information, so the CPU is fully supported (of course in a way!). The driver used is powernow-k8 which is also logical. — Huygens, Feb 08 '13 at 14:06
@ChrisS I have added the cpufreq-info information to the original question. — Huygens, Feb 08 '13 at 18:36
If you don't really need frequency scaling, you can always disable it. — Michael Hampton, Feb 08 '13 at 18:44
@MichaelHampton as I am doing some benchmarking to tune properly my kvm install, I setup the governor to performance which indeed disable scaling. However, I obviously want this feature to work or I would not have ask the question ;-) — Huygens, Feb 08 '13 at 21:05
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/7472/discussion-between-huygens-and-iain) — Huygens, Feb 11 '13 at 07:55

score 10 · Accepted Answer · edited Apr 13 '17 at 12:36

I have found the solution thanks to the tip given by Nils and a nice article.

Tuning the ondemand CPU DVFS governor

The ondemand governor has a set of parameters to control when it is kicking the dynamic frequency scaling (or DVFS for dynamic voltage and frequency scaling). Those parameters are located under the sysfs tree: /sys/devices/system/cpu/cpufreq/ondemand/

One of this parameters is up_threshold which like the name suggest is a threshold (unit is % of CPU, I haven't find out though if this is per core or merged cores) above which the ondemand governor kicks in and start changing dynamically the frequency.

To change it to 50% (for example) using sudo is simple:
sudo bash -c "echo 50 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold"

If you are root, an even simpler command is possible:
echo 50 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold

Note: those changes will be lost after the next host reboot. You should add them to a configuration file that is read during boot, like /etc/init.d/rc.local on Ubuntu.

I have found out that my guest VM, although consuming a lot of CPU (80-140%) on the host was distributing the load on both cores, so no single core was above 95%, thus the CPU, to my exasperation, was staying at 800 MHz. Now with the above patch, the CPU dynamically changes it frequency per core much faster, which suits better my needs, 50% seems a better threshold for my guest usage, your mileage may vary.

Optionally, verify if you are using HPET

It is possible that some applicable which incorrectly implement timers might get affected by DVFS. This can be a problem on the host and/or guest environment, though the host can have some convoluted algorithm to try to minimise this. However, modern CPU have newer TSC (Time Stamp Counter) which are independent of the current CPU/core frequency, those are: constant (constant_tsc), invariant (invariant_tsc) or non-stop (nonstop_tsc), see this Chromium article about TSC resynchronisation for more information on each. So if your CPU is equipped with one of this TSC, you don't need to force HPET. To verify if your host CPU supports them, use a similar command (change the grep parameter to the corresponding CPU feature, here we test for the constant TSC):

$ grep constant_tsc /proc/cpuinfo

If you do not have one of this modern TSC, you should either:

Active HPET, this is described here after;
Not use CPU DVFS if you have any applications in the VM that rely on precise timing, which is the one recommended by Red Hat.

A safe solution is to enable HPET timers (see below for more details), they are slower to query than TSC ones (TSC are in the CPU, vs. HPET are in the motherboard) and perhaps not has precise (HPET >10MHz; TSC often the max CPU clock) but they are much more reliable especially in a DVFS configuration where each core could have a different frequency. Linux is clever enough to use the best available timer, it will rely on first the TSC, but if found too unreliable, it will use the HPET one. This work good on host (bare metal) systems, but due to not all information properly exported by the hypervisor, this is more of a challenge for the guest VM to detect badly behaving TSC. The trick is then to force to use HPET in the guest, although you would need the hypervisor to make this clock source available to the guests!

Below you can find how to configure and/or enable HPET on Linux and FreeBSD.

Linux HPET configuration

HPET, or high-precision event timer, is a hardware timer that you can find in most commodity PC since 2005. This timer can be used efficiently by modern OS (Linux kernel supports it since 2.6, stable support on FreeBSD since latest 9.x but was introduced in 6.3) to provide consistent timing invariably to CPU power management. It allows to build also easier tick-less scheduler implementations.

Basically HPET is like a safety barrier which even if the host has DVFS active, the host and guest timing events will be less affected.

There is a good article from IBM regarding enabling HPET, it explains how to verify which hardware timer your kernel is using, and which are available. I provide here a brief summary:

Checking the available hardware timer(s):
cat /sys/devices/system/clocksource/clocksource0/available_clocksource

Checking the current active timer:
cat /sys/devices/system/clocksource/clocksource0/current_clocksource

Simpler way to force usage of HPET if you have it available is to modify your boot loader to ask to enable it (since kernel 2.6.16). This configuration is distribution dependant, so please refer to your own distribution documentation to set it properly. You should enable hpet=enable or clocksource=hpet on the kernel boot line (this again depends on the kernel version or distribution, I did not find any coherent information).
This make sure that the guest is using the HPET timer.

Note: on my kernel 3.5, Linux seems to pick-up automatically the hpet timer.

FreeBSD guest HPET configuration

On FreeBSD one can check which timers are available by running:
sysctl kern.timecounter.choice

The currently chosen timer can be verified with:
sysctl kern.timecounter.hardware

FreeBSD 9.1 seems to automatically prefer HPET over other timer provider.

Todo: how to force HPET on FreeBSD.

Hypervisor HPET export

KVM seems to export HPET automatically when the host has support for it. However, for Linux guest they will prefer the other automatically exported clock which is kvm-clock (a paravirtualised version of the host TSC). Some people reports trouble with the preferred clock, your mileage may vary. If you want to force HPET in the guest, refer to the above section.

VirtualBox does not export the HPET clock to the guest by default, and there is no option to do so in the GUI. You need to use the command line and make sure the VM is powered off. the command is:

./VBoxManage modifyvm "VM NAME" --hpet on

If the guest keeps on selecting another source than HPET after the above change, please refer to the above section how to force the kernel to use HPET clock as a source.

is there a real application for this, or is it just a one-off trick? — ewwhite, Feb 10 '13 at 11:52
@ewwhite what do you mean by a one-off trick? The finding is that DVFS (dynamic voltage and frequency scaling) is actually working with KVM and a Linux host. The process CPU utilisation of 80-140% was probably distributed on both core evenly, so no one core was reaching the 95% default threshold which would lead to frequency scaling. Without changing anything, if I really create a single thread process which uses 100% of one core in the VM, then the freq scaling is kicked, so I was just not seeing it. As for real application of DVFS, it is about saving power and decreased temperature. — Huygens, Feb 10 '13 at 16:37
@ewwhite Do you mean "is there an application that tunes this value for me?" I think the answer is no. Otherwise someone would have put in a _sensible_ default already. 95 is definitely not sensible here. — Michael Hampton, Feb 10 '13 at 18:12
No, is there an application (reason) to want to use this in a virtualized setup? — ewwhite, Feb 10 '13 at 18:15
Reason: I don't want the CPU to run at full speed for several reasons: power consumption is higher, temperature is higher, wear faster, FAN are spinning faster (more power and faster wearing), requires bigger UPS battery. — Huygens, Feb 10 '13 at 20:11

score 4 · Answer 2 · answered Feb 09 '13 at 20:51

4

It is not the guest that triggers the upscale - the host must do this. So you have to lower the according trigger-level on the host.

answered Feb 09 '13 at 20:51

Nils

18,202
11
46
82

Interesting, would you happen to know how to do this? – Huygens Feb 09 '13 at 21:02
@Huygens Normally this is done via some sort of cpufrequency-daemon. There is a config-file for that daemon where you can change its behaviour and up/downscale values. Not sure where exactly this is located at Ubuntu. – Nils Feb 09 '13 at 21:13
You solved it, by default (on Ubuntu at least) the threshold is 95%, I m not sure if it is per CPU though. By lowering it down to 50% I have the expected behaviour! On Ubuntu you would do that like this: `sudo bash -c "echo 50 > /sys/devices/system/cpu/cpufreq/ondemand/up_threshold"` Source: http://www.ivanlam.info/blog/2012/04/26/kvm-virtual-machine-low-network-throughput/ – Huygens Feb 09 '13 at 21:36
1

@Huygens I had this problem on CentOS - there the config-file for `cpuspeed` is located at **/etc/sysconfig/cpuspeed** to make such a change permanent. In my case I had a VBox-VM with just one CPU (physically dual-core). I had to lower the level to 45% to get the upscale-effect in the VM. – Nils Feb 17 '13 at 20:45

score 2 · Answer 3 · answered Feb 09 '13 at 05:31

2

on the host, a kvm cpu looks like a process. The scaling mechanism doesn't watch processes, only the overall cpu consumption.

and it is generally best practice to disable cpu scaling/throttling/etc when running VMs

answered Feb 09 '13 at 05:31

dyasny

1,136
6
8

Weirdly, when I do top on the host I can see that the overall CPU consumption is about 80-130 %, (btw all consumed by the kvm and ksm processes) but not frequency scaling. When I run other processes which consume CPUs, the ondemand governor is quickly kicking in! The only difference I assume is that the kvm process is using some virtualisation technology (AMD svm in my case) which could make that the governor of the host does not react. And the guest does not manage to request the underlying hw to scale I guess, though on bare metal it worked. – Huygens Feb 09 '13 at 08:58
Could you refer to an article detailing why frequency scaling is not a best practice when running VMs? I am curious to understand why. Red Hat seems to support this, see chapter 6.2.4 (there was a problem in earlier release than RHEL 5.3) https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/5.3_Release_Notes/sect-Release_Notes-Virtualization.html – Huygens Feb 09 '13 at 08:59
I don't have any articles handy, but think about what virtualization is for, and how it works. You are planning to utilize an underutilized machine by loading it with VMs. The VMs should be stable and predictable. Do you think a CPU frequency adjustments underneath the VMs will help with that? And talking about best practice, ubuntu as a virt host is not a good idea in my experience – dyasny Feb 09 '13 at 15:55
You can migrate VMs between different host, sometimes differing in frequency. The thing that should be stable in this case is the features exposed by the CPU from the host, so that if SSE4 is exposed and your application make use of it, once you migrate to a CPU that does not support it, your VM does not crash. So frequency scaling, which is a big factor in mitigating power consumption, should not be a problem. I googled it and did not find any article mentioning that. – Huygens Feb 09 '13 at 16:23
Using Ubuntu, RHEL, Fedora or you-name-it as a host is not the problem. What would differ is the kernel, kvm, apci version provided by each of these vendors. Maybe Ubuntu as not the latest greatest, but if freq scaling was not supported on older version of kvm, I want to know which and check if my installation has it installed or newer. – Huygens Feb 09 '13 at 16:32
Nothing to do with live migration actually. The CPU flag set is easily brought to a common denominator by libvirt's CPU model abstraction, but have you actually tried to migrate a production VM between two drastically different CPUs? In quite a few applications, alternating frequencies can cause problems. And with VMs the problem can only grow, because there is no mechanism that would provide a feedback to the CPU controller. – dyasny Feb 09 '13 at 20:09
1

As for the distribution, I mentioned Ubuntu as problematic because it is. Both on SF and other sites, I keep seeing people reporting KVM related issues that I never manage to reproduce on Fedora or RHEL. Please feel free to disagree, I am not continuing into a flamewar here. – dyasny Feb 09 '13 at 20:10
I do not expect any issue with the type of application I am going to run (mostly NAS-like). I also do remember a time where I was using Linux and I had a physical switch on my PC to change the CPU frequency from 66 to 33 MHz (486 DX2 66). Linux never found it problematic! And that switch is doing exactly the same thing as the host doing frequency scaling for the guest. There are plenty of papers on frequency scaling and virtualization returned by Google. It is a thing that work, it just does not work on my setup and I want to find out why. – Huygens Feb 09 '13 at 20:32
Hi @dyasny, following your advice on the risk of CPU DVFS, I googled a bit further than I intended. It seems that since around 2005, commodity PC have a new timer, HPET. It has been supported in several OS since then and is often used instead of older RTC to provide coherent timing invariant to CPU power management. This allows also tickless system to handle timing events. Linux since kernel 2.6 and FreeBSD since version 9.x (at least) support HPET,: https://wiki.freebsd.org/TuningPowerConsumption My host uses a Linux kernel 3.5 and my guest uses FreeBSD 9.1, so I should be safe. Thx anyway – Huygens Feb 11 '13 at 08:28