What's a CPU to do when it has nothing to do?

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net!

It would be reasonable to expect doing nothing to be an easy, simple task for a kernel, but it isn't. At Kernel Recipes 2018, Rafael Wysocki discussed what CPUs do when they don't have anything to do, how the kernel handles this, problems inherent in the current strategy, and how his recent rework of the kernel's idle loop has improved power consumption on systems that aren't doing anything.

The idle loop, one of the kernel subsystems that Wysocki maintains, controls what a CPU does when it has no processes to run. Precise to a fault, Wysocki defined his terms: for the purposes of this discussion, a CPU is an entity that can take instructions from memory and execute them at the same time as any other entities in the same system are doing likewise. On a simple, single-core single-processor system, that core is the CPU. If the processor has multiple cores, each of those cores is a CPU. If each of those cores exposes multiple interfaces for simultaneous instruction execution, which Intel calls "hyperthreading", then each of those threads is a CPU. [Rafael Wysocki]

A CPU is idle if there are no tasks for it to run. Or, again more precisely, the Linux kernel has a number of internal scheduling classes, including the special idle class. If there are no tasks to run on a given CPU in any of those classes save the idle class, the CPU is regarded as idle. If the hardware doesn't make allowance for this, then the CPU will have to run useless instructions until it is needed for real work. However, this is a wildly inefficient use of electricity, so most CPUs support a number of lower-power states into which the kernel can put them until they are needed to do useful work.

Idle states are not free to enter or exit. Entry and exit both require some time, and moreover power consumption briefly rises slightly above normal for the current state on entry to idle and above normal for the destination state on exit from idle. Although increasingly deep idle states consume decreasing amounts of power, they have increasingly large costs to enter and exit. This implies that for short idle periods, a fairly shallow idle state is the best use of system resources; for longer idle periods, the costs of a deeper idle state will be justified by the increased power savings while idle. It is therefore in the kernel's best interests to predict how long a CPU will be idle before deciding how deeply to idle it. This is the job of the idle loop.

In this loop, the CPU scheduler notices that a CPU is idle because it has no work for the CPU to do. The scheduler then calls the governor, which does its best to predict the appropriate idle state to enter. There are currently two governors in the kernel, called "menu" and "ladder". They are used in different cases, but they both try to do roughly the same thing: keep track of system state when a CPU idles and how long it ended up idling for. This is done in order to predict how long a freshly-idle CPU is likely to remain so, and thus what idle state is most appropriate for it.

This job is made particularly difficult by the CPU scheduler's clock tick. This is a timer that is run by the CPU scheduler for the purpose of time-sharing the CPU: if you are going to run multiple jobs on a single CPU, each job can only be run for a while, then periodically put aside in favor of another job. This tick doesn't need to run on a CPU that is idle, since there are no jobs between which the CPU should be shared. Moreover, if the tick is allowed to run on an otherwise-idle CPU, it will prevent the governor from selecting deep idle states by limiting the time for which the CPU is likely to remain idle. So in kernels 4.16 and older, the scheduler disables the tick before calling the governor. When the CPU is woken by an interrupt, the scheduler makes a decision about whether there's work to do and, if so, reactivates the tick.

If the governor predicts a long idle, and the idle period turns out to be long, the governor "wins": the CPU will enter a deep idle state and power will be saved. But if the governor predicts long idle and the period turns out to be short, the governor "loses" because the costs of entering a deep idle state are not repaid by power savings over the short idle period. Worse, if the governor predicts a short idle period, it loses regardless of the actual idle duration: if the actual duration is long, potential power savings have been missed out on, and if it's short, the costs of stopping and restarting the tick have been paid needlessly. Or to put it another way, because stopping and starting the tick have a cost, there is no point in stopping the tick if the governor is going to predict a short idle.

Wysocki considered trying to redesign the governor to work around this, but concluded that the essential problem is that the tick is stopped before the governor is invoked, thus before the recommended idle state is known. He therefore reworked the idle loop for kernel 4.17 so that the decision about stopping the tick is taken after the governor has made its recommendation of the idle state. If the recommendation is for a long idle, the tick is stopped so as not to wake the CPU prematurely. If the recommendation is for a short idle, the tick is left on to avoid paying the cost of turning it off. That means the tick is also a safety net that will wake the CPU in the event that the idle turns out to be longer than predicted and give the governor another chance to get it right.

When the idled CPU is woken by an interrupt, whether from the tick that was left running or by some other event, the scheduler immediately makes a decision about whether there's work to do. If there is, the tick is restarted if need be; but if there is not, the governor is immediately re-invoked. Since that means the governor can now be invoked both when the tick is running and when it is stopped, the governor had to be reworked to take this into account.

Re-examining the win/loss table from earlier, Wysocki expects things to be improved by this rework. If long idle is predicted, the tick is still stopped, so nothing changes; we win if the actual idle is long, and lose if it's short. But if short idle is predicted, we're better off: if the actual idle is short, we've saved the cost of stopping and restarting the tick, and if the actual idle is long, the unstopped timer will wake us up and give us another bite at the prediction cherry.

[Comparison graph]

Since game theory is no substitute for real-world data, Wysocki tested this on a number of systems. The graph above is characteristic of all the systems tested and shows power consumption against time on a system that is idle. The green line is with the old idle loop, the red is with the new: power consumption is less under the new scheme, and moreover it is much more predictable than before. Not all CPUs tested showed as large a gap between the green and red lines, but all showed a flat red line beneath a bumpy green one. As Wysocki put it, this new scheme predicts short idles less often than the old scheme did, but it is right about them being short more often.

In response to a question from the audience, Wysocki said that the work is architecture-independent. Intel CPUs will benefit from it particularly, because they have a comparatively large array of idle states from which the governor may select, giving the governor the best chance of doing well if it predicts correctly; but ARM CPUs, for example, will also benefit.

[CPU use]

A 20% drop in idle power consumption may seem small as victories go, but it's not. Any system that wants to be able to cope reasonably well with peak loads will need spare capacity in normal operation, which will manifest as idle time. The graph above shows CPU usage on my mail/talk/file-transfer/VPN/NTP/etc. server over the past year; the bright yellow is idle time. Saving 20% of that power will please my co-location provider very much indeed, and it's good for the planet, too.

[We would like to thank LWN's travel sponsor, The Linux Foundation, for assistance with travel funding for Kernel Recipes.]

Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.

(Log in to post comments)