Silicon Power Management Features

line

This page discusses some of the concepts behind many of the power savings projects on this site.

Topics discussed in this page include:

CPU Power States

There are three CPU power option states:
  • C-states, a set of idle states
  • P-states, performances states, which allow you to scale the frequency in voltage of your CPU
  • T-states, thermal states that allow the system to respond to emergency thermal conditions

You can adjust these states to regulate your system’s power use to more accurately reflect your usage needs, and save power where possible.

C-states

The basic idea of "C-states" is that a microprocessor, when it's not executing instructions, can save power in several ways (called "states"). Each of those ways has a different tradeoff in terms of powersaving versus latency and performance. This is a very abstract concept, so here’s an example taken from the Intel® Core™2 Duo datasheet:



C-state Max Power Consumption
C0 (busy wait)   35 Watts (TDP)
C1   13.5 Watts
C2 12.9 Watts
C3 7.7 Watts
C4 1.2 Watts

Notes:

  • The higher the C state number, the less amount of power consumed. Read more about the tradeoffs of these states in The power-performance tradeoff section below.
  • While these power consumption numbers denote the maximum possible (or TDP) values, and not actual values as you will see them on your laptop, they give a reasonable indication of the differences in power reduction of the various C-states.

The Power-Performance Tradeoff

While the power is reduced for higher number (deeper) C-states, this power reduction comes at a price: The deeper the C-state, the longer it takes to leave the C-state, and the more energy this transition costs. For example, if your computer is going in and out of idle at a high frequency (say every millisecond), and the kernel would decide to go to the C3 or C4 state during the short idle periods, it could easily take the processor 185 microseconds to wake up, and in addition the wakeup would consume more energy than it conserved during its very brief resting period. However, had the kernel decided to use the C2 state, the processor would come out of its nap in maybe 10 microseconds. While C2 in a steady state uses much more power than the C4 state, the energy costs to go into or out of C2 are much lower, so for the scenario at hand here, C2 would actually be more energy efficient.

Now, if your system is actually idle for longer periods of time, say 20ms, then C4 becomes the clear winner: the one time energy cost of the transition is dwarfed by the difference in energy savings.... unless the exit latency cannot be tolerated. If you are doing high precision audio recording, 185 microseconds delay may well be more than you want to cope with, in which case the kernel should pick C2 again.

If you want to see what C-states your Linux laptop has, and what the exit latencies for each state are, issue the following command:

cat /proc/acpi/processor/*/power

The Kernel’s Timer Tick Prevented Power Savings

Now that we've seen that to get good power savings from the processor, we want to be idle for at least 20 milliseconds, it's time for the kernel to ruin the party.

Until 2.6.21, the Linux kernel programmed the PIT chip of the PC (or equivalent on other architectures) to generate interrupts at a regular interval of either 250 Hz (4ms interval) or 1000 Hz (1ms interval). This regular interrupt, often called "timer tick", has several tasks:

  • Increment the "jiffies" variable. This global variable represents the kernels internal notion of time and is used in many placess all across the kernel
  • Process accounting: each timer tick the kernel looks at which process is running (if any) and increments the cpu usage counter of this process. This information (including the idle information) is used by programs such as "top" to display which programs are consuming CPU time.
  • Process scheduler time slicing: The scheduler gives each process that is running a so called "time slice", an amount of time that it's allowed to run before other processes get their turn. The timer tick interrupts the process at the end of it's timeslice and the scheduler then gives the cpu to another process at the end of the timer tick processing.
  • Defered events (timers): Many things in a kernel are of a "do this after 50 milliseconds" or "if nothing else happened in 3 seconds, call this function to recover" nature. In Linux kernel speak these are called "timers". Each timer tick, the kernel looks at the queue of outstanding future timer events to see if any have now become due (this sounds expensive, but the timer list is sorted so it's a cheap operation), and will process the ones whos turn it is.

This timer tick approach has a certain elegance in it's simplicity and has served Linux well since the early 1990's. However, in the light of the C-state hardware capabilities, a regular 1ms or 4ms interrupt has the effect of waking the CPU frequently from the deep sleep states (or even preventing the CPU from ever entering the deepest sleep states), which obviously makes the system consume more power than needed.

The Introduction of Tickless Idle into the Kernel In the last year or so, Thomas Gleixner and others have worked on feature that is called "tickless idle" or just "tickless": the removal of the regular timer tick when the system is idle. This feature has become part of the 2.6.21 kernel for the i386 architecture, and 2.6.23 will likely have the x86_64 version included as well.

Of the four functions the timer tick has, only two are really relevant when the cpu is idle:

  • updating the kernels notion what time it is (jiffies)
  • processing deferred events (timers)

When Linux was created, the PIT was more or less the only usable clock device on the PC; however a modern PC has a wide range of clocks and clock-like devices. A fundamental step in the Linux kernel architecture in the last year was to split these devices into two categories:

  • Clock sources
  • Event Sources.

Clock Sources are devices that can answer the "what time is it right now" question, and the clock source infrastructure abstracts away the differences between the various devices towards the rest of the kernel. The source code for the clocksource infrastructure lives in the kernel/time/clocksource.c file.

Event Sources are those devices that can generate an interrupt after a software-specified amount of time. The event source layer abstracts these various devices for the rest of the kernel, and picks the best one for a task based on the various capabilities (in terms of precision, accuracy and maximum duration) of each of the devices in the system. The source code for the event source infrastructure lives in the kernel/time/clockevents.c file.

This separation and the two hardware abstraction layers make the excercise of getting rid of the timer tick during idle a relatively simple task:

  • Instead of updating jiffies by one every timer tick, jiffies gets updated to what the value should be based on the "what time is it" question to the clock source layer.
  • Instead of looking every millisecond if any timer is due for processing, the kernel calculates when the first timer is due and asks the event source layer to give a single interrupt at exactly the right time.

In order to fullfill the other 2 tasks of the timer tick (accounting and time-slicing), the timer tick is kept running when the cpu is not actually idle.

As with most simple things, the devil is in the details. While modern PCs have many clock-like components, the reliability of each of those components varies wildly by manufacturer, technology generation or even BIOS settings. This varying reliability is the primary answer to the "why did it take so long" question. Thankfully, for most systems, the so called HPET component of the chipset is the most promising and reliable event source.

By not having a regular timer tick when the processor is idle, it is in principle possible to have really long periods of idle as long as there are no future timers planned. Unfortunately, in practice on a current Linux distribution, both the kernel and userspace applications set so many timers that it is not uncommon to have 500 or more of such events per second.... so the system would go to a 1 millisecond sleep time to an approximately 2 millisecond sleep time. This would obviously provide only minimal power gain over the kernel with the regular tick.
There are roughly two areas where this needs fixing: the kernel and the userspace side.

Fixing high frequency events -- kernel side On the kernel side, there are drivers and subsystems that have timers that are just "randomly short", and that can be increased without any noticable effects to the system or the user. But doing that isn't quite enough: A second technique that is being used for this is to align as many timers as possible to happen at the same time. Many timers are of the "once every 2 seconds" type, but the user of the timer doesn't really care when *exactly* the timer happens. By making sure all timers of this class only fire at the start of a full second, the processor can (if no other timers exist) sleep the rest of the second without being interrupted.

The API that the kernel has for this is called round_jiffies, and there are two primary functions that drivers and subsystems should use:

unsigned long round_jiffies(unsigned long time);
unsigned long round_jiffies_relative(unsigned long delta);

The round_jiffies() function, as the name suggests, rounds off it's argument such that all callers of round_jiffies() that would be in the same second of time, end up at exactly the same moment. round_jiffies_relative() performs this exact same function, with the difference that round_jiffies() operates on absolute times (the "jiffies" variable), round_jiffies_relative() operates on time deltas. This matches the diversity of kernel APIs for future work, some of which take absolute time, while others take time deltas.

Fixing high frequency events -- userspace side Unfortunately, the userspace situation is a much bigger problem than the kernel side. Many desktop programs behave really badly and have a large number of timers that aren't really needed. The most frequent scenario is that the code is polling for something while it would just get an event if the programmer had bothered. In defense of the userspace programmers, until the kernel went tickless, this behavior not only was not affecting the system negatively, it wasn't even possible to see what was happening.

Now with the tickless kernel, the environment has changed. Programs that poll frequenty are actively hurting the power consumption, and they can be exposed a lot easier as well. Linux PowerTOP is a tool that developers and users can use to see which software pieces are taking the processor out of its idle sleep state. In addition, PowerTOP shows how well the system is doing in terms of using the deeper C-states and it also provides tuning suggestions to the system.

Even when all the "needlessly polling" behavior gets fixed, userspace programs may have housekeeping tasks that just have to happen once in a while. Similar to the round_jiffies() infrastructure in the kernel, there is an API in recent versions of the glib library that provides rounding and grouping timers: g_timeout_add_seconds().


 About | ISN | Intel is a trademark of Intel Corporation in the U.S. and other countries. | * Other names / brands may be claimed as the property of others