ARM Architecture Channel

Boosting energy efficiency – Energy debugging

Monday, April 4th, 2011 by Oyvind Janbu

Using an ultra low power microcontroller alas does not by itself mean that an embedded system designer will automatically arrive at the lowest possible energy consumption.  To achieve this, the important role of software also needs to be taken into account.  Code needs to be optimized, not just in terms of functionality but also with respect to energy efficiency.  Software has perhaps never really been formally identified as an ‘energy drain’ and it needs to be.  Every clock cycle and line of code consumes energy and this needs to be minimized if best possible energy efficiency is to be achieved.

While the first two parts of this article have proposed the fundamental ways in which microcontroller design needs to evolve in the pursuit of real energy efficiency, so this third part considers how the tools which support them also need to change.  Having tools available that provide developers with detailed monitoring of their embedded systems’ energy consumption is becoming vital for many existing and emerging energy sensitive battery-backed applications.

As a development process proceeds, code size naturally increases and optimizing it for energy efficiency becomes a much harder and time-consuming task.  Without proper tools, identifying a basic code error such as a while loop that should have been replaced with an interrupt service routine can be difficult.  Such a simple code oversight causes a processor to stay active waiting for an event instead of entering an energy saving sleep mode – it therefore matters!  If these ‘energy bugs’ are not identified and corrected during the development phase then they’re virtually impossible to detect in field or burn-in tests.  Add together a good collection of such bugs and they will have an impact on total battery lifetime, perhaps critically so.

In estimating potential battery lifetimes, embedded developers have been able to use spreadsheets provided by microcontroller vendors to get a reasonable estimation of application behavior in terms of current consumption.  Measurements of a hardware setup made by an oscilloscope or multimeter and entered into spreadsheets can be extrapolated to give a pretty close estimation of battery life expectancy.  This approach however does not provide any correlation between current consumption and code and the impact of any bugs – the application’s milliamp rating is OK, but what’s not known is whether it could be any better.

With a logic analyzer a developer gets greater access to the behavior of the application and begins to recognize that ‘something strange’ is going on.  Yes it’s a ‘code view’ tool, and shows data as timing diagrams, protocol decodes, state machine traces, assembly language or its correlation with source level software, however it offers no direct relationship with energy usage.

Combine the logic analyzer, the multimeter, and the spreadsheet and you do start to make a decent connection between energy usage and code, but the time and effort spent in setting up all the test equipment (and possibly repeating it identically on numerous occasions), making the measurements and recording them into spreadsheets can be prohibitive if not practically impossible.

Low power processors such as the ARM Cortex-M3 however are already providing a SWO (serial wire output) that can be used to provide quite sophisticated and flexible debug and monitoring capabilities that tool suppliers can harness to enable embedded developers to directly correlate code execution with energy usage.

Simple development platforms can be created which permanently sample microcontroller power rail current consumption, convert it, and send it along with voltage and timing data via USB to a PC-based energy-to-code profiling tool.  Courtesy of the ARM’s SWO pin, the tool can also retrieve program counter information from the CPU.  The coming together of these two streams of data enables true ‘energy debugging’ to take place.

Provided that current measurements have a fairly high dynamic range, say from 0.1µA to 100mA, then it’s possible to monitor a very wide and practical range of microcontroller current consumption.  Once uploaded with the microcontroller object code, the energy profiling tool then has all the information resources it needs to accurately correlate energy consumption with code.

The energyAware Profiler tool from Energy Micro shows the relationship between current consumption, C/C++ code, and the energy used by a particular function. Clicking on a current peak, top right, reveals the associated code, bottom left.

The tool correlates the program-counter value with machine code, and because it is aware of the functions of the C/C++ source program, it can then readily indicate how energy use changes as various functions run.  So the idea of a tool that can highlight to a developer in real time, an energy-hungry code problem comes to fruition.  The developer watches a trace of power versus time, identifies a surprising peak, clicks on it and is immediately shown the offending code.

Such an ability to identify and rectify energy drains in this way and at an early stage of prototype development will certainly help reduce the overall energy consumption of the end product, and it will not add to the development time either, on the contrary.

We would all be wise to consider the debug process of low power embedded systems development as becoming a 3-stage cycle from now on:  hardware debugging, software functionality debugging, and software energy debugging.

Microcontroller development tools need to evolve to enable designers to identify wasteful ‘energy bugs’ in software during the development cycle.  Discovering energy-inefficient behavior that endanger battery lifetime during a product field trial is after all rather costly and really just a little bit too late!

Boosting energy efficiency – Sleeping and waking

Friday, March 18th, 2011 by Oyvind Janbu

While using a 32-bit processor can enable a microcontroller to stay in a deep-sleep mode for longer, there is nevertheless some baseline power consumption which can significantly influence the overall energy budget. However, historically 32-bit processors have admittedly not been available with useful sub-µA standby modes. With the introduction of power efficient 32-bit architectures, the standby options are now complementing the reduced processing and active time.

With the relatively low power consumption many microcontrollers exhibit in deep sleep, the functionality they provide in these modes is often very limited.  Since applications often require features such as real time counters, power-on reset / brown-out detection or UART reception to be enabled at all times, many microcontroller systems are prevented from ever entering deep sleep since such basic features are only available in an active run mode.  Many microcontroller solutions also have limited SRAM and CPU state retention in sub-µA standby modes, if at all.  Other solutions need to turn-off or duty-cycle brown-out and power-on reset detectors in order to save power.

In the pursuit of energy efficiency then microcontrollers need to provide product designers with a choice a sleep modes offering the flexibility to scale basic resources, and thereby the power consumption, in several defined levels or energy modes.  While energy modes constitute a coarse division of basic resources, additional fine-grained tuning of resources within each energy mode should also be able to be implemented by enabling / disabling individual peripheral functions.

There’s little point though in offering a microcontroller with tremendous sleep mode energy consumption if its energy efficiency gains are lost due to the time it takes for the microcontroller to wake up and enter run mode.

When a microcontroller goes from a deep sleep state, where the oscillators are disabled, to an active state, there is always a wake-up period, where the processor must wait for the oscillators to stabilize before starting to execute code.  Since no processing can be done during this period of time, the energy spent while waking up is wasted energy, and so reducing the wake-up time is important to reduce overall energy consumption.

Furthermore, microcontroller applications impose real time demands which often mean that the wake-up time must be kept to a minimum to enable the microcontroller to respond to an event within a set period of time.  Because the latency demanded by many applications is lower than the wake-up time of many existing microcontrollers, the device is often inhibited from going into deep sleep at all – not a very good solution for energy sensitive applications.

A beneficial solution would be to use a very fast RC oscillator that instantly wakes up the CPU and then optionally transfers the clock source to a crystal oscillator if needed. This meets both the real time demands as well as encourages run- and sleep mode duty cycling. Albeit the RC oscillator is not as accurate as a crystal oscillator, the RC oscillator is sufficient as the CPUs clock source during crystal start-up.

We know that getting back to sleep mode is key to saving energy. Therefore the CPU should preferably use a high clock frequency to solve its tasks more quickly and efficiently.  Even if the higher frequency at first appears to require more power, the advantage is a system that is able to return to low power modes in a fraction of the time.

Peripherals however might not need to run at the CPU’s clock frequency.  One solution to this conundrum is to pre-scale the clock to the core and peripherals, thereby ensuring the dynamic power consumption of the different parts is kept to a minimum.  If the peripherals can further operate without the supervision of the CPU, we realize that a flexible clocking system is a vital requirement for energy efficient microcontrollers.

The obvious way for microcontrollers to use less energy is to allow the CPU to stay asleep while the peripherals are active, and so the development of peripherals that can operate with minimum or no intervention from the CPU is another worthy consideration for microcontroller designers.  When peripherals look after themselves, the CPU can either solve other high level tasks or simply fall asleep, saving energy either way.

With advanced sequence programming, routines for operating peripherals previously controlled by the CPU can be handled by the peripherals themselves.  The use of a DMA controller provides a pragmatic approach to autonomous peripheral operation.  Helping to offload CPU workload to peripherals, a flexible DMA controller can effectively handle data transfers between memory and communication or data processing interfaces.

Of course there’s little point in using autonomous peripherals to relieve the burden of the CPU if they’re energy hungry.  Microcontroller makers also need to closely consider the energy consumption of peripherals such as serial communication interfaces, data encryption/decryption engines, display drivers and radio communication peripherals.  All peripherals must be efficiently implemented and optimized for energy consumption in order to fulfill the application’s need for a low system level energy consumption.

Taking the autonomy ideal a step further, the introduction of additional programmable interconnect structures into a microcontroller enable peripherals to talk to peripherals without the intervention of the CPU, thereby reducing energy consumption even further.  A typical example of a peripheral talking to another peripheral would be an ADC conversion periodically triggered by a timer. A flexible peripheral interconnect allows direct hardware interaction between such peripherals, solving the task while the CPU is in its deepest sleep state.

The third part of this three part article explores the tools and techniques available for energy debugging.

Boosting energy efficiency – How microcontrollers need to evolve

Monday, February 21st, 2011 by Oyvind Janbu

Whatever the end product, all designers have specific tasks to solve and their solutions will be influenced by the resources that are available and the constraints of cost, time, physical size and technology choice.  At the heart of many a good product, the ubiquitous microcontroller often has a crucial influence on system power design and particularly in a brave new world that’s concerned with energy efficiency, users are entitled to demand a greater service from them.  The way microcontrollers are built and operate needs to evolve dramatically if it is to achieve the best possible performance from limited battery resources.

Bearing in mind that the cost of even a typical coin cell battery can be relatively high compared to that of a microcontroller, there are obvious advantages in designing a system that offers the best possible energy efficiency.  It can enable designers to reduce the cost and size of a battery.  Secondly, it can enable designers to significantly extend the lifetime of a battery, consequently reducing the frequency of battery replacement and for certain products the frequency, cost and ‘carbon footprint’ associated with product maintenance call-outs.

Microcontrollers, like many other breeds of electronic components, are these days very keen to stress their ‘ultra low power’ credentials, which is perfectly fine and appropriate where a device’s dynamic performance merits; however, with a finite amount of charge available from a battery cell, it is how a microcontroller uses energy (i.e. power over the full extent of time), that needs to be more closely borne in mind.

Microcontroller applications improve their energy efficiency by operating in several states – most notably active and sleep modes that consume different amounts of energy.

Product designers need to minimize the product of current and time over all phases of microcontroller operation, throughout both active and sleep periods (Figure 1). Not only does every microamp count, but so does every microsecond that every function takes.  This relationship between amperage and time makes the comparison of 8-, and 16-bit microcontrollers with 32-bit microcontrollers less straightforward. Considering alone their current consumption characteristics in a deep-sleep mode, it is easy to understand why 8-bit or 16-bit microcontrollers have been in an attractive position in energy sensitive applications, where microcontroller duty cycles can be very low.  A microcontroller may after all stay in a deep sleep state for perhaps 99% of the time.

However, if product designers are concerned with every microamp and microsecond every function takes, then using a 32-bit microcontroller should be being considered for even in the ‘simplest’ of product designs.  The higher performance of 32-bit processors enables the microcontroller to finish tasks quicker so that they can spend more time in the low-power sleep modes, which lowers overall energy consumption.  32-bit microcontrollers are therefore not necessarily ‘application overkill’.

More than that though, even simple operations on 8-bit or 16-bit variables can need the services of a 32-bit processor if system energy usage goals are to be achieved.  By harnessing the full array of low-power design techniques available today, 32-bit cores can offer a variety of low-power modes with rapid wake-up times that areon par with 8-bit microcontrollers.

There is a common misconception that switching from an 8-bit microcontroller to a 32-bit microcontroller will result in bigger code size, which directly affects the cost and power consumption of end products.  This is borne of the fact that many people have the impression that 8-bit microcontrollers use 8-bit instructions and 32-bit microcontrollers use 32-bit instructions.  In reality, many instructions in 8-bit microcontrollers are 16-bit or 24-bit in length.

The ARM Cortex-M3 and Cortex-M0 processors are based on the Thumb-2 technology, which provides excellent code density.  Thumb-2 microcontrollers have 16-bit as well as 32-bit instructions, with the 32-bit instruction functionality a superset of the 16-bit version.  Typical output from a C compiler gives 90% 16-bit instructions. The 32-bit version would only be used when the operation cannot be performed with a 16-bit instruction.  As a result, most of the instructions in an ARM Cortex microcontroller program are 16-bits.  That’s smaller than many of the instructions in 8-bit microcontrollers, typically providing less compiled code from a 32-bit processor than 8- or 16-bit microcontrollers.

The second part in this three part series looks deeper at the issues around microcontroller sleep modes.

Dark Silicon Redux: System Design Problem or Fundamental Law?

Tuesday, February 1st, 2011 by Max Baron

Like a spotlight picking out an object in total darkness, the presentation of a solution to a problem may sometimes highlight one aspect while obscuring others. Such were the dark silicon problem and the solution by UCSD and MIT that was presented at Hot Chips 2010. Such was also the article I published in August describing the two universities’ idea that could increase a processor’s efficiency.

At the time of that writing, it appeared that the idea would be followed in time by many others that together would overcome the dark silicon problem. All would be well: Moore’s Law that provides more transistors would also provide higher compute performance.

The term ‘dark silicon’ was probably coined by ARM. ARM described dark silicon as a problem that must be solved by innovative design, but can it be completely solved?Can design continue to solve the problem ‘forever’? To answer the question, we next try to take a qualitative look at the dependencies among the system, the die, and compute performance.

According to a 2009 article published in EE Times, ARM CTO Mike Muller said: “Without fresh innovations, designers could find themselves by 2020 in an era of ‘dark silicon,’ able to build dense devices they cannot afford to power.” Mr. Muller also noted in the same article that“ . . . a 11nm process technology could deliver devices with 16 times more transistors . . . but those devices will only use a third as much energy as today’s parts, leaving engineers with a power budget so pinched they may be able to activate only nine percent of those transistors.”

The use of “only” in the quote may be misunderstood to indicate lower power consumption and higher efficiency. I believe that it indicated disappointment that compared with today’s parts the power consumption would not drop to at least one sixteenth of its 2009 value — to match the rise in the number of transistors.

The term “power budget” can have more than one interpretation. In tethered systems pursuing peak-performance, it can be the worst-case power that is die-temperature related. In mobile systems, it may have a different interpretation: it may be related to the battery-capacity and the percentage of overall system power allocated to the processor. Both interpretations will limit a chip’s power-performance but the limiting factors will be different.

The architects at UCSD/MIT made the best of the unusable silicon problem by surrounding a general-purpose processor core with very efficient small cores located in the dark silicon area. The cores could execute very short sequences of the application code faster and more efficiently than a general-purpose processor but, to keep within the boundary of a power budget, they were probably activated only when needed by the program.

The universities have shown a capability to use part of the dark silicon transistors. It would be interesting to find whether, as transistor numbers increase, the power budget might be dictated by some simple parameters. Finding some limits would rule out dark silicon as a mere problem whose solution will allow designers to utilize 100% of a die to obtain increased performance. In some implementations, the limits could define the best die size and technology of a SoC.

In a system willing to sacrifice power consumption for performance the power budget should be equal to or smaller than the power that can be delivered to the die without causing damage. It is the power (energy/time) that in steady state can be removed from the die by natural and forced cooling, without raising the die’s temperature to a level that would reduce the die’s reliability or even destroy it.

If we allow ourselves the freedom sometimes employed by physicists in simplifying problems, we can say that for a uniformly cooled die of infinite heat conductivity (hot spots can’t occur), the heat generated by circuits and therefore the power budget, are both distributed evenly across the area of the die and are proportional to it (Pbudget α Adie  . . . the larger the die the higher the power budget).

Simplifying things once more, we define a die-wide average energy Eavg in joules required for one single imaginary circuit (the average circuit) to switch state. The power budget (energy divided by time) can now be expressed as the power consumed by the single circuit: Pbudget ~ f * Eavgwhere f is the frequency of switching the single average circuit. The actual frequency of all logic on the chip would be factual = f / n where n is the average number of switchings occurring at the same time.

In other words, assumingdie-area cooling, with all other semiconductor conditions (a given technology node, fabrication, leakage, environment parameters and the best circuit design innovations) and cooling – all kept constant — the peak computing performance obtainable (allowable number of average switching per second) is directly related to the die area. Else the chip will be destroyed.The fate of 3D multi-layer silicon will be worse since the sandwiched layers will enjoy less cooling than the external layers.

Power budgets assigned to processors in mobile systems are more flexible but can be more complex to determine. Camera system designers, for example, can trade-off finder screen size and brightness or fps (frames per second), or zoom and auto focus during video capture — for more processor power. Smart phones that allow non-real-time applications to run slower can save processor power. And, most mobile systems will profit from heterogeneous configurations employing CPUs and hard-wired low power accelerators.

Power budgets in mobile systems will also be affected by software and marketing considerations. Compilers affect the energy consumed by an application based on the number and kind of instructions required for the job to complete. Operating systems are important in managing a system’s resources and controlling the system power states. And, in addition to software and workload considerations the ‘bare core’ power consumption associated with a SoC must compete with claims made by competitors.

If local die temperature and power dissipation terminated the period where higher clock frequency meant more performance, the limitations imposed by allocated power budget or die area will curtail the reign of multiple core configurations as a means of increasing performance.

Most powerful 3D computer

Many computer architects like to learn from existing architectures. It was interesting therefore to see how the most powerful known 3D computer is working around its power limitations. It was however very difficult to find much data on the Internet. The data below was compiled from a few sources and the reader is asked to help corroborate it and/or provide more reliable numbers and sources:

An adult human brain is estimated to contain 1011 (100 Billion) neurons. A firing neuron consumes an average energy of 10-9 joules.  The neuron’s maximum firing rate is estimated by some papers to be 1,000Hz. Normal operating frequencies are lower at 300Hz to 400Hz.

The maximum power that would be generated by the human brain with all neurons firing at the maximum frequency of 1,000 Hz is 103 * 1011* 10-9 = 105 joule/second = 100,000 Watt — enough to destroy the brain and some of its surroundings.

Some papers estimate the actual power consumption of the brain at 10W while others peg it at 100W. According to still other papers the power averaged over 24 hours is 20W. Yet, even the highest number seems acceptable since the brain’s 3D structure is blood-and-evaporation cooled and kept at optimal temperature. Imagine keeping a 100W heat source cool by blood flow!  Performance-wise the 10W and 100W power estimates imply that the brain is delivering 1010 or 1011 neuron firings per second. Using the considerations applied to semiconductor die usage, the brain may be running at 0.01% or up to 0.1% of its neuron capacity possibly turning semi-“dark brain” sections fully “on” or partly “off” depending on workload. Compare these percentages with the much higher 9% utilization factor forecasted for 11nm silicon.

The highly dense silicon chip and the human brain are affected by the same laws of physics.

In semiconductor technology, as Moore’s law places more transistors on the same-sized die or makes the die smaller, the power budget needed for full transistor utilization moves in the opposite direction since it requires larger die areas.Unless cost-acceptable extreme cooling can track technology nodes by removing for example at 11nm about five times more heat from the reference die, or technology finds ways to reduce a cores’ power dissipation by the same factor, Moore’s Law and computing performance will be following different roadmaps.

In mobile applications the limit is affected by battery capacity vs. size and weight. According to some battery developers, capacity is improving slowly as vendors spend more effort in creating custom batteries for big suppliers of mobile systems — than in research. I’m estimating battery capacity to improve at approximately 6% per year, leaving Moore’s law without support since it doubles transistor numbers every two years.

UCSD/MIT’s approach is not a ‘waste’ of transistors if its use of dark silicon can deliver higher performance within the boundaries of the power budget.The Von Neumann architecture was built to save components since it was created at a time when components were expensive, bulky and hard to manufacture. Our problem today and in the near future is to conceive of an architecture that can use an affluence of components.