Extreme Processing Channel

Processors can only be so small, so cheap, and so fast. Extreme Processing explores the principles behind the practical limits of the smallest, largest, fastest, and most energy efficient embedded devices available so that when a new device pushes the limits, embedded developers can more easily see how it affects them.

When is running warm – too hot?

Wednesday, March 21st, 2012 by Robert Cravotta

Managing the heat emanating from electronic devices has always been a challenge and design constraint. Mobile devices present an interesting set of design challenges because unlike a server operating in a strictly climate controlled room, users want to operate their mobile devices across a wider range of environments. Mobile devices place additional design burdens on developers because the size and form factor of the devices restrict the options for managing the heat generated while the device is operating.

The new iPad offers the latest device where technical specifications may or may not be compatible with what users expect from their devices. According to Consumer Reports, the new iPad can reach operating temperatures that are up to 13 degrees higher (when plugged in) than an iPad 2 performing the same tasks under the same operating conditions. Using a thermal imaging camera, the peak temperature reported by Consumer Reports is 116 degrees Fahrenheit on the front and rear of the new iPad while playing Infinity Blade II. The peak heat spot was near one corner of the device (Image at the referenced article).

This type of peak temperature is perceived as warm to very warm to the touch for short periods of time. However, for some people, they may consider a peak temperature of 116 degrees Fahrenheit to be too warm for a device that they plan to hold in their hands or on their lap for extended periods of time.

There are probably many engineering trade-offs that were considered in the final design of the new iPad. The feasible options for heat sinks or distributing heat away from the device were probably constrained by the iPad’s thin form factor, dense internal components, and larger battery requirements. Integrating a higher pixel density display definitely provided a design constraint on how the system and graphic processing was architected to deliver an improvement in display quality and maintain an acceptable battery life.

Are consumer electronics bumping up against edge of what designers can deliver when balancing small form factors, high performance processing and graphics, acceptable device weight, and long enough battery life? Are there design trade-offs that are still available to designers to further push where mobile devices can go while staying within the constraints of acceptable heat, weight, size, cost, and performance? Have you ever dealt with running a warm system that becomes a system that is running too hot? If so, how did you deal with it?

Low Power Design: Energy Harvesting

Friday, March 25th, 2011 by Robert Cravotta

In an online course about the Fundamentals of Low Power Design I proposed a spectrum of six categories of applications that identify the different design considerations for low power design for embedded developers. The spectrum of low power applications I propose are:

1) Energy harvesting

2) Disposable or limited use

3) Replaceable energy storage

4) Mobile

5) Tethered with passive cooling

6) Tethered with active cooling

This article focuses on the characteristics that affect energy harvesting applications. I will publish future articles that will focus on the characteristics of the other categories.

Energy harvesting designs represent the extreme low end of low power design spectrum. In an earlier article I identified some common forms of energy harvesting that are publicly available and the magnitude (typically in the μW to mW range) of the energy that are typically available for harvesting.

Energy harvesting designs are ideal for tasks that take place in locations that are difficult to deliver power. Examples include remote sensors, such as might reside in a manufacturing building where the quantity of devices might make performing regular battery replacements infeasible. Also, many of the sensors may be in locations that are difficult or dangerous for an operator to reach. For this reason, energy harvesting systems usually run autonomously, and they spend the majority of their time in a sleep state. Energy harvesting designs often trade-off computation capabilities to fit within a small energy budget because the source of energy is intermittent and/or not guaranteed on a demand basis.

Energy harvesting systems consist of a number of subsystems that work together to provide energy to the electronics of the system. The energy harvester is the subsystem that interfaces with the energy source and converts it into usable and storable electricity. Common types of energy harvesters are able to extract energy from ambient light, vibration, thermal differentials, as well as ambient RF energy.

The rate of energy captured from the environment by the energy harvester may not be sufficient to allow the system to operate; rather, the output of the energy harvester feeds into an energy storage and power management controller that conditions and stores the captured energy in an energy bladder, buffer, capacitor, or battery. Then, when the system is in an awake state, it is drawing energy from the storage module.

The asymmetry between the rate of collecting energy and consuming energy necessitates that the functions the system needs to perform are only executed on a periodic basis that allows enough new energy to be captured and stored between operating cycles. Microcontrollers that support low operating or active power consumption, as well as the capability to quickly switch between the on and off state are key considerations for energy harvesting applications.

A consideration that makes energy harvesting designs different from the other categories in the low power spectrum is that the harvested energy must undergo a transformation to be usable by the electronics. This is in contrast to systems that can recharge their energy storage – these systems receive electricity directly in quantities that support operating the system and recharging the energy storage module.

If the available electricity ever becomes insufficient to operate the energy harvesting module, the module may not be able to capture and transform ambient energy even when there is enough energy in the environment. This key condition for operating means the decision for when and how the system will turn on and off must take extra precautions to avoid drawing too much energy during operation or it will risk starving the system into an unrecoverable condition.

Energy harvesting applications are still an emerging application space. As the cost continues to decrease and the efficiency of the harvesting modules continues to improve, more applications will make sense to pursue in an analogous fashion that microcontrollers have been replacing mechanical controls within systems for the past few decades.

What makes an embedded design low power?

Wednesday, March 2nd, 2011 by Robert Cravotta

It seems that nearly everything these days is marketed as a low power device/system. I see it so much in marketing material and in so many unsubstantiated contexts that it has become one of those phrases words that becomes invisible on the page or screen that I am reading. It is one of those terms that lack a set-in-concrete context – rather, it is often used as an indication of the intent of a device’s designers. Is it reasonable to declare an mW device as low power when there are μW devices in existence? Is it ever reasonable to declare a multi-watt system as low power?

The fact that low power thresholds are moving targets makes it more difficult to declare a system as low power – meaning that what is considered low power today soon becomes normal and the threshold of what constitutes low power necessarily shifts.

I recently was asked to build an online course about low power design for a processor that consumes power on the order of Watts. When I think of low power designs, I usually think of power consumption that is several orders of magnitude lower than that. While low power is not defined as a specific threshold, it can be addressed with appropriate techniques and strategies based on the context of the end design. I came up with an energy consumption spectrum that consists of six major categories. Even though the specific priorities for low power are different for each category, the techniques to address those priorities are similar and combined in different mixes.

We will be rolling out a new approach (that will eventually become fully integrated within the Embedded Processing Directory) for describing and highlighting low power features incorporated within microprocessors (including microcontrollers and DSPs) to enable developers to more easily identify those processors that will enable them to maximize the impact of the type of low power design they need.

What do you think is necessary to consider an embedded design as low power? Are there major contexts for grouping techniques and strategies for a set of application spaces? For example, energy harvesting applications are different from battery powered devices, which are different again from devices that are plugged into a wall socket. In some cases, a design may need to complete a function within a maximum amount of energy while another may need to limit the amount of heat is generated from performing a function. What are the different ways to consider a design as a low power one?

Will Watson affect embedded systems?

Wednesday, February 23rd, 2011 by Robert Cravotta

IBM’s Watson computer system recently beat two of the strongest Jeopardy players in the world in a real match of Jeopardy. The match was the culmination of four years of work by IBM researchers. This week’s question has a dual purpose – to focus discussion on how the Watson innovations can/will/might affect the techniques and tools available to embedded developers – and to solicit questions from you that I can ask the IBM research team when I meet up with them (after the main media furor dies down a bit).

The Watson computing system is the latest example of innovations in extreme processing problem spaces. The NOVA’s video “Smartest Machine on Earth” provides a nice overview of the project and the challenges that the researchers faced while getting Watson ready to compete against human players in the game Jeopardy. While Watson is able to interpret the natural language wording of Jeopardy answers and tease out appropriate responses for the questions (Jeopardy provides answers and contestants provide the questions), it was not clear from the press material or the video that Watson was performing processing of natural language in audio form or only text form. A segment near the end of the NOVA video casts doubt on whether Watson was able to work with audio inputs.

In order to bump Watson’s performance into the champion “cloud” (a distribution presented in the video of the performance of Jeopardy champions), the team had to rely on machine learning techniques so that the computing system could improve how it recognizes the many different contexts that apply to words.Throughout the video, we see that the team kept adding more pattern recognition engines (rules?) to the Watson software so that it could handle different types of Jeopardy questions. A satisfying segment in the video was when Watson was able to change its weighting engine for a Jeopardy category that it did not understand after receiving the correct answers of four questions in that category – much like a human player would refine their understanding of a category during a match.

Watson uses 2800 processors, and I estimate that the power consumption is on the order of a megawatt or more. This is not a practical energy footprint for most embedded systems, but the technologies that make up this system might be available to distributed embedded systems if they can connect to the main system. Also, consider that the human brain is a blood-cooled 10 to 100 W system – this suggests that we may be able to drastically improve the energy efficiency of a system like Watson in the coming years.

Do you think this achievement is huff and puff? Do you think it will impact the design and capabilities of embedded systems? For what technical questions would you like to hear answers from the IBM research team in a future article?

Forward to the Past: A Different Way to Cope with Dark Silicon

Tuesday, February 8th, 2011 by Max Baron

Leigh’s comment to whether dark silicon is a design problem or fundamental law presents an opportunity to explore an “old” processor architecture, the Ambric architecture, an architecture whose implementation made use of dark silicon but did not escape the limitations imposed on Moore’s Law by power budgets.

Mike Butts introduced the Ambric architecture at the 2006 Fall Microprocessor Forum, an event at which I served as technical content chairperson. Tom Halfhill, my colleague at the time, wrote an article about Ambric’s approach and in February 2007 Ambric won In-Stat’s 2006 Microprocessor Report Analysts’ Choice Award for Innovation.

I’ll try to describe the architecture for those that are not familiar with it.

The Ambric’s architecture’s configuration went beyond the classical MIMD definition. It was described as a globally asynchronous – locally synchronous (GALS) architecture — a description that for chip designers held connotations of clock-less processing. The description however does not detract in any way from the innovation and the award for which I voted.

The streaming Ambric architecture as I saw it at the time could be described as a heterogeneous mix of two types of processing cores plus memories and interconnect.

Ambric’s programming innovation involved software objects assigned to specific combinations of cores and/or memory whose execution could proceed in their own time and at their own clock rate– this probably being the reason for the software-defined term “asynchronous architecture.” But the cores were clocked and some could be clocked at different rates—but probably in sync to avoid metastability.

The two types of processor cores provided by Am2045 — the chip introduced at the event — were described as SRs (Streaming RISC) engaged mainly in managing communications and utilities for the second type of cores, the high performance SRDs (Streaming RISC with DSP Extensions) that were the heavy lifter cores in the architecture.

Perhaps the most important part of Ambric’s innovation was the concept of objects assigned to combinations of one or more cores that could be considered as software/hardware black boxes. The black boxes could be interconnected via registers and control that made them behave as if they were FIFOs.

I believe that this is the most important part of the innovation because it almost removes the overhead of thread synchronization. With the removal of this major obstacle to taking advantage of highly parallelizable workloads such as encountered in DSP applications, Ambric opened the architecture for execution by hundreds and possibly thousands of cores — but at the price of reduced generality and the need of more human involvement in the routing of objects on interconnects for best performance of processor cores and memory.

The Ambric architecture can cover with cores and memories a die that for example provides at a lower technology node, four times more transistors but the architecture can’t quadruple its computing speed (switchings per second) due to power budget limitations be they imposed by temperature limitations or battery capacity. Designers can only decrease the chip’s VDD to match a chip’s power dissipation to its power budget, but in doing so, they must reduce clock frequency and associated performance.

The idea of connecting “black boxes” originated with designers of analog computers and hybrid analog/digital computers at least 60 years ago. It was the approach employed in designing computers just before the introduction of the Van Neumann architecture. Ambric’s innovation that created a software/hardware combination is probably independent of the past.

Compared with Ambric’s approach, the UCSD/MIT idea is based on a number of compiler-created different efficient small cores specialized to execute short code sequences critical to the performance of the computer. The UCSD/MIT architecture can enjoy more generality in executing workloads on condition that some specific small cores must be created for the type of workloads targeted. By raising small core frequency without creating dangerous hot spots, the architecture can deliver performance yet keep within power budget boundaries – but it too, can’t deliver increased compute performance at the same rate as Moore’s law delivers transistors.

Dark Silicon Redux: System Design Problem or Fundamental Law?

Tuesday, February 1st, 2011 by Max Baron

Like a spotlight picking out an object in total darkness, the presentation of a solution to a problem may sometimes highlight one aspect while obscuring others. Such were the dark silicon problem and the solution by UCSD and MIT that was presented at Hot Chips 2010. Such was also the article I published in August describing the two universities’ idea that could increase a processor’s efficiency.

At the time of that writing, it appeared that the idea would be followed in time by many others that together would overcome the dark silicon problem. All would be well: Moore’s Law that provides more transistors would also provide higher compute performance.

The term ‘dark silicon’ was probably coined by ARM. ARM described dark silicon as a problem that must be solved by innovative design, but can it be completely solved?Can design continue to solve the problem ‘forever’? To answer the question, we next try to take a qualitative look at the dependencies among the system, the die, and compute performance.

According to a 2009 article published in EE Times, ARM CTO Mike Muller said: “Without fresh innovations, designers could find themselves by 2020 in an era of ‘dark silicon,’ able to build dense devices they cannot afford to power.” Mr. Muller also noted in the same article that“ . . . a 11nm process technology could deliver devices with 16 times more transistors . . . but those devices will only use a third as much energy as today’s parts, leaving engineers with a power budget so pinched they may be able to activate only nine percent of those transistors.”

The use of “only” in the quote may be misunderstood to indicate lower power consumption and higher efficiency. I believe that it indicated disappointment that compared with today’s parts the power consumption would not drop to at least one sixteenth of its 2009 value — to match the rise in the number of transistors.

The term “power budget” can have more than one interpretation. In tethered systems pursuing peak-performance, it can be the worst-case power that is die-temperature related. In mobile systems, it may have a different interpretation: it may be related to the battery-capacity and the percentage of overall system power allocated to the processor. Both interpretations will limit a chip’s power-performance but the limiting factors will be different.

The architects at UCSD/MIT made the best of the unusable silicon problem by surrounding a general-purpose processor core with very efficient small cores located in the dark silicon area. The cores could execute very short sequences of the application code faster and more efficiently than a general-purpose processor but, to keep within the boundary of a power budget, they were probably activated only when needed by the program.

The universities have shown a capability to use part of the dark silicon transistors. It would be interesting to find whether, as transistor numbers increase, the power budget might be dictated by some simple parameters. Finding some limits would rule out dark silicon as a mere problem whose solution will allow designers to utilize 100% of a die to obtain increased performance. In some implementations, the limits could define the best die size and technology of a SoC.

In a system willing to sacrifice power consumption for performance the power budget should be equal to or smaller than the power that can be delivered to the die without causing damage. It is the power (energy/time) that in steady state can be removed from the die by natural and forced cooling, without raising the die’s temperature to a level that would reduce the die’s reliability or even destroy it.

If we allow ourselves the freedom sometimes employed by physicists in simplifying problems, we can say that for a uniformly cooled die of infinite heat conductivity (hot spots can’t occur), the heat generated by circuits and therefore the power budget, are both distributed evenly across the area of the die and are proportional to it (Pbudget α Adie  . . . the larger the die the higher the power budget).

Simplifying things once more, we define a die-wide average energy Eavg in joules required for one single imaginary circuit (the average circuit) to switch state. The power budget (energy divided by time) can now be expressed as the power consumed by the single circuit: Pbudget ~ f * Eavgwhere f is the frequency of switching the single average circuit. The actual frequency of all logic on the chip would be factual = f / n where n is the average number of switchings occurring at the same time.

In other words, assumingdie-area cooling, with all other semiconductor conditions (a given technology node, fabrication, leakage, environment parameters and the best circuit design innovations) and cooling – all kept constant — the peak computing performance obtainable (allowable number of average switching per second) is directly related to the die area. Else the chip will be destroyed.The fate of 3D multi-layer silicon will be worse since the sandwiched layers will enjoy less cooling than the external layers.

Power budgets assigned to processors in mobile systems are more flexible but can be more complex to determine. Camera system designers, for example, can trade-off finder screen size and brightness or fps (frames per second), or zoom and auto focus during video capture — for more processor power. Smart phones that allow non-real-time applications to run slower can save processor power. And, most mobile systems will profit from heterogeneous configurations employing CPUs and hard-wired low power accelerators.

Power budgets in mobile systems will also be affected by software and marketing considerations. Compilers affect the energy consumed by an application based on the number and kind of instructions required for the job to complete. Operating systems are important in managing a system’s resources and controlling the system power states. And, in addition to software and workload considerations the ‘bare core’ power consumption associated with a SoC must compete with claims made by competitors.

If local die temperature and power dissipation terminated the period where higher clock frequency meant more performance, the limitations imposed by allocated power budget or die area will curtail the reign of multiple core configurations as a means of increasing performance.

Most powerful 3D computer

Many computer architects like to learn from existing architectures. It was interesting therefore to see how the most powerful known 3D computer is working around its power limitations. It was however very difficult to find much data on the Internet. The data below was compiled from a few sources and the reader is asked to help corroborate it and/or provide more reliable numbers and sources:

An adult human brain is estimated to contain 1011 (100 Billion) neurons. A firing neuron consumes an average energy of 10-9 joules.  The neuron’s maximum firing rate is estimated by some papers to be 1,000Hz. Normal operating frequencies are lower at 300Hz to 400Hz.

The maximum power that would be generated by the human brain with all neurons firing at the maximum frequency of 1,000 Hz is 103 * 1011* 10-9 = 105 joule/second = 100,000 Watt — enough to destroy the brain and some of its surroundings.

Some papers estimate the actual power consumption of the brain at 10W while others peg it at 100W. According to still other papers the power averaged over 24 hours is 20W. Yet, even the highest number seems acceptable since the brain’s 3D structure is blood-and-evaporation cooled and kept at optimal temperature. Imagine keeping a 100W heat source cool by blood flow!  Performance-wise the 10W and 100W power estimates imply that the brain is delivering 1010 or 1011 neuron firings per second. Using the considerations applied to semiconductor die usage, the brain may be running at 0.01% or up to 0.1% of its neuron capacity possibly turning semi-“dark brain” sections fully “on” or partly “off” depending on workload. Compare these percentages with the much higher 9% utilization factor forecasted for 11nm silicon.

The highly dense silicon chip and the human brain are affected by the same laws of physics.

In semiconductor technology, as Moore’s law places more transistors on the same-sized die or makes the die smaller, the power budget needed for full transistor utilization moves in the opposite direction since it requires larger die areas.Unless cost-acceptable extreme cooling can track technology nodes by removing for example at 11nm about five times more heat from the reference die, or technology finds ways to reduce a cores’ power dissipation by the same factor, Moore’s Law and computing performance will be following different roadmaps.

In mobile applications the limit is affected by battery capacity vs. size and weight. According to some battery developers, capacity is improving slowly as vendors spend more effort in creating custom batteries for big suppliers of mobile systems — than in research. I’m estimating battery capacity to improve at approximately 6% per year, leaving Moore’s law without support since it doubles transistor numbers every two years.

UCSD/MIT’s approach is not a ‘waste’ of transistors if its use of dark silicon can deliver higher performance within the boundaries of the power budget.The Von Neumann architecture was built to save components since it was created at a time when components were expensive, bulky and hard to manufacture. Our problem today and in the near future is to conceive of an architecture that can use an affluence of components.

Considerations for 4-bit processing

Friday, December 10th, 2010 by Robert Cravotta

I recently posed a question of the week about who is using 4-bit processors and for what types of systems. At the same time, I contacted some of the companies that still offer 4-bit processors. In addition to the three companies that I identified as still offering 4-bit processors (Atmel, EM Microelectronics, and Epson), a few readers mentioned parts from NEC Electronics, Renesas, Samsung, and National. NEC Electronics and Renesas merged and Renesas Electronics America now sells the combined set of those company’s processor offerings.

These companies do not sell their 4-bit processors to the public developer community in the same way that 8-, 16-, and 32-bit processors are. Atmel and Epson told me their 4-bit lines support legacy systems. The Epson lines support most notably timepiece designs. I was able to speak with EM Microelectronics at length about their 4-bit processors and gained the following insights.

Programming 4-bit processors is performed in assembly language only. In fact, the development tools cost in the range of $10,000 and the company loans the tools to their developer clients rather than sell them. 4-bit processors are made for dedicated high volume products – such as the Gillette Fusion ProGlide. The 4-bit processors from EM Microelectronics are available only as ROM-based devices, and this somewhat limits the number of designs the company will support because the process to verify the mask sets is labor intensive. The company finds the designers that can make use of these processors – not the other way around. The company approaches a developer and works to demonstrate how the 4-bit device can provide differentiation to the developer’s design and end product.

The sweet spot for 4-bit processor designs are single battery applications that have a 10 year lifespan and where the device is active perhaps 1% of that time and in standby the other 99%. An interesting differentiator for 4-bit processors is that they can operate at 0.6V. This is a substantial advantage over the lowest power 8-bit processors which are still fighting over the 0.9 to 1.8V space. Also, 4-bit processors have been supporting energy harvesting designs since 1990 whereas 8- and 16-bit processor vendors are only within the last year or so beginning to offer development and demonstration kits for energy harvesting. These last two revelations strengthen my claim in “How low can 32-bit processors go” that smaller sized processors will reach lower price and energy thresholds years before the larger processors can feasibly support those same thresholds – and that time advantage is huge.

I speculate that there may be other 4-bit designs out there, but the people using them do not want anyone else to know about them. Think about it, would you want your competitor to know you were able to simplify the problem set to fit on such a small device? Let them think you are using a larger, more expensive (cost and energy) device and wonder how you are doing it.

Are you, or would you consider, using a 4-bit microcontroller?

Wednesday, November 24th, 2010 by Robert Cravotta

Jack Ganssle recently asked me about 4-bit microcontrollers. He noted that there are no obvious 4-bit microcontrollers listed in the Embedded Processing Directory – but that is partly because there are so few of them that I “upgraded” them to the 8-bit listing a few years back. In all the years I have been doing the directory, this is the first time someone has asked about the 4-bitters.

I suspect the timing of Jack’s inquiry is related to his recent article “8 bits is dead” where he points out that the predicted death of 8-bit microcontrollers continues to be false – in fact, he predicts “that the golden age of 8 bits has not yet arisen. As prices head to zero, volumes will soar putting today’s numbers to shame.” I agree with him, the small end of the processing spectrum is loaded with potential and excitement, so much so that I started a series on extreme processing thresholds a few months ago to help define where the current state of the art for processing options is so that it is easier to identify when and how it shifts.

The timing of this inquiry also coincides with Axel Streicher’s article asking “Who said 16-bit is dead?” Axel makes a similar observation about 16-bit processors. I would have liked to have seen him point out that 16-bit architectures is also a sweet spot for DSC (digital signal controllers), especially because Freescale was one of the first companies to adopt the DSC naming. A DSC is a hybrid that combines architectural features of a microcontroller and a digital signal processor in a single execution engine.

A comment on Jack’s article suggested that this topic is the result of someone needing a topic for a deadline, but I beg to differ. There are changes in the processing market that constantly raise the question of whether 8- and 16-bitters will finally become extinct. The big change this year was the introduction of the Cortex-M0 – and this provided the impetus for me to revisit this same topic, albeit from a slightly different perspective, earlier this year when I asked “How low can 32 bits processors go?” I offer that a key advantage that smaller processors have over 32-bit processors is that they reach lower cost and energy thresholds several years before 32-bit processors can get there, so the exciting new stuff will be done on the smaller processors long before they are put on a 32-bit processor.

In contrast, the humble 4-bit gets even less to no attention than the 8- and 16-bitters – but the 4-bit microcontroller is not dead either. Epson just posted a new data sheet for a 4-bit microcontroller a few weeks ago (I am working to get them added to the Embedded Processing Directory now). The Epson 4-bitters are legacy devices that are used in time pieces. EM Microelectronics’ EM6607 is a 4-bit microcontroller; I currently have a call to them to clarify its status and find out what types of applications it is used in.You can still find information about Atmel’s MARC4 which the company manages out of their German offices and is not currently investing any new money into.

So to answer Jack’s question – no, 4-bit processors are not dead yet, and they might not die anytime soon. Are any of you using 4-bit processors in any shape or form? Would you consider using them? What types of processing characteristics define a 4-bitter’s sweet spot? Do you know of any other companies offering 4-bit processors or IP?

UCSD Turns On the Light on Dark Silicon

Friday, August 27th, 2010 by Max Baron

The session on SoCs at Hot Chips 22 featured only one academic paper among several presentations that combined technical detail with a smidgeon of marketing. Originating from a group of researchers from UCSD and MIT, the presentation titled “GreenDroid: A Mobile Application Processor for a Future of Dark Silicon,” introduced the researchers’ solution to the increase of dark silicon as the fabrication of chips evolves toward smaller semiconductor technology nodes.

The reference to dark silicon seems to have been picked up by the press when in 2009 Mike Muller ARM’s CTO, described the increasing limitations imposed by power consumption, on driving and utilizing the increasing numbers of transistors provided by technology nodes down to 11nm. As described by the media, Mike Muller’s warning spoke about power budgets that could not be increased to keep up with the escalating number of transistors provided by smaller geometries.

Why have power budgets? The word “budget” seems to imply permission that designers can increase power by an arbitrary setting of a higher budget. Carrying power increases to extreme levels however will generate temperatures that will destroy the chip or drastically reduce its lifetime. Thus, a fixed reference die whose power budget is almost fixed due to the die’s fixed dimensions will reach a semiconductor technology node where only a small percent of its Moore’s Law–predicted transistors can be driven. The remaining transistors are the dark silicon.

The solution presented at Hot Chips 22 by UCSD cannot increase the power budget of a SoC but it can employ more dark silicon that would otherwise remain unused. The basic idea was simplicity itself: instead of employing a large power-hungry processor that expends a lot of unnecessary energy in driving logic that may not be needed for a particular application–why not create a large number of very efficient small C-cores (UCSD term) that could execute very short sequences of the application code very efficiently?

Imagine a processor tile such as encountered in MIT’s original design that through further improvement became Tilera’s first tile-configured chip. UCSD is envisioning a similar partition using tiles but the tiles are different. The main and comparatively power-hungry processor of UCSD’s tile is still in place but now, surrounding the processor’s data cache, we see a number of special-purpose compiler-generated C-cores.

According to UCSD, these miniature Tensilica-like or ARC-like workload-optimized ISA cores can execute the short repetitive code common to a few applications more efficiently than the main processor. The main processor in UCSD’s tile – a MIPS engine – still needs to execute the program sequences that will not gain efficiency if they are migrated to C-cores. We don’t know whether the C-cores should be considered coprocessors to the main processor such as might be created by a Critical Blue approach, or slave processors.

UCSD’s presentation did not discuss the limitations imposed by data cache bandwidths on the number of C-cores that by design cannot communicate with one another and must use the cache to share operands and results of computations. Nor did the presentation discuss the performance degradation and delays related to loading instructions in each and every C-core or the expected contention on accessing off-chip memory. We would like to see these details made public after the researchers take the next step in their work.

UCSD did present many charts describing the dark silicon problem plus charts depicting an application of C-cores to Android. A benchmark comparison chart was used to illustrate that the C-core approach could show up to 18x better energy efficiency (13.7x on average). The chart would imply that one could run up to 18x more processing tiles on a dense chip that had large area of dark silicon ready for work, but the presentation did not investigate the resulting performance – we know that in most applications the relationships will not be linear.

I liked the result charts and the ideas but was worried that they were not carried out to the level of a complete SoC plus memory to help find the gotchas in the approach. I was disappointed to see that most of the slides presented by the university reminded me of marketing presentations made by the industry. The academic presentation reminded me once more that some universities are looking to obtain patents and trying to accumulate IP portfolios while their researchers may be positioning their ideas to obtain the next year’s sponsors and later, venture capital for a startup.

Exploring multiprocessing extremes

Friday, August 6th, 2010 by Robert Cravotta

Extreme multiprocessing is an interesting topic because it can mean vastly different things to different people depending on what types of problems they are trying to solve.

At one end of the spectrum, there are multiprocessing designs that maximize the amount of processing work that the system performs within a unit of time while staying within an energy budget to perform that work. These types of designs, often high-compute, parallel processing, work station, or server systems, are able to deliver a higher processing throughput rate at lower power dissipation than if they used a hypothetical single core processor that ran at significantly faster clock rates. The multiple processor cores in these types of systems might operate in the GHz range.

While multiprocessing architectures are an approach to increase processing throughput while maintaining an energy budget, for the past few years, I have been unofficially hearing from high performance processor suppliers that some of their customers are asking for faster processors despite the higher energy budget. These designers understand how to build their software systems using a single instruction-stream model. The contemporary programming models and tools are falling short for enabling software developers to scale their code across multiple instruction streams. The increased software complexity and risks outweigh the complexity of managing the higher thermal and energy thresholds.

At the other end of the spectrum, there are multiprocessing designs that rely on multiple processor cores to partition the workload among independent resources to minimize resource dependencies and design complexity. These types of designs are the meat and potatoes of the embedded multiprocessing world. The multiple processor cores in these types of systems might operate in the 10’s to 100’s MHz range.

Let me clarify how I am using multiprocessing to avoid confusion. Multiprocessing designs use more than a single processing core, working together (even indirectly) to accomplish some system level function. I do not assume what type of cores the design uses, nor whether they are identical, similar, or dissimilar. I also do not assume that the cores are co-located in the same silicon die, chip package, board, or even chassis because a primary difference for each of these implementation options are energy dissipation and latency of the data flow. The design concepts are similar between each scale as long as the implementation meets the energy and latency thresholds. To further clarify, multicore is a subset of multiprocessing where the processing cores are co-located in the same silicon die.

I will to try to identify the size, speed, energy, and processing width limits for multiprocessing systems for each of these types of designers. In the next extreme processing article, I will explore how scaling multiprocessing upwards might change basic assumptions about processor architectures.