Entries Tagged ‘Dark Silicon’

Forward to the Past: A Different Way to Cope with Dark Silicon

Tuesday, February 8th, 2011 by Max Baron

Leigh’s comment to whether dark silicon is a design problem or fundamental law presents an opportunity to explore an “old” processor architecture, the Ambric architecture, an architecture whose implementation made use of dark silicon but did not escape the limitations imposed on Moore’s Law by power budgets.

Mike Butts introduced the Ambric architecture at the 2006 Fall Microprocessor Forum, an event at which I served as technical content chairperson. Tom Halfhill, my colleague at the time, wrote an article about Ambric’s approach and in February 2007 Ambric won In-Stat’s 2006 Microprocessor Report Analysts’ Choice Award for Innovation.

I’ll try to describe the architecture for those that are not familiar with it.

The Ambric’s architecture’s configuration went beyond the classical MIMD definition. It was described as a globally asynchronous – locally synchronous (GALS) architecture — a description that for chip designers held connotations of clock-less processing. The description however does not detract in any way from the innovation and the award for which I voted.

The streaming Ambric architecture as I saw it at the time could be described as a heterogeneous mix of two types of processing cores plus memories and interconnect.

Ambric’s programming innovation involved software objects assigned to specific combinations of cores and/or memory whose execution could proceed in their own time and at their own clock rate– this probably being the reason for the software-defined term “asynchronous architecture.” But the cores were clocked and some could be clocked at different rates—but probably in sync to avoid metastability.

The two types of processor cores provided by Am2045 — the chip introduced at the event — were described as SRs (Streaming RISC) engaged mainly in managing communications and utilities for the second type of cores, the high performance SRDs (Streaming RISC with DSP Extensions) that were the heavy lifter cores in the architecture.

Perhaps the most important part of Ambric’s innovation was the concept of objects assigned to combinations of one or more cores that could be considered as software/hardware black boxes. The black boxes could be interconnected via registers and control that made them behave as if they were FIFOs.

I believe that this is the most important part of the innovation because it almost removes the overhead of thread synchronization. With the removal of this major obstacle to taking advantage of highly parallelizable workloads such as encountered in DSP applications, Ambric opened the architecture for execution by hundreds and possibly thousands of cores — but at the price of reduced generality and the need of more human involvement in the routing of objects on interconnects for best performance of processor cores and memory.

The Ambric architecture can cover with cores and memories a die that for example provides at a lower technology node, four times more transistors but the architecture can’t quadruple its computing speed (switchings per second) due to power budget limitations be they imposed by temperature limitations or battery capacity. Designers can only decrease the chip’s VDD to match a chip’s power dissipation to its power budget, but in doing so, they must reduce clock frequency and associated performance.

The idea of connecting “black boxes” originated with designers of analog computers and hybrid analog/digital computers at least 60 years ago. It was the approach employed in designing computers just before the introduction of the Van Neumann architecture. Ambric’s innovation that created a software/hardware combination is probably independent of the past.

Compared with Ambric’s approach, the UCSD/MIT idea is based on a number of compiler-created different efficient small cores specialized to execute short code sequences critical to the performance of the computer. The UCSD/MIT architecture can enjoy more generality in executing workloads on condition that some specific small cores must be created for the type of workloads targeted. By raising small core frequency without creating dangerous hot spots, the architecture can deliver performance yet keep within power budget boundaries – but it too, can’t deliver increased compute performance at the same rate as Moore’s law delivers transistors.

Dark Silicon Redux: System Design Problem or Fundamental Law?

Tuesday, February 1st, 2011 by Max Baron

Like a spotlight picking out an object in total darkness, the presentation of a solution to a problem may sometimes highlight one aspect while obscuring others. Such were the dark silicon problem and the solution by UCSD and MIT that was presented at Hot Chips 2010. Such was also the article I published in August describing the two universities’ idea that could increase a processor’s efficiency.

At the time of that writing, it appeared that the idea would be followed in time by many others that together would overcome the dark silicon problem. All would be well: Moore’s Law that provides more transistors would also provide higher compute performance.

The term ‘dark silicon’ was probably coined by ARM. ARM described dark silicon as a problem that must be solved by innovative design, but can it be completely solved?Can design continue to solve the problem ‘forever’? To answer the question, we next try to take a qualitative look at the dependencies among the system, the die, and compute performance.

According to a 2009 article published in EE Times, ARM CTO Mike Muller said: “Without fresh innovations, designers could find themselves by 2020 in an era of ‘dark silicon,’ able to build dense devices they cannot afford to power.” Mr. Muller also noted in the same article that“ . . . a 11nm process technology could deliver devices with 16 times more transistors . . . but those devices will only use a third as much energy as today’s parts, leaving engineers with a power budget so pinched they may be able to activate only nine percent of those transistors.”

The use of “only” in the quote may be misunderstood to indicate lower power consumption and higher efficiency. I believe that it indicated disappointment that compared with today’s parts the power consumption would not drop to at least one sixteenth of its 2009 value — to match the rise in the number of transistors.

The term “power budget” can have more than one interpretation. In tethered systems pursuing peak-performance, it can be the worst-case power that is die-temperature related. In mobile systems, it may have a different interpretation: it may be related to the battery-capacity and the percentage of overall system power allocated to the processor. Both interpretations will limit a chip’s power-performance but the limiting factors will be different.

The architects at UCSD/MIT made the best of the unusable silicon problem by surrounding a general-purpose processor core with very efficient small cores located in the dark silicon area. The cores could execute very short sequences of the application code faster and more efficiently than a general-purpose processor but, to keep within the boundary of a power budget, they were probably activated only when needed by the program.

The universities have shown a capability to use part of the dark silicon transistors. It would be interesting to find whether, as transistor numbers increase, the power budget might be dictated by some simple parameters. Finding some limits would rule out dark silicon as a mere problem whose solution will allow designers to utilize 100% of a die to obtain increased performance. In some implementations, the limits could define the best die size and technology of a SoC.

In a system willing to sacrifice power consumption for performance the power budget should be equal to or smaller than the power that can be delivered to the die without causing damage. It is the power (energy/time) that in steady state can be removed from the die by natural and forced cooling, without raising the die’s temperature to a level that would reduce the die’s reliability or even destroy it.

If we allow ourselves the freedom sometimes employed by physicists in simplifying problems, we can say that for a uniformly cooled die of infinite heat conductivity (hot spots can’t occur), the heat generated by circuits and therefore the power budget, are both distributed evenly across the area of the die and are proportional to it (Pbudget α Adie  . . . the larger the die the higher the power budget).

Simplifying things once more, we define a die-wide average energy Eavg in joules required for one single imaginary circuit (the average circuit) to switch state. The power budget (energy divided by time) can now be expressed as the power consumed by the single circuit: Pbudget ~ f * Eavgwhere f is the frequency of switching the single average circuit. The actual frequency of all logic on the chip would be factual = f / n where n is the average number of switchings occurring at the same time.

In other words, assumingdie-area cooling, with all other semiconductor conditions (a given technology node, fabrication, leakage, environment parameters and the best circuit design innovations) and cooling – all kept constant — the peak computing performance obtainable (allowable number of average switching per second) is directly related to the die area. Else the chip will be destroyed.The fate of 3D multi-layer silicon will be worse since the sandwiched layers will enjoy less cooling than the external layers.

Power budgets assigned to processors in mobile systems are more flexible but can be more complex to determine. Camera system designers, for example, can trade-off finder screen size and brightness or fps (frames per second), or zoom and auto focus during video capture — for more processor power. Smart phones that allow non-real-time applications to run slower can save processor power. And, most mobile systems will profit from heterogeneous configurations employing CPUs and hard-wired low power accelerators.

Power budgets in mobile systems will also be affected by software and marketing considerations. Compilers affect the energy consumed by an application based on the number and kind of instructions required for the job to complete. Operating systems are important in managing a system’s resources and controlling the system power states. And, in addition to software and workload considerations the ‘bare core’ power consumption associated with a SoC must compete with claims made by competitors.

If local die temperature and power dissipation terminated the period where higher clock frequency meant more performance, the limitations imposed by allocated power budget or die area will curtail the reign of multiple core configurations as a means of increasing performance.

Most powerful 3D computer

Many computer architects like to learn from existing architectures. It was interesting therefore to see how the most powerful known 3D computer is working around its power limitations. It was however very difficult to find much data on the Internet. The data below was compiled from a few sources and the reader is asked to help corroborate it and/or provide more reliable numbers and sources:

An adult human brain is estimated to contain 1011 (100 Billion) neurons. A firing neuron consumes an average energy of 10-9 joules.  The neuron’s maximum firing rate is estimated by some papers to be 1,000Hz. Normal operating frequencies are lower at 300Hz to 400Hz.

The maximum power that would be generated by the human brain with all neurons firing at the maximum frequency of 1,000 Hz is 103 * 1011* 10-9 = 105 joule/second = 100,000 Watt — enough to destroy the brain and some of its surroundings.

Some papers estimate the actual power consumption of the brain at 10W while others peg it at 100W. According to still other papers the power averaged over 24 hours is 20W. Yet, even the highest number seems acceptable since the brain’s 3D structure is blood-and-evaporation cooled and kept at optimal temperature. Imagine keeping a 100W heat source cool by blood flow!  Performance-wise the 10W and 100W power estimates imply that the brain is delivering 1010 or 1011 neuron firings per second. Using the considerations applied to semiconductor die usage, the brain may be running at 0.01% or up to 0.1% of its neuron capacity possibly turning semi-“dark brain” sections fully “on” or partly “off” depending on workload. Compare these percentages with the much higher 9% utilization factor forecasted for 11nm silicon.

The highly dense silicon chip and the human brain are affected by the same laws of physics.

In semiconductor technology, as Moore’s law places more transistors on the same-sized die or makes the die smaller, the power budget needed for full transistor utilization moves in the opposite direction since it requires larger die areas.Unless cost-acceptable extreme cooling can track technology nodes by removing for example at 11nm about five times more heat from the reference die, or technology finds ways to reduce a cores’ power dissipation by the same factor, Moore’s Law and computing performance will be following different roadmaps.

In mobile applications the limit is affected by battery capacity vs. size and weight. According to some battery developers, capacity is improving slowly as vendors spend more effort in creating custom batteries for big suppliers of mobile systems — than in research. I’m estimating battery capacity to improve at approximately 6% per year, leaving Moore’s law without support since it doubles transistor numbers every two years.

UCSD/MIT’s approach is not a ‘waste’ of transistors if its use of dark silicon can deliver higher performance within the boundaries of the power budget.The Von Neumann architecture was built to save components since it was created at a time when components were expensive, bulky and hard to manufacture. Our problem today and in the near future is to conceive of an architecture that can use an affluence of components.

UCSD Turns On the Light on Dark Silicon

Friday, August 27th, 2010 by Max Baron

The session on SoCs at Hot Chips 22 featured only one academic paper among several presentations that combined technical detail with a smidgeon of marketing. Originating from a group of researchers from UCSD and MIT, the presentation titled “GreenDroid: A Mobile Application Processor for a Future of Dark Silicon,” introduced the researchers’ solution to the increase of dark silicon as the fabrication of chips evolves toward smaller semiconductor technology nodes.

The reference to dark silicon seems to have been picked up by the press when in 2009 Mike Muller ARM’s CTO, described the increasing limitations imposed by power consumption, on driving and utilizing the increasing numbers of transistors provided by technology nodes down to 11nm. As described by the media, Mike Muller’s warning spoke about power budgets that could not be increased to keep up with the escalating number of transistors provided by smaller geometries.

Why have power budgets? The word “budget” seems to imply permission that designers can increase power by an arbitrary setting of a higher budget. Carrying power increases to extreme levels however will generate temperatures that will destroy the chip or drastically reduce its lifetime. Thus, a fixed reference die whose power budget is almost fixed due to the die’s fixed dimensions will reach a semiconductor technology node where only a small percent of its Moore’s Law–predicted transistors can be driven. The remaining transistors are the dark silicon.

The solution presented at Hot Chips 22 by UCSD cannot increase the power budget of a SoC but it can employ more dark silicon that would otherwise remain unused. The basic idea was simplicity itself: instead of employing a large power-hungry processor that expends a lot of unnecessary energy in driving logic that may not be needed for a particular application–why not create a large number of very efficient small C-cores (UCSD term) that could execute very short sequences of the application code very efficiently?

Imagine a processor tile such as encountered in MIT’s original design that through further improvement became Tilera’s first tile-configured chip. UCSD is envisioning a similar partition using tiles but the tiles are different. The main and comparatively power-hungry processor of UCSD’s tile is still in place but now, surrounding the processor’s data cache, we see a number of special-purpose compiler-generated C-cores.

According to UCSD, these miniature Tensilica-like or ARC-like workload-optimized ISA cores can execute the short repetitive code common to a few applications more efficiently than the main processor. The main processor in UCSD’s tile – a MIPS engine – still needs to execute the program sequences that will not gain efficiency if they are migrated to C-cores. We don’t know whether the C-cores should be considered coprocessors to the main processor such as might be created by a Critical Blue approach, or slave processors.

UCSD’s presentation did not discuss the limitations imposed by data cache bandwidths on the number of C-cores that by design cannot communicate with one another and must use the cache to share operands and results of computations. Nor did the presentation discuss the performance degradation and delays related to loading instructions in each and every C-core or the expected contention on accessing off-chip memory. We would like to see these details made public after the researchers take the next step in their work.

UCSD did present many charts describing the dark silicon problem plus charts depicting an application of C-cores to Android. A benchmark comparison chart was used to illustrate that the C-core approach could show up to 18x better energy efficiency (13.7x on average). The chart would imply that one could run up to 18x more processing tiles on a dense chip that had large area of dark silicon ready for work, but the presentation did not investigate the resulting performance – we know that in most applications the relationships will not be linear.

I liked the result charts and the ideas but was worried that they were not carried out to the level of a complete SoC plus memory to help find the gotchas in the approach. I was disappointed to see that most of the slides presented by the university reminded me of marketing presentations made by the industry. The academic presentation reminded me once more that some universities are looking to obtain patents and trying to accumulate IP portfolios while their researchers may be positioning their ideas to obtain the next year’s sponsors and later, venture capital for a startup.