Dark Silicon Redux: System Design Problem or Fundamental Law?

Tuesday, February 1st, 2011 by Max Baron

Like a spotlight picking out an object in total darkness, the presentation of a solution to a problem may sometimes highlight one aspect while obscuring others. Such were the dark silicon problem and the solution by UCSD and MIT that was presented at Hot Chips 2010. Such was also the article I published in August describing the two universities’ idea that could increase a processor’s efficiency.

At the time of that writing, it appeared that the idea would be followed in time by many others that together would overcome the dark silicon problem. All would be well: Moore’s Law that provides more transistors would also provide higher compute performance.

The term ‘dark silicon’ was probably coined by ARM. ARM described dark silicon as a problem that must be solved by innovative design, but can it be completely solved?Can design continue to solve the problem ‘forever’? To answer the question, we next try to take a qualitative look at the dependencies among the system, the die, and compute performance.

According to a 2009 article published in EE Times, ARM CTO Mike Muller said: “Without fresh innovations, designers could find themselves by 2020 in an era of ‘dark silicon,’ able to build dense devices they cannot afford to power.” Mr. Muller also noted in the same article that“ . . . a 11nm process technology could deliver devices with 16 times more transistors . . . but those devices will only use a third as much energy as today’s parts, leaving engineers with a power budget so pinched they may be able to activate only nine percent of those transistors.”

The use of “only” in the quote may be misunderstood to indicate lower power consumption and higher efficiency. I believe that it indicated disappointment that compared with today’s parts the power consumption would not drop to at least one sixteenth of its 2009 value — to match the rise in the number of transistors.

The term “power budget” can have more than one interpretation. In tethered systems pursuing peak-performance, it can be the worst-case power that is die-temperature related. In mobile systems, it may have a different interpretation: it may be related to the battery-capacity and the percentage of overall system power allocated to the processor. Both interpretations will limit a chip’s power-performance but the limiting factors will be different.

The architects at UCSD/MIT made the best of the unusable silicon problem by surrounding a general-purpose processor core with very efficient small cores located in the dark silicon area. The cores could execute very short sequences of the application code faster and more efficiently than a general-purpose processor but, to keep within the boundary of a power budget, they were probably activated only when needed by the program.

The universities have shown a capability to use part of the dark silicon transistors. It would be interesting to find whether, as transistor numbers increase, the power budget might be dictated by some simple parameters. Finding some limits would rule out dark silicon as a mere problem whose solution will allow designers to utilize 100% of a die to obtain increased performance. In some implementations, the limits could define the best die size and technology of a SoC.

In a system willing to sacrifice power consumption for performance the power budget should be equal to or smaller than the power that can be delivered to the die without causing damage. It is the power (energy/time) that in steady state can be removed from the die by natural and forced cooling, without raising the die’s temperature to a level that would reduce the die’s reliability or even destroy it.

If we allow ourselves the freedom sometimes employed by physicists in simplifying problems, we can say that for a uniformly cooled die of infinite heat conductivity (hot spots can’t occur), the heat generated by circuits and therefore the power budget, are both distributed evenly across the area of the die and are proportional to it (Pbudget α Adie  . . . the larger the die the higher the power budget).

Simplifying things once more, we define a die-wide average energy Eavg in joules required for one single imaginary circuit (the average circuit) to switch state. The power budget (energy divided by time) can now be expressed as the power consumed by the single circuit: Pbudget ~ f * Eavgwhere f is the frequency of switching the single average circuit. The actual frequency of all logic on the chip would be factual = f / n where n is the average number of switchings occurring at the same time.

In other words, assumingdie-area cooling, with all other semiconductor conditions (a given technology node, fabrication, leakage, environment parameters and the best circuit design innovations) and cooling – all kept constant — the peak computing performance obtainable (allowable number of average switching per second) is directly related to the die area. Else the chip will be destroyed.The fate of 3D multi-layer silicon will be worse since the sandwiched layers will enjoy less cooling than the external layers.

Power budgets assigned to processors in mobile systems are more flexible but can be more complex to determine. Camera system designers, for example, can trade-off finder screen size and brightness or fps (frames per second), or zoom and auto focus during video capture — for more processor power. Smart phones that allow non-real-time applications to run slower can save processor power. And, most mobile systems will profit from heterogeneous configurations employing CPUs and hard-wired low power accelerators.

Power budgets in mobile systems will also be affected by software and marketing considerations. Compilers affect the energy consumed by an application based on the number and kind of instructions required for the job to complete. Operating systems are important in managing a system’s resources and controlling the system power states. And, in addition to software and workload considerations the ‘bare core’ power consumption associated with a SoC must compete with claims made by competitors.

If local die temperature and power dissipation terminated the period where higher clock frequency meant more performance, the limitations imposed by allocated power budget or die area will curtail the reign of multiple core configurations as a means of increasing performance.

Most powerful 3D computer

Many computer architects like to learn from existing architectures. It was interesting therefore to see how the most powerful known 3D computer is working around its power limitations. It was however very difficult to find much data on the Internet. The data below was compiled from a few sources and the reader is asked to help corroborate it and/or provide more reliable numbers and sources:

An adult human brain is estimated to contain 1011 (100 Billion) neurons. A firing neuron consumes an average energy of 10-9 joules.  The neuron’s maximum firing rate is estimated by some papers to be 1,000Hz. Normal operating frequencies are lower at 300Hz to 400Hz.

The maximum power that would be generated by the human brain with all neurons firing at the maximum frequency of 1,000 Hz is 103 * 1011* 10-9 = 105 joule/second = 100,000 Watt — enough to destroy the brain and some of its surroundings.

Some papers estimate the actual power consumption of the brain at 10W while others peg it at 100W. According to still other papers the power averaged over 24 hours is 20W. Yet, even the highest number seems acceptable since the brain’s 3D structure is blood-and-evaporation cooled and kept at optimal temperature. Imagine keeping a 100W heat source cool by blood flow!  Performance-wise the 10W and 100W power estimates imply that the brain is delivering 1010 or 1011 neuron firings per second. Using the considerations applied to semiconductor die usage, the brain may be running at 0.01% or up to 0.1% of its neuron capacity possibly turning semi-“dark brain” sections fully “on” or partly “off” depending on workload. Compare these percentages with the much higher 9% utilization factor forecasted for 11nm silicon.

The highly dense silicon chip and the human brain are affected by the same laws of physics.

In semiconductor technology, as Moore’s law places more transistors on the same-sized die or makes the die smaller, the power budget needed for full transistor utilization moves in the opposite direction since it requires larger die areas.Unless cost-acceptable extreme cooling can track technology nodes by removing for example at 11nm about five times more heat from the reference die, or technology finds ways to reduce a cores’ power dissipation by the same factor, Moore’s Law and computing performance will be following different roadmaps.

In mobile applications the limit is affected by battery capacity vs. size and weight. According to some battery developers, capacity is improving slowly as vendors spend more effort in creating custom batteries for big suppliers of mobile systems — than in research. I’m estimating battery capacity to improve at approximately 6% per year, leaving Moore’s law without support since it doubles transistor numbers every two years.

UCSD/MIT’s approach is not a ‘waste’ of transistors if its use of dark silicon can deliver higher performance within the boundaries of the power budget.The Von Neumann architecture was built to save components since it was created at a time when components were expensive, bulky and hard to manufacture. Our problem today and in the near future is to conceive of an architecture that can use an affluence of components.

Tags: , ,

One Response to “Dark Silicon Redux: System Design Problem or Fundamental Law?”

  1. Leigh says:

    One existing approach to the ‘dark silicon’ problem is kind of like the UCSD/MIT approach. Employ hundreds of simpler, lighter weight cores that are far more energy-efficient than typical SMP/Von Neumann/shared/virtual memory cores such as x86 CPUs. Then arrange them on silicon in a MIMD, hierarchical mesh of interconnect so they can still express highly complex programs such as H.264 video compression. Decompose application architectures so that the cores only do single tasks. Each core would have efficient, dedicated, distributed memories, instead of constantly churning virtual/shared memory, (do allow shared SDRAM access for larger data chunks such as video frame buffers). Oh, and to ease programmability and software design/debug have the MIMD interconnect enforce task/core ‘encapsulation’ in the object-oriented sense, enabled in silicon circuits… this encapsulation being crucial to practical programmability and debug of complex software systems on hundreds or thousands of cores. For an example of this architecture, see Nethra’s ‘Ambric-Architecture’ chips with over 300 cores per chip. As an example, these massively-parallel-processor-array (MPPA) chips are deployed in a 13,000 core X-ray processing system of 40 TeraOPS and 2,000 GMACS compute, in under 500 W. This 40-chip system is housed in an ATCA chassis. If the same chassis could possibly be stuffed with 40 Intel CPUs or Nvidia GPU’s of equivalent compute, it would melt into a smoking puddle of aluminum/burnt plastic from over 6KW of energy (GPUs). See http://www.nethra.us.com/products_am2045_overview.php for a description of the current generation chip. The next gen of this type of architecture can push toward 1K cores and enormous compute on one die without a ‘dark silicon’ problem. The key to practical use is programmability which has been solved by the programming model, the encapsulation enforced by the self-synchronizing interconnect, and mature design/debug tools.

Leave a Reply